Misplaced Pages

Triplet loss

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Function for machine learning algorithms

Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection.

The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images.

Definition

The loss function is defined using triplets of training points of the form ( A , P , N ) {\displaystyle (A,P,N)} . In each triplet, A {\displaystyle A} (called an "anchor point") denotes a reference point of a particular identity, P {\displaystyle P} (called a "positive point") denotes another point of the same identity in point A {\displaystyle A} , and N {\displaystyle N} (called a "negative point") denotes an point of an identity different from the identity in point A {\displaystyle A} and P {\displaystyle P} .

Let x {\displaystyle x} be some point and let f ( x ) {\displaystyle f(x)} be the embedding of x {\displaystyle x} in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of f ( x ) {\displaystyle f(x)} is unity (the L2 norm of a vector X {\displaystyle X} in a finite dimensional Euclidean space is denoted by X {\displaystyle \Vert X\Vert } .) We assemble m {\displaystyle m} triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} in the training data set:

f ( A ( i ) ) f ( P ( i ) ) 2 2 + α < f ( A ( i ) ) f ( N ( i ) ) 2 2 {\displaystyle \Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}+\alpha <\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}}

The variable α {\displaystyle \alpha } is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2.

Thus, the full form of the function to be minimized is the following:

L = i = 1 m max ( f ( A ( i ) ) f ( P ( i ) ) 2 2 f ( A ( i ) ) f ( N ( i ) ) 2 2 + α , 0 ) {\displaystyle L=\sum _{i=1}^{m}\max {\Big (}\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}-\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}+\alpha ,0{\Big )}}

Selection of triplets

In general, the number of triplets of the form ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image A ( i ) {\displaystyle A^{(i)}} they chose that positive image P ( i ) {\displaystyle P^{(i)}} for which f ( A ( i ) ) f ( P ( i ) ) 2 2 {\displaystyle \Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}} is maximum (such a positive image was called a "hard positive image") and that negative image N ( i ) {\displaystyle N^{(i)}} for which f ( A ( i ) ) f ( N ( i ) ) 2 2 {\displaystyle \Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}} is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

  • Generate triplets offline computing the minimum and maximum on a subset of the data.
  • Generate triplets online by selecting the hard positive/negative examples from within a mini-batch.

Comparison and Extensions

In computer vision tasks such as re-identification, a prevailing belief has been that the triplet loss is inferior to using surrogate losses (i.e., typical classification losses) followed by separate metric learning steps. Recent work showed that for models trained from scratch, as well as pretrained models, a special version of triplet loss doing end-to-end deep metric learning outperforms most other published methods as of 2017.

Additionally, triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks.

In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture.

Other extensions involve specifying multiple negatives (multiple negatives ranking loss).

See also

References

  1. Schroff, Florian; Kalenichenko, Dmitry; Philbin, James (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering": 815–823. {{cite journal}}: Cite journal requires |journal= (help)
  2. Hermans, Alexander; Beyer, Lucas; Leibe, Bastian (2017-03-22). "In Defense of the Triplet Loss for Person Re-Identification". arXiv:1703.07737 .
  3. Zhou, Mo; Niu, Zhenxing; Wang, Le; Gao, Zhanning; Zhang, Qilin; Hua, Gang (2020-04-03). "Ladder Loss for Coherent Visual-Semantic Embedding" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence. 34 (7): 13050–13057. doi:10.1609/aaai.v34i07.7006. ISSN 2374-3468. S2CID 208139521.
  4. Reimers, Nils; Gurevych, Iryna (2019-08-27). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 .
Categories: