Loss Functions
Before looking at the loss/objective functions we need to look at different problem setting because in ML setting loss function is a mathmatical representation of the problem we are trying to solve.
Problem Settings
- Classification: The classification can be of two types i.e multi-class or multi-label. In multiclass setting the general choice of loss function is cross entropy loss. In multi-label setting summation of binary cross entropy loss for each class.
- Regression: Generally mean squared error is used as loss function.
- Metric Learning: In metric learning we try to learn a distance function between two data points. The loss function is important loss functions are triplet loss , Pairwise Ranking Loss(Contrastive Loss) .
- Reinforcement Learning:
Loss Functions
-
Binary Cross Entropy Loss:
As the name suggest there are only two classes so we need the probability of only one class. It is given by:
\[L_{BCE} = -\sum_{i=1}^{n} y_i log(p_i) + (1-y_i) log(1-p_i)\]where $y_i$ is the ground truth probability and $p_i$ is the predicted probability for class $i$. This will be used in binary classification problems.
-
Cross Entropy Loss:
Logistic Loss and Multinomial Logistic Loss are other names for Cross-Entropy loss. The cross entropy loss can be used when the output of the model is a probability value between 0 and 1. It is given by:
\[L_{CE} = -\sum_{i=1}^{n} y_i log(p_i)\]Multi-label classification the loss function is the sum of the binary cross entropy loss for each class. It is given by:
\[L_{CE} = -\sum_{i=1}^{n} y_i log(p_i) + (1-y_i) log(1-p_i)\]Also weights can be introduced to the loss function deal with imbalanced dataset.
- Focal Loss:
-
Pairwise Ranking Loss(Contrastive Loss):
The objective is to learn representations with a small distance between them for positive pairs, and greater distance than some margin value m for negative pairs. Pairwise Ranking Loss forces representations to have 0 distance for positive pairs, and a distance greater than a margin for negative pairs. Being \(r_a\)(anchor), \(r_x\) can be (Positive) or (negative) sample and d a distance function, we can write:
\[L_{pairwise}(y,r_a,r_x) = y.d(r_a,r_x) + (1-y)*max(0,m-d(r_a,r_x))\]For positive pairs, only the first part of the loss will usined as y==1 and it will be 0 only when the net produces representations for both the two elements in the pair have no distance between them. For negative pairs i.e y==0 , the loss will be 0 when the distance between the representations of the two pair elements is greater than the margin m. The function of the margin is that, when the representations produced for a negative pair are distant enough, no efforts are wasted on enlarging that distance, so further training can focus on more difficult pairs. The most common distance function is the euclidean distance. The main difference with triple loss is that we use pairs instead of triplets.ref
-
Triplet Loss:
We use triplets of training data samples, instead of pairs. The triplets are formed by an anchor sample \(x_a\), a positive sample \(x_p\) and a negative sample \(x_n\). The objective is that the distance between the anchor sample and the negative sample representations d(ra,rn) is greater (and bigger than a margin m) than the distance between the anchor and positive representations d(ra,rp). With the same notation, we can write:
\[L(r_a,r_n,r_p) = max(0, m + d(r_a,r_p) - d(r_a,r_n))\]Triples can be for three types:
- Easy Triplets: \(d(r_a,r_n)>d(r_a,r_p)+m\)
- Hard Triplets: \(d(r_a,r_n)<d(r_a,r_p)\)
- Semi-Hard Triplets: \(d(r_a,r_p)<d(r_a,r_n)<d(r_a,r_p)+m\)
Strategy of triplet selection is very important because it will affect the model performace and convergence.