Metrics

Classification Metrics

  1. Accuracy:

    This is the most simple measure, defined as the number of accurate predictions divided by the total number of predictions. This will not be a very good metric as if we have imbalanced class labels, then the metric will be biased towards the majority class. Also, when the problem is multi-class, accuracy will not be a good measure as it does not consider the class information.

  2. Confusion Matrix:

    This is a table that shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. This gives us a complete picture of how well our model works with respect to all the classes.

    2x2 Confusion Matrix

    The confusion matrix is for a binary classifier the ground truth are of two types True or False and the prediction can be of two types Positive or Negative. So, the confusion matrix will have 4 values TP, TN, FP, FN.

    TP: True Positive mean that when the ground truth is True and the prediction is positive.

    TN: True Negative mean that when the ground truth is True and the prediction is negative.

    FP: False Positive mean that when the ground truth is False and the prediction is positive.

    FN: False Negative mean that when the ground truth is False and the prediction is negative.

      Prediction 1 Prediction 0 Groud Truth
    Label 1 TP FN TP+FN
    Label 0 FP TN FP+TN
    Predicted TP+FP TN+FN  
  3. Precision:

    Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answers is of all the positive predictions, how many are actually positive. This metric is useful when the cost of False Positive is high. For example, in the case of death penalty, we would want to be very sure that the person is guilty before giving the death penalty. So, we would want to have a high precision.

    \[Precision = \frac{TP}{(TP+FP)}\]
  4. Recall:

    Recall is the ratio of correctly predicted positive observations to all observations in the actual class. The question recall answers is: Of all the positive classes, how many did we predict correctly? It is also called Sensitivity. The recall is a good measure to use when the cost of True Negative is high. For example, in the case of a natural calamity, we would want to alert as many people as possible, a wrong alert is not harmful, but a missed alert can be very harmful.

    \[Recall = \frac{TP}{(TP+TN)}\]
  5. \(F_\beta\) Score

    The \(F_\beta\) score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The \(F_\beta\) score weights recall more than precision by a factor of \(\beta\). \(\beta\) = 1.0 means recall and precision are equally important. for \(\beta < 1.0\), precision is more important than recall and vice versa.

    \[F_{\beta} = \frac{1}{ \frac{\beta^2}{1 + \beta^2} . \frac{1}{Recall} + \frac{1}{1 + \beta^2} . \frac{1}{Precision} }\] \[F_{\beta} = (1+\beta^2) \frac{Precision*Recall}{(\beta^2*Precision)+Recall}\]

    When \(\beta\) = 1.0, it is called F1 score, which is the harmonic mean of precision and recall.

    \[F1 = 2 \frac{Precision*Recall}{Precision+Recall}\]

    When \(\beta\) = 0.5, it is called F0.5 score, which is the weighted harmonic mean of precision and recall.

    \[F0.5 = \frac{1.25*Precision*Recall}{(0.25*Precision)+Recall}\]

    When \(\beta\) = 2.0, it is called F2 score, which is the weighted harmonic mean of precision and recall.

    \[F2 = \frac{5*Precision*Recall}{(4*Precision)+Recall}\]

    F1 Score Multiclass Classification

    • F1 Macro

      We have to calculate the F1 score for each class and then take the average of all the F1 scores. This is called F1 Macro.

    • F1 Micro

      We have to calculate the F1 with the global information of total FP, FN and TP.

    • F1 Weighted

      We have to calculate the F1 score for each class and then take the weighted average of all the F1 scores. Weights are the number of samples in each class or can be user defined.

  6. AUC ROC Curve: Needs revision

    AUC ROC curve is a performance measurement for binary classification problems. ROC is a curve AUC is the area under the ROC curve. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) for a model which outputs probability. We use a threshold to predict the label, and these labels are used to calculate TPR and FPR. It shows the tradeoff between sensitivity and specificity. The closer the curve follows the upper left-hand border of the ROC space, the better the model (Why). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The ROC curve is also called the receiver operating characteristic. The area under the ROC curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).

Questions

  1. What is the difference between accuracy and F1 score when the classes are balanced?

    Accuracy is the same as F1 macro score when the classes are balanced, but F1 micro will still give a better score as it give equal weightage to all the classes and overall accuracy might look good, but internally, the prediction for some classes might not be good. Both accuracy and F1 macro score will not capture this.

Regression Metrics

  1. Mean Absolute/Squared Error and Root Mean Squared Error

    The mean absolute error (MAE) We’ll calculate the average absolute error between actual and predicted values. The mean squared error (MSE) is very similar to the MAE, but it squares the difference before summing them all instead of using the absolute value. This means that large errors are penalized more than with the MAE. The root mean squared error (RMSE) is the square root of the MSE. It’s more popular than MSE because the RMSE is interpretable in the “y” units.

  2. R Squared

    A dataset has n values marked \(y_1,..,y_n\) (collectively known as \(y_i\) or as a vector \(y = [y_1,...,y_n]^T\)), each associated with a fitted (or modeled, or predicted) value \(f_1,...,f_n\) (known as $f_i$, or sometimes \(\hat{y_i}\), as a vector f).

    Define the residuals as \(e_i = y_i − f_i\) (forming a vector e).

    If \(\bar {y}\) is the mean of the observed data then the variability of the data set can be measured with two sums of squares formulas:

    • The sum of squares of residuals also called the residual sum of squares:

      \[SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\]
    • The total sum of squares (proportional to the variance of the data):

      \[SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}\]

    The most general definition of the coefficient of determination is \(R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}\)

    In the best case, the modeled values exactly match the observed values, which results in \(SS_{\text{res}}=0\) and \(R^{2}=1\). A baseline model, which always predicts \(\bar {y}\), will have \(R^{2}=0\). Models that have worse predictions than this baseline will have a negative \(R^{2}\).

    Fraction of variance unexplained

    In a general form, \(R^{2}\) can be seen to be related to the fraction of variance unexplained (FVU). The numerator in the equation represents unexplained variance, while the denominator represents total variance. The FVU is the ratio of unexplained variance to total variance:

    \[R^{2}= \frac{ SS_{\rm {tot}} - SS_{\rm {res} }}{ SS_{\rm {tot}}}\]

    Inflated R2

    When we add a new estimator or variable into the regression model \(R^{2}\) goes up or at least will stay the same which not good because comparing models with different numbers of features will not be feasible with \(R^{2}\). Proof 2

  3. Adjusted R Squared:

    The use of an adjusted \(R^{2}\) (one common notation is \({\bar R}^{2}\), pronounced “R bar squared”; another is \(R_{\text{a}}^{2}\) or \(R_{\text{adj}}^{2}\)) is an attempt to account for the phenomenon of the \(R^{2}\) automatically increasing when extra explanatory variables are added to the model. There are many different ways of adjusting ([14]). By far the most used one, to the point that it is typically just referred to as adjusted R, is the correction proposed by Mordecai Ezekiel. The adjusted \(R^{2}\) is defined as

    \[{\bar {R}}^{2}={1-{SS_{\text{res}}/{\text{df}}_{\text{res}} \over SS_{\text{tot}}/{\text{df}}_{\text{tot}}}}\]

    where dfres is the degrees of freedom of the estimate of the population variance around the model, and dftot is the degrees of freedom of the estimate of the population variance around the mean. dfres is given in terms of the sample size n and the number of variables p in the model, dfres =n − p. dftot is given in the same way, but with p being unity for the mean, i.e. dftot = n − 1.

    The adjusted \(R^{2}\) can be negative, and its value will always be less than or equal to that of \(R^{2}\).

Retrieval Metrics

  1. Precision@K P@K is the number of relevant documents in the first K retrieved documents divided by K.

  2. Recall@K It is the ratio of the number of relevant documents in the first K documents to the total number of relevant documents.

  3. MRR (Mean resiprocal rank)

    It is the mean of inverse ranks of the first relevant document (RD) for each query. where n is the number of queries. \(MRR = \frac{1}{n} \sum_{i=1}^{n}\frac{1}{\text{rank of first RD for query i}}\)

  4. MAP (Mean Average Precision)

    First, we calculate the average precision (AP) for each query and then compute the mean of all the AP values. To calculate AP for a single query, we compute the average of precision of all RDs at the corresponding rank. We explain the computation AP for a query through an example. For a query, consider $RD_1$ ranked at 3, $RD_2$ ranked at 7 and $RD_3$ ranked at 10. To calculate AP, we calculate the precision values at 3, 7, and 10 and compute the average. We calculate AP for each query and compute the mean of AP values for all $n$ queries to calculate MAP.

    \[MAP = \frac{1}{n} \sum_{i=1}^{n}AP_i\]
  5. NDCG (normalized discounted cumulative gain) We will start with the definition on cumulative gain. it is a sum of gains associated for items within a search query. Gain is zero or one which represents the relevance of the item. Basically we just count the number of relevant items in the top k results.

    \[CG = \sum_{i=1}^{k}rel_i\]

    The problem with CG is that it does not take into account the position of the relevant item. So we use DCG which is the sum of the gains associated with the items in the top k results but the gain is discounted at lower positions. The discount factor is the log of the position of the item.

    \[DCG = \sum_{i=1}^{k}\frac{rel_i}{log_2(i+1)}\]

    The problem with DCG is that it is not normalized because as different queries can have different relevance. So we use NDCG which is the ratio of DCG and the ideal DCG. The ideal DCG is the DCG when the items are sorted in the decreasing order of relevance.

    \[NDCG = \frac{DCG}{IDCG}\]

    Wikipedia

Summarization Metrics

  1. BLEU

    BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

    This score is calculated by comparing the n-grams of your predicted text with the n-grams of the reference text.It is a combination of precision using n-grams and brevity penalty.

    Precision in text:

    Ref Sentence 1: This is a small house.

    Predicted Sentence 2: The house is small.

    The precision of Sentence 2 is 3/4. But a sentence like house house house will also have a precsion of 3/4. To resolve this issue we use clipping. Lets say

    Ref Sentence 1: This is a small house.

    Ref Sentence 2: This house is small.

    Predicted Sentence 3: A house house house is very small.

    Now the house matches with both S1 and S2 and is 3 time in so count will be 3 the count of is is 1 and similarly small for the word a we have one match with S1 so the count will be one and the word very has 0 count. Now when we clipped counts everthing above 1 becomes one so the total clipped count we have is 4. Now the precision is 4/7(output lenght).

    This can be extended to N-grams. The BLEU score is the geometric mean of the clipped precision for all the n-grams multiplied by the brevity penalty.

    \[Geometric Mean = \prod_{n=1}^{N} (precision_{n})^{w_n}\]

    Generally \(w_n = 1/n\) where n represents the n-gram but different weights can be used to give more importance to higher order n-grams.

    Brevity Penalty

    \[Brevity Penalty = \left\{\begin{matrix} 1 & if~c>=r \\ e^{1 - \frac{r}{c}} & \end{matrix}\right.\]

    Breavity penalty penalize shorter length text as we are using the geometric mean of the clipped precision. so only a single occures of a n-gram will contribute to precision but it can happend the a perticular n-gram may have to repeat multiple times to get the meaning of the sentence. So to avoid this we use brevity penalty.

    So the formula for BLEU score is

    \[BLEU = Brevity Penalty * Geometric Mean\]

    Their are different variants of BLEU score like BLEU-1, BLEU-2, BLEU-3, BLEU-4 and so on the BLEU score for unigrams, BLEU-2 is the BLEU score for bigrams and unigrams. BLEU-4 is the most commonly used BLEU score it uses all uni, bi, tri and 4-grams.

    Drawbacks of BLEU

    1. Bleu Score is based on precision; hence it doesn’t look at the recall metrics. In easier terms, it fails to focus on the fact whether all words of reference are covered by the predicted text or not.

    2. It fails to capture the meaning and order of words.

  2. ROUGE ROUGE-N: Overlap of n-grams[2] between the system and reference summaries.

     ROUGE-1 refers to the overlap of unigrams (each word) between the system and reference summaries.
     ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
    

    ROUGE-L: Longest Common Subsequence (LCS)[3] based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

  3. METEOR

  4. BERTScore

  5. Perplexity This is one of the meausre which is generally used in language modeling. It can be thought of as a measure of uncertainty or “surprise” related to the outcomes. A low perplexity indicates that our model is able to capture the underlying distribution of the data well. Formula

    \[Perplexity = 2^{-\sum_{x} p(x)log_{2}p(x)}\]

    where x is the word and p(x) is the probability of the word.

Clustering Metrics

  1. Silhouette Score The Silhouette Coefficient is defined for each sample and is composed of two scores:

    a: The mean distance between a sample i and all other points in the same class.

    b: The mean distance between a sample i and all other points in which are not part of the cluter in which point i lies.

    The Silhouette Coefficient s for a single sample is then given as:

    \[s = \frac{b - a}{max(a, b)}\]

    The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.

  2. Inter and Intra Cluster Distance ratio

    Inter Cluster Distance: The distance between the centroids of two clusters.

    Intra Cluster Distance: The average distance between all the data points within a cluster.

    The ratio of Inter Cluster Distance to Intra Cluster Distance is called Inter/Intra Cluster Distance Ratio. The higher the ratio, the better the clustering.

Segmentation Metrics

  1. IoU

  2. Precision

  3. Recall

  4. F1 Score

  5. MAP

results matching ""

    No results matching ""