SVM
SVM for Classification
Let’s say we have +ve and -ve classes what SVM tries to do is find a hyperplane such that it maximizes the margin between the +ve and -ve classes.
Geometric Intuition
- Convex hull of the +ve and -ve classes.
- Find the shortest line that connects the convex hulls of the +ve and -ve classes.
- The line that is perpendicular to the shortest line is the separating hyperplane that maximizes the margin between the +ve and -ve classes.
Formulation
We assume a separating hyperplane of the form \(w^Tx + b = 0\) where \(w\) is the normal vector to the hyperplane and \(b\) is the bias. We assume that the +ve class is on one side of the hyperplane at distance d and the -ve class is on the other side of the hyperplane at a distance c these are the margins and are parallel to the separating hyperplan. We can write the equation of the margins as follows:
\[w^Tx + b = d\] \[w^Tx + b = -c\]we can divide the first equation by d and the second equation by c as in doesn’t change the formulation and the d and c are absorved in the w and b. Now our goal is to maximize the margins which is the distance between the two paralle plane which is given by $\frac{d_1-d_2}{\sqrt{a^2+b^2}}$. Which would be the same as minimizing \(\frac{2}{\left \| w \right \|}\).
Hence the Objective function becomes:
\[w^*,b^* = \underset{w,b}{argmax}\left (\frac{2}{||w||}\right )\] \[s.t \:\:\: y_i(w^Tx_i + b) \geq 1 \:\:\: \forall i\]The important part is that the data points which have to be linearly separable for the above formulation to work. But in real world data is not linearly separable so we introduce the concept of slack variables. The slack variables are the distance of the data points from the margins. The objective function becomes:
\[w^*,b^* = \underset{w,b}{argmin}\left (\frac{||w||}{2} + C \frac{1}{n}\sum_{i}^{n}\varepsilon_i\right )\] \[s.t \:\:\: y_i(w^Tx_i + b) \geq 1 - \varepsilon_i \:\:\: , \varepsilon_i \geq 0 , \forall i \:\:\]If we look at the formulation we have added a penalty for error in the earlier formulation error was not possible because of the constraints of the formulation which is equivalent to C being infinity. Now we have a tradeoff between the margin and the error. If we increase C we are penalizing the error more and if we decrease C we are penalizing the margin more.
We will convert the above formulation into a Lagrangian formulation and solve it using the Lagrangian multipliers. The Lagrangian formulation is as follows:
\[L(w,b,\varepsilon,\alpha,\mu) = \frac{||w||^{2}}{2} + C \frac{1}{n}\sum_{i}^{n}\varepsilon_i - \sum_{i}^{n}\alpha_i(y_i(w^Tx_i + b) - 1 + \varepsilon_i) - \sum_{i}^{n}\mu_i\varepsilon_i\]We take the partial derivative of the Lagrangian formulation with respect to w,b and \(\varepsilon\) and equate them to 0. We get the following equations:
\[w = \sum_{i}^{n}\alpha_iy_ix_i\] \[0 = \sum_{i}^{n}\alpha_iy_i\] \[\alpha_i = C - \mu_i\]We substitute the above equations in the Lagrangian formulation and we get the following formulation:
\[L(w,b,\varepsilon,\alpha,\mu) = \sum_{i}^{n}\alpha_i - \frac{1}{2}\sum_{i}^{n}\sum_{j}^{n}\alpha_i\alpha_jy_iy_jx_i^Tx_j\] \[s.t \:\:\: \sum_{i}^{n}\alpha_iy_i = 0\] \[\alpha_i \geq 0\] \[\mu_i \geq 0\] \[\varepsilon_i \geq 0\] \[0 = \sum_{i}^{n}\alpha_iy_i\] \[0 = \sum_{i}^{n}\mu_i\varepsilon_i\]The \(L(w,b,\varepsilon,\alpha,\mu)\) looks the same as the formulation for the dual problem of the Hard SVM the only differece is the constrains. The constrains \(\alpha_i \geq 0\) , \(\mu_i \geq 0\) and \(\alpha_i = C - \mu_i\) we combine these constrains and we get \(0 \leq \alpha_i \leq C\).
Model Prediction
One’s we have identified the support vectors we can calculate the prediction on test as follows:
\[y = sign(\sum_{i}^{n}\alpha_iy_ix_i^Tx + b)\]As only the support vectors contribute to the prediction we only need to store the support vectors and the corresponding \(\alpha\) values. The \(\alpha\) values for non support vectors are 0.
The other important thing to note is that the dot product \(x_i^Tx_j\) can be replaced by a kernel function \(K(x_i,x_j)\).
Kernel Trick
The kernel trick is a technique in which we can replace the dot product \(x_i^Tx_j\) with a kernel function \(K(x_i,x_j)\) which is a function which takes two vectors as input and outputs a scalar. The kernel function should satisfy the following properties:
\[K(x_i,x_j) = K(x_j,x_i)\] \[K(x_i,x_j) \geq 0\] \[\sum_{i}^{n}\sum_{j}^{n}c_ic_jK(x_i,x_j) \geq 0\]i.e. Symmetric, Positive Semi-definite.
The kernel function can be used to calculate the similarity between two vectors in a higher dimensional space without actually calculating the dot product in the higher dimensional space.
Example:
Kernel Functions
- Linear Kernel: \(K(x_i,x_j) = x_i^Tx_j\)
- Polynomial Kernel: \(K(x_i,x_j) = (x_i^Tx_j + 1)^d\)
- Gaussian Kernel(rbf kernel): \(K(x_i,x_j) = exp(-\frac{ \left \| x_i - x_j \right \| ^2}{2\sigma^2})\)
Gaussian Kernel is the most commonly used kernel function. The hyperparameter \(\sigma\) is the bandwidth of the kernel function.This is seen as \(\left \| x_i - x_j \right \| ^2\) is the squared distance between the two vectors. The hyperparameter \(\sigma\) controls the kernel function. If \(\sigma\) is small then the kernel function is very sensitive to the distance between the two vectors (only the vectors which have a distance almost zero will be considered similar) and if \(\sigma\) is large then the kernel function is not very sensitive to the distance between the two vectors. Visit. RBF Kernel is equivalent to an infinite dimensional feature space because of the Taylor series expansion of the exponential function. Higher the value of \(\sigma\) gives a smoother decision boundary and lower the value of \(\sigma\) gives a more complex decision boundary. \(\sigma\) is inverse of \(\gamma\) DEMO
SVM for Regression
Pros
- SVM is a convex optimization problem hence we are guaranteed to find the global minima.
- SVM is a sparse model as only the support vectors contribute to the prediction.
- SVM can be non-linear using the kernel trick.
- SVM is effective in cases where the number of dimensions is greater than the number of samples.
Cons
- Scaling is important in SVM as it is sensitive to the scale of the features.
- SVM is computationally expensive and parallelization is not possible hence not suitable for large datasets. This is because of the quadratic nature of the algorithm distance between all the datapoints. Linear SVM is faster than non-linear SVM. As a rule of thumb I use: O(n_samples^2 * n_features) for RBF kernel using SMO solver and n_sample * n_features for linear SVMs
- SVM is sensitive to outliers.
- SVM is not probabilistic in nature. (We can use Platt scaling to get the probability estimates) also based on the distances from the hyperplane we can get the probability estimates using logistic regression.