KNN
We select k nearest neighbors and assign the class to the new data point based on the majority of the class of the k nearest neighbors. The distance between the data points is calculated using the Euclidean distance. The problem with this approach is that even when the data points are far away they are considered neighbors. To address this we have weighted k nearest neighbors in which we assign weights to the neighbors based on the distance. The other problem is that for every data point we have calculated the distance to all the other data points. To address this we have k-d trees which are binary trees that partition the data points into regions.
Interview Questions
Q: Which algorithm can be used for value imputation in both categorical and continuous categories of data?
KNN can be used for value imputation in both categorical and continuous categories of data.
Q: Bias and Variance in KNN?
A high bias means we have a very simple model which does not fit the data well and leads to underfitting. Variance is the variability of the model prediction for a given data point. A high variance means that our model changes a lot based on the training data and leads to overfitting. KNN has high variance and low bias.
Q: How to select the value of k in KNN?
We can use cross-validation to select the value of k in KNN. We can also use the elbow method with respect to the loss to select the value of k.
Q: How to handle categorical features in KNN?
The best way to deal with categorical features is to convert them into one-hot encoded features. Converting them to numerical values might not be a good choice as in categorical features their is no sence of order.
Q: Impact of feature scaling on KNN?
The features with large spread/variation and large values will dominate the distance calculation. So, it is important to scale the features before applying KNN.
Q: Imapact of high dimensionality on KNN?
The curse of dimensionality the distance functions like Euclidean distance become less meaningful in higher dimensions. There is little difference in the distances between different pairs of points. ref
Q: Impact of outliers on KNN?
KNN is sensitive to outliers as the distance calculation is affected by the outliers. So, it is important to remove the outliers before applying KNN.