KNN Algorithm


Hi there, hope you're doing great ! ^^

Today we will approach the K-Nearest Neighbor algorithm which even though pretty basic still remains nonetheless among the basic machine learning algorithm to know for data science due to it's easy and fast application allowing us to get a first quick analysis on our dataset. As always, the first party will be dedicated to the theory and the second to the application of this algorithm on real life problem but enough talking let's get down to it ^^


Theory

The K-Nearest Neighbor (KNN) algorithm is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training sets. More particularly, the KNN algorithm uses "feature similarity" to predict the values of new datapoints meaning that the new data points will be assigned a value based on how closely it matches the points in the training set. But let's do a quick example to explain a bit more thoroughly how the KNN algorithm is working :

Let's suppose we have the height, weight and T-shirt size of some customers as follows and that we need to predict the T-shirt size of a new customer with a height of 161 and a weight of 61.

 

First we need to choose what distance function to use. Here we will use the most common one wich is the Euclidean one, however keep in mind that it's not a mandatory choice and that you could very might as well choose to work with another one such as the Manhattan function which is also pretty popular in the case of continuous variables.

 

 

Alright, so now that we have our distance function, it's time to measure the distance (similarity) between our new sample and each observations of our training set in order to find the k-closest customers for this new customer in terms of height and weight.

note : Here for the sake of simplicity we will use a K equal to 5

 

Calculus example :

Euclidean distance between first observation and new observation is equal to

 

Once we are done with calculating the distance for our training set we get the following results :

 

Finally, as you can see among the 5 closest data points to our new customers four of them are blue, so we can conclude that the prediction for our new customer will be a T-shirt size M.

Alright, so now that we have review the theory behind the KNN algorithm and illustrated it with a quick example let's implement it into a real lige application with the help of python.

Application

The KNN algorithm can be used for both classification and regression. In this sense we will first see the classifier application before moving on toward the regressor one.

Application 1 : Classification

First as always we import the needed packages

 

Then we load the Iris dataset from the UCI machine learning repository and reformat a bit the dataset in order to have something clean to work with :

 

From there we create our X and Y variables and split our dataset into a training and a test set :

 

 

However given that we are dealing with independent variables measured in different units, it is important here to standardize our variables before calculating the distance. To do so we usually use on of the three following methods :

 

 

Thankfully in Python, we already dispose of a tool to apply standardization without having to reinvent the wheel :

 

 

Alright so know that our data are set up let's create and fit our KNN model with a K = 8 :

 

From there the only remaining thing to do is to use our classifier to make prediction based on our test dataset and print out the results as shown below :

 

 

And that's it for our classifier application with KNN.

Application 2 : Regressor  

First we load our packages :

 

 

Then we import and reformat correctly our data :

 

 

Once it's done we create and fit our model on our training dataset :

 

 

and extract our MSE in order to check the efficiency of our model :

 

 

And that's it for the regressor case ^^ So don't hesitate to fine tune these examples an apply them on different datasets to get a good graps of this algorithm and be at ease to use it on a usual basis.

As usual full code can be found here.

Take care ✌️