PCA method
Hi there, hope everyone is doing fine ^^
As we've seen in the previous serie of tutorial on the 7 basic algorithms used in data science, machine learning generally works wonders to build good predictive
model on large and concise datasets. However, in most cases the reality is not so simple as we can literally end up with dozens of variables containing inconsistencies and redundant features
which at term can become quite problematic given that it usually increase quite dramatically the overall compilation time to the extent that we cannot even compute the model anymore
^^'
Thankfully, different methods exist to decrease the number of features of a given dataset and only retain the most important one and that will be the focus of
today's tutorial as we will approach the most commonly used method called Principal Component Analysis (PCA).
So first, what is PCA ?
PCA is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transformed into a data set of
significantly lower dimension without loss of any important information. It follows the following 4 steps :
Step 1 : Standardizations of the data
Step 2 : Computing the covariance matrix
Step 3: Calculating the eigenvectors and eigenvalues
Step 4 : Reducing the dimensions of the data set
So let's take a closer look at each of those steps :
Step 1 : Standardizations of the data
Here the idea is to scale our data in such a manner that all the variables and their values lies within a similar range
Exemple :
Let's consider 2 variables in a data set, one with values ranging from 0 to 100 and the second one with values between 1000 and 15000. In such scenario, it appears
obvious that the output calculated by using these predictor varibales will be biaised since the variable with a larger will obviously have more impact on the outcome.
That's why it is important to first operate a standardization of our data set in order to avoid the problem of the biaised outcome.
To operate the standardization we simply substract to each value the mean and then divide this result by the overall deviation in the dataset as shown below
:
Step 2 : Computing the covariance matrix
In this second step we simply compute the covariance matrix of our data set in order to identify the correlations between our different features.
Quick recap :
Step 3: Calculating the eigenvectors and eigenvalues
We calculate the eigenvectors and eigenvalues from our matrix in order to spot where we have the most variance in our data since more variance in the data denotes
more information about the said data. Then once we have compute all our eigenvectors and eigenvalues, all we have to do is to sort those by descending orders to get our principal components by
order of significance.
Step 4 : Reducing the dimensions of the data set
Once we have all our principal components the last thing that we have to do is to choose the number of principal components that we want to consider in our case and
re-arrange our original data with the final principal components representing the most significant information of the dat set. To do so, we multiply the transpose of the original data set by the
transpose of the obtained feature vector.
Alright so now that we have a better understandig of how the PCA process works, let's try it for ourselve by implementing it on a real data set with
Python.
Application
First thing first let's import the needed libraries :
From there, we also import our data and reshape them in order to have a clean dataframe to work with :
Now, as we discussed before in our presentation of the PCA process we will before doing anything else operate a normalization of our dataset :
Once it's done, we can easily compute our principal components with the help of the sklearn.decomposition library :
n.b. The choice of principal components here is totally arbitrary so feel free to experiment for yourself !
By checking the results here :
we can see that the principal component 1 holds 44.2% of the information while the princpal component 2 holds only 19% of the information meaning that while
projecting a thirty dimensional data set onto a two dimensional data set we lost approximately in the 36% of the initial information.
Bonus : Here is the code to plot your results with colab
And that's it for me on the subject of the pCA process ^^. So I hope that you enjoyed this tuto introducing the PCA process and as always don't hesitate to go
beyond in order to get a deeper understanding of the notions that we just introduce.
Like always, full code is available here on the github of the website.
I'll catch you on the next one, in the meantime take care and happy coding ✌️