Data can be more or less complex. Imagine a data set with many features, say a hundred. How do you know which features matter the most and how could you possibly project the data onto a 2D plane? One of the popular techniques to answer this question is principal component analysis or PCA in short. PCA transforms the data into a new attribute space where features are uncorrelated and ranked by the degree of explained variance. Let me show you this by painting some data. Although this data is in 2D, we could identify the position of each point just by knowing its coordinate in a new tilted axis. This would also be our principal component. Its direction is defined by the vector PC1. If our data lie in a many dimensional space, maybe only a couple of principal components are enough to explain it. Let’s see this in action. This time, I’ll be using ‘wine’ data set with 13 features. A 13 dimensional space is difficult to grasp so we’ll be using PCA to transform the data into fewer dimensions. How do you know how many principal components to go with? The best choice is to select the first few principal components that explain, say, 80% of variability. Orange shows the proportion of explained variance in a scree diagram. Five principal components in my data set already explain slightly more than 80% of variability. I can check the transformed data set in the data table. Now let’s see how our transformed data looks like in a scatter plot. I will plot the data using just the first two components. The three different vines are really nicely separated. Turns out that chemical components called flavanoids are those that define the first component the most. followed by phenols. Today we’ve learned how to transform our data into a set of linearly uncorrelated features with principal component analysis. Next time, we’ll show you another way to rank features with Rank widget.