13
Aug

# Getting Started With Orange 09: Principal Component Analysis

Data can be more or less complex. Imagine a data set with many features, say a hundred. How do you know which features matter the most and how could you possibly project the data onto a 2D plane? One of the popular techniques to answer this question is principal component analysis or PCA in short. PCA transforms the data into a new attribute space where features are uncorrelated and ranked by the degree of explained variance. Let me show you this by painting some data. Although this data is in 2D, we could identify the position of each point just by knowing its coordinate in a new tilted axis. This would also be our principal component. Its direction is defined by the vector PC1. If our data lie in a many dimensional space, maybe only a couple of principal components are enough to explain it. Let’s see this in action. This time, I’ll be using ‘wine’ data set with 13 features. A 13 dimensional space is difficult to grasp so we’ll be using PCA to transform the data into fewer dimensions. How do you know how many principal components to go with? The best choice is to select the first few principal components that explain, say, 80% of variability. Orange shows the proportion of explained variance in a scree diagram. Five principal components in my data set already explain slightly more than 80% of variability. I can check the transformed data set in the data table. Now let’s see how our transformed data looks like in a scatter plot. I will plot the data using just the first two components. The three different vines are really nicely separated. Turns out that chemical components called flavanoids are those that define the first component the most. followed by phenols. Today we’ve learned how to transform our data into a set of linearly uncorrelated features with principal component analysis. Next time, we’ll show you another way to rank features with Rank widget.

• Aaron Butler says:

omg I love programming

• Mana ammai says:

great videos I really love the orange ! can you make a video on Reading data from sql db tables and performing some recommendation stuff using machine learning algorithms.

• Rafael Calastro says:

Orange does make data less complex. Thanks =)

I am building a model to predict with accuracy (R2) a numeric target from an excel file with 100+ features (nominal and numerical) and ~25.000 records using Orange 3.

Being a big dataset, Orange does need time to process its "live mode". I am using the Sample operator to reduce the amount of records and was wondering alternatives to reduce/optimize the columns by its relevance on target.

Could you explain by and large which scenarios normally we should use PCA instead of Rank operator? Thank you

• SB012 says:

Fantastic set of tutorials on Orange. Looking forward to many more!

• Franz A says:

Help. i need install a module for timeseries

• Kirk Ramble says:

I heart Orange! Maybe I am missing it, but in the tutorial, how did we know which components were the biggest drivers?

• Jeddie Eddie says:

wow they make this so easy… i feel like im cheating to use it…

• John Zhang says:

The lighting is so bright I think she is illuminating.

• PSchaff2 says:

Hey everyone!

Let me say that I'm not especially a data scientist, but I'd like to work my way into it. Orange seems to be great help and simplification, so thanks for that to the developers team!

But watching the Videos now I have a question:

In the Video "Getting Started With Orange 09: Principal Component Analysis", in Minute 2:40 she says, that Flavanoids is the most important Principal Component, followed by Phenols.

And sorry, I don't get it. Yes, Flavanoids have the highest (negative) value, but second largest here is not phenols. Not even in the little part that is visible on screen. So why does she conclude that phenols are the second most important?? I looked that the table countless times and I have no clue 🙂