# K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithm | Intellipaat

Hey guys welcome back to the 9th module

of this data science course. So in the previous video we discussed unsupervised

machine learning and it’s working mechanism. Then we also discussed many

algorithms like k-means or pca and unsupervised learning. However we did not

discuss what are these algorithms and how do they work.

Well in today’s module we’ll be deep diving into the k-means clustering

algorithm. So in this module we will first understand the concept of clustering

then we will look at the different types of clustering after which we will move

on to k-means clustering. The main topic of our discussion today and finally we

will understand the working mechanism of k-means algorithm so before we explore

k-means clustering let us first understand what is clustering. So in

simple words clustering is a process of dividing the dataset into groups of

similar data points or features so now let’s look at the definition of

clustering. So clustering is a process of dividing a dataset into points where the

data points in the same group are as similar as possible and the data points

in the different groups are as dissimilar as possible so now you might

be thinking why do we need clustering or where is it really applicable. Well to tell you we do use clustering in our day-to-day activities for example

whenever you go to any supermarket you will find that all vegetables are

grouped in one row, fruits in another row and other packaged foods in a different

row. So they are clustered into different groups.

Now this in turn will help you to fasten your shopping process so another example

would be the list of products on Amazon or Flipkart. So these online shopping

apps recommend you the products based on your past history. So what have you

watched in your past they recommend you the similar products or products related

to it. So how do they recommend you the similar products? Well again the concept

behind this is clustering. So is there any criteria to apply clustering? so

mainly these are the two things to be kept in mind. The inter distance between

the two groups should be very large and the intra distance between the members

of the group should be very small that means when you compare two different

groups there should be as dissimilar as possible

and when you compare two data points of a single group they should be as similar

as possible. So now I will discuss the different types of clustering. So there

are three main types of clustering techniques namely exclusive clustering,

overlapping clustering and hierarchical clustering. So let us start with

exclusive clustering. So exclusive clustering is a hard clustering

technique in which the data points exclusively belong to just one cluster.

So you can see here that all the orange triangles belong to one group and all

the purple triangles belong to another group and both these clusters are

entirely different from each other so next we have overlapping clustering.

so in this case some of the data points belong to multiple groups. So as you can

see over here some of the orange and purple triangles belong to both the

groups and are represented in green color. Then we have hierarchical

clustering. So this algorithm starts with all the data points assigned to a

cluster of their own and then the two nearest clusters are merged into the

same cluster and in the end this algorithm terminates when there is only

a single cluster left and to understand this let’s consider an example so as you

can see here A and B and D and E are combined based on some similarities

and in the next step the combination of A and B is quite similar to C and hence

grouped them in one cluster similarly in the next step the combination of D and E

is similar to F and hence grouped in one segment and finally in the fourth step

we see that the final tree includes all the clusters combined into one single

cluster. Well now that we know what is clustering and its different types let

us move on to K means clustering the main topic of our discussion today. So

k-means clustering is an algorithm whose main goal is to group similar data

points into a cluster and the number of groups or clusters are represented by K

so k-means clustering runs on distance calculations which uses euclidean

distance to calculate distance and the formula for euclidean distance is as

shown on the screen Now let’s take an analogy and understand

what k-means clustering is. So consider that you’ve been given a

bunch of books and asked to arrange them in a library so now how will you start

segregating them? So let’s say you meet a lecturer and come to know that all of

these books belong to three subjects so now you will just take three books at

random and those books will become a starting point for those three separate

clusters. So again you have to go through the massive initial group of books and

look at each of the books to check under which cluster this book belongs to. Now

if you need to check all the characteristics of books like author,

publication and year to place them in one cluster. So technically the starting

book which you will choose will become a centroid. Now we will pick those books

which are closer to the centroid and group them and we repeat this step till

we get the desired result. So now let’s move on to the algorithm of k-means

clustering so where k is given the k-means algorithm can be executed in the

following steps. So in step one we partition the objects into k non-empty

subsets in step two we identify the cluster centroids of the current

partition and step three we assign each point to a specific cluster and in step four we compute the distances from each point and a lot points to the

cluster by the distance from the centroid is minimum and in the last step

after realloting the points we find the centroid of the new cluster which is

formed and if it’s too confusing let’s actually take an example to understand

that in a better way. So consider that you started a new home delivery service

for a supermarket and you want to open three delivery centers in the city so

first you need to identify the possible challenges that you would face so you need to

identify the areas from where orders are placed frequently then you need to

identify the number of required delivery centers to cover in the specific area

and finally you need to discover the locations for delivery centers to be

established in order to keep the distance between supermarket and

delivery points minimum so now answering these questions requires a lot of

analysis and mathematics however clustering makes a life easy by solving

out such real-life challenges so let’s see how. Now consider these points as

most probable locations in which the order is placed

frequently. So now these three points are considered as a delivery centers which

cover the area more efficiently that is the distance from delivery centers to

delivery point as least so these would act as the centroids or the cluster

centers. sSo now let us calculate the distance from each delivery location to

our cluster centers or delivery centers. So the distance which is least will be

colored in its color i.e. the delivery center is near to orange

cluster center then the delivery location is colored in orange color and

similarly they will do it for all delivery

locations so now we have all the delivery locations assigned to some

cluster centers. So now we have all the delivery centers clustered into three

clusters. So we’ll go ahead and calculate the centroid of all the points lying

within a particular cluster. So these centroids may come to be seen as the

current cluster centers or they may not and if they are same then it is an ideal

scenario so here the centroids come out to be

different from the current cluster centers so we will take this new cluster

points as centroids and ignore the previous ones.

Now again we’ll calculate the distance from each delivery location to the

cluster center so here again we might find that so the delivery locations are

closer to the cluster centers so we will just assign those delivery location to

the new cluster center and color them accordingly. So again we have to find the

new cluster centroid for all of the points and if this comes out to be same

as the previous cluster center then it’s fine

else you have to repeat until the clusters converge and since the cluster

centers do not converge we repeat the same process to find the new centroids

and we stop only when the cluster centers converge so there is one more

thing which we need to take care of and it is called as distortion so if the

distortion is low it means the clustering is good so generally you

would run came in several times with different random initializations and

choose the clustering the lowest distortion so distortion is calculated

by using this formula given him so this time we repeat the step keeping

distortion in mind so in this case we’ll just name the distance by using the

distortion variable X now we will increase the

number of key and check whether clusters improve or distort so now let’s check

with the value of key to be fine so we see that as the value of K increases the

distortion decreases now if we decrease the value of key let’s say K equals 2

then the clusters would look like this and distortion increases so we should

choose the value of K such that even if we increase the value of K the

distortion remains constant and this is called as a ideal value for K and this

can be identified by using the elbow method to the point where distortion

remains constant is considered as an ideal value of K and this is how K means

clustering works now let me summarize the algorithm for you so initially we

need to find the number of clusters and centroid for a given data set then we

identify the distance from the centroid and group them based on the minimum

distance and we repeat this step until we get the fixed Android so guys this is

all about the k-means algorithm and this brings us to the end of the session

thanks for attending and let’s meet in the next class

Got a question on K Means Clustering or Data Science ? Do write it in the comment and you will get a response immediately by our expert. For Data Science training & certification, call us at US: 1800-216-8930 (Toll Free) or India: +917022374614. You can also write us at [email protected]

Extraordinary mind-blowing fantastic

Going to watch all videos of intellipaat