8
Nov

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithm | Intellipaat


Hey guys welcome back to the 9th module
of this data science course. So in the previous video we discussed unsupervised
machine learning and it’s working mechanism. Then we also discussed many
algorithms like k-means or pca and unsupervised learning. However we did not
discuss what are these algorithms and how do they work.
Well in today’s module we’ll be deep diving into the k-means clustering
algorithm. So in this module we will first understand the concept of clustering
then we will look at the different types of clustering after which we will move
on to k-means clustering. The main topic of our discussion today and finally we
will understand the working mechanism of k-means algorithm so before we explore
k-means clustering let us first understand what is clustering. So in
simple words clustering is a process of dividing the dataset into groups of
similar data points or features so now let’s look at the definition of
clustering. So clustering is a process of dividing a dataset into points where the
data points in the same group are as similar as possible and the data points
in the different groups are as dissimilar as possible so now you might
be thinking why do we need clustering or where is it really applicable. Well to tell you we do use clustering in our day-to-day activities for example
whenever you go to any supermarket you will find that all vegetables are
grouped in one row, fruits in another row and other packaged foods in a different
row. So they are clustered into different groups.
Now this in turn will help you to fasten your shopping process so another example
would be the list of products on Amazon or Flipkart. So these online shopping
apps recommend you the products based on your past history. So what have you
watched in your past they recommend you the similar products or products related
to it. So how do they recommend you the similar products? Well again the concept
behind this is clustering. So is there any criteria to apply clustering? so
mainly these are the two things to be kept in mind. The inter distance between
the two groups should be very large and the intra distance between the members
of the group should be very small that means when you compare two different
groups there should be as dissimilar as possible
and when you compare two data points of a single group they should be as similar
as possible. So now I will discuss the different types of clustering. So there
are three main types of clustering techniques namely exclusive clustering,
overlapping clustering and hierarchical clustering. So let us start with
exclusive clustering. So exclusive clustering is a hard clustering
technique in which the data points exclusively belong to just one cluster.
So you can see here that all the orange triangles belong to one group and all
the purple triangles belong to another group and both these clusters are
entirely different from each other so next we have overlapping clustering.
so in this case some of the data points belong to multiple groups. So as you can
see over here some of the orange and purple triangles belong to both the
groups and are represented in green color. Then we have hierarchical
clustering. So this algorithm starts with all the data points assigned to a
cluster of their own and then the two nearest clusters are merged into the
same cluster and in the end this algorithm terminates when there is only
a single cluster left and to understand this let’s consider an example so as you
can see here A and B and D and E are combined based on some similarities
and in the next step the combination of A and B is quite similar to C and hence
grouped them in one cluster similarly in the next step the combination of D and E
is similar to F and hence grouped in one segment and finally in the fourth step
we see that the final tree includes all the clusters combined into one single
cluster. Well now that we know what is clustering and its different types let
us move on to K means clustering the main topic of our discussion today. So
k-means clustering is an algorithm whose main goal is to group similar data
points into a cluster and the number of groups or clusters are represented by K
so k-means clustering runs on distance calculations which uses euclidean
distance to calculate distance and the formula for euclidean distance is as
shown on the screen Now let’s take an analogy and understand
what k-means clustering is. So consider that you’ve been given a
bunch of books and asked to arrange them in a library so now how will you start
segregating them? So let’s say you meet a lecturer and come to know that all of
these books belong to three subjects so now you will just take three books at
random and those books will become a starting point for those three separate
clusters. So again you have to go through the massive initial group of books and
look at each of the books to check under which cluster this book belongs to. Now
if you need to check all the characteristics of books like author,
publication and year to place them in one cluster. So technically the starting
book which you will choose will become a centroid. Now we will pick those books
which are closer to the centroid and group them and we repeat this step till
we get the desired result. So now let’s move on to the algorithm of k-means
clustering so where k is given the k-means algorithm can be executed in the
following steps. So in step one we partition the objects into k non-empty
subsets in step two we identify the cluster centroids of the current
partition and step three we assign each point to a specific cluster and in step four we compute the distances from each point and a lot points to the
cluster by the distance from the centroid is minimum and in the last step
after realloting the points we find the centroid of the new cluster which is
formed and if it’s too confusing let’s actually take an example to understand
that in a better way. So consider that you started a new home delivery service
for a supermarket and you want to open three delivery centers in the city so
first you need to identify the possible challenges that you would face so you need to
identify the areas from where orders are placed frequently then you need to
identify the number of required delivery centers to cover in the specific area
and finally you need to discover the locations for delivery centers to be
established in order to keep the distance between supermarket and
delivery points minimum so now answering these questions requires a lot of
analysis and mathematics however clustering makes a life easy by solving
out such real-life challenges so let’s see how. Now consider these points as
most probable locations in which the order is placed
frequently. So now these three points are considered as a delivery centers which
cover the area more efficiently that is the distance from delivery centers to
delivery point as least so these would act as the centroids or the cluster
centers. sSo now let us calculate the distance from each delivery location to
our cluster centers or delivery centers. So the distance which is least will be
colored in its color i.e. the delivery center is near to orange
cluster center then the delivery location is colored in orange color and
similarly they will do it for all delivery
locations so now we have all the delivery locations assigned to some
cluster centers. So now we have all the delivery centers clustered into three
clusters. So we’ll go ahead and calculate the centroid of all the points lying
within a particular cluster. So these centroids may come to be seen as the
current cluster centers or they may not and if they are same then it is an ideal
scenario so here the centroids come out to be
different from the current cluster centers so we will take this new cluster
points as centroids and ignore the previous ones.
Now again we’ll calculate the distance from each delivery location to the
cluster center so here again we might find that so the delivery locations are
closer to the cluster centers so we will just assign those delivery location to
the new cluster center and color them accordingly. So again we have to find the
new cluster centroid for all of the points and if this comes out to be same
as the previous cluster center then it’s fine
else you have to repeat until the clusters converge and since the cluster
centers do not converge we repeat the same process to find the new centroids
and we stop only when the cluster centers converge so there is one more
thing which we need to take care of and it is called as distortion so if the
distortion is low it means the clustering is good so generally you
would run came in several times with different random initializations and
choose the clustering the lowest distortion so distortion is calculated
by using this formula given him so this time we repeat the step keeping
distortion in mind so in this case we’ll just name the distance by using the
distortion variable X now we will increase the
number of key and check whether clusters improve or distort so now let’s check
with the value of key to be fine so we see that as the value of K increases the
distortion decreases now if we decrease the value of key let’s say K equals 2
then the clusters would look like this and distortion increases so we should
choose the value of K such that even if we increase the value of K the
distortion remains constant and this is called as a ideal value for K and this
can be identified by using the elbow method to the point where distortion
remains constant is considered as an ideal value of K and this is how K means
clustering works now let me summarize the algorithm for you so initially we
need to find the number of clusters and centroid for a given data set then we
identify the distance from the centroid and group them based on the minimum
distance and we repeat this step until we get the fixed Android so guys this is
all about the k-means algorithm and this brings us to the end of the session
thanks for attending and let’s meet in the next class

Tags: , , , , , , , , , , , , , , , , , , ,

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *