Hierarchical Clustering in Machine Learning

0
176
Hierarchical Clustering in Machine Learning

Introduction Hierarchical Clustering In Machine Learning

Hierarchical Clustering In Machine Learning is the part of the unsupervised machine learning technique that forms a cluster on the basis of similarity instance in the data point follows a series of partitions to come up with final clusters. Structure looks like a tree. There is a different type of algorithm is used like Agglomerative Clustering & Divisive Hierarchical Clustering.

  1. Agglomerative Hierarchical  Clustering
  2. Divisive Hierarchical Clustering

Agglomerative Clustering

This algorithm works by grouping the data one by one on the basis of the nearest distance measure of all the pairwise distance between the data point. Structure of Agglomerative Clustering that is more informative than the unstructured set of clusters returned by flat clustering. In this clustering algorithm does not require us to prespecify the number of clusters value. 

There are many available methods to form a group of data.

  1. single-nearest distance or single linkage.
  2. complete-farthest distance or complete linkage.
  3. average-average distance or average linkage.
  4. centroid distance.
  5. ward’s method – the sum of squared Euclidean distance is minimized.

This is the way we groping the data until one cluster is formed. With the help of using the dendrogram, we calculate how many number of clusters will be formed.

Following are step to for Agglomerative Hierarchical Clustering

  • Suppose given data  X = {x1, x2, x3, …, xn}
  1. Start with the disjoint all data point as clustering having level L(0) = 0 and sequence number m = 0. Suppose we have the following data to form clusters now in this step we assign each and every data point as a cluster.
multiple clusters
  1. Find the minimum distance pair of clusters in the current cluster point, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)]   where the minimum is over all pairs of clusters in the current clustering this process is repeated till one cluster will be formed.
agglomerative clustering

3) Form Dengrogram according to the data point. Like in the below graph there is form a dendrogram according to the nearest cluster point. It joint one by one.

Hierarchical Clustering / Dendrogram: Simple Definition, Examples ...

As compare to above there is suppose B & C are the first nearest distance point then A is closer to B & C the same as this process is repeated till one cluster is formed.

4) The next step is to determine the number of clusters so here in the dendrogram, The x-axis consists of the data point and the y-axis consists of the Euclidean distance between the clusters which formed. now we lock for the largest vertical line without crossing any horizontal line and this one is a red-framed line on the below diagram then count the number of vertical line passes on this red line that is the number of clusters. The number of clusters in this given data is 5. 

Hierarchical Clustering with Python and Scikit-Learn

Divisive Hierarchical Clustering

Divisive hierarchical clustering is works in the opposite as the agglomerative hierarchical clustering. In this method top to down approach. In this process top to down method first convert all data set as the one cluster and the repeat process till each data point assign as a separate cluster.

Step one at the beginning it forms all data point as one cluster just like below graph.

single cluster

Till one cluster as each data point.

multiple clusters

Here we divide cluster separately hence the name is divisive hierarchical clustering.

Agglomerative hierarchical clustering is widely used in the industry so will focus on that algorithm in the blog.

Implementation Of agglomerative hierarchical clustering.
# import required Libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Load DataSet
data = pd.read_csv('Wholesale customers.csv')
data.head()
ChannelRegionFreshMilkGroceryFrozenDetergents_PaperDelicassen
023126699656756121426741338
123705798109568176232931776
223635388087684240535167844
313132651196422164045071788
4232261554107198391517775185
# Calculate info and describe of data
data.info()   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 8 columns):
Channel             440 non-null int64
Region              440 non-null int64
Fresh               440 non-null int64
Milk                440 non-null int64
Grocery             440 non-null int64
Frozen              440 non-null int64
Detergents_Paper    440 non-null int64
Delicassen          440 non-null int64
dtypes: int64(8)
memory usage: 27.6 KB
ChannelRegionFreshMilkGroceryFrozenDetergents_PaperDelicassen
count440.000000440.000000440.000000440.000000440.000000440.000000440.000000440.000000
mean1.3227272.54318212000.2977275796.2659097951.2772733071.9318182881.4931821524.870455
std0.4680520.77427212647.3288657380.3771759503.1628294854.6733334767.8544482820.105937
min1.0000001.0000003.00000055.0000003.00000025.0000003.0000003.000000
25%1.0000002.0000003127.7500001533.0000002153.000000742.250000256.750000408.250000
50%1.0000003.0000008504.0000003627.0000004755.5000001526.000000816.500000965.500000
75%2.0000003.00000016933.7500007190.25000010655.7500003554.2500003922.0000001820.250000
max2.0000003.000000112151.00000073498.00000092780.00000060869.00000040827.00000047943.000000
# perform data preprocessing 
# Apply standard scalar for  feature scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc_data = sc.fit_transform(data)
# To draw dentrogram load library

import scipy.cluster.hierarchy as sch

# Pass data into dendrogram function
plt.figure(figsize=(10,8))
dendrogram = sch.dendrogram(sch.linkage(sc_data,method='ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()

#Fitting Hierarchical Clustering to the Dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=2,affinity='euclidean',
                             linkage='ward')

# Get Prediction nad evaluate model
y_hc = hc.fit_predict(sc_data)
# Evaluate the model
from sklearn.metrics import silhouette_score
silhouette_score(sc_data, y_hc) * 100

# let set the thereshold value 30 to cut the graph
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = sch.dendrogram(sch.linkage(sc_data, method='ward'))
plt.axhline(y=30, color='r', linestyle='--')
plt.show()

Conclusion

In this blog, you will get the better understanding of hierarchical clustering and their approach to how to solve the clustering problem with the help of using hierarchical clustering.

LEAVE A REPLY

Please enter your comment!
Please enter your name here