All About Programming: Improved Seeding For Clustering With K-Means++

Improved Seeding For Clustering With K-Means++ | The Data Science Lab

Improved Seeding For Clustering With K-Means++ Clustering data into subsets is an important task for many data science applications. At The Data Science Lab we have illustrated how Lloyd's algorithm for k-means clustering works, including snapshots of python code to visualize the iterative clustering steps . One of the issues with the procedure is that this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods, so that we went a step further and coded up the gap statistic to find the proper k for k-means clustering . In combination with the clustering algorithm, the gap statistic allows to estimate the best value for k among those in a given range. An additional problem with the standard k-means procedure still remains though, as shown by the image on the right, where a poor random initialization of the centroids leads to suboptimal clustering:

Read full article from Improved Seeding For Clustering With K-Means++ | The Data Science Lab

Improved Seeding For Clustering With K-Means++ | The Data Science Lab

No comments:

Post a Comment

Labels

Popular Posts