#C13270. Implement K-Means Clustering with Metrics
Implement K-Means Clustering with Metrics
Implement K-Means Clustering with Metrics
In this problem, you are required to implement the K-Means clustering algorithm from scratch along with two important clustering metrics: the silhouette score and the Within-Cluster Sum of Squares (WCSS). Given a set of data points in a d-dimensional space and an integer k, you must cluster the points into k clusters.
You are to create a KMeans
class with the following methods:
fit()
: Partition the input data points into k clusters using the iterative k-means algorithm. Use a maximum of 300 iterations and stop early if the centroid movement is less than a tolerance of \(10^{-4}\).predict()
: Given new data points, assign each point to its closest cluster centroid.
You must also implement two functions:
calculate_silhouette_score()
: Compute the average silhouette score of the clustering. The silhouette score for a single point is defined as \(s = \frac{b - a}{\max(a, b)}\), where \(a\) is the average distance to all other points in the same cluster and \(b\) is the minimum average distance to points in any other cluster. If a cluster contains only a single point, treat its silhouette score as 0.calculate_wcss()
: Compute the sum of squared distances from each point to the centroid of its corresponding cluster.
Input/Output:
The program must read from standard input and write to standard output.
inputFormat
The input is given via standard input in the following format:
n d k x[0][0] x[0][1] ... x[0][d-1] ... x[n-1][0] x[n-1][1] ... x[n-1][d-1] m q[0][0] q[0][1] ... q[0][d-1] ... q[m-1][0] q[m-1][1] ... q[m-1][d-1]
Where:
- n is the number of data points, d is the number of dimensions, and k is the number of clusters.
- The next n lines each contain d floating-point numbers representing a data point.
- m is the number of query points.
- The next m lines each contain d floating-point numbers representing a query point.
outputFormat
The output should be printed to standard output in the following format:
Note: The cluster indices range from 0 to k-1.
## sample4 2 2
1 1
1 2
10 10
10 11
2
1 1.5
10 10.5
0.9213
1.0000
0 1
</p>