#C13270. Implement K-Means Clustering with Metrics

    ID: 42790 Type: Default 1000ms 256MiB

Implement K-Means Clustering with Metrics

Implement K-Means Clustering with Metrics

In this problem, you are required to implement the K-Means clustering algorithm from scratch along with two important clustering metrics: the silhouette score and the Within-Cluster Sum of Squares (WCSS). Given a set of data points in a d-dimensional space and an integer k, you must cluster the points into k clusters.

You are to create a KMeans class with the following methods:

  • fit(): Partition the input data points into k clusters using the iterative k-means algorithm. Use a maximum of 300 iterations and stop early if the centroid movement is less than a tolerance of \(10^{-4}\).
  • predict(): Given new data points, assign each point to its closest cluster centroid.

You must also implement two functions:

  • calculate_silhouette_score(): Compute the average silhouette score of the clustering. The silhouette score for a single point is defined as \(s = \frac{b - a}{\max(a, b)}\), where \(a\) is the average distance to all other points in the same cluster and \(b\) is the minimum average distance to points in any other cluster. If a cluster contains only a single point, treat its silhouette score as 0.
  • calculate_wcss(): Compute the sum of squared distances from each point to the centroid of its corresponding cluster.

Input/Output:

The program must read from standard input and write to standard output.

inputFormat

The input is given via standard input in the following format:

n d k
x[0][0] x[0][1] ... x[0][d-1]
...
x[n-1][0] x[n-1][1] ... x[n-1][d-1]
m
q[0][0] q[0][1] ... q[0][d-1]
...
q[m-1][0] q[m-1][1] ... q[m-1][d-1]

Where:

  • n is the number of data points, d is the number of dimensions, and k is the number of clusters.
  • The next n lines each contain d floating-point numbers representing a data point.
  • m is the number of query points.
  • The next m lines each contain d floating-point numbers representing a query point.

outputFormat

The output should be printed to standard output in the following format:



Note: The cluster indices range from 0 to k-1.

## sample
4 2 2
1 1
1 2
10 10
10 11
2
1 1.5
10 10.5
0.9213

1.0000 0 1

</p>