#C14899. Iris K-Means Clustering Challenge

    ID: 44598 Type: Default 1000ms 256MiB

Iris K-Means Clustering Challenge

Iris K-Means Clustering Challenge

You are given a dataset with n samples, each having m features. Your task is to implement a k-means clustering algorithm with several required steps:

  • Data Preprocessing: Standardize the dataset (i.e. perform z−score normalization) on each feature. For each feature, subtract its mean and divide by its standard deviation.
  • K-Means Clustering:
    • initialize_centroids: Randomly select k distinct data points from the dataset as the initial centroids. Use a fixed random seed of 42 for reproducibility.
    • assign_clusters: Assign each data point to the nearest centroid using the Euclidean distance.
    • update_centroids: Update each centroid as the mean of all data points assigned to it.
    • Iterate the assignment and update steps until convergence or until a provided maximum number of iterations (max_iters) is reached.
  • Silhouette Score: Compute the overall silhouette score of the clustering. For each sample i belonging to a cluster, define:

$$ s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}} $$

  • a(i) is the average distance from i to all other points in the same cluster.
  • b(i) is the minimum average distance from i to points in a different cluster.

The final silhouette score is the average of s(i) for all samples. If a cluster contains only one sample, its silhouette score is defined as 0.

Input & Output: Your program will read the dataset and parameters from standard input (stdin) and print the computed silhouette score (rounded to 6 decimal places) to standard output (stdout).

Input Format:
The first line contains two integers: n (number of samples) and m (number of features).
Then follow n lines, each with m space‐separated float numbers representing a sample.
The next line contains two integers: k (number of clusters) and max_iters (maximum iterations for k-means).

Note: Use a fixed random seed of 42 for centroid initialization to ensure consistent results.

inputFormat

The first line contains two integers, n and m. The following n lines each contain m space‐separated float numbers representing the features of a sample. The last line contains two integers: k (the number of clusters) and max_iters (the maximum number of iterations).

outputFormat

Output a single float: the silhouette score of the clustering (rounded to 6 decimal places).

## sample
4 2
1 2
3 4
5 6
7 8
2 10
0.250000