#C14899. Iris K-Means Clustering Challenge
Iris K-Means Clustering Challenge
Iris K-Means Clustering Challenge
You are given a dataset with n samples, each having m features. Your task is to implement a k-means clustering algorithm with several required steps:
- Data Preprocessing: Standardize the dataset (i.e. perform z−score normalization) on each feature. For each feature, subtract its mean and divide by its standard deviation.
- K-Means Clustering:
- initialize_centroids: Randomly select
k
distinct data points from the dataset as the initial centroids. Use a fixed random seed of 42 for reproducibility. - assign_clusters: Assign each data point to the nearest centroid using the Euclidean distance.
- update_centroids: Update each centroid as the mean of all data points assigned to it.
- Iterate the assignment and update steps until convergence or until a provided maximum number of iterations (
max_iters
) is reached.
- initialize_centroids: Randomly select
- Silhouette Score: Compute the overall silhouette score of the clustering. For each sample i belonging to a cluster, define:
$$ s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}} $$
- a(i) is the average distance from i to all other points in the same cluster.
- b(i) is the minimum average distance from i to points in a different cluster.
The final silhouette score is the average of s(i)
for all samples. If a cluster contains only one sample, its silhouette score is defined as 0.
Input & Output: Your program will read the dataset and parameters from standard input (stdin) and print the computed silhouette score (rounded to 6 decimal places) to standard output (stdout).
Input Format:
The first line contains two integers: n
(number of samples) and m
(number of features).
Then follow n
lines, each with m
space‐separated float numbers representing a sample.
The next line contains two integers: k
(number of clusters) and max_iters
(maximum iterations for k-means).
Note: Use a fixed random seed of 42 for centroid initialization to ensure consistent results.
inputFormat
The first line contains two integers, n
and m
.
The following n
lines each contain m
space‐separated float numbers representing the features of a sample.
The last line contains two integers: k
(the number of clusters) and max_iters
(the maximum number of iterations).
outputFormat
Output a single float: the silhouette score of the clustering (rounded to 6 decimal places).
## sample4 2
1 2
3 4
5 6
7 8
2 10
0.250000