#C12900. Efficient CSV Data Processing and Aggregation

    ID: 42379 Type: Default 1000ms 256MiB

Efficient CSV Data Processing and Aggregation

Efficient CSV Data Processing and Aggregation

You are given a CSV file provided via standard input. Your task is to process the CSV data in a memory‐efficient way by performing the following operations:

  1. Remove duplicate rows.
  2. Fill missing values according to a specified strategy. Three strategies are supported: mean, median, and mode. In this problem, the strategy can be provided as input and you should use it to compute the replacement value based on non‐missing entries. For example, the mean is computed as \(\text{mean} = \frac{\text{sum}}{\text{count}}\).
  3. Group the data by a specified column (the group-by column) and compute aggregation values for each remaining numerical column. For each such column, calculate the sum, mean, and count of numbers in that group.
  4. Output the aggregated data in CSV format with a header row. The header should start with the group-by column followed by each other column appended with _sum, _mean, and _count in the order they originally appear.

This task simulates processing large CSV files in a competitive programming environment. The input is read from stdin and the output is written to stdout.

inputFormat

The input is provided via standard input in the following format:

  1. A line containing a string that specifies the column name by which the data should be grouped.
  2. A line containing a string that specifies the missing value fill strategy (mean, median, or mode).
  3. A line containing an integer n representing the total number of lines of CSV data that follow (including the header row).
  4. n lines, each representing a row in the CSV file. The first of these lines is the header, containing column names separated by commas, followed by data rows. Missing values are represented as empty fields.

outputFormat

The output should be printed to standard output in CSV format. The first line must be a header row. The header row begins with the group-by column name, followed by, for each of the other columns, three column names formatted as {col}_sum, {col}_mean, and {col}_count. Each subsequent row should contain the aggregated values for one unique value in the group-by column, sorted in ascending order of the group-by key. The mean values should be printed as floating-point numbers.

## sample
A
mean
8
A,B,C
1,5,10
2,6,12
2,7,12
4,8,14
5,,15
6,6,16
6,7,16
A,B_sum,B_mean,B_count,C_sum,C_mean,C_count

1,5,5.0,1,10,10.0,1 2,13,6.5,2,24,12.0,2 4,8,8.0,1,14,14.0,1 5,6.5,6.5,1,15,15.0,1 6,13,6.5,2,32,16.0,2

</p>