#C12900. Efficient CSV Data Processing and Aggregation
Efficient CSV Data Processing and Aggregation
Efficient CSV Data Processing and Aggregation
You are given a CSV file provided via standard input. Your task is to process the CSV data in a memory‐efficient way by performing the following operations:
- Remove duplicate rows.
- Fill missing values according to a specified strategy. Three strategies are supported: mean, median, and mode. In this problem, the strategy can be provided as input and you should use it to compute the replacement value based on non‐missing entries. For example, the mean is computed as \(\text{mean} = \frac{\text{sum}}{\text{count}}\).
- Group the data by a specified column (the group-by column) and compute aggregation values for each remaining numerical column. For each such column, calculate the sum, mean, and count of numbers in that group.
- Output the aggregated data in CSV format with a header row. The header should start with the group-by column followed by each other column appended with
_sum
,_mean
, and_count
in the order they originally appear.
This task simulates processing large CSV files in a competitive programming environment. The input is read from stdin
and the output is written to stdout
.
inputFormat
The input is provided via standard input in the following format:
- A line containing a string that specifies the column name by which the data should be grouped.
- A line containing a string that specifies the missing value fill strategy (
mean
,median
, ormode
). - A line containing an integer n representing the total number of lines of CSV data that follow (including the header row).
- n lines, each representing a row in the CSV file. The first of these lines is the header, containing column names separated by commas, followed by data rows. Missing values are represented as empty fields.
outputFormat
The output should be printed to standard output in CSV format. The first line must be a header row. The header row begins with the group-by column name, followed by, for each of the other columns, three column names formatted as {col}_sum
, {col}_mean
, and {col}_count
. Each subsequent row should contain the aggregated values for one unique value in the group-by column, sorted in ascending order of the group-by key. The mean
values should be printed as floating-point numbers.
A
mean
8
A,B,C
1,5,10
2,6,12
2,7,12
4,8,14
5,,15
6,6,16
6,7,16
A,B_sum,B_mean,B_count,C_sum,C_mean,C_count
1,5,5.0,1,10,10.0,1
2,13,6.5,2,24,12.0,2
4,8,8.0,1,14,14.0,1
5,6.5,6.5,1,15,15.0,1
6,13,6.5,2,32,16.0,2
</p>