#C12380. Balancing Imbalanced Datasets with Resampling Strategies
Balancing Imbalanced Datasets with Resampling Strategies
Balancing Imbalanced Datasets with Resampling Strategies
In many machine learning tasks, datasets often suffer from class imbalance, where one class significantly outnumbers the other. This imbalance can lead to biased models that perform poorly on the minority class. In this problem, you are given a feature matrix (X) and a label vector (y) (with labels 0 and 1). Your task is to implement two resampling strategies:
-
Upsampling (Over-sampling): Increase the number of samples in the minority class by duplicating samples (with replacement) to match the count of the majority class.
-
Downsampling (Under-sampling): Decrease the number of samples in the majority class by randomly removing samples (without replacement) to match the count of the minority class.
For both methods, the balanced dataset should have equal numbers of samples from each class. More formally:
- For upsampling, if (n_0 > n_1) then the minority class (with (n_1) samples) is increased to (n_0) samples, so the total becomes (2n_0). If (n_1 > n_0), then the other class is upsampled similarly.
- For downsampling, if (n_0 > n_1) then the majority class is reduced to (n_1) samples, so the total becomes (2n_1). Likewise, if (n_1 > n_0), the majority class is reduced accordingly.
You need to implement a program that reads the dataset and the desired resampling mode from standard input, performs the rebalancing, and outputs the total number of samples in the balanced dataset along with the counts of class 0 and class 1.
inputFormat
The input is given via standard input (stdin) and has the following format:
- The first line contains an integer indicating the mode: 1 for upsampling and 2 for downsampling.
- The second line contains two integers \(n\) and \(m\), where \(n\) is the number of samples and \(m\) is the number of features.
- The next \(n\) lines each contain \(m\) space-separated floating-point numbers representing the features of each sample.
- The last line contains \(n\) space-separated integers representing the labels (each label is either 0 or 1).
outputFormat
Output to standard output (stdout) three space-separated integers:
- The total number of samples in the balanced dataset.
- The number of samples in class 0.
- The number of samples in class 1.
1
10 1
0
0
0
0
1
1
0
0
0
1
0 0 0 0 1 1 0 0 0 1
14 7 7