#C13837. Synthetic Dataset Generation for Binary Classification

    ID: 43419 Type: Default 1000ms 256MiB

Synthetic Dataset Generation for Binary Classification

Synthetic Dataset Generation for Binary Classification

You are given a task to generate a synthetic dataset for binary classification. In this problem, you are provided with four parameters: (n) (the number of samples), (m) (the number of features per sample), (\text{class_balance}) (the proportion of samples belonging to class 1), and (r) (the noise percentage).

The dataset is generated as follows:

  • Use a fixed random seed of 42 for reproducibility.
  • Create a label vector \(y\) with \(n_1 = \lfloor n \times \text{class\_balance} \rfloor\) ones (representing class 1) and \(n - n_1\) zeros (representing class 0), then shuffle \(y\) randomly.
  • Generate an \(n \times m\) feature matrix \(X\) where each element is sampled from the standard normal distribution \(\mathcal{N}(0,1)\).
  • If \(r > 0\), add noise to \(X\) as follows: for each element, add a noise value computed as \(\text{noise} = r \times \sigma\) where \(\sigma\) is the standard deviation of the original \(X\) (i.e. \(\sigma = \sqrt{\frac{1}{nm}\sum_{i,j}(X_{ij}-\mu)^2}\) with \(\mu\) being the mean of \(X\)).
Your program should then output the generated feature matrix and label vector. The feature matrix should be printed as \(n\) lines, each containing \(m\) floating-point numbers rounded to 6 decimal places. Finally, output one line with \(n\) space-separated labels (0 or 1).

inputFormat

The input consists of a single line containing four space-separated values:

  • \(n\) (number of samples)
  • \(m\) (number of features per sample)
  • \(\text{class\_balance}\) (a float representing the proportion of class 1 samples)
  • \(r\) (noise percentage)
For example: 2 1 0.5 0.0

outputFormat

Output the synthetic dataset as follows:

  • Print \(n\) lines, each line containing \(m\) space-separated floating-point numbers representing the features of each sample. Each number should be rounded to 6 decimal places.
  • Print one additional line containing \(n\) space-separated integers (0 or 1), which are the labels for the dataset.
For example, for input 2 1 0.5 0.0, a valid output is:
0.496714
-0.138264
0 1
## sample
2 1 0.5 0.0
0.496714

-0.138264 0 1

</p>