#C13837. Synthetic Dataset Generation for Binary Classification
Synthetic Dataset Generation for Binary Classification
Synthetic Dataset Generation for Binary Classification
You are given a task to generate a synthetic dataset for binary classification. In this problem, you are provided with four parameters: (n) (the number of samples), (m) (the number of features per sample), (\text{class_balance}) (the proportion of samples belonging to class 1), and (r) (the noise percentage).
The dataset is generated as follows:
- Use a fixed random seed of 42 for reproducibility.
- Create a label vector \(y\) with \(n_1 = \lfloor n \times \text{class\_balance} \rfloor\) ones (representing class 1) and \(n - n_1\) zeros (representing class 0), then shuffle \(y\) randomly.
- Generate an \(n \times m\) feature matrix \(X\) where each element is sampled from the standard normal distribution \(\mathcal{N}(0,1)\).
- If \(r > 0\), add noise to \(X\) as follows: for each element, add a noise value computed as \(\text{noise} = r \times \sigma\) where \(\sigma\) is the standard deviation of the original \(X\) (i.e. \(\sigma = \sqrt{\frac{1}{nm}\sum_{i,j}(X_{ij}-\mu)^2}\) with \(\mu\) being the mean of \(X\)).
inputFormat
The input consists of a single line containing four space-separated values:
- \(n\) (number of samples)
- \(m\) (number of features per sample)
- \(\text{class\_balance}\) (a float representing the proportion of class 1 samples)
- \(r\) (noise percentage)
2 1 0.5 0.0
outputFormat
Output the synthetic dataset as follows:
- Print \(n\) lines, each line containing \(m\) space-separated floating-point numbers representing the features of each sample. Each number should be rounded to 6 decimal places.
- Print one additional line containing \(n\) space-separated integers (0 or 1), which are the labels for the dataset.
2 1 0.5 0.0
, a valid output is:
0.496714 -0.138264 0 1## sample
2 1 0.5 0.0
0.496714
-0.138264
0 1
</p>