#C13167. Housing Price Prediction with Linear Regression
Housing Price Prediction with Linear Regression
Housing Price Prediction with Linear Regression
You are given a dataset representing housing features and prices. The dataset is in CSV format with a header row. The first input line is the target column name (for example, price), and the second line is an integer m indicating the number of data rows. The third line contains the CSV header and the following m lines contain the data rows. Each row contains numerical values separated by commas. One of the columns represents the target variable (house price) and the rest are features (for example, number of bedrooms, size in square feet, proximity to a landmark, etc.).
Your task is to perform a linear regression on the training set and evaluate the model on the test set. Use the following procedure:
- Let n be the number of data rows. Determine the test set size as k = max(1, floor(0.2 * n)) and the training set will consist of the first n - k rows of data and the test set will be the last k rows.
- Construct the training matrix X by taking all features from the training rows and adding a column of ones to account for the intercept. Let y be the vector of target values from the training set.
- Solve for the regression coefficients using the normal equation: $$\beta = (X^T X)^{-1} X^T y.$$
- For each test sample, compute its predicted target value \(\hat{y}\) using the computed coefficients.
- Compute the Root Mean Squared Error (RMSE) as $$RMSE = \sqrt{\frac{1}{k}\sum_{i=1}^{k}(y_i - \hat{y}_i)^2}.$$
- Compute the coefficient of determination (R²) using $$R^2 = 1 - \frac{\sum_{i=1}^{k}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{k}(y_i - \bar{y})^2},$$ where \(\bar{y}\) is the mean of the test target values. If the denominator is 0, define \(R^2 = 1\).
Print the RMSE and R² in one line separated by a space. It is guaranteed that the input data follows a perfect linear relation so that the computed prediction on the test set will match the actual target values exactly (i.e. RMSE = 0 and R² = 1). Your submitted solution must read input from stdin
and output to stdout
.
Note: All formulas are expressed in LaTeX format.
inputFormat
The input has the following format:
- A line containing the target column name (a string).
- A line containing an integer m, the number of data rows.
- A CSV header line with column names separated by commas.
- m lines of CSV data, each line containing numerical values separated by commas.
You can assume that all numbers are valid and the CSV columns align with the header. The target column is one of the header fields.
outputFormat
Print one line containing two numbers: the RMSE and the coefficient of determination (R²), separated by a space. The values should be printed with 6 decimal places.
## sampleprice
6
bedrooms,size_in_sqft,proximity_to_landmark,price
2,1500,10,250000
3,2000,8,370000
4,2500,7,480000
3,1800,9,340000
5,3000,6,590000
4,2200,8,440000
0.000000 1.000000
</p>