#C14500. House Prices: Linear Regression Analysis
House Prices: Linear Regression Analysis
House Prices: Linear Regression Analysis
In this problem, you are provided with a dataset containing housing information in CSV format. Your task is to perform data cleaning by filling missing numerical values with their mean and missing categorical values with their mode. Then, apply one‐hot encoding to any categorical features. Next, split the dataset into training and testing sets with an 80-20 split. Using the training set, train a linear regression model to predict house prices (the SalePrice
column). Finally, evaluate the model performance by computing the Mean Squared Error (MSE) and the R-squared score. The formulas are given in \( \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 \) and \( R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2} \).
inputFormat
The input is provided via standard input (stdin) as CSV formatted text. The first line contains the header with column names and each subsequent line is a data record. The target column is always 'SalePrice'.
outputFormat
Output two space‐separated floating point numbers: the first is the Mean Squared Error (MSE) and the second is the R-squared score, printed to standard output (stdout). For example: 0.0 1.0
.## sample
OverallQual,GrLivArea,GarageCars,GarageArea,TotalBsmtSF,1stFlrSF,FullBath,YearBuilt,SalePrice
7,1710,2,548,856,856,2,2003,208500
6,1262,2,460,1262,1262,2,1976,181500
7,1786,2,608,920,920,2,2001,223500
7,1717,3,642,756,756,1,1915,140000
8,2198,3,836,1145,1145,2,2000,250000
0.0 1.0