Project 3: Linear Regression
The goal of this project was to answer one main question: “What features most strongly influence a home’s sale price, and how accurately can we predict that price using linear regression?”
To explore this, I used the housing dataset from Kaggle, which contains information about nearly 1,500 homes. Each home includes lots of details like square footage, number of bedrooms and bathrooms, neighborhood, and year built. My goal wasn’t just to predict prices, but to see which factors really matter and how much a simple linear model could learn from them.
Looking at the Data
The first thing I did was take a look at the dataset to understand what I was working with. Each row represented a home, and the target column “SalePrice” was what I wanted to predict. Some homes were small and affordable, while others were large and luxurious. When I plotted the prices, I noticed there were a few very expensive houses pulling the average upward.
I also checked for missing values. Some homes didn’t have garages or basements, so certain columns were blank. There were also features measured in very different scales. For example, lot area was in the thousands while quality ratings were single digits. This gave me an early idea of what kind of data cleaning would be needed before modeling.
# loading the data
df = pd.read_csv('train.csv')
print(df.shape)
df.head()
TARGET = "SalePrice"
y = df[TARGET].copy()
X = df.drop(columns=[TARGET])
# Getting a glisp on all the missing data
nulls = X.isna().sum().sort_values(ascending=False)
nulls_head = nulls[nulls > 0].head(20)
nulls_head
plt.figure(figsize=(10, 5))
sns.barplot(x=nulls_head.values, y=nulls_head.index)
plt.title("Top 20 Features by Missing Values")
plt.xlabel("Count Missing"); plt.ylabel("Feature")
plt.tight_layout()
plt.show()

Preprocessing
Before any model could be built, the data needed to be cleaned and transformed. I used three main preprocessing steps:
- Handling missing values – For numeric columns, I filled missing values with the median since it’s less affected by extreme outliers. For categorical columns, I used the most common category so that missing labels wouldn’t cause errors.
- Encoding categories – Many columns, like “Neighborhood” were words. Models can’t interpret those so I used one-hot encoding to turn them into numeric yes/no columns.
- Scaling numeric values – Since some features had much larger numbers than others, I standardized all numeric features. This step makes sure that no single feature takes over the model just because it’s on a bigger scale.
# Splitting the data into numeric and categorical data types
# so I can process and then combine later for training
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(exclude=[np.number]).columns.tolist()
numeric_pipe = Pipeline(steps=[
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler(with_mean=True, with_std=True)),
])
categorical_pipe = Pipeline(steps=[
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(transformers=[
("num", numeric_pipe, numeric_features),
("cat", categorical_pipe, categorical_features)
])
3: Experiments
Experiment 1
My first experiment was a plain linear regression model using all the cleaned data. I didn’t apply any transformations or new features yet because I just wanted to see how well the model could do on its own.
The baseline performed reasonably well but had one clear problem: it tended to underestimate expensive houses and overestimate cheaper ones. That told me the model wasn’t handling the uneven distribution of prices very well. This first step gave me a baseline score to compare against future experiments.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model1 = Pipeline([
("prep", preprocessor),
("model", LinearRegression())
])
model1.fit(X_train, y_train)
pred1 = model1.predict(X_test)
rmse1 = rmse(y_test, pred1)
print(f"Experiment 1 RMSE: {rmse1:,.2f}")
# Experiment 1 RMSE: 29,475.33
Experiment 2
For my second experiment, I made just one small change. I applied a logarithmic transformation to the target variable instead of predicting prices directly, the model predicted the log of it.
This adjustment helps when the target values are skewed, as it evens out differences between smaller and larger numbers. Essentially, it lets the model focus on percentage changes instead of absolute differences. After transforming the target, the model became more accurate and more balanced meaning it no longer struggled as much with predicting higher priced homes.
y_log = np.log1p(y)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y_log, test_size=0.2, random_state=42)
model2 = Pipeline([
("prep", preprocessor),
("model", LinearRegression())
])
model2.fit(X_train2, y_train2)
pred2 = np.expm1(model2.predict(X_test2))
rmse2 = rmse(y_test, pred2)
print(f"Experiment 2 RMSE (log target): {rmse2:,.2f}")
# Experiment 2 RMSE (log target): 22,743.88
Experiment 3
For my final experiment, I added a single, simple feature called TotalSF, which stands for total square footage. I calculated it by adding up the first-floor, second-floor, and basement areas for each home.
This new feature gave the model a better understanding of the home’s overall size, which turned out to be a strong predictor of price. The improvement showed the model’s accuracy had increased again. It was proof that even a simple, logical feature can make a noticeable difference in a model’s performance.
X3 = X.copy()
X3["TotalSF"] = X3["1stFlrSF"] + X3["2ndFlrSF"].fillna(0) + X3["TotalBsmtSF"].fillna(0)
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y_log, test_size=0.2, random_state=42)
model3 = Pipeline([
("prep", preprocessor),
("model", LinearRegression())
])
model3.fit(X_train3, y_train3)
pred3 = np.expm1(model3.predict(X_test3))
rmse3 = rmse(y_test, pred3)
print(f"Experiment 3 RMSE (with TotalSF): {rmse3:,.2f}")
# Experiment 3 RMSE (with TotalSF): 22,743.88
What I Learned
Through these three simple experiments, I learned that good modeling is about curiosity and attention to detail. Each small change taught me something: cleaning the data prevented errors, transforming the target made the predictions fairer, and adding one thoughtful feature helped the model see the bigger picture.
By the end, my model could predict home prices fairly accurately using nothing more than linear regression. More importantly, I gained a deeper understanding of what really drives a home’s value. Even the simplest model can tell a meaningful story when you take the time to build it step by step. Plus this project shows the usefulness that comes with building prediction models that can make realestate data more transparent and accessible for everyone.
Dataset Link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv
Github Link: https://github.com/aborland123/project3house
