Project 2: Can a model detect what is a Spam and what is not a Spam?
For this project, I wanted to answer a question I’ve always been curious about: how do email providers or phones actually know when a message is spam? I get random texts and emails that look sketchy, and somehow Gmail or my phone filters them before I even notice. That made this project both practical and interesting to me. The goal was to build a classification model that could take in a text message and decide whether it was spam or not.
I went looking for datasets and landed on the SMS Spam Collection Dataset (https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset). It has about 5,500 text messages labeled as either “spam” or “ham” (not spam). Each row is pretty simple with just the label and the actual message.
Before training, I cleaned up the data by dropping any null values and converting the labels into numbers (spam as 1, ham as 0). Then I split the data into training and testing sets using an 80/20 split, making sure the ratio of spam to ham stayed balanced. Since machine learning models can’t directly understand text, I used TF-IDF vectorization to transform the words into numerical features. This step was crucial because it highlights important words while downplaying common words. I also applied class balancing for Logistic Regression since spam was the minority class in the dataset.
For modeling, I decided to try two algorithms: Multinomial Naive Bayes and Logistic Regression. Naive Bayes is known for being fast and effective on text data, even though it assumes words appear independently. Logistic Regression, on the other hand, learns weights for each word and often does a better job when the data has more nuance. Both are widely used as baselines in text classification, so I thought it would be interesting to compare them.
# The imports for the entire project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, classification_report, confusion_matrix,
average_precision_score
)
import seaborn as sns
import matplotlib.pyplot as plt
# Reading in the Dataset and seeing the first 5 rows
df = pd.read_csv("spam.csv")
df.head()
# Dropping the Null Values
df = df.dropna()
# Getting a few stastics
df.describe()
df.info()
Training the Model
# Labeling the data
X = df["message"]
y = (df["label"].str.lower() == "spam").astype(int)
# Train/test split into 80% training/20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Pipelines
# Naive Bayes
nb = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
("clf", MultinomialNB(alpha=0.5)),
])
#Logistic Regression model
lr = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
("clf", LogisticRegression(max_iter=1000, solver="liblinear",
class_weight="balanced")),
])
def evaluate(name, pipe):
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(f"\n== {name} ==")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print(classification_report(y_test, y_pred, target_names=["ham","spam"]))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
# Average Precision
clf = pipe.named_steps["clf"]
if hasattr(clf, "predict_proba"):
y_score = pipe.predict_proba(X_test)[:, 1]
elif hasattr(clf, "decision_function"):
y_score = pipe.decision_function(X_test)
else:
y_score = None
if y_score is not None:
ap = average_precision_score(y_test, y_score)
print("PR-AUC (Average Precision):", round(ap, 3))
return pipe
# Evaluate both
nb_trained = evaluate("MultinomialNB", nb)
lr_trained = evaluate("LogisticRegression", lr)
The Result:

The results were pretty telling. Naive Bayes reached about 97.2% accuracy, but it only caught about 80% of spam messages. Logistic Regression did even better, hitting 98.4% accuracy and catching 92% of spam. That might sound like a small difference, but in practice it’s huge. Missing spam is much worse than accidentally flagging a normal text, so Logistic Regression felt like the more reliable choice.
I also looked at precision-recall curves and confusion matrices, and the visuals confirmed what the numbers showed: Logistic Regression was simply better at finding spam while keeping false positives low.
Further Analysis
# 5) Top tokens for spam vs ham (from LR)
vec = lr_trained.named_steps["tfidf"]
clf = lr_trained.named_steps["clf"]
feat = vec.get_feature_names_out()
coef = clf.coef_[0]
top_spam = feat[coef.argsort()[-20:]][::-1]
top_ham = feat[coef.argsort()[:20]]
print("\nTop spam-weighted tokens:\n", list(top_spam))
print("\nTop ham-weighted tokens:\n", list(top_ham))
Result:
Top spam-weighted tokens: ['call', 'txt', 'free', 'text', 'to', 'reply', 'uk', 'stop', 'www', 'mobile', 'claim', 'your', 'from', '150p', 'service', 'com', 'won', 'chat', 'now', '50'] Top ham-weighted tokens: ['my', 'me', 'ok', 'that', 'but', 'll', 'it', 'gt', 'lt', 'so', 'can', 'da', 'home', 'come', 'when', 'lt gt', 'at', 'later', 'how', 'then']
Plotting the data
# Get top spam tokens and weights
spam_idx = coef.argsort()[-20:][::-1]
ham_idx = coef.argsort()[:20]
spam_tokens = feat[spam_idx]
spam_weights = coef[spam_idx]
ham_tokens = feat[ham_idx]
ham_weights = coef[ham_idx]
df = pd.DataFrame({
"token": list(spam_tokens) + list(ham_tokens),
"weight": list(spam_weights) + list(ham_weights),
"class": ["spam"] * len(spam_tokens) + ["ham"] * len(ham_tokens)
})
# Plot
plt.figure(figsize=(10, 8))
sns.barplot(
data=df,
y="token", x="weight", hue="class",
dodge=False, palette={"spam": "red", "ham": "blue"}
)
plt.title("Top Tokens for Spam vs Ham (Logistic Regression)")
plt.xlabel("Coefficient Weight")
plt.ylabel("Token")
plt.tight_layout()
plt.show()

One of the most interesting parts was looking at which words the model thought were most “spammy” versus “hammy.” Spam messages were strongly associated with words like “call,” “txt,” “free,” “claim,” “stop,” and “service,” while ham messages leaned on casual, conversational words like “ok,” “my,” “home,” and “later.”
Overall, I learned that even simple models like Naive Bayes and Logistic Regression can do a great job of detecting spam if the text is preprocessed well. It also showed me that preprocessing is just as important as the choice of model. Without TF-IDF, the algorithms would have had no way of understanding the difference between a scammy text and a casual “see you later” message. At the same time, I also realized that models like this aren’t perfect. Spam changes over time, so models need to be retrained with newer data. There are also risks: if the filter is too strict, important messages could get lost in the spam folder.
Still, I thought it was pretty cool that with a relatively small dataset and a few lines of code, I was able to build something that almost perfectly detects spam. If I continued this project, I’d try adding more advanced features like character-level analysis (to catch messages that misspell words on purpose) or tuning thresholds for even better recall. But the impact of this project, is what we already see in things like Gmail. Stuff like this is being used to protect people from being scammed.
Code Link: https://github.com/aborland123/ScamClassificationModel
