Project 4: What patterns can be found among tech companies that experienced layoffs, and how do factors like company size, funding stage, and industry cluster together?

For this project, I’m analyzing publicly reported tech layoffs from 2020 to 2024 to see if companies naturally group together based on how and when they reduced staff. My goal is to find patterns among company size, funding stage, industry, and timing to understand how layoffs spread across the tech world. I’m especially curious if venture-backed startups tend to have higher percentage cuts, or if there were certain years where layoffs were concentrated across similar industries. By using clustering instead of assuming any specific labels or causes, I can look for natural patterns that emerge in the data.

Pre-Processing the Data:

The dataset I’m using is the Tech Layoffs Dataset (2020–2024) from Kaggle. It includes company name, industry, location, date of layoff, number of employees laid off, total company size, percentage of workforce laid off, total funding raised, and company stage (like startup or public). I also created new features such as year and quarter to analyze layoffs over time, log-scaled funding and size to handle large variations, and simplified stage categories like “Seed/Angel,” “Early (A/B),” “Late (C+),” and “Public.” This dataset gives a broad look at how different types of companies experienced layoffs during recent years.

import os, math, numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)

CSV_PATH = "layoffs.csv"  

df_raw = pd.read_csv(CSV_PATH)
print("Raw shape:", df_raw.shape)
display(df_raw.head(3))

df = df_raw.copy()
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

if "percentage_laid_off" in df.columns:
    df["percentage_laid_off"] = (
        df["percentage_laid_off"].astype(str).str.strip().str.replace("%", "", regex=False)
    )
    df["percentage_laid_off"] = pd.to_numeric(df["percentage_laid_off"], errors="coerce")

def parse_funds(x):
    if pd.isna(x): 
        return np.nan
    s = str(x).strip().lower().replace("$", "").replace(",", "")
    mult = 1.0
    if s.endswith("b"):
        mult = 1_000.0  # billions -> millions
        s = s[:-1]
    elif s.endswith("m"):
        mult = 1.0
        s = s[:-1]
    try:
        return float(s) * mult
    except:
        return np.nan

df["funding_total_musd"] = df["funds_raised"].apply(parse_funds) if "funds_raised" in df.columns else np.nan

rename_map = {
    "total_laid_off": "num_laid_off",
    "percentage_laid_off": "pct_laid_off"
}
df = df.rename(columns={k:v for k,v in rename_map.items() if k in df.columns})

if "num_laid_off" in df.columns and "pct_laid_off" in df.columns:
    df["company_size"] = np.where(
        df["pct_laid_off"].notna() & (df["pct_laid_off"] > 0) & df["num_laid_off"].notna(),
        df["num_laid_off"] / (df["pct_laid_off"] / 100.0),
        np.nan
    )

# Building new dataframe
df_feat = pd.DataFrame({
    "company": df.get("company"),
    "industry": df.get("industry"),
    "country": df.get("country"),
    "city": df.get("location"),
    "date": pd.to_datetime(df.get("date"), errors="coerce"),
    "num_laid_off": pd.to_numeric(df.get("num_laid_off"), errors="coerce"),
    "company_size": pd.to_numeric(df.get("company_size"), errors="coerce"),
    "pct_laid_off": pd.to_numeric(df.get("pct_laid_off"), errors="coerce"),
    "funding_total_usd": pd.to_numeric(df.get("funding_total_musd"), errors="coerce"),
    "stage": df.get("stage")
})

df_feat["year"] = df_feat["date"].dt.year
df_feat["quarter"] = df_feat["date"].dt.quarter

TOPK = 15
top_industries = df_feat["industry"].value_counts().head(TOPK).index
top_countries  = df_feat["country"].value_counts().head(TOPK).index

df_model = df_feat.copy()
df_model["industry_top"] = np.where(df_model["industry"].isin(top_industries), df_model["industry"], "Other")
df_model["country_top"]  = np.where(df_model["country"].isin(top_countries),  df_model["country"],  "Other")

feature_cols_num = ["pct_laid_off", "log1p_company_size", "log1p_num_laid_off", "log1p_funding_total_usd", "year", "quarter"]
feature_cols_cat = ["stage_bucket", "industry_top", "country_top"]

numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
categorical_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore"))])
preprocess = ColumnTransformer(transformers=[("num", numeric_transformer, feature_cols_num), ("cat", categorical_transformer, feature_cols_cat)])
X_prepared = preprocess.fit_transform(df_model)
X_dense = X_prepared.toarray() if hasattr(X_prepared, "toarray") else X_prepared
print("Prepared matrix shape:", X_dense.shape)

Before modeling, I explored the data through visualizations and descriptive summaries. Many variables, like company size and total funding, were highly skewed, so I applied log transformations and standard scaling to make clustering more stable. For missing values, I filled in numeric gaps with the median and categorical ones with the most frequent value. I also reduced the number of unique industries and countries by grouping less common ones into “Other,” which helps avoid noisy clusters. Through this exploration, I noticed that early-stage companies seemed to have more drastic percentage layoffs than larger, established ones, which supports part of my initial question.

After preprocessing, I selected key features related to the research question: percentage laid off, company size, number laid off, total funding, year, quarter, stage, industry, and country. I standardized the numeric values and one-hot encoded the categorical ones to prepare them for clustering. Then I used k-means to test different numbers of clusters (from 2 to 10) and evaluated each with silhouette, Davies-Bouldin, and Calinski-Harabasz scores. These metrics helped me find the best balance between well-separated and compact clusters. I then compared the results to agglomerative clustering with the same number of clusters to see how the structures differed. Finally, I visualized the results in two dimensions using PCA and plotted a dendrogram to show how the agglomerative clusters formed.

Visualizing The Data

Average Layoffs based on Company Size

stage_mean = (
    df_feat.dropna(subset=["pct_laid_off"])
          .groupby("stage_bucket")["pct_laid_off"]
          .mean()
          .sort_values(ascending=False)
)
plt.figure(figsize=(6, 4))
plt.bar(stage_mean.index.astype(str), stage_mean.values, color="#3b7dd8")
plt.xticks(rotation=25, ha="right")
plt.ylabel("Average % of Workforce Laid Off")
plt.title("Average Layoffs by Company Stage", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()

Industries Most Affected by Layoffs

ind_counts = (
    df_feat["industry"]
    .fillna("Unknown")
    .value_counts()
    .head(15)
    .sort_values()
)
plt.figure(figsize=(6, 6))
plt.barh(ind_counts.index.astype(str), ind_counts.values, color="#4e79a7")
plt.xlabel("Number of Layoff Events")
plt.title("Industries Most Affected by Layoffs", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()

Employees Laid Off and Events

ts = df_feat.dropna(subset=["date"]).copy()
ts["month"] = ts["date"].dt.to_period("M").dt.to_timestamp()
monthly = ts.groupby("month").agg(
    events=("company", "count"),
    total_laid_off=("num_laid_off", "sum")
).fillna(0)

plt.figure(figsize=(7, 3))
plt.plot(monthly.index, monthly["events"], color="#1f77b4")
plt.title("Monthly Count of Layoff Events (2020–2024)", fontsize=13, fontweight="bold")
plt.xlabel("Month")
plt.ylabel("Number of Events")
plt.tight_layout()
plt.show()

plt.figure(figsize=(7, 3))
plt.plot(monthly.index, monthly["total_laid_off"], color="#d62728")
plt.title("Total Employees Laid Off per Month (2020–2024)", fontsize=13, fontweight="bold")
plt.xlabel("Month")
plt.ylabel("Employees Laid Off")
plt.tight_layout()
plt.show()

Average Work Force Laid Off Heat Map

mask = df_feat["pct_laid_off"].notna() & df_feat["log1p_company_size"].notna()
plt.figure(figsize=(6, 4))
plt.scatter(df_feat.loc[mask, "log1p_company_size"], df_feat.loc[mask, "pct_laid_off"], s=10, alpha=0.5, color="#ff7f0e")
plt.xlabel("log₁ₚ(Company Size)")
plt.ylabel("% of Workforce Laid Off")
plt.title("Relationship Between Company Size and Layoff Severity", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()

Clustering

Clustering is an unsupervised learning method that groups data points so that those within a cluster are more similar to each other than to points in other clusters. I use two main methods for this analysis. The first is k-means clustering, which picks a specific number of clusters (k) and assigns each company to the nearest cluster center. It repeats this process until the cluster centers stabilize, and it works best when clusters are compact and roughly circular in shape. The second method is agglomerative clustering, which starts by treating every company as its own cluster and then repeatedly merges the closest ones. This approach creates a tree-like structure called a dendrogram and can capture more complex shapes in the data.

results = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=20, random_state=42)
    labels = km.fit_predict(X_dense)
    sil = silhouette_score(X_dense, labels) if len(set(labels)) > 1 else np.nan
    db  = davies_bouldin_score(X_dense, labels)
    ch  = calinski_harabasz_score(X_dense, labels)
    results.append({"k": k, "silhouette": sil, "davies_bouldin": db, "calinski_harabasz": ch})

res_df = pd.DataFrame(results)
display(res_df)

plt.figure()
plt.plot(res_df["k"], res_df["silhouette"], marker="o")
plt.title("Silhouette Score by Number of Clusters", fontsize=13, fontweight="bold")
plt.xlabel("k"); plt.ylabel("Silhouette Score")
plt.show()

best_k = int(res_df.sort_values(["silhouette", "calinski_harabasz"], ascending=[False, False]).iloc[0]["k"])
print("Chosen k:", best_k)

pca = PCA(n_components=2, random_state=42)
X_2d = pca.fit_transform(X_dense)

def scatter_2d(X2, labels, title):
    plt.figure()
    plt.scatter(X2[:,0], X2[:,1], s=10, alpha=0.7, c=labels)
    plt.title(title, fontsize=13, fontweight="bold")
    plt.xlabel("PC1"); plt.ylabel("PC2")
    plt.show()

scatter_2d(X_2d, kmeans_labels, "K-Means Clusters (PCA Projection)")
scatter_2d(X_2d, agg_labels, "Agglomerative Clusters (PCA Projection)")

sample_n = min(400, X_dense.shape[0])
idx = np.random.RandomState(42).choice(X_dense.shape[0], size=sample_n, replace=False)
Z = linkage(X_dense[idx], method="ward")
plt.figure(figsize=(10,4))
dendrogram(Z, truncate_mode="lastp", p=30, leaf_rotation=90., leaf_font_size=8.)
plt.title("Agglomerative Clustering Dendrogram (Truncated)", fontsize=13, fontweight="bold")
plt.show()

The clustering results revealed interesting trends. Some clusters contained mostly early-stage startups that had smaller teams but higher percentage layoffs, especially during 2022 and 2023. Others contained large public companies that had massive layoffs in absolute numbers but smaller percentages relative to their workforce. Another cluster appeared to group companies outside the U.S. with moderate layoffs and lower funding levels. These insights help show how the tech contraction played out differently depending on company stage, region, and funding level. By profiling each cluster’s median values and most common industries, I was able to tie the findings back to my original questions about size, stage, and timing.

This project has meaningful social and ethical implications. Understanding layoff patterns can help job seekers, analysts, and policymakers identify which parts of the tech world were hit hardest and which recovered fastest. However, there are also risks. Clustering can oversimplify complex situations, and labeling a group as “high-risk” could unfairly stigmatize companies or industries. The data may also be biased since it only includes publicly reported layoffs, meaning smaller or private layoffs might be underrepresented. To minimize these risks, I avoid making causal claims and emphasize that the clusters only show patterns—not reasons—behind layoffs.

In conclusion, this clustering analysis provided a deeper, data-driven view of how tech layoffs unfolded across different types of companies from 2020 to 2024. Even though it can’t capture every human story behind the numbers, it highlights important patterns in how the tech industry has evolved during recent years of instability.

Dataset: Tech Layoffs (Kaggle, 2020–2024)

Github: https://github.com/aborland123/Project4