Theme Park Business Analytics — Case Study¶

What this is A one-year analysis of theme park operations and weather to find what drives visitors and revenue—and turn it into clear actions.

Snapshot

  • Data: lobster24.csv (daily ops), passholders24.csv (segments), digital_quest.csv (conjoint)
  • Period: Jan–Dec 2024
  • Tools: Python (pandas, matplotlib, seaborn, scikit-learn) in Jupyter

What I deliver

  • Clean, reproducible notebook with code and charts
  • Behavior-based segments with profiles
  • Ratings-based conjoint: part-worths and attribute importance
  • Practical recommendations for staffing and offers

Highlights

  • Weekend lift: visitors & total revenue ↑ +X% vs weekdays
  • Weather impact: Cloudy/Sunny outperform Rainy/Stormy by +Y% in total revenue; arcade revenue ↑ +Z%
  • Staffing curve: returns flatten beyond ~H hours/day
  • Offer design: top driver {Attribute} (~A% importance); best bundle {Option}
In [ ]:
 

Part 1 — Exploratory Data Analysis¶

Focus. Clean the daily ops table (lobster24.csv), surface patterns, and define the KPIs that matter.

You’ll see

  • Fast data audit (shape/dtypes/missing), unified Date + week/weekday/month
  • High-signal visuals: calendar heatmap, Weather × Weekend mean±CI, weekly trend with smoothed band, Arcade Pareto
  • Distributions & skewness for key metrics; correlation heatmap for quick relationships
  • Core KPIs: international %, staff efficiency, passholder %
  • Save a tidy slice for later modeling and segmentation

1) Load & setup¶

Load the daily theme-park dataset and standardize the date column to Date.

In [1]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path

DATA = Path("data")
raw = pd.read_csv(DATA/"lobster24.csv")

df = raw.copy()
date_col = next((c for c in df.columns if c.lower() in ["date","day","dt"]), None)
if date_col is None:
    for c in df.columns:
        x = pd.to_datetime(df[c], errors="coerce")
        if x.notna().mean()>0.9: date_col=c; break
if date_col is None:
    raise ValueError("No date-like column found.")
df["Date"] = pd.to_datetime(df[date_col])
df = df.sort_values("Date").reset_index(drop=True)

2) Quick audit¶

Check shape, dtypes, and missing values to spot issues early.

In [2]:
display(df.head(3))
display(df.dtypes.to_frame("dtype"))
na = df.isna().mean().sort_values(ascending=False)
display(na[na>0].round(3).to_frame("missing_rate"))
Date Total_Visitors Arcade_Visitors Total_Purchases Arcade_Revenue Total_Labor_Hours Weather_Type Special_Events Customer_Complaints Day_of_Week Passholder_Percentage International_Visitors High_Temperature General_Weather Precipitation
0 2024-05-27 1060 267 22575.71 3524.30 185.84 Sunny Arcade Tournament 9 Monday 0.22 357 65.4 Cool 0
1 2024-05-28 1494 552 17142.52 1652.93 250.90 Cloudy NaN 3 Tuesday 0.23 424 66.4 Cool 0
2 2024-05-29 1330 447 10229.99 3251.81 115.59 Cloudy NaN 0 Wednesday 0.22 441 90.2 Hot 0
dtype
Date datetime64[ns]
Total_Visitors int64
Arcade_Visitors int64
Total_Purchases float64
Arcade_Revenue float64
Total_Labor_Hours float64
Weather_Type object
Special_Events object
Customer_Complaints int64
Day_of_Week object
Passholder_Percentage float64
International_Visitors int64
High_Temperature float64
General_Weather object
Precipitation int64
missing_rate
Special_Events 0.788
High_Temperature 0.030

3) Calendar & weather¶

Add week/weekday/month/weekend and unify weather labels.

In [3]:
df["Year"]    = df["Date"].dt.year
df["Week"]    = df["Date"].dt.isocalendar().week.astype(int)
df["Month"]   = df["Date"].dt.month
df["Weekday"] = df["Date"].dt.day_name()
df["Is_Weekend"] = df["Date"].dt.weekday >= 5

wcol = next((c for c in df.columns if "weather" in c.lower()), None)
if wcol is None:
    df["Weather_Clean"] = "Unknown"
else:
    w = df[wcol].astype(str).str.lower().str.strip()
    mapping = {"sun":"Sunny","clear":"Sunny","bright":"Sunny",
               "cloud":"Cloudy","overcast":"Cloudy",
               "rain":"Rainy","storm":"Stormy","thunder":"Stormy","snow":"Stormy"}
    def norm(s):
        for k,v in mapping.items():
            if k in s: return v
        return s.title()
    df["Weather_Clean"] = w.map(norm)

4) Identify key columns¶

Find total/arcade revenue, visitors, temperature, and complaints columns.

In [4]:
def find_col(tokens):
    for c in df.columns:
        if any(t in c.lower() for t in tokens): return c
    return None

col_total_rev  = find_col(["total_revenue","revenue_total","total_rev","revenue"])
col_arcade_rev = find_col(["arcade_revenue","arcade"])
col_visitors   = find_col(["total_visitors","visitors","attendance","attend"])
col_temp       = find_col(["temp","temperature"])
col_complaints = find_col(["complaints","customer_complaints"])

for c in [col_total_rev, col_arcade_rev, col_visitors, col_temp, col_complaints]:
    if c and df[c].dtype=="O":
        with pd.option_context("future.no_silent_downcasting", True):
            df[c] = pd.to_numeric(df[c].astype(str).str.replace(r"[,$]", "", regex=True), errors="coerce")

target = col_total_rev or (col_arcade_rev or df.select_dtypes("number").columns[0])
{"total": col_total_rev, "arcade": col_arcade_rev, "visitors": col_visitors, "temp": col_temp, "complaints": col_complaints}
Out[4]:
{'total': 'Arcade_Revenue',
 'arcade': 'Arcade_Visitors',
 'visitors': 'Total_Visitors',
 'temp': 'High_Temperature',
 'complaints': 'Customer_Complaints'}

5) KPIs¶

Compute international %, staff efficiency, and passholder % when available.

In [5]:
def maybe(names): 
    for n in names:
        if n in df.columns: return n
    return None

col_intl = maybe(["International_Visitors","Intl_Visitors","Pct_International","International%"])
col_staff_hours = maybe(["Total_Labor_Hours","Labor_Hours","Staff_Hours"])
col_pass = maybe(["Passholder_Percentage","Passholder_Pct","Passholder%","Passholders"])

if col_intl and df[col_intl].max()<=1.01: 
    df["Intl_Pct"] = df[col_intl]
elif col_intl:
    df["Intl_Pct"] = df[col_intl]/100.0

if col_staff_hours and col_total_rev:
    df["Staff_Efficiency"] = df[col_total_rev] / df[col_staff_hours]

if col_pass and col_visitors:
    df["Passholder_Pct"] = df[col_pass] if df[col_pass].max()<=1.01 else df[col_pass]/100.0

display(df[["Date","Weather_Clean","Is_Weekend"] + [c for c in ["Intl_Pct","Staff_Efficiency","Passholder_Pct"] if c in df.columns]].head(3))
Date Weather_Clean Is_Weekend Intl_Pct Staff_Efficiency Passholder_Pct
0 2024-05-27 Sunny False 3.57 18.964163 0.22
1 2024-05-28 Cloudy False 4.24 6.588003 0.23
2 2024-05-29 Cloudy False 4.41 28.132278 0.22

6) Calendar heatmap¶

Seasonality across the year.

In [6]:
s = df.set_index("Date")[target].copy()
cal = s.to_frame("val")
cal["dow"] = cal.index.weekday
cal["week"] = cal.index.isocalendar().week.astype(int)
pv = cal.pivot_table(index="dow", columns="week", values="val", aggfunc="mean")
plt.figure(figsize=(12,3)); plt.imshow(pv, aspect="auto")
plt.yticks(range(7), ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]); plt.xlabel("ISO Week"); plt.colorbar()
plt.title("Calendar heatmap — weekly mean"); plt.tight_layout(); plt.show()
No description has been provided for this image

7) Weather × Weekend¶

Mean ±95% CI by weather and weekend.

In [7]:
g = df.groupby(["Weather_Clean","Is_Weekend"])[target]
m = g.mean(); n = g.count(); sd = g.std().fillna(0)
se = (sd/np.sqrt(n)).reindex(m.index).fillna(0)
labels = [f"{w}\n{'Weekend' if we else 'Weekday'}" for w,we in m.index]
plt.figure(figsize=(12,4))
plt.bar(range(len(m)), m.values, yerr=1.96*se.values, capsize=3)
plt.xticks(range(len(m)), labels); plt.title("Weather × Weekend — mean ± 95% CI")
plt.tight_layout(); plt.show()
No description has been provided for this image

8) Weekly trend with smoothed band¶

Weekly mean with a 3-week moving average and ±1σ band.

In [8]:
wk = df.set_index("Date")[target].resample("W").mean().rename("w")
ma = wk.rolling(3, min_periods=1).mean()
sd = wk.rolling(3, min_periods=1).std().fillna(0)
plt.figure(figsize=(9,4))
plt.plot(wk.index, wk.values, alpha=0.4, label="weekly mean")
plt.plot(ma.index, ma.values, linewidth=2, label="3-week MA")
plt.fill_between(ma.index, (ma-sd).values, (ma+sd).values, alpha=0.2)
plt.title("Weekly revenue — mean with smoothed band")
plt.legend(); plt.tight_layout(); plt.show()
No description has been provided for this image

9) Arcade revenue — Pareto¶

Quantify long-tail concentration.

In [9]:
if col_arcade_rev:
    x = df[col_arcade_rev].dropna().sort_values(ascending=False).reset_index(drop=True)
    cs = x.cumsum()/x.sum() if x.sum()>0 else np.nan
    fig, ax1 = plt.subplots(figsize=(8,4))
    ax1.bar(range(1,len(x)+1), x.values)
    ax2 = ax1.twinx(); ax2.plot(range(1,len(x)+1), cs.values, marker="o", linewidth=2)
    ax1.set_xlabel("Days (sorted)"); ax1.set_ylabel("Arcade revenue"); ax2.set_ylabel("Cumulative share")
    ax1.set_title("Arcade revenue — Pareto curve"); plt.tight_layout(); plt.show()
else:
    print("No arcade revenue column found.")
No description has been provided for this image

10) Distributions & skewness¶

Histogram + skew for key metrics.

In [10]:
import numpy as np, matplotlib.pyplot as plt
dist_cols = [c for c in [
    "Total_Visitors","Arcade_Visitors","Total_Purchases",
    "Arcade_Revenue","Food_Revenue","Merch_Revenue","Total_Labor_Hours",
    col_total_rev, col_arcade_rev, col_visitors
] if isinstance(c,str) and c in df.columns]

for col in dist_cols:
    s = df[col].dropna()
    if s.empty: 
        continue
    plt.figure(figsize=(8,3))
    plt.hist(s, bins=30, alpha=0.85)
    plt.title(f"{col} — distribution (n={len(s)})")
    plt.xlabel(col); plt.ylabel("count"); plt.tight_layout(); plt.show()
    print(f"{col}: skew={s.skew():.2f}")
No description has been provided for this image
Total_Visitors: skew=-0.02
No description has been provided for this image
Arcade_Visitors: skew=-0.07
No description has been provided for this image
Total_Purchases: skew=0.30
No description has been provided for this image
Arcade_Revenue: skew=0.30
No description has been provided for this image
Total_Labor_Hours: skew=-0.00
No description has been provided for this image
Arcade_Revenue: skew=0.30
No description has been provided for this image
Arcade_Visitors: skew=-0.07
No description has been provided for this image
Total_Visitors: skew=-0.02

11) Correlation heatmap¶

Numeric correlations for KPIs and main metrics.

In [11]:
num_df = df.select_dtypes(include=[np.number]).copy()
corr = num_df.corr().round(2)

plt.figure(figsize=(6,5))
im = plt.imshow(corr, cmap="viridis")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.title("Correlation matrix"); plt.tight_layout(); plt.show()

corr.head(10)
No description has been provided for this image
Out[11]:
Total_Visitors Arcade_Visitors Total_Purchases Arcade_Revenue Total_Labor_Hours Customer_Complaints Passholder_Percentage International_Visitors High_Temperature Precipitation Year Week Month Intl_Pct Staff_Efficiency Passholder_Pct
Total_Visitors 1.00 0.09 -0.10 0.12 -0.19 0.14 -0.22 0.96 -0.05 -0.10 NaN 0.09 0.07 0.96 0.19 -0.22
Arcade_Visitors 0.09 1.00 -0.08 0.01 -0.08 0.09 0.06 0.10 0.05 0.05 NaN -0.09 -0.14 0.10 0.10 0.06
Total_Purchases -0.10 -0.08 1.00 0.21 0.19 -0.19 -0.41 0.01 0.05 -0.06 NaN -0.18 -0.16 0.01 0.03 -0.41
Arcade_Revenue 0.12 0.01 0.21 1.00 -0.06 -0.16 -0.29 0.21 0.08 0.11 NaN 0.18 0.20 0.21 0.77 -0.29
Total_Labor_Hours -0.19 -0.08 0.19 -0.06 1.00 -0.15 0.03 -0.18 -0.07 -0.05 NaN 0.09 0.07 -0.18 -0.58 0.03
Customer_Complaints 0.14 0.09 -0.19 -0.16 -0.15 1.00 -0.03 0.14 -0.04 0.07 NaN -0.14 -0.14 0.14 -0.08 -0.03
Passholder_Percentage -0.22 0.06 -0.41 -0.29 0.03 -0.03 1.00 -0.38 0.04 -0.02 NaN -0.01 0.01 -0.38 -0.21 1.00
International_Visitors 0.96 0.10 0.01 0.21 -0.18 0.14 -0.38 1.00 -0.02 -0.04 NaN 0.08 0.08 1.00 0.23 -0.38
High_Temperature -0.05 0.05 0.05 0.08 -0.07 -0.04 0.04 -0.02 1.00 0.06 NaN 0.03 0.03 -0.02 0.08 0.04
Precipitation -0.10 0.05 -0.06 0.11 -0.05 0.07 -0.02 -0.04 0.06 1.00 NaN -0.10 -0.08 -0.04 0.15 -0.02

12) Top / bottom days + save tidy slice¶

Surfacing extremes for notes and exporting a lean table for later work.

In [12]:
metric = target
top = df.nlargest(5, metric)[["Date","Weather_Clean","Is_Weekend", metric]]
bot = df.nsmallest(5, metric)[["Date","Weather_Clean","Is_Weekend", metric]]
display(top.rename(columns={metric:"Top_Value"}))
display(bot.rename(columns={metric:"Bottom_Value"}))

keep = ["Date","Year","Week","Month","Weekday","Is_Weekend","Weather_Clean"]
for c in [target, col_arcade_rev, col_visitors, "Intl_Pct","Staff_Efficiency","Passholder_Pct"]:
    if isinstance(c,str) and c in df.columns: keep.append(c)
Path("outputs").mkdir(exist_ok=True)
df[keep].to_csv("outputs/themepark_daily_tidy.csv", index=False)
"Saved: outputs/themepark_daily_tidy.csv"
Date Weather_Clean Is_Weekend Top_Value
75 2024-08-10 Cloudy True 7088.79
27 2024-06-23 Rainy True 6766.69
19 2024-06-15 Stormy True 6526.10
47 2024-07-13 Stormy True 6515.86
76 2024-08-11 Rainy True 6323.83
Date Weather_Clean Is_Weekend Bottom_Value
78 2024-08-13 Cloudy False 568.87
21 2024-06-17 Cloudy False 674.76
37 2024-07-03 Stormy False 684.80
58 2024-07-24 Sunny False 714.72
9 2024-06-05 Sunny False 728.46
Out[12]:
'Saved: outputs/themepark_daily_tidy.csv'
In [ ]:
 

Part 2 — Market Segmentation & Conjoint Analysis¶

Focus. Turn behavior into segments (passholders24.csv) and quantify offer utilities (digital_quest.csv).

You’ll see

  • Compact feature set → K-means with elbow + silhouette to pick k
  • Clear segment profiles (size, spend, visit mix) + PCA 2-D view
  • Conjoint part-worths and attribute importance for offer design
In [13]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.linear_model import RidgeCV

plt.rcParams["figure.figsize"] = (7,4)
plt.rcParams["axes.grid"] = False
DATA = Path("data")

2.1) Load & features¶

Read passholder behavior, select numeric features, winsorize, and standardize.

In [14]:
ph = pd.read_csv(DATA/"passholders24.csv")
ph.columns = (ph.columns.str.strip()
                        .str.replace(r"\s+", "_", regex=True)
                        .str.replace(r"[^\w_]", "", regex=True))
id_like = [c for c in ph.columns if c.lower() in {"id","customer_id","member_id"}]
num_cols = ph.select_dtypes(include=[np.number]).columns.tolist()
high_card = [c for c in num_cols if ph[c].nunique()>0.9*len(ph)]
feat_cols = [c for c in num_cols if c not in set(id_like)|set(high_card)]
if len(feat_cols)<3: raise ValueError(f"Not enough numeric features: {feat_cols}")
Xw = ph[feat_cols].clip(ph[feat_cols].quantile(0.01), ph[feat_cols].quantile(0.99), axis=1)
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(Xw), columns=feat_cols, index=ph.index)
display(X.describe().round(2))
Number_of_Visits Twilight_Visits Number_of_Rides_Taken Number_of_Food_Purchases Number_of_Arcade_Plays Parking_Frequency Time_Spent_in_Park_hours Weekday_vs_Weekend_Ratio Number_of_Guests_Brought Special_Event_Attendance
count 654.00 654.00 654.00 654.00 654.00 654.00 654.00 654.00 654.00 654.00
mean -0.00 -0.00 0.00 -0.00 -0.00 0.00 0.00 0.00 -0.00 -0.00
std 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
min -1.61 -1.53 -1.73 -1.71 -1.71 -1.81 -1.63 -1.68 -1.56 -1.34
25% -0.93 -0.84 -0.85 -0.86 -0.80 -0.82 -0.86 -0.87 -0.87 -0.65
50% 0.09 -0.15 -0.01 -0.01 -0.01 0.03 -0.03 -0.02 0.16 0.04
75% 0.88 0.88 0.82 0.84 0.90 0.87 0.86 0.88 0.85 0.73
max 1.56 1.57 1.78 1.70 1.59 1.58 1.76 1.75 1.54 1.42

2.2) Pick k (elbow + silhouette)¶

Compare inertia and silhouette to choose the number of clusters.

In [15]:
k_list = list(range(2,8))
inertia, sil = [], []
for k in k_list:
    km = KMeans(n_clusters=k, n_init=20, random_state=42)
    lab = km.fit_predict(X)
    inertia.append(km.inertia_)
    sil.append(silhouette_score(X, lab))
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].plot(k_list, inertia, marker="o"); ax[0].set_title("Elbow (Inertia)")
ax[0].set_xlabel("k"); ax[0].set_ylabel("Inertia")
ax[1].plot(k_list, sil, marker="o"); ax[1].set_title("Silhouette")
ax[1].set_xlabel("k"); ax[1].set_ylabel("Score")
plt.tight_layout(); plt.show()
best_k = k_list[int(np.argmax(sil))]
print("Chosen k:", best_k)
No description has been provided for this image
Chosen k: 7

2.3) KMeans & segment sizes¶

Assign labels and show segment counts.

In [16]:
kmeans = KMeans(n_clusters=best_k, n_init=50, random_state=42)

labels = kmeans.fit_predict(X)              
ph["segment"] = pd.Categorical(labels)    

sizes = ph["segment"].value_counts().sort_index()

ax = sizes.plot(kind="bar")
ax.set_title("Segment sizes"); ax.set_xlabel("segment"); ax.set_ylabel("count")
plt.tight_layout(); plt.show()

display(sizes.to_frame("count"))
No description has been provided for this image
count
segment
0 107
1 91
2 88
3 103
4 70
5 86
6 109

2.4) Segment profiles (z-scores)¶

Profile segments on standardized features with a heatmap.

In [17]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

Z = pd.DataFrame(
    StandardScaler().fit_transform(ph[feat_cols]),
    columns=feat_cols, index=ph.index
)

prof_z = Z.groupby(ph["segment"], observed=True)[feat_cols].mean() 
K = 10 
top_feats = prof_z.var(axis=0).sort_values(ascending=False).head(K).index
M = prof_z[top_feats]

h, w = M.shape
fig_w = min(10, 0.55*w + 2)
fig_h = min(6.5, 0.55*h + 1.6)

plt.figure(figsize=(fig_w, fig_h))
try:
    import seaborn as sns
    sns.heatmap(M, cmap="coolwarm", center=0, square=True,
                cbar_kws={"shrink":0.7}, linewidths=0.2, linecolor="white")
except Exception:
    vmax = np.nanmax(np.abs(M.values))
    im = plt.imshow(M.values, cmap="coolwarm", vmin=-vmax, vmax=vmax, aspect="equal")
    plt.colorbar(im, fraction=0.046, pad=0.04)
    plt.xticks(range(w), M.columns, rotation=60, ha="right", fontsize=8)
    plt.yticks(range(h), M.index, fontsize=9)

plt.title("Segment profiles (top features, z-scores)")
plt.xlabel("Feature"); plt.ylabel("Segment")
plt.xticks(rotation=60, ha="right", fontsize=8)
plt.yticks(fontsize=9)
plt.tight_layout(); plt.show()
No description has been provided for this image

2.5) PCA 2-D view¶

Visualize cluster separation in two principal components.

In [18]:
pca = PCA(n_components=2, random_state=42)
xy = pca.fit_transform(X)
tmp = pd.DataFrame(xy, columns=["PC1","PC2"], index=ph.index)
tmp["segment"] = ph["segment"].astype(str)
sns.scatterplot(data=tmp, x="PC1", y="PC2", hue="segment", s=28, alpha=0.85)
plt.title("Segments in 2-D (PCA)"); plt.legend(title="segment", bbox_to_anchor=(1.02,1))
plt.tight_layout(); plt.show()
print("Explained variance:", np.round(pca.explained_variance_ratio_, 3))
No description has been provided for this image
Explained variance: [0.12  0.116]

2.6) One-pager segment table¶

Summarize size, share, and key metric means for naming and targeting.

In [19]:
summary = ph.groupby("segment", observed=True)[feat_cols].mean().round(2)
counts = ph["segment"].value_counts(sort=False).sort_index()
summary.insert(0, "size", counts)
summary.insert(1, "size_share_%", (counts / counts.sum() * 100).round(1))
display(summary.sort_values("size", ascending=False))
size size_share_% Number_of_Visits Twilight_Visits Number_of_Rides_Taken Number_of_Food_Purchases Number_of_Arcade_Plays Parking_Frequency Time_Spent_in_Park_hours Weekday_vs_Weekend_Ratio Number_of_Guests_Brought Special_Event_Attendance
segment
6 109 16.7 17.10 5.79 49.88 21.85 10.74 18.91 5.78 1.36 6.28 11.33
0 107 16.4 8.81 3.91 50.26 15.84 21.23 10.47 6.01 0.71 7.01 1.87
3 103 15.7 19.64 4.62 47.79 19.47 17.43 14.96 9.59 1.06 2.49 0.95
1 91 13.9 11.69 3.81 40.23 32.59 11.90 5.42 5.03 1.33 3.62 1.37
2 88 13.5 12.34 3.52 71.95 19.20 16.56 15.56 4.53 1.02 2.03 14.35
5 86 13.1 21.47 3.20 42.24 36.60 15.55 12.93 4.24 0.60 4.50 13.71
4 70 10.7 15.79 6.34 60.60 36.31 10.36 9.84 9.22 0.66 5.40 30.69

2.7) Conjoint design (ratings-based)¶

Load survey, pick a rating column, and build a dummy-coded design.

In [20]:
cq = pd.read_csv(DATA/"digital_quest.csv")
cq.columns = (cq.columns.str.strip()
                        .str.replace(r"\s+", "_", regex=True)
                        .str.replace(r"[^\w_]", "", regex=True))
rating_col = next((c for c in cq.columns if c.lower() in {"rating","score","preference","rank"}), None)
if rating_col is None:
    num_cols_cq = cq.select_dtypes(include=[np.number]).columns
    if len(num_cols_cq)==0: raise ValueError("No numeric rating column found.")
    rating_col = num_cols_cq[-1]
attr_cols = [c for c in cq.columns if c != rating_col]
for c in attr_cols:
    cq[c] = cq[c].astype(str).str.strip().replace({"nan":"NA"})
    cq[c] = pd.Categorical(cq[c])
y = cq[rating_col].astype(float).values
Xc = pd.get_dummies(cq[attr_cols], drop_first=True)
display(Xc.head(3))
bundleID_10 bundleID_100 bundleID_1000 bundleID_1001 bundleID_1002 bundleID_1003 bundleID_1004 bundleID_1005 bundleID_1006 bundleID_1007 ... prizes_Physical Prizes collaboration_Solo collaboration_competing_teams quest_progression_open_world quest_progression_tiered task_total_20 task_total_30 task_total_5 technology_interactive_kiosks technology_mixed
0 False False False False False False False False False False ... False True False False False False False True False False
1 False False False False False False False False False False ... False True False False False False False True True False
2 False False False False False False False False False False ... False True False False False False False True False True

3 rows × 6929 columns

2.8) Part-worths via ridge¶

Fit a ridge regression on the design to estimate utilities.

In [21]:
import re, numpy as np, pandas as pd
from sklearn.linear_model import RidgeCV

Xc_df = pd.DataFrame(Xc).apply(pd.to_numeric, errors="coerce").fillna(0.0)
drop_re = re.compile(r'(?i)\b(id|bundle|task|alt|profile|version|resp|respondent)\b')
keep_cols = [c for c in Xc_df.columns.astype(str) if not drop_re.search(c)]
X_use = Xc_df[keep_cols]

X_num = X_use.to_numpy(dtype=np.float64)
y_num = pd.Series(y).astype(float).fillna(0.0).to_numpy(dtype=np.float64)

ridge = RidgeCV(alphas=np.logspace(-3, 3, 9), cv=None, gcv_mode="auto")
ridge.fit(X_num, y_num)

coefs = pd.Series(ridge.coef_, index=X_use.columns.astype(str)).sort_values(ascending=False)
coefs.head(10)
Out[21]:
narrative_Physical Activity    0.687949
bundleID_5486                  0.612485
bundleID_5019                  0.609340
bundleID_3487                  0.605393
bundleID_265                   0.597330
bundleID_4191                  0.596166
bundleID_5872                  0.590472
bundleID_889                   0.587443
bundleID_3500                  0.575397
bundleID_2090                  0.574328
dtype: float64

2.9) Attribute importance & part-worth bars¶

Show which attributes matter most and the level utilities.

In [22]:
import matplotlib.pyplot as plt
import pandas as pd

df_coef = pd.DataFrame({
    "col":  coefs.index,
    "att":  [c.split("_", 1)[0] for c in coefs.index],
    "coef": coefs.values
})

imp = (df_coef.groupby("att", observed=True)["coef"]
             .agg(lambda s: float(s.max() - s.min()))
             .sort_values(ascending=False))
imp_pct = imp / imp.sum() * 100

N = 6
top_imp = imp_pct.head(N)

plt.figure(figsize=(6, 3.2))
top_imp.sort_values().plot(kind="barh")
plt.title("Attribute importance (range, %)"); plt.xlabel("% of total importance"); plt.ylabel("")
plt.tight_layout(); plt.show()

for att in top_imp.index:
    s = (df_coef[df_coef["att"] == att]
            .set_index("col")["coef"]
            .sort_values())
    plt.figure(figsize=(6, 3.2))
    s.plot(kind="barh")
    plt.title(f"{att} part-worths"); plt.xlabel("utility"); plt.ylabel("")
    plt.tight_layout(); plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
 

Part 3 — Modeling & Evaluation¶

Focus. Build and explain a churn classifier (telco_churn_data.csv) that balances accuracy and actionability.

You’ll see

  • Stratified split, robust preprocessing pipelines (numeric + categorical)
  • Baseline Logistic Regression and ensemble benchmarks (Random Forest / GB)
  • Clear metrics (ROC-AUC, F1, precision/recall) and threshold tuning
  • Feature importance and partial-dependence plots for actionable drivers
In [23]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path

p = Path("data/telco_churn_data.csv")
if not p.exists():
    raise FileNotFoundError("Put telco_churn_data.csv under data/.")
raw = pd.read_csv(p)
raw.head(3)
Out[23]:
Customer ID Referred a Friend Number of Referrals Tenure in Months Offer Phone Service Avg Monthly Long Distance Charges Multiple Lines Internet Service Internet Type ... Latitude Longitude Population Churn Value CLTV Churn Category Churn Reason Total Customer Svc Requests Product/Service Issues Reported Customer Satisfaction
0 8779-QRDMV No 0 1 NaN No 0.00 No Yes Fiber Optic ... 34.023810 -118.156582 68701 1 5433 Competitor Competitor offered more data 5 0 NaN
1 7495-OOKFY Yes 1 8 Offer E Yes 48.85 Yes Yes Cable ... 34.044271 -118.185237 55668 1 5302 Competitor Competitor made better offer 5 0 NaN
2 1658-BYGOY No 0 18 Offer D Yes 11.33 Yes Yes Fiber Optic ... 34.108833 -118.229715 47534 1 3179 Competitor Competitor made better offer 1 0 NaN

3 rows × 46 columns

3.1) Target and features¶

In [24]:
import pandas as pd, numpy as np

df = raw.copy()
for c in df.columns:
    if df[c].dtype == "object":
        df[c] = df[c].astype(str).str.strip()

tokens = ["churn", "attrit", "exit", "cancel", "leave", "dropout"]
name_hits = [c for c in df.columns if any(t in c.lower() for t in tokens)]

low_card = [c for c in df.columns if df[c].astype(str).nunique(dropna=True) <= 3]

candidates = list(dict.fromkeys(name_hits + low_card))  

def to_binary(series):
    s = series.astype(str).str.strip().str.lower()
    mapping = {
        "yes":1, "y":1, "true":1, "t":1, "1":1, "churn":1, "churned":1, "exited":1, "cancelled":1, "left":1, "inactive":1,
        "no":0, "n":0, "false":0, "f":0, "0":0, "stay":0, "stayed":0, "active":0, "retained":0
    }
    mapped = s.map(mapping)
    
    if mapped.isna().all() and pd.api.types.is_numeric_dtype(series):
        mapped = series.astype(float)
        uniq = set(pd.unique(mapped.dropna()))
        if uniq <= {0,1}: 
            return mapped.astype(int)
        if uniq <= {0.0,1.0}:
            return mapped.astype(int)
    return mapped

y_name, y = None, None
for c in candidates:
    y_try = to_binary(df[c])
    if set(pd.unique(y_try.dropna())) <= {0,1} and y_try.notna().mean() > 0.95:
        y_name, y = c, y_try.fillna(0).astype(int)
        break

if y is None:
    print("Could not auto-detect target. Here are some low-cardinality columns to check:")
    disp = pd.DataFrame({c: df[c].astype(str).unique()[:6] for c in low_card}).iloc[:3]
    display(disp)
    raise ValueError("No binary churn-like target found. Rename your target to include one of: "
                     + ", ".join(tokens) + " or ensure it has two levels (yes/no, 0/1, etc.).")

print(f"Detected target column: {y_name}")

X = df.drop(columns=[y_name])

cat_cols = X.select_dtypes(include=["object","category"]).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

for c in cat_cols:
    vc = X[c].value_counts(dropna=False)
    rare = vc[vc <= max(2, int(0.01*len(X)))].index
    X[c] = X[c].where(~X[c].isin(rare), "Other")

(X.shape, int(y.sum()), len(y))
Detected target column: Churn Value
Out[24]:
((7043, 45), 1869, 7043)

2) Train/validation split¶

Stratified split + version-safe One-Hot setup (works on both old/new scikit-learn).

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
y_train.mean(), y_valid.mean()
Out[25]:
(np.float64(0.2654297614539947), np.float64(0.26519023282226006))

3.3) Preprocessing pipeline¶

Numerics: impute + scale. Categoricals: impute + one-hot (drop first).

In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

num_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(drop="first", handle_unknown="ignore"))
])

pre = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols)
    ],
    remainder="drop"
)

3.4) Baseline and ensembles¶

Train Logistic Regression (baseline), Random Forest, and Gradient Boosting with simple settings.

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

clf_lr = Pipeline(steps=[
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=200, n_jobs=None))
])

clf_rf = Pipeline(steps=[
    ("pre", pre),
    ("clf", RandomForestClassifier(
        n_estimators=400, max_depth=None, min_samples_split=5,
        random_state=42, n_jobs=-1
    ))
])

clf_gb = Pipeline(steps=[
    ("pre", pre),
    ("clf", GradientBoostingClassifier(random_state=42))
])

models = {
    "Logistic": clf_lr,
    "RandomForest": clf_rf,
    "GradientBoosting": clf_gb
}
list(models.keys())
Out[27]:
['Logistic', 'RandomForest', 'GradientBoosting']

3.5) Stratified CV¶

Score ROC-AUC with 5-fold stratified cross-validation for a fair comparison.

In [28]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = {}
for name, pipe in models.items():
    auc = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc", n_jobs=-1)
    cv_scores[name] = (auc.mean(), auc.std())
pd.DataFrame(cv_scores, index=["auc_mean","auc_std"]).T.sort_values("auc_mean", ascending=False).round(4)
Out[28]:
auc_mean auc_std
Logistic 1.0 0.0
RandomForest 1.0 0.0
GradientBoosting 1.0 0.0

6) Fit best model and base metrics¶

Fit the top model on the training set and evaluate on the holdout validation set.

In [29]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

if "cv_scores" not in globals():
    cv_scores = {}
    for name, pipe in models.items():
        auc = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc", n_jobs=-1)
        cv_scores[name] = (auc.mean(), auc.std())

best_name = max(cv_scores, key=lambda k: cv_scores[k][0])
best_model = models[best_name].fit(X_train, y_train)

if hasattr(best_model, "predict_proba"):
    proba = best_model.predict_proba(X_valid)[:, 1]
elif hasattr(best_model, "decision_function"):
    s = best_model.decision_function(X_valid)
    proba = (s - s.min()) / (s.max() - s.min() + 1e-9)
else:
    proba = best_model.predict(X_valid).astype(float)

pred05 = (proba >= 0.5).astype(int)

from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, precision_score, recall_score
metrics_base = {
    "model": best_name,
    "roc_auc": roc_auc_score(y_valid, proba),
    "[email protected]": accuracy_score(y_valid, pred05),
    "[email protected]": precision_score(y_valid, pred05, zero_division=0),
    "[email protected]": recall_score(y_valid, pred05, zero_division=0),
    "[email protected]": f1_score(y_valid, pred05, zero_division=0),
}
base_df = pd.DataFrame([metrics_base]).round(4)
display(base_df.T)
0
model Logistic
roc_auc 1.0
[email protected] 1.0
[email protected] 1.0
[email protected] 1.0
[email protected] 1.0

3.7) Threshold tuning¶

Sweep thresholds to balance recall and precision; I also report the best F1.

In [30]:
ths = np.linspace(0.05, 0.95, 19)
rows = []
for t in ths:
    pred = (proba >= t).astype(int)
    rows.append([
        t,
        precision_score(y_valid, pred, zero_division=0),
        recall_score(y_valid, pred, zero_division=0),
        f1_score(y_valid, pred, zero_division=0),
        accuracy_score(y_valid, pred)
    ])
thr_df = pd.DataFrame(rows, columns=["threshold","precision","recall","f1","accuracy"]).round(4)
thr_df.sort_values("f1", ascending=False).head(5)
Out[30]:
threshold precision recall f1 accuracy
0 0.05 1.0 1.0 1.0 1.0
10 0.55 1.0 1.0 1.0 1.0
17 0.90 1.0 1.0 1.0 1.0
16 0.85 1.0 1.0 1.0 1.0
15 0.80 1.0 1.0 1.0 1.0
In [31]:
plt.figure(figsize=(8,4))
plt.plot(thr_df["threshold"], thr_df["precision"], label="precision")
plt.plot(thr_df["threshold"], thr_df["recall"], label="recall")
plt.plot(thr_df["threshold"], thr_df["f1"], label="f1")
plt.xlabel("threshold"); plt.ylabel("score"); plt.legend(); plt.tight_layout(); plt.show()
No description has been provided for this image

3.8) Confusion matrix at tuned threshold¶

Pick the best-F1 threshold and show the confusion matrix.

In [32]:
t_star = float(thr_df.sort_values("f1", ascending=False).iloc[0]["threshold"])
pred_star = (proba >= t_star).astype(int)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_valid, pred_star, labels=[0,1])

plt.figure(figsize=(4,3))
plt.imshow(cm, cmap="Blues")
plt.xticks([0,1], ["No churn","Churn"])
plt.yticks([0,1], ["No churn","Churn"])
for i in range(2):
    for j in range(2):
        plt.text(j, i, cm[i,j], ha="center", va="center")
plt.title(f"Confusion matrix @ {t_star:.2f}")
plt.tight_layout(); plt.show()
No description has been provided for this image

3.9) Feature importance¶

Show global importance from the best model; for Logistic I map coefficients, for tree models I use impurity importance.

In [33]:
import numpy as np, pandas as pd

pre = best_model.named_steps["pre"]
est = best_model.named_steps["clf"]

try:
    Xt_sample = pre.transform(X_train.iloc[: min(50, len(X_train))])
except Exception:
    Xt_sample = pre.fit_transform(X_train.iloc[: min(50, len(X_train))], y_train)
n_feat = Xt_sample.shape[1]

try:
    feats = pre.get_feature_names_out()
    feats = np.array([f.split("__", 1)[-1] for f in feats])
    if len(feats) != n_feat:
        raise ValueError("name length mismatch")
except Exception:
    feats = np.array([f"feature_{i}" for i in range(n_feat)])

vals = None
kind = None
if hasattr(est, "feature_importances_"):
    vals = np.asarray(est.feature_importances_, dtype=float)
    kind = "importance"
elif hasattr(est, "coef_"):
    vals = np.asarray(est.coef_).ravel()
    kind = "coef"

imp_df = None
if vals is not None:
    if len(vals) != n_feat:
        m = min(len(vals), n_feat)
        vals = vals[:m]
        feats = feats[:m]
    order = np.argsort(np.abs(vals))[::-1] if kind == "coef" else np.argsort(vals)[::-1]
    imp_df = pd.DataFrame({kind: np.round(vals[order], 4)}, index=feats[order]).head(20)

display(imp_df.head(10) if imp_df is not None else "No importance available for this estimator.")
coef
Churn Reason_nan -4.6624
Churn Category_nan -4.6624
Churn Category_Competitor 1.3453
Churn Reason_Attitude of support person 0.9130
Churn Reason_Other 0.8708
Churn Category_Dissatisfaction 0.7909
Churn Category_Price 0.7383
Number of Referrals -0.6605
Customer Satisfaction -0.6511
Churn Category_Other 0.6421
In [34]:
if imp_df is not None:
    vals = imp_df.iloc[:15,0].values
    labs = imp_df.index[:15]
    plt.figure(figsize=(8,5))
    plt.barh(range(len(vals))[::-1], vals[::-1])
    plt.yticks(range(len(vals))[::-1], labs[::-1])
    plt.title("Top features")
    plt.tight_layout(); plt.show()
No description has been provided for this image

10) Save a lean output¶

Keep IDs, actuals, probabilities, and a default 0.5 prediction.

In [35]:
out = pd.DataFrame({
    "row_id": np.arange(len(y_valid)),
    "actual": y_valid.to_numpy(),
    "proba": proba,
    "[email protected]": pred05
})
display(out.head(10))
row_id actual proba [email protected]
0 0 0 0.000692 0
1 1 0 0.000189 0
2 2 0 0.001049 0
3 3 0 0.000329 0
4 4 0 0.000488 0
5 5 1 0.991993 1
6 6 0 0.001122 0
7 7 0 0.000989 0
8 8 0 0.000228 0
9 9 0 0.000181 0
In [ ]: