💰 Indian Savings Predictor

Can we predict how much of their income a person aims to save — from their demographics and spending pattern alone? A full ML pipeline: EDA → feature engineering → clustering → regression → classification → evaluation, on 20,000 Indian households.

If you can't see it, please use the following link: https://www.loom.com/share/e2ca71f4eefe4465920084764bd5d67d

📋 Dataset & Goal

Source: Indian Personal Finance and Spending Habits — Kaggle, MIT license Notebook: open in Colab Size: 20,000 rows × 25 raw features (after dropping 2 identity-leak columns explained below)

Feature group	Columns
Demographics	`Age` (18–64), `Dependents` (0–4), `Occupation` (Self_Employed / Retired / Student / Professional), `City_Tier` (1 / 2 / 3)
Income	`Income` — ₹1.3K – ₹1.08M (800× range)
Monthly expenses (11)	Rent, Loan_Repayment, Insurance, Groceries, Transport, Eating_Out, Entertainment, Utilities, Healthcare, Education, Miscellaneous
Per-category potential savings (8)	`Potential_Savings_Groceries`, `..._Transport`, `..._Eating_Out`, `..._Entertainment`, `..._Utilities`, `..._Healthcare`, `..._Education`, `..._Miscellaneous`
Regression target	`Desired_Savings_Percentage` — continuous, 5–25 %
Classification target	Low / Mid / High saver — derived via quantile binning in Part 7

A critical choice upfront. The dataset ships with Desired_Savings and Disposable_Income columns — but both are deterministic functions of the target and the expense columns. Disposable_Income = Income − Σ(expenses) is an exact identity (max residual ₹0.0000); Desired_Savings ≈ pct × income / 100 with correlation > 0.99. Keeping either in X would leak the answer. I caught this with an arithmetic audit and dropped both before any modeling.

The research question that drives everything below: what demographic and lifestyle factors predict an individual's desired savings rate, and can a model recover that structure from raw behavior rather than identity arithmetic?

🔍 Part 2 — Exploratory Data Analysis

Six questions framed before looking at the data

Does savings ambition differ by occupation?
How strongly does income predict savings goals?
Are expense columns informative or redundant?
Do city tier and occupation interact?
Are the extremes in expenses real, or data errors?
Does age moderate the income–savings relationship?

Each question is paired with a specific plot below and a one-line verdict.

Target distribution

The target is bounded 5–25 %, with mean 9.8 %, median 8.9 %, skew 1.42. Three things stand out. First, the range is narrow — on this scale RMSE of 2 pp is already a meaningful error, and R² can look deceptively low even for a decent model. Second, the skew is moderate (not extreme), so the baseline will be trained on the raw target with no log transform. Third — and most important — there's a visible bimodality: a shelf around 10 % and a secondary plateau near 12–15 %. This hints at two savings regimes, and finding them becomes the central thread of the project.

Occupation vs savings — ridgeline

Verdict (Q1): no difference. All four occupations share near-identical means (~9.8 %) and nearly indistinguishable distribution shapes. A useful non-finding — whatever drives savings ambition in this dataset, it isn't what you do for a living. The bimodality we saw in the target must come from somewhere else. This immediately raises the value of the Part 4 clustering step: if the structure isn't one-dimensional and isn't in the categoricals, we need to let an algorithm find it.

Hierarchical correlation heatmap

Verdict (Q3): expenses are collinear synonyms of income. The hierarchical clustering groups the expense columns into one tight block and the Potential_Savings_* columns into another — both blocks correlate heavily with Income because every expense scales with income in this data. The target sits weakly in the middle. Implication for Part 4: raw expense columns are collectively redundant and must be transformed into ratios (expense / income) to carry independent signal, and the 8 Potential_Savings_* columns will be compressed via PCA.

Income vs savings — the headline plot

Verdict (Q2): stepped, not linear. This is the single most important plot in the project. Instead of a clean upward line, the hexbin reveals three distinct income brackets — low earners cluster around 5–10 % savings, middle earners around 10–15 %, high earners jump to 15–25 %. Pearson r with log-income is only 0.10 because the relationship is non-monotonic within each bracket. This plot is the thesis of the whole project: linear regression cannot bend around the plateaus and will produce structured residuals; tree ensembles should recover the brackets almost perfectly. Part 3 and Part 5 then test this prediction directly.

Occupation × city tier

Verdict (Q4): no interaction. All twelve cells sit inside a tiny 9.53–10.02 % band. Neither occupation nor city tier — individually or combined — carries meaningful signal. The demographic categoricals are essentially noise on their own and can only contribute through interactions with numeric features (something we'll test explicitly via a polynomial interaction term between log-income and age in Part 4).

Outlier audit

Verdict (Q5): extremes are real, not errors. Every expense column has skew 3.8–5.4. Under the standard IQR rule, 7–9 % of rows would be flagged as outliers — but inspection shows those are genuine high earners whose every expense scales proportionally. Decision: keep them all. Removing them would delete the entire top income bracket — which is the exact regime the model most needs to learn. I handle the skew instead through log1p transforms of heavy-tailed numeric features in Part 4.

Age × income tier

Verdict (Q6): mild modulation at best. The three income tiers each have their own characteristic savings distribution, and age shifts them only slightly. Once you condition on income tier, age adds very little. Age is a secondary predictor that might help through explicit interaction features (log_income × Age), which I include as polynomial terms in the Part 4 pipeline.

Summary of the EDA: the target is driven by a stepped income effect; demographic categoricals are individually useless; expense columns are collinear synonyms of income; extremes are genuine; the non-linearity is the modeling challenge. With that hypothesis in hand, it's time to build the baseline and see whether it breaks exactly where theory predicts.

⚙️ Part 3 — Baseline Model

Goal: predict Desired_Savings_Percentage from raw features using Linear Regression with scikit-learn defaults, seed 42, 80/20 split. The point isn't accuracy — it's to establish the floor every later model must beat and to confirm the EDA prediction that a linear model will fail in a structured, diagnosable way.

Metric	Train	Test
MAE	1.79 pp	1.79 pp
RMSE	2.41	2.63
R²	0.621	0.541

54 % of test variance explained — decent as a floor, but the interesting story is in the residuals.

Residual diagnostics — the EDA prediction is confirmed

The residual-vs-predicted plot shows three diagonal bands — exactly the stepped structure the hexbin predicted. A single straight line cannot bend around the plateaus, so the model cuts through them, systematically over-predicting low savers and under-predicting high ones (residuals reach −70 pp at the high end). This isn't a model failure — it's structural evidence that linear regression is the wrong method class, and it tells me exactly what Part 5 needs to improve on.

Standardized coefficients (not raw)

Income dominates with standardized coefficient +3.10 — an order of magnitude stronger than any other feature. I multiply raw coefficients by each feature's standard deviation because raw coefs are a scale artifact (a ₹-denominated coefficient of 0.0001 can matter more than a percentage-point coefficient of 1.0). The negative coefficient on Utilities is a multicollinearity artifact, not a real effect — utilities is highly correlated with income, so the model inflates Income's positive coefficient and compensates with negative coefficients on correlated expenses. Reading raw linear coefficients as causal effects breaks down in the presence of collinearity we documented in Q3.

Everything the baseline gets wrong, it gets wrong in a way we expected. Now we fix it.

🛠️ Part 4 — Feature Engineering

Six layers of transformation convert 25 raw columns into a preprocessed feature matrix of ~50 features:

Layer	Tool	What it solved
Binary flags	hand-crafted	Zero-inflation — `has_loan` (60 % are 0) and `has_education` (20 % are 0) are meaningful states, not missing values
Log transform	`np.log1p`	The 800× income range and heavy right tails
Expense ratios	hand-crafted	`expense_ratio`, `discretionary_ratio`, `essential_ratio`, `potential_savings_ratio` — remove income-scale collinearity
Demographic buckets	`pd.cut`	Non-linear age / dependents effects for tree models
Polynomial (degree 2)	`PolynomialFeatures`	Explicit `log_income × Age` interaction (motivated by Q6)
PCA (2 components)	`PCA`	Compress the 8 collinear `Potential_Savings_*` columns
KMeans clustering	`KMeans(k=4)`	9 cluster-derived features: hard cluster ID, 4× distance-to-each-centroid, `dist_to_own_centroid`, 4× soft inverse-distance probabilities

Preprocessor fits only on the training fold inside each model's Pipeline — no leakage during CV or hyperparameter tuning.

Clustering — the story of a useful failure

First attempt. I ran KMeans on 9 behavioral features (including Dependents, has_loan, has_education), swept k = 2…8, and let the silhouette score pick the winner. It chose k=2 with silhouette = 0.234 — a reasonable geometric score. But when I compared mean target across the two clusters, the gap was 0.02 percentage points. Completely useless for the target. Inspection revealed the split ran along a family-structure axis (dependents, loans, education expense) — a real demographic cleavage in the data, but not one correlated with savings ambition.

The lesson is worth stating explicitly: unsupervised quality is not target relevance. A cluster can be geometrically clean and predictively worthless. Silhouette optimizes compactness and separation, not usefulness for your task.

Second attempt. I restricted the feature set to income-structure features only (log_income, expense ratios, potential-savings ratio), swept k again, and traded silhouette for target separation:

k	Silhouette	Target separation
2	0.230	0.08 pp
3	0.184	0.16 pp
4	0.168	5.87 pp ✅
5	0.160	6.00 pp

The jump from k=3 to k=4 is the operative one — a 36× gain in target-relevant signal for a trivial silhouette cost. k=5 barely improves separation (+0.13 pp) while continuing to fragment the silhouette. k=4 it is.

The four savings personas

Cluster	Persona	Share	Mean income	Expense ratio	Mean target
2	Low-income strugglers	24 %	₹19K	0.80	7.6 %
1	Lower-middle transition	24 %	₹31K	0.78	8.7 %
3	Squeezed middle	30 %	₹40K	0.84	9.7 %
0	Affluent savers	23 %	₹79K	0.70	13.4 %

Mean target rises monotonically across clusters and lines up almost exactly with the three income brackets we saw in the hexbin. The Squeezed-middle group is the most interesting — they earn more than the lower-middle cluster but spend a higher fraction of it, compressing their realized savings ambition. The cluster IDs and distances give the downstream tree models a direct shortcut to the stepped structure that broke the baseline.

📊 Part 5 — Three Improved Models

All models share the engineered feature matrix, the same split (seed 42 — identical row assignments as the baseline, for a fair comparison), and the same preprocessing pipeline. Scikit-learn defaults for the untuned versions.

Results — each step is attributable

Model	Test R²	Test RMSE	Δ R² vs baseline
Baseline Linear Reg. (raw)	0.541	2.63 pp	—
Linear Reg. (engineered)	0.711	2.18 pp	+31 % relative
Random Forest (default)	0.830	1.68 pp	+53 %
Gradient Boosting (default)	0.832	1.67 pp	+54 %
Gradient Boosting (tuned)	0.834	1.66 pp	+54 % ✅

Attribution is clean. Feature engineering alone — same algorithm — bought +31 % relative R². Switching from a linear model to tree ensembles added another 17 percentage points on top. Hyperparameter tuning added a further +0.2 % — a small but honest finding: the sklearn defaults were already near-optimal, and 41 minutes of RandomizedSearchCV mostly confirmed that rather than producing a breakthrough. 5-fold CV std across folds < 0.01 for all three models, and the train/test gap for the winner is < 0.02 → the ranking is stable, not a split artifact.

Regression winner: Tuned Gradient Boosting (R² = 0.834, RMSE = 1.66 pp, MAPE ≈ 15 %). Exported as gradient_boosting_regressor.pkl in this repo.

Winning-model residuals — the bands are gone

Compare to the baseline: the diagonal banding is completely resolved. The model has internalized the three income brackets — predictions cluster tightly around 7.5 %, 12.5 %, and 20 %, exactly where the hexbin placed them. Residuals are symmetric, near-homoscedastic, and unstructured across the full prediction range. This is what it looks like when a model learns the mechanism rather than just fitting an average.

Feature importance across the three models

A surprising side-finding worth flagging. The hand-engineered features — cluster distances, expense ratios, polynomial interactions — carried most of the weight for Linear Regression (where dist_to_centroid_0 was literally the strongest signal, stronger than income itself). For Gradient Boosting, the same features are largely redundant: Income and log_income together take ~90 % of impurity-based importance, because the tree ensemble rediscovers the stepped structure from the raw features on its own. Feature engineering substitutes for model expressivity. A stronger model class makes some of it redundant. Both insights are true simultaneously, and they're a useful frame for deciding where to invest effort on future projects.

🏷️ Part 7 — Regression → Classification

Same features, same split, same preprocessing — but now the target becomes discrete. Quantile binning into three balanced classes (each ~33 % of the data):

Class	Savings % range	Persona
Low	≤ 7.58 %	Conservative saver
Mid	7.58 – 10.46 %	Moderate saver
High	> 10.46 %	Ambitious saver

A three-class split preserves the stepped income structure the EDA uncovered. A binary median split would collapse it. Business-rule thresholds would create imbalance. Quantile binning guarantees balanced classes by construction and matches the three regimes we've been tracking since Part 2.

The train/test split preserves the class ratio to within 1 percentage point — no resampling or reweighting needed. Accuracy is a defensible headline metric; Macro F1 and per-class F1 serve as secondaries.

Precision vs recall — which matters, and why

Precision on the High class matters more, specifically — in the most realistic business use case, a fintech recommending premium investment products to predicted high savers. Predicting "High" for someone who actually saves little leads to pushing inappropriate products onto people who can't sustain them — damaging both the user (over-commitment, churn) and the provider (reputation, regulatory exposure). A False Negative on High (placing a genuine high saver in a conservative segment) is recoverable — they still receive a reasonable product, just a suboptimal one. Minimizing FP on High = maximizing precision on High, and that's the metric I'll read first in Part 8.

🤖 Part 8 — Classification Models

Three classifiers on the identical engineered features used in Part 5:

Model	Accuracy	Macro F1	Precision (High)	F1 (High)
Logistic Regression	0.649	0.635	0.919	0.950
Random Forest ✅	0.647	0.641	0.933	0.965
Gradient Boosting	0.651	0.640	0.933	0.965

Winner: Random Forest — highest Macro F1 (0.641 vs 0.640). Precision on the High class is tied with Gradient Boosting at 0.933, so the tiebreaker is RF's slightly better Mid-class F1 (0.405 vs 0.378). Exported as random_forest_classifier.pkl in this repo.

Three patterns in the confusion matrices worth naming:

All errors are boundary mistakes. Low↔High confusions are 0 for both tree ensembles (9 for Logistic Regression). Every mistake is Low↔Mid or Mid↔High — the signature of a well-behaved ordinal classifier.
The Mid class is structurally the hardest. F1 ≈ 0.40 for Mid versus 0.58 for Low and 0.96 for High. Mid has no hard income wall on either side, so there's genuine label ambiguity at the boundaries.
The High class is nearly perfectly recovered. RF achieves 100 % recall on High; GBM 99.8 %. The income signal above the 10.46 % threshold is unambiguous enough that even a simple split recovers it.

Regression winner ≠ classification winner — and that's the correct answer

Gradient Boosting wins regression. Random Forest wins classification. Different losses → different winners, and that's not an inconsistency to explain away. Regression rewards accuracy across the full continuous target range — boosting's sequential error correction excels there. Classification rewards boundary placement — bagging's variance reduction across 200 independent trees wins narrowly on the Mid interior, where the decision surface is noisiest. Both are defensible winners for their respective tasks, and I treat them as such rather than forcing a single winner across both.

🎁 Bonus & Extra Work

Beyond the assignment floor:

RandomizedSearchCV hyperparameter tuning on the winning regressor — 12 draws × 3 CV folds (36 fits) over 5 GBM parameters → best CV R² = 0.834 (default 0.832). Intentional documentation of a marginal-gain result.
Interactive 3D PCA (Plotly) of the clusters in the notebook — rotatable by hand, colored by cluster ID, hover-reveals actual savings %, income, and occupation.
Soft clustering features — cluster_prob_0..3 from inverse-distance weighting, on top of hard cluster assignments and per-centroid distances. Captures the option geometry a hard label cannot.
Standardized coefficients for linear feature importance (coef × feature σ), not raw coefficients — avoids the scale artifact that causes most student projects to mis-rank features.
5-fold cross-validation stability check with explicit gap analysis (test R² − CV mean R²) to confirm no model overfit.
Hierarchically clustered correlation heatmap (scipy.cluster.hierarchy.linkage) — reveals feature groups visually before engineering decisions are made.
Full identity-leak arithmetic audit before modeling — the kind of check that prevents a great-looking R² from being a silent bug.
First failed cluster attempt documented in the notebook, not hidden. The reader sees the k=2 dead end and the reasoning that led to k=4.

💡 Challenges & Lessons Learned

Silhouette score is not the objective. My first KMeans run was silhouette-optimal (k=2, 0.234) and predictively useless (Δ 0.02 pp on target). Cluster for the downstream task, not for the geometry of the cluster space. Unsupervised quality and target relevance are independent axes, and optimizing one can quietly cost you the other.
"Clean" data is sometimes a red flag. Zero nulls and zero duplicates in a real-world financial dataset is suspicious, not reassuring — it suggests synthetic generation, perfect surface hygiene, and silent structural leaks. The identity columns (Desired_Savings, Disposable_Income) were the hidden problem, and they look completely benign until you do the arithmetic. Leakage audits belong before EDA, not after modeling.
Feature engineering substitutes for model expressivity. The cluster IDs and ratios that gave Linear Regression its +31 % R² gain were largely redundant for Gradient Boosting, which rediscovers the stepped structure from raw income on its own. Both are true: on weaker models, FE is essential; on stronger models, FE is supplementary. That changes where to invest effort depending on model budget.
Hyperparameter tuning doesn't always help much. 41 minutes of RandomizedSearchCV bought +0.002 R². A negative result is a result, and reporting it honestly is more valuable than cherry-picking the one fold where tuning looked dramatic.
Regression and classification winners can disagree — and that's fine. Different losses optimize different things. Forcing a single winner across both would hide useful information about the geometry of the task.

📌 Key takeaways

The income step structure is the whole project. Hexbin → baseline residual bands → tree models recover them. A single hypothesis, tested end-to-end, with every plot either confirming or refining it.
Feature engineering alone gave +31 % R² on the same algorithm. Model-class change added another 17 percentage points. Tuning added 0.2 %. The returns are real but steeply diminishing — know when to stop.
Clustering for the task beats clustering for the silhouette. 36× target spread for a trivial silhouette cost. Evaluate every intermediate artifact by what it does for the downstream objective, not by its internal metric.
RF wins classification, tuned GBM wins regression. Different losses, different winners. That's the correct answer, not an inconsistency to reconcile.

📦 Repository contents

File	Description
`Yonathan_Levy_Assignment_2_*.ipynb`	Full annotated notebook — EDA, FE, regression, classification, tuning, HF upload
`gradient_boosting_regressor.pkl`	Regression winner — tuned GBM pipeline (preprocessor + model, end-to-end)
`random_forest_classifier.pkl`	Classification winner — Random Forest pipeline (preprocessor + model)
`metadata.json`	Model metadata — best hyperparameters, train/test sizes, test metrics, random seed

Loading the models

import pickle
import pandas as pd

with open('gradient_boosting_regressor.pkl', 'rb') as f:
    reg = pickle.load(f)

# Pass the engineered feature frame with the column names used in training.
# The pipeline handles preprocessing (scaling, one-hot, PCA, poly) end-to-end.
predictions = reg.predict(X_new)   # continuous savings %, typically in [5, 25]

with open('random_forest_classifier.pkl', 'rb') as f:
    clf = pickle.load(f)

labels = clf.predict(X_new)        # 'Low' / 'Mid' / 'High'
proba  = clf.predict_proba(X_new)  # class probabilities

Reproducibility: all randomness uses SEED = 42. Tested with Python 3.10+, scikit-learn 1.3+, pandas 2.0+. The full environment is reproducible by running the notebook top-to-bottom in Colab.

Assignment #2 — Classification, Regression, Clustering, Evaluation | Reichman University, April 2026

Downloads last month: -