← Projects
Predict lifespan from daily lifestyle habits — Random Forest, Gradient Boosting, Ridge & Lasso with GridSearchCV
Training model… this may take 20–40 seconds.
R² Score
R² (coefficient of determination) measures how well the model explains variance in the target. 1.0 = perfect fit, 0 = predicts the mean, negative = worse than the mean.
Test: —
•
Train: —
1.0 = perfect fit
MAE
Mean Absolute Error — average absolute difference between predicted and actual age at death, in years. Lower is better. Robust to outliers.
Test: —
•
Train: —
Mean Absolute Error (years)
MSE
Mean Squared Error — average squared difference between predicted and actual values. Penalizes large errors more heavily than MAE. Lower is better.
Test: —
•
Train: —
Mean Squared Error
RMSE
Root Mean Squared Error — square root of MSE, expressed in the same units as the target (years). Easier to interpret than MSE. Lower is better.
Test: —
•
Train: —
Root MSE (years)
Actual vs Predicted Age at Death
Each point compares the true age at death (x-axis) to the model's prediction (y-axis). Points on the dashed yellow line are perfect predictions. Tighter clustering around the line means better accuracy.
Feature Importance
For tree-based models, importance reflects how much each feature reduces prediction error across all splits. For Ridge/Lasso, it shows the absolute coefficient magnitude after scaling. Higher = more influential.
Train vs Test Metrics
Compares model performance on training data versus held-out test data. Large gaps between train and test indicate overfitting. R² bars may be small since the scale includes MAE/MSE/RMSE.
Residuals vs Fitted
Each point is one test sample: x = predicted age, y = actual − predicted. A well-behaved model shows points scattered randomly around the dashed zero line with no pattern. Funneling or curves indicate heteroscedasticity or non-linearity.
Residuals Distribution
Histogram of residuals (actual − predicted). A good regression model produces residuals that are roughly normally distributed and centred near zero. Heavy tails or skew suggest systematic bias.
Learning Curve
Training R² and cross-validation R² plotted against increasing training set size. Shaded bands show ±1 std deviation across folds. A converging gap means the model generalises well; a persistent gap indicates overfitting that more data may help close.
Computing learning curve…
Exploratory Data Analysis
Loading plots…
README.md
Regression pipeline predicting age at death from daily lifestyle habits on the Quality of Life dataset (10,000 samples). Switch between four models using the toggle — each runs GridSearchCV independently with results cached for instant switching. EDA plots load in parallel.
Models
- Random Forest — GridSearch over
n_estimators,max_depth,min_samples_split - Gradient Boosting — GridSearch over
n_estimators,learning_rate,max_depth - Ridge — L2 regularization; GridSearch over
alpha - Lasso — L1 regularization, promotes sparsity; GridSearch over
alpha
ML Pipeline
- Feature / label separation — drop
id; target =age_at_death - Encoding — one-hot encode
genderandoccupation_type - Train / test split — 80 / 20, random_state=42
- Scaling — StandardScaler fit on train, transform applied to both sets
- GridSearchCV — 3-fold CV, scoring=R², n_jobs=-1
- Evaluation — R², MAE, MSE, RMSE reported for both train and test sets
- Diagnostics — residuals vs fitted, residuals distribution, and learning curve
EDA
- Correlation heatmap — lower-triangle Pearson correlations across all numeric features
- Stacked histogram — age at death distribution broken down by occupation type
- Dot plot — mean lifestyle metrics and age at death per occupation, sorted by longevity
Features
- avg_work_hours_per_day — daily hours worked
- avg_rest_hours_per_day — daily hours of rest
- avg_sleep_hours_per_day — daily hours of sleep
- avg_exercise_hours_per_day — daily hours of exercise
- gender — one-hot encoded
- occupation_type — one-hot encoded
Tech Stack
- scikit-learn — RandomForestRegressor, GradientBoostingRegressor, Ridge, Lasso, GridSearchCV, StandardScaler
- seaborn / matplotlib — correlation heatmap, stacked histogram, PairGrid dot plot
- pandas / numpy — data wrangling and preprocessing
- Chart.js — actual vs predicted scatter, feature importance, train/test metrics bar
- Flask — serves page,
/run?model=and/plotsAPI endpoints