Longevity Prediction

Predict lifespan from daily lifestyle habits — Random Forest, Gradient Boosting, Ridge & Lasso with GridSearchCV

Training model… this may take 20–40 seconds.

R² Score ? R² (coefficient of determination) measures how well the model explains variance in the target. 1.0 = perfect fit, 0 = predicts the mean, negative = worse than the mean.

Test: — • Train: —

1.0 = perfect fit

MAE ? Mean Absolute Error — average absolute difference between predicted and actual age at death, in years. Lower is better. Robust to outliers.

Test: — • Train: —

Mean Absolute Error (years)

MSE ? Mean Squared Error — average squared difference between predicted and actual values. Penalizes large errors more heavily than MAE. Lower is better.

Test: — • Train: —

Mean Squared Error

RMSE ? Root Mean Squared Error — square root of MSE, expressed in the same units as the target (years). Easier to interpret than MSE. Lower is better.

Test: — • Train: —

Root MSE (years)

Actual vs Predicted Age at Death ? Each point compares the true age at death (x-axis) to the model's prediction (y-axis). Points on the dashed yellow line are perfect predictions. Tighter clustering around the line means better accuracy.

Feature Importance ? For tree-based models, importance reflects how much each feature reduces prediction error across all splits. For Ridge/Lasso, it shows the absolute coefficient magnitude after scaling. Higher = more influential.

Train vs Test Metrics ? Compares model performance on training data versus held-out test data. Large gaps between train and test indicate overfitting. R² bars may be small since the scale includes MAE/MSE/RMSE.

Residuals vs Fitted ? Each point is one test sample: x = predicted age, y = actual − predicted. A well-behaved model shows points scattered randomly around the dashed zero line with no pattern. Funneling or curves indicate heteroscedasticity or non-linearity.

Residuals Distribution ? Histogram of residuals (actual − predicted). A good regression model produces residuals that are roughly normally distributed and centred near zero. Heavy tails or skew suggest systematic bias.

Learning Curve ? Training R² and cross-validation R² plotted against increasing training set size. Shaded bands show ±1 std deviation across folds. A converging gap means the model generalises well; a persistent gap indicates overfitting that more data may help close.

Computing learning curve…

Exploratory Data Analysis

Loading plots…

README.md

Regression pipeline predicting age at death from daily lifestyle habits on the Quality of Life dataset (10,000 samples). Switch between four models using the toggle — each runs GridSearchCV independently with results cached for instant switching. EDA plots load in parallel.

Models

Random Forest — GridSearch over n_estimators, max_depth, min_samples_split
Gradient Boosting — GridSearch over n_estimators, learning_rate, max_depth
Ridge — L2 regularization; GridSearch over alpha
Lasso — L1 regularization, promotes sparsity; GridSearch over alpha

ML Pipeline

Feature / label separation — drop id; target = age_at_death
Encoding — one-hot encode gender and occupation_type
Train / test split — 80 / 20, random_state=42
Scaling — StandardScaler fit on train, transform applied to both sets
GridSearchCV — 3-fold CV, scoring=R², n_jobs=-1
Evaluation — R², MAE, MSE, RMSE reported for both train and test sets
Diagnostics — residuals vs fitted, residuals distribution, and learning curve

EDA

Correlation heatmap — lower-triangle Pearson correlations across all numeric features
Stacked histogram — age at death distribution broken down by occupation type
Dot plot — mean lifestyle metrics and age at death per occupation, sorted by longevity

Features

avg_work_hours_per_day — daily hours worked
avg_rest_hours_per_day — daily hours of rest
avg_sleep_hours_per_day — daily hours of sleep
avg_exercise_hours_per_day — daily hours of exercise
gender — one-hot encoded
occupation_type — one-hot encoded

Tech Stack

scikit-learn — RandomForestRegressor, GradientBoostingRegressor, Ridge, Lasso, GridSearchCV, StandardScaler
seaborn / matplotlib — correlation heatmap, stacked histogram, PairGrid dot plot
pandas / numpy — data wrangling and preprocessing
Chart.js — actual vs predicted scatter, feature importance, train/test metrics bar
Flask — serves page, /run?model= and /plots API endpoints