Logistic Regression Pipeline Example
This page is an example of a minimal pipeline for a logistic regression classifier for the 2015 season.
Table of Contents
1 Setup
1.1 Load Packages
import pandas as pd import numpy as np from sklearn.metrics import log_loss from sklearn.linear_model import LogisticRegression from src import utils # see src/ folder in project repo from src.data import make_dataset
1.2 Helper Functions
print_df = utils.create_print_df_fcn(tablefmt='html'); show_fig = utils.create_show_fig_fcn(img_dir='models/classifier_pipeline_example/');
1.3 Load Data
data = make_dataset.get_train_data_v1(2015)
print_df(data.head())
season | daynum | numot | tourney | team1 | team2 | score1 | score2 | loc | team1win | seed1 | seednum1 | seed2 | seednum2 | seeddiff | ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | 11 | 0 | 0 | 1103 | 1420 | 74 | 57 | 1103 | 1 | nan | nan | nan | nan | nan | 2015_1103_1420 |
1 | 2015 | 11 | 0 | 0 | 1104 | 1406 | 82 | 54 | 1104 | 1 | nan | nan | nan | nan | nan | 2015_1104_1406 |
2 | 2015 | 11 | 0 | 0 | 1112 | 1291 | 78 | 55 | 1112 | 1 | Z02 | 2 | nan | nan | nan | 2015_1112_1291 |
3 | 2015 | 11 | 0 | 0 | 1113 | 1152 | 86 | 50 | 1113 | 1 | nan | nan | nan | nan | nan | 2015_1113_1152 |
4 | 2015 | 11 | 0 | 0 | 1102 | 1119 | 78 | 84 | 1119 | 0 | nan | nan | nan | nan | nan | 2015_1102_1119 |
1.4 Process Data
We'll process the data for a logistic regression on a single feature,
seeddiff
, which is the difference in seeds.
X_train = data.loc[data.tourney == 0, ['seeddiff']].dropna() X_test = data.loc[data.tourney == 1, ['seeddiff']].dropna() y_train = data.loc[X_train.index, 'team1win'] y_test = data.loc[X_test.index, 'team1win'] print_df(X_train.head())
seeddiff | |
---|---|
26 | 2 |
42 | -9 |
47 | 0 |
88 | -4 |
108 | -8 |
2 Models
2.1 Simple Logistic Regression
We fit a logistic regression classifier on the regular season games.
- intercept is fixed at 0 because having lower team ID should not affect the winning probability, given that all other factors are balanced.
clf = LogisticRegression(penalty='l2', fit_intercept=False, C=0.0001, verbose=False, max_iter=1000, solver='lbfgs') clf.fit(X_train, y_train) pred_train = pd.DataFrame({'ID':data.loc[X_train.index, 'ID'], 'Pred':clf.predict_proba(X_train)[:, 0], 'Train':True}) pred_test = pd.DataFrame({'ID':data.loc[X_test.index, 'ID'], 'Pred':clf.predict_proba(X_test)[:, 0], 'Train':False}) pred = pd.concat([pred_train, pred_test])[['ID', 'Pred', 'Train']] print_df(pred.head())
ID | Pred | Train | |
---|---|---|---|
26 | 2015_1186_1411 | 0.48241 | True |
42 | 2015_1214_1234 | 0.578531 | True |
47 | 2015_1248_1352 | 0.5 | True |
88 | 2015_1295_1400 | 0.535136 | True |
108 | 2015_1308_1455 | 0.569926 | True |
3 Evaluation
3.1 LogLoss
train_loss = log_loss(y_train, pred.loc[pred.Train, 'Pred']) test_loss = log_loss(y_test, pred.loc[~pred.Train, 'Pred']) print('train log_loss:{:0.2f}\ttest log_loss:{:0.2f}'.format(train_loss, test_loss))
train log_loss:0.75 test log_loss:0.77
3.2 Accuracy
Although accuracy is not directly relevant for evaluation, it might be useful for ensembling the predictions.
- ROC or PR is irrelevant for this data representation. Having a lower team ID is arbitrary so we should always use 0.5 as the threshold for classification.
train_acc = np.mean(y_train == clf.predict(X_train)) test_acc = np.mean(y_test == clf.predict(X_test)) print('train accuracy:{:0.2f}\ttest accuracy:{:0.2f}'.format(train_acc, test_acc))
train accuracy:0.72 test accuracy:0.79
4 Next Steps
4.1 Data
- Use more features
- Perform feature engineering
4.2 Models
- Fit more complex models
- expand features
- black-box models
- emsemble
- Create model API and save predictions (for automated evaluation below)
4.3 Evaluation
- Automate evaluation via cross-validation
- Split data into folds
- Call model API to save predictions on each fold
- do this for many models with various hyperparameter settings
- Load predictions and calculate metrics to compare performance