Logistic Regression Pipeline Example

This page is an example of a minimal pipeline for a logistic regression classifier for the 2015 season.

1. Setup
2. Models
- 2.1. Simple Logistic Regression
3. Evaluation
- 3.1. LogLoss
- 3.2. Accuracy
4. Next Steps

1 Setup

1.1 Load Packages

import pandas as pd
import numpy as np
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from src import utils  # see src/ folder in project repo
from src.data import make_dataset

1.2 Helper Functions

print_df = utils.create_print_df_fcn(tablefmt='html');
show_fig = utils.create_show_fig_fcn(img_dir='models/classifier_pipeline_example/');

1.3 Load Data

data = make_dataset.get_train_data_v1(2015)
print_df(data.head())

	season	daynum	team1	team2	score1	score2	loc	team1win	seed1	seednum1	seed2	seednum2	seeddiff	ID
0	2015	11	1103	1420	74	57	1103	1	nan	nan	nan	nan	nan	2015_1103_1420
1	2015	11	1104	1406	82	54	1104	1	nan	nan	nan	nan	nan	2015_1104_1406
2	2015	11	1112	1291	78	55	1112	1	Z02	2	nan	nan	nan	2015_1112_1291
3	2015	11	1113	1152	86	50	1113	1	nan	nan	nan	nan	nan	2015_1113_1152
4	2015	11	1102	1119	78	84	1119	0	nan	nan	nan	nan	nan	2015_1102_1119

1.4 Process Data

We'll process the data for a logistic regression on a single feature, seeddiff, which is the difference in seeds.

X_train = data.loc[data.tourney == 0, ['seeddiff']].dropna()
X_test = data.loc[data.tourney == 1, ['seeddiff']].dropna()
y_train = data.loc[X_train.index, 'team1win']
y_test = data.loc[X_test.index, 'team1win']
print_df(X_train.head())

	seeddiff
26	2
42	-9
47	0
88	-4
108	-8

2 Models

2.1 Simple Logistic Regression

We fit a logistic regression classifier on the regular season games.

intercept is fixed at 0 because having lower team ID should not affect the winning probability, given that all other factors are balanced.

clf = LogisticRegression(penalty='l2', fit_intercept=False, C=0.0001,
                         verbose=False, max_iter=1000, solver='lbfgs')
clf.fit(X_train, y_train)
pred_train = pd.DataFrame({'ID':data.loc[X_train.index, 'ID'],
                          'Pred':clf.predict_proba(X_train)[:, 0],
                           'Train':True})
pred_test = pd.DataFrame({'ID':data.loc[X_test.index, 'ID'],
                          'Pred':clf.predict_proba(X_test)[:, 0],
                          'Train':False})
pred = pd.concat([pred_train, pred_test])[['ID', 'Pred', 'Train']]
print_df(pred.head())

	ID	Pred	Train
26	2015_1186_1411	0.48241	True
42	2015_1214_1234	0.578531	True
47	2015_1248_1352	0.5	True
88	2015_1295_1400	0.535136	True
108	2015_1308_1455	0.569926	True

3 Evaluation

3.1 LogLoss

train_loss = log_loss(y_train, pred.loc[pred.Train, 'Pred'])
test_loss = log_loss(y_test, pred.loc[~pred.Train, 'Pred'])
print('train log_loss:{:0.2f}\ttest log_loss:{:0.2f}'.format(train_loss, test_loss))

train log_loss:0.75	test log_loss:0.77

3.2 Accuracy

Although accuracy is not directly relevant for evaluation, it might be useful for ensembling the predictions.

ROC or PR is irrelevant for this data representation. Having a lower team ID is arbitrary so we should always use 0.5 as the threshold for classification.

train_acc = np.mean(y_train == clf.predict(X_train))
test_acc = np.mean(y_test == clf.predict(X_test))
print('train accuracy:{:0.2f}\ttest accuracy:{:0.2f}'.format(train_acc, test_acc))

train accuracy:0.72	test accuracy:0.79

4 Next Steps

4.1 Data

Use more features
Perform feature engineering

4.2 Models

Fit more complex models
- expand features
- black-box models
- emsemble
Create model API and save predictions (for automated evaluation below)

4.3 Evaluation

Automate evaluation via cross-validation
- Split data into folds
- Call model API to save predictions on each fold
  - do this for many models with various hyperparameter settings
- Load predictions and calculate metrics to compare performance

Table of Contents