This page is an example of a minimal pipeline for a logistic regression classifier for the 2015 season.

1 Setup

1.1 Load Packages

import pandas as pd
import numpy as np
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from src import utils  # see src/ folder in project repo
from src.data import make_dataset

1.2 Helper Functions

print_df = utils.create_print_df_fcn(tablefmt='html');
show_fig = utils.create_show_fig_fcn(img_dir='models/classifier_pipeline_example/');

1.3 Load Data

data = make_dataset.get_train_data_v1(2015)
print_df(data.head())
season daynum numot tourney team1 team2 score1 score2 loc team1winseed1 seednum1 seed2 seednum2 seeddiff ID
0 2015 11 0 0 1103 1420 74 57 1103 1nan nan nan nan nan2015_1103_1420
1 2015 11 0 0 1104 1406 82 54 1104 1nan nan nan nan nan2015_1104_1406
2 2015 11 0 0 1112 1291 78 55 1112 1Z02 2 nan nan nan2015_1112_1291
3 2015 11 0 0 1113 1152 86 50 1113 1nan nan nan nan nan2015_1113_1152
4 2015 11 0 0 1102 1119 78 84 1119 0nan nan nan nan nan2015_1102_1119

1.4 Process Data

We'll process the data for a logistic regression on a single feature, seeddiff, which is the difference in seeds.

X_train = data.loc[data.tourney == 0, ['seeddiff']].dropna()
X_test = data.loc[data.tourney == 1, ['seeddiff']].dropna()
y_train = data.loc[X_train.index, 'team1win']
y_test = data.loc[X_test.index, 'team1win']
print_df(X_train.head())
seeddiff
26 2
42 -9
47 0
88 -4
108 -8

2 Models

2.1 Simple Logistic Regression

We fit a logistic regression classifier on the regular season games.

  • intercept is fixed at 0 because having lower team ID should not affect the winning probability, given that all other factors are balanced.
clf = LogisticRegression(penalty='l2', fit_intercept=False, C=0.0001,
                         verbose=False, max_iter=1000, solver='lbfgs')
clf.fit(X_train, y_train)
pred_train = pd.DataFrame({'ID':data.loc[X_train.index, 'ID'],
                          'Pred':clf.predict_proba(X_train)[:, 0],
                           'Train':True})
pred_test = pd.DataFrame({'ID':data.loc[X_test.index, 'ID'],
                          'Pred':clf.predict_proba(X_test)[:, 0],
                          'Train':False})
pred = pd.concat([pred_train, pred_test])[['ID', 'Pred', 'Train']]
print_df(pred.head())
ID PredTrain
262015_1186_14110.48241 True
422015_1214_12340.578531True
472015_1248_13520.5 True
882015_1295_14000.535136True
1082015_1308_14550.569926True

3 Evaluation

3.1 LogLoss

train_loss = log_loss(y_train, pred.loc[pred.Train, 'Pred'])
test_loss = log_loss(y_test, pred.loc[~pred.Train, 'Pred'])
print('train log_loss:{:0.2f}\ttest log_loss:{:0.2f}'.format(train_loss, test_loss))
train log_loss:0.75	test log_loss:0.77

3.2 Accuracy

Although accuracy is not directly relevant for evaluation, it might be useful for ensembling the predictions.

  • ROC or PR is irrelevant for this data representation. Having a lower team ID is arbitrary so we should always use 0.5 as the threshold for classification.
train_acc = np.mean(y_train == clf.predict(X_train))
test_acc = np.mean(y_test == clf.predict(X_test))
print('train accuracy:{:0.2f}\ttest accuracy:{:0.2f}'.format(train_acc, test_acc))
train accuracy:0.72	test accuracy:0.79

4 Next Steps

4.1 Data

  • Use more features
  • Perform feature engineering

4.2 Models

  • Fit more complex models
    • expand features
    • black-box models
    • emsemble
  • Create model API and save predictions (for automated evaluation below)

4.3 Evaluation

  • Automate evaluation via cross-validation
    • Split data into folds
    • Call model API to save predictions on each fold
      • do this for many models with various hyperparameter settings
    • Load predictions and calculate metrics to compare performance