EDA - Season 2015 | Kaggle - Men’s March Madness 2019

1. Motivation
2. Setup
3. Visualizations
4. Questions
- 4.1. How many times does a pair of teams play each other in a season?
- 4.2. Does the outcome of a regular season game between team1 and team2 predict the outcome in the tournament?

1 Motivation

Here we explore the data from 2015 season. 2015 was chosen arbitrarily among recent years. There's a few reasons for visualizing a single season at a time.

It keeps the code simple. We can later write functions to do the same for the other seasons or multiple seasons combined.
It reflects my general modeling approach: Stay simple until there are reasons not to.

Our main goal is to predict the outcomes of the 2019 tournament. Regular season results from 2019 season might help us with this prediction, but it's not clear what information can be gained from previous seasons, given that the roster goes through a significant change each season. For instance, what can we actually learn from the 2000 season that can help us predict the outcomes of the 2019 tournament? Not that useful information can't be shared across seasons, but we'll explore the incremental benefits of using larger datasets and more complicated models.

Focusing on a single season at a time helps us retain simplicity and interpretability. The goal, of course, is to expand our models, and we'll do that in a way that helps us gain insight about the problem domain.

2 Setup

2.1 Load Packages

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from tabulate import tabulate
from src import utils  # see src/ folder in project repo
from src.data import make_dataset

2.2 Helper Functions

print_df = utils.create_print_df_fcn(tablefmt='html');
show_fig = utils.create_show_fig_fcn(img_dir='eda/eda_2015/');

2.3 Load Data

data = make_dataset.get_train_data_v1(season=2015)
# difference in scores
data['scorediff'] = data['score1'] - data['score2']
# winning and losing scores
data['score_w'] = np.where(data.team1win == 1, data.score1, data.score2)
data['score_l'] = np.where(data.team1win == 0, data.score1, data.score2)
print('Data size = {}'.format(data.shape))
print_df(data.head())

Data size = (5421, 19)

	season	daynum	team1	team2	score1	score2	loc	team1win	seed1	seednum1	seed2	seednum2	seeddiff	ID	scorediff	score_w	score_l
0	2015	11	1103	1420	74	57	1103	1	nan	nan	nan	nan	nan	2015_1103_1420	17	74	57
1	2015	11	1104	1406	82	54	1104	1	nan	nan	nan	nan	nan	2015_1104_1406	28	82	54
2	2015	11	1112	1291	78	55	1112	1	Z02	2	nan	nan	nan	2015_1112_1291	23	78	55
3	2015	11	1113	1152	86	50	1113	1	nan	nan	nan	nan	nan	2015_1113_1152	36	86	50
4	2015	11	1102	1119	78	84	1119	0	nan	nan	nan	nan	nan	2015_1102_1119	-6	84	78

2.4 Basic Description

n_missing = data.isna().sum().rename('n_missing')
print_df(data.describe().append(n_missing))

	season	daynum	numot	tourney	team1	team2	score1	score2	team1win	seednum1	seednum2	seeddiff	scorediff	score_w	score_l	ID	loc	seed1	seed2
count	5421	5421	5421	5421	5421	5421	5421	5421	5421	1166	1161	376	5421	5421	5421	nan	nan	nan	nan
mean	2015	71.5853	0.0791367	0.0123593	1224.32	1343.95	66.3785	67.3331	0.472607	8.57719	8.62532	0.361702	-0.954621	72.6838	61.0279	nan	nan	nan	nan
std	0	36.0027	0.326858	0.110494	85.1581	84.8386	12.0862	11.9573	0.499295	4.77696	4.60865	5.70773	14.6983	10.541	10.5097	nan	nan	nan	nan
min	2015	11	0	0	1101	1106	26	26	0	1	1	-15	-62	38	26	nan	nan	nan	nan
25%	2015	40	0	0	1154	1285	58	59	0	4	5	-3	-10	65	54	nan	nan	nan	nan
50%	2015	74	0	0	1210	1359	66	67	0	9	9	0	-2	72	61	nan	nan	nan	nan
75%	2015	103	0	0	1281	1414	74	75	1	13	12	4	9	79	68	nan	nan	nan	nan
max	2015	154	4	1	1460	1464	116	126	1	16	16	15	69	126	111	nan	nan	nan	nan
n_missing	0	0	0	0	0	0	0	0	0	4255	4260	5045	0	0	0	0	0	4255	4260

3 Visualizations

3.1 Distribution of points from each team

fig, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True, sharey=True)
for i, (is_tourney, df) in enumerate(data.groupby('tourney')):
    color = '#1f77b4' if is_tourney == 0 else '#ff7f0e'
    axes[i,0].scatter(df.score1, df.score2, s=1, c=color)
    axes[i,1].hexbin(df.score1, df.score2, bins='log', gridsize=50)
    lims = [20, 125]
    axes[i,0].set_ylabel('Team 2 Score ({})'.format('Regular' if is_tourney == 0 else 'Tourney'))
    axes[i,1].set_xlim(lims)
    axes[i,1].set_ylim(lims)
    for j in range(2):
      axes[i,j].plot(lims, lims, c='r', lw=0.5)
axes[1,0].set_xlabel('Team 1 Score')
axes[1,1].set_xlabel('Team 1 Score')
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.05, hspace=0.05)
plt.suptitle('Bivariate Distribution of Points')
show_fig('scatter_points.png')

An initial look at the distribution of points scored by team 1 (with lower ID) and team 2 (with higher ID). As expected, there's nothing special here which means that team IDs are probably assigned arbitrarily.

3.2 Proportion of win by difference in seeds

df_tmp = data.groupby(['tourney', 'seeddiff'])['team1win'].agg(['sum', 'size']).reset_index()
fig, axes = plt.subplots(2, 1, figsize = (7, 10), sharex=True)
for i, (is_tourney, df) in enumerate(df_tmp.groupby('tourney')):
    axes[i].bar(df.seeddiff, df['size'], label='Loss')
    axes[i].bar(df.seeddiff, df['sum'], label='Win')
    axes[i].set_title('Regular' if is_tourney == 0 else 'Tourney')
axes[1].set_xlabel('Difference in Seeds')
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='right')
show_fig('bar_win_by_seeddiff.png')

This figure shows the result of the games between seeded teams. Win, loss, and difference in seeds are from the perspective of team 1, or the team with the lower ID. For example, there were 3 tournament games in which team 1 was the underdog by 15 seed points, and all resulted in a loss. There were also two upsets in the tournament in which teams who had 11 seed point advantage lost the game.

3.3 Difference in scores by difference in seeds

sns.lmplot(x='seeddiff', y='scorediff', hue='tourney', data=data, aspect=1.5)
show_fig('scatter_scorediff_by_seeddiff.png')

There isn't a huge difference, but the slope between scorediff and seeddiff is less steep for the tournament games. This means that the games tend to be closer in the tournament than regular season, controlling for the difference in seeds. The difference in slopes might be used to quantify the increase in competitiveness in the tournament.

3.4 Win vs. difference in seeds

sns.lmplot(x='seeddiff', y='team1win', hue='tourney', data=data,
           scatter_kws={"s": 5}, y_jitter=0.03, logistic=True, aspect=1.5)
show_fig('scatter_win_by_seeddiff.png')

Similar result here as above, but for the logistic regression curve. The difference in seeds has less impact on the winning probabilities in the tournament than during regular season.

3.5 Distribution of winning vs. losing points

fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10, 6))
axes[0].scatter(data['score_l'], data['score_w'], s=1)
axes[1].hexbin(data['score_l'], data['score_w'], bins='log', gridsize=50)
plt.subplots_adjust(left=0.1, bottom=0.2, right=None, top=None, wspace=0.05, hspace=None)
plt.suptitle('Winning vs. Losing Scores')
fig.text(0.5, 0.04, 'Losing Score', ha='center')
fig.text(0.04, 0.5, 'Winning Score', va='center', rotation='vertical')
show_fig('scatter_winscore_by_losescore.png')

When the losing team scores high, the games are more competitive in a sense that there's less score difference. This is intuitive because there's a soft threshold for the total points scored in a game due to the play-clock.

3.6 Distribution of score difference by losing team's points

fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10, 6))
axes[0].scatter(data['score_l'], data['scorediff'].abs(), s=1)
axes[1].hexbin(data['score_l'], data['scorediff'].abs(), bins='log', gridsize=50)
plt.subplots_adjust(left=0.1, bottom=0.2, right=None, top=None, wspace=0.05, hspace=None)
plt.suptitle('Score Difference vs. Losing Score')
fig.text(0.5, 0.04, 'Losing Team Points', ha='center')
fig.text(0.04, 0.5, 'Score Difference', va='center', rotation='vertical')
show_fig('scatter_scorediff_by_losingpoints.png')

This plot shows the same information as the previous one.

4 Questions

4.1 How many times does a pair of teams play each other in a season?

num_encounters = data.groupby(['team1', 'team2']).size().value_counts()
print_df(pd.DataFrame({'num_encounters':num_encounters.index,
                       'count': num_encounters,
                       'prop': num_encounters / num_encounters.sum()})
         .set_index('num_encounters'))

num_encounters	count	prop
1	2464	0.642839
2	1150	0.300026
3	219	0.0571354

Only about 30% of all pairs play twice in a season. About 5.7% of all pairs play three times.

4.2 Does the outcome of a regular season game between team1 and team2 predict the outcome in the tournament?

tourney_matchups = data.loc[data.tourney == 1]
regular_matchups = data.loc[data.tourney == 0]
joined_matchups = pd.merge(regular_matchups, tourney_matchups, on=['team1', 'team2'], suffixes=('_regular', '_tourney'))
print_df(joined_matchups[['team1', 'team2', 'seednum1_tourney', 'seednum2_tourney', 'team1win_tourney', 'team1win_regular']])

	team1	team2	seednum1_tourney	seednum2_tourney	team1win_tourney	team1win_regular
0	1181	1277	1	7	1	1
1	1412	1417	14	11	0	0
2	1181	1458	1	1	1	1
3	1211	1417	2	11	1	1
4	1257	1301	4	8	1	0

In 2015, there's only five regular season games between two teams that played in a tournament game. In four out of five cases, the outcome from the tournament agreed with the outcome from the regular season.

Table of Contents