Predicting Soccer Results Using Data Mining and Statistical Analysis

Matt Radiuk's SENG 474 Data Mining Project

Introduction

I am a huge sports fan and I have always had a big fascination with data. With this data mining project, I finally got a chance to combine the two. I was also recently introduced to sports betting and it got me thinking, could I potentially merge the enormous amount of available data with some python to build a working model and predict the outcome of games? I never really had the tools to do this, but after taking Seng 474 I was finally able to wrap my head around it and give it a go.

Soccer has always been my favourite sport to play and watch so that is the sport that I am going to be focusing on. In soccer the winner is the team that scores the most goals, and each goal is alawys worth eactly 1, this makes the Poisson distribution a prime candidate to build my model around. After training on ten years of historical data, it will be able to calculate the probable number of goals scored and conceded by each team against each team, and therefore determine the probability of certain outcomes for future games. This can potentially give us an edge against the bookmaker and help us get more opportunites to make profitable bets. The formula for a Poisson distribution is:

Poisson Dist

Where X represents the number of goals per 90 minutes and λ is the expected number (average rate) of goals per 90 minutes, which we will be calculating later.

Data Collection

For the sake of this project, and due to the amount of resources available I am only going to be focusing specifically building this project around the English Premier League. The dataset that I am using was very easy to find, as it is coming from a github repository called football-csv where a ton of soccer data has been collected and uploaded for anyone to use. Despite having data available for the past 100 years, in an attempt to keep things a little more relevant, I am only going to be using data from past 10 years to train my model which should be plenty.

First things first, lets import all of the python necessities into our notebook.

In [6]:
import pandas as pd
import numpy as np
import re
import warnings
import glob
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import poisson

Load the entire dataset into memory as a DataFrame

  1. One .csv file for each season 2010-2019
  2. Read and concatenate the files into one DataFrame
  3. One row corresponds to one game.
  4. Rows are sorted chronologically by the Date column
In [7]:
epl = pd.concat([pd.read_csv(f) for f in glob.glob('eng/2010s/*/eng.1.csv')])
epl
Out[7]:
Round Date Team 1 FT HT Team 2
0 ? (Sat) 14 Aug 2010 (W32) Aston Villa FC (1) 3-0 2-0 West Ham United FC (1)
1 ? (Sat) 14 Aug 2010 (W32) Blackburn Rovers FC (1) 1-0 1-0 Everton FC (1)
2 ? (Sat) 14 Aug 2010 (W32) Bolton Wanderers FC (1) 0-0 0-0 Fulham FC (1)
3 ? (Sat) 14 Aug 2010 (W32) Chelsea FC (1) 6-0 2-0 West Bromwich Albion FC (1)
4 ? (Sat) 14 Aug 2010 (W32) Sunderland AFC (1) 2-2 1-0 Birmingham City FC (1)
... ... ... ... ... ... ...
115 ? (Sat) 9 Nov 2019 (W45) Southampton FC (12) 1-2 0-1 Everton FC (12)
116 ? (Sat) 9 Nov 2019 (W45) Tottenham Hotspur FC (12) 1-1 0-0 Sheffield United FC (12)
117 ? (Sun) 10 Nov 2019 (W45) Liverpool FC (12) 3-1 2-0 Manchester City FC (12)
118 ? (Sun) 10 Nov 2019 (W45) Manchester United FC (12) 3-1 2-0 Brighton & Hove Albion FC (12)
119 ? (Sun) 10 Nov 2019 (W45) Wolverhampton Wanderers FC (12) 2-1 1-0 Aston Villa FC (12)

3540 rows × 6 columns

This is the top and bottom 5 columns from the training dataset. 3540 games in total.

Pre-processing

Since the dataset is rather large in size, I need to remove some of the duplicate or unnecessary information. The following are redundant and will be removed.

  1. Round column
  2. The day and week number from the Date column
  3. The trailing "FC" after each team name, as well as the game number
  4. HT column
In [8]:
epl = epl.drop(columns=['Round', 'HT'], axis=1)
epl.replace(to_replace ='\s?F?C?\s*\(.*?\)', value = "", regex = True, inplace=True)
epl.head()
Out[8]:
Date Team 1 FT Team 2
0 14 Aug 2010 Aston Villa 3-0 West Ham United
1 14 Aug 2010 Blackburn Rovers 1-0 Everton
2 14 Aug 2010 Bolton Wanderers 0-0 Fulham
3 14 Aug 2010 Chelsea 6-0 West Bromwich Albion
4 14 Aug 2010 Sunderland A 2-2 Birmingham City

That is starting to look a little better. Next, I need to remove the dash and split up the values from the FT column so there are separate columns for the Home and Away scores.

In [9]:
df = epl[["Date", "Team 1", "Team 2"]]
new = epl["FT"].str.split("-", n = 1, expand = True)
warnings.filterwarnings('ignore')
df["HomeFT"] = new[0] 
df["AwayFT"] = new[1] 
df = df[["Date", "Team 1", "HomeFT", "AwayFT", "Team 2"]]
df.columns = ['Date', 'Home', 'Hscore', 'Ascore', "Away"]
df = df.astype({"Hscore": int, 'Ascore': int})
df.head()
Out[9]:
Date Home Hscore Ascore Away
0 14 Aug 2010 Aston Villa 3 0 West Ham United
1 14 Aug 2010 Blackburn Rovers 1 0 Everton
2 14 Aug 2010 Bolton Wanderers 0 0 Fulham
3 14 Aug 2010 Chelsea 6 0 West Bromwich Albion
4 14 Aug 2010 Sunderland A 2 2 Birmingham City

Now that the data is formatted and looks just as we wanted it to, we can begin to play with it a little.

Visualization

For my model to be accurate, I am going to have to gather a couple important metrics. These include:

1. Average number of goals scored by home team per game
2. Average number of goals scored by away team per game
3. Total number of goals scored per game
In [10]:
dfgraph = df[["Hscore", 'Ascore']]
dfgraph=dfgraph.astype(int)
means = dfgraph.mean()
means
Out[10]:
Hscore    1.55678
Ascore    1.19435
dtype: float64

It's pretty clear to see that, on average, the home team scores more goals than the away team. This is the so called "home field advantage" and it is one of the main features that my model is going to be built around. Using this information, I can see the exact distribution of how many goals the home team scores vs the visiting team.

In [11]:
dfgraph['Total Goals'] = dfgraph.apply(lambda row: row['Hscore'] + row['Ascore'], axis=1)
dfgraph=dfgraph.groupby(dfgraph.columns.tolist()).size().reset_index().rename(columns={0:'Count'})
dfgraph['Rate'] = dfgraph['Count'].transform(lambda x: x / x.sum())
dfgraph1 = dfgraph.groupby('Ascore')['Rate'].sum()
dfgraph2 = dfgraph.groupby('Hscore')['Rate'].sum()
result = pd.concat([dfgraph1, dfgraph2], axis=1, sort=False)
result.columns = ['Hscore', 'Ascore']
result.fillna(0)
Out[11]:
Hscore Ascore
0 0.334181 0.223446
1 0.329944 0.321751
2 0.200847 0.247740
3 0.094350 0.128531
4 0.028249 0.051412
5 0.009040 0.018079
6 0.002825 0.006497
7 0.000282 0.001412
8 0.000000 0.001130
9 0.000282 0.000000

The leftmost column here represents the number of goals scored in a game and the other two show the percentage of how many times the home and away teams have scored that many goals.

Interestingly enough, history was made while I was working on this project. No away team had ever scored 8 goals in a single game until Leicester City scored 9 against Southampton on October 25th, 2019! We can see that this game represents 0.028 percent of our dataset, and was the only time a team had scored 9 goals in a game over the past 10 years.

Moving on to more important things...

In [24]:
result.plot.bar(y=['Hscore', 'Ascore'])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f49eab5b0f0>

We can see pretty easily from this graph that the "home team advantage" is very real. Away teams are over 10% more likely to fail to score a single goal, while home teams are more likely to score 2, 3, 4, or 5.

In [13]:
dfgraph3=dfgraph.groupby('Total Goals')['Rate'].sum()
dfgraph3.plot.bar(x='Total Goals',y='Rate',color='red')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f49ebcfa9e8>

This graph removes the distinction between home and away, and shows us the total distribution of the number of goals scored.

Data Mining

Now we get to start the fun stuff. Using the prepared data from the previous steps, we fit our model with the home and away teams attacking and defending stats to compute a probability for each potential outcome.

In [14]:
Data = pd.concat([df[['Home','Away','Hscore']].assign(home=1).rename(columns={'Home':'team', 'Away':'opponent','Hscore':'goals'}), df[['Away','Home','Ascore']].assign(home=0).rename(columns={'Away':'team', 'Home':'opponent','Ascore':'goals'})])
In [15]:
model = smf.glm(formula="goals ~ home + team + opponent", data=Data, family=sm.families.Poisson()).fit()
model.summary()
Out[15]:
Generalized Linear Model Regression Results
Dep. Variable: goals No. Observations: 7080
Model: GLM Df Residuals: 7008
Model Family: Poisson Df Model: 71
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -10232.
Date: Fri, 27 Dec 2019 Deviance: 7920.4
Time: 18:18:32 Pearson chi2: 6.89e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 0.3777 0.092 4.101 0.000 0.197 0.558
team[T.Arsenal] 0.3092 0.079 3.934 0.000 0.155 0.463
team[T.Aston Villa] -0.2843 0.094 -3.036 0.002 -0.468 -0.101
team[T.Birmingham City] -0.3747 0.179 -2.097 0.036 -0.725 -0.024
team[T.Blackburn Rovers] -0.1206 0.124 -0.970 0.332 -0.364 0.123
team[T.Blackpool] 0.0423 0.152 0.279 0.781 -0.255 0.340
team[T.Bolton Wanderers] -0.0808 0.123 -0.659 0.510 -0.321 0.159
team[T.Brighton & Hove Albion] -0.3409 0.129 -2.648 0.008 -0.593 -0.089
team[T.Burnley] -0.2797 0.103 -2.703 0.007 -0.483 -0.077
team[T.Cardiff City] -0.4364 0.141 -3.100 0.002 -0.712 -0.161
team[T.Chelsea] 0.2850 0.079 3.617 0.000 0.131 0.439
team[T.Crystal Palace] -0.1528 0.091 -1.678 0.093 -0.331 0.026
team[T.Everton] 0.0253 0.082 0.309 0.757 -0.135 0.185
team[T.Fulham] -0.1687 0.096 -1.756 0.079 -0.357 0.020
team[T.Huddersfield Town A] -0.7050 0.157 -4.488 0.000 -1.013 -0.397
team[T.Hull City A] -0.3335 0.118 -2.826 0.005 -0.565 -0.102
team[T.Leicester City] 0.1016 0.090 1.135 0.256 -0.074 0.277
team[T.Liverpool] 0.3251 0.078 4.144 0.000 0.171 0.479
team[T.Manchester City] 0.4759 0.077 6.179 0.000 0.325 0.627
team[T.Manchester United] 0.2548 0.079 3.222 0.001 0.100 0.410
team[T.Middlesbrough] -0.6148 0.204 -3.009 0.003 -1.015 -0.214
team[T.Newcastle United] -0.1355 0.086 -1.582 0.114 -0.303 0.032
team[T.Norwich City] -0.2637 0.103 -2.568 0.010 -0.465 -0.062
team[T.Queens Park Rangers] -0.2968 0.116 -2.563 0.010 -0.524 -0.070
team[T.Reading] -0.1861 0.167 -1.112 0.266 -0.514 0.142
team[T.Sheffield United] -0.1905 0.286 -0.666 0.505 -0.751 0.370
team[T.Southampton] -0.0641 0.087 -0.740 0.459 -0.234 0.106
team[T.Stoke City] -0.2386 0.088 -2.711 0.007 -0.411 -0.066
team[T.Sunderland A] -0.2566 0.091 -2.823 0.005 -0.435 -0.078
team[T.Swansea City A] -0.1597 0.089 -1.793 0.073 -0.334 0.015
team[T.Tottenham Hotspur] 0.2338 0.079 2.947 0.003 0.078 0.389
team[T.Watford] -0.1652 0.100 -1.644 0.100 -0.362 0.032
team[T.West Bromwich Albion] -0.1848 0.087 -2.120 0.034 -0.356 -0.014
team[T.West Ham United] -0.0679 0.085 -0.802 0.423 -0.234 0.098
team[T.Wigan Athletic] -0.2066 0.112 -1.845 0.065 -0.426 0.013
team[T.Wolverhampton Wanderers] -0.1539 0.107 -1.440 0.150 -0.363 0.056
opponent[T.Arsenal] -0.3740 0.078 -4.800 0.000 -0.527 -0.221
opponent[T.Aston Villa] -0.0428 0.078 -0.546 0.585 -0.197 0.111
opponent[T.Birmingham City] -0.1265 0.145 -0.873 0.383 -0.411 0.158
opponent[T.Blackburn Rovers] 0.0579 0.105 0.552 0.581 -0.148 0.264
opponent[T.Blackpool] 0.1881 0.129 1.461 0.144 -0.064 0.440
opponent[T.Bolton Wanderers] 0.0302 0.106 0.285 0.776 -0.178 0.238
opponent[T.Brighton & Hove Albion] -0.1487 0.106 -1.403 0.161 -0.356 0.059
opponent[T.Burnley] -0.1885 0.089 -2.123 0.034 -0.362 -0.015
opponent[T.Cardiff City] 0.0953 0.103 0.926 0.354 -0.106 0.297
opponent[T.Chelsea] -0.4997 0.080 -6.246 0.000 -0.657 -0.343
opponent[T.Crystal Palace] -0.2017 0.081 -2.491 0.013 -0.360 -0.043
opponent[T.Everton] -0.3176 0.077 -4.135 0.000 -0.468 -0.167
opponent[T.Fulham] -0.0082 0.082 -0.100 0.920 -0.169 0.153
opponent[T.Huddersfield Town A] 0.0119 0.105 0.113 0.910 -0.194 0.218
opponent[T.Hull City A] -0.0592 0.095 -0.623 0.533 -0.245 0.127
opponent[T.Leicester City] -0.2411 0.085 -2.822 0.005 -0.409 -0.074
opponent[T.Liverpool] -0.4207 0.079 -5.349 0.000 -0.575 -0.267
opponent[T.Manchester City] -0.6143 0.082 -7.454 0.000 -0.776 -0.453
opponent[T.Manchester United] -0.5209 0.080 -6.484 0.000 -0.678 -0.363
opponent[T.Middlesbrough] -0.2208 0.150 -1.473 0.141 -0.515 0.073
opponent[T.Newcastle United] -0.1266 0.076 -1.676 0.094 -0.275 0.021
opponent[T.Norwich City] 0.0021 0.085 0.024 0.981 -0.164 0.168
opponent[T.Queens Park Rangers] 0.0218 0.093 0.234 0.815 -0.161 0.204
opponent[T.Reading] 0.1246 0.132 0.945 0.344 -0.134 0.383
opponent[T.Sheffield United] -0.8717 0.339 -2.573 0.010 -1.536 -0.208
opponent[T.Southampton] -0.2272 0.079 -2.877 0.004 -0.382 -0.072
opponent[T.Stoke City] -0.2108 0.077 -2.727 0.006 -0.362 -0.059
opponent[T.Sunderland A] -0.1309 0.078 -1.674 0.094 -0.284 0.022
opponent[T.Swansea City A] -0.1701 0.079 -2.158 0.031 -0.325 -0.016
opponent[T.Tottenham Hotspur] -0.4187 0.079 -5.333 0.000 -0.573 -0.265
opponent[T.Watford] -0.0681 0.086 -0.792 0.428 -0.237 0.100
opponent[T.West Bromwich Albion] -0.1555 0.077 -2.032 0.042 -0.305 -0.006
opponent[T.West Ham United] -0.1132 0.075 -1.501 0.133 -0.261 0.035
opponent[T.Wigan Athletic] 0.0088 0.094 0.094 0.925 -0.175 0.192
opponent[T.Wolverhampton Wanderers] -0.0324 0.092 -0.354 0.724 -0.212 0.147
home 0.2650 0.020 12.962 0.000 0.225 0.305

Now that we have our model for the attacking and defensive abilities of each team for home and away, we can start to simulate some matches...

In [16]:
def matchSim(homeTeam, awayTeam):
    hAvg = model.predict(pd.DataFrame(data={'team': homeTeam,'opponent': awayTeam,'home':1}, index=[1])).values[0]
    aAvg = model.predict(pd.DataFrame(data={'team': awayTeam,'opponent': homeTeam,'home':0},                                                    index=[1])).values[0]
    pred = [[poisson.pmf(i, tAvg) for i in range(0, 18)] for tAvg in [hAvg, aAvg]]
    return(np.outer(np.array(pred[0]), np.array(pred[1])))

Time to test my model with my favourite team against their bitter rivals

In [17]:
testGame = matchSim("Tottenham Hotspur", 'Arsenal')
print("Probability of Spurs winning at home: " + str(np.sum(np.tril(testGame, -1))))
print("Probability of a draw: " + str(np.sum(np.diag(testGame))))
print("Probability of Arsenal winning away: " + str(np.sum(np.triu(testGame, 1))))
Probability of Spurs winning at home: 0.45554015885622784
Probability of a draw: 0.24088913894330144
Probability of Arsenal winning away: 0.30357070220018745

Tottenham are the deserved favourites but more importantly, my model seems to be functional. Now it is time to test it against the online betting lines and to see if we can make some money.

In [18]:
odds = pd.read_csv('eplodds.csv',)

I got this file from from a site called Cheap Data Feeds. They scrape the odds from the reputable betting exchange Pinnacle.com and share them for free. The prices should reflect the "true odds" of each event happening and are a good baseline to test our model against. This dataframe needs a small amount of processing before we can compare the two.

In [19]:
dfodds = odds[["homeTeam", "awayTeam", "gameMoneylineHomePriceEU", "gameMoneylineDrawPriceEU", "gameMoneylineAwayPriceEU"]]
dfodds=dfodds.dropna()
In [20]:
dfodds['probabilityHome'] = (1/dfodds['gameMoneylineHomePriceEU'])-0.01
dfodds['probabilityDraw'] = (1/dfodds['gameMoneylineDrawPriceEU'])-0.01
dfodds['probabilityAway'] = (1/dfodds['gameMoneylineAwayPriceEU'])-0.01

oddsComparison = dfodds[["homeTeam", "awayTeam", "probabilityHome", "probabilityDraw", "probabilityAway"]]
oddsComparison
Out[20]:
homeTeam awayTeam probabilityHome probabilityDraw probabilityAway
0 Crystal Palace Bournemouth 0.446621 0.278184 0.280698
1 Burnley Manchester City 0.078889 0.142672 0.783021
2 Leicester City Watford 0.722064 0.182308 0.096157
3 Wolves West Ham United 0.545556 0.254550 0.197469
4 Manchester United Tottenham Hotspur 0.360370 0.269330 0.367358
5 Chelsea Aston Villa 0.757460 0.157504 0.085877
6 Southampton Norwich City 0.555291 0.230385 0.212717
8 Liverpool Everton 0.705308 0.185312 0.110337
9 Sheffield United Newcastle United 0.525045 0.271690 0.200526
10 Arsenal Brighton and Hove Albion 0.611504 0.224742 0.163010
11 Everton Chelsea 0.277356 0.249067 0.470769
12 Watford Crystal Palace 0.394858 0.286736 0.315733
13 Bournemouth Liverpool 0.143374 0.194918 0.662495
14 Tottenham Hotspur Burnley 0.694225 0.195339 0.111359
15 Manchester City Manchester United 0.758049 0.161821 0.080909
16 Newcastle United Southampton 0.349712 0.294878 0.352319
17 Norwich City Sheffield United 0.328983 0.267008 0.401523
18 Aston Villa Leicester City 0.207391 0.236914 0.553380
19 Brighton and Hove Albion Wolves 0.354964 0.301526 0.339650
20 West Ham United Arsenal 0.266243 0.223645 0.504668

This dataframe contains each Premier League matchup in the next two weeks along with the probabilities for each possible outcome according to Pinnacles bookmakers.

In [21]:
oddsComparison = oddsComparison.replace("Bournemouth", "AFC Bournemouth")
oddsComparison = oddsComparison.replace("Wolves", "Wolverhampton Wanderers")
oddsComparison = oddsComparison.replace("Brighton and Hove Albion", "Brighton & Hove Albion")
#Just some edge cases with the team names

#Loop through each row (game) and compute the probability for each outcome and add it to the table
for index, row in oddsComparison.iterrows():
    match = matchSim(row.homeTeam, row.awayTeam)
    oddsComparison.at[index, 'predictionHome'] = np.sum(np.tril(match, -1))
    oddsComparison.at[index, 'predictionDraw'] = np.sum(np.diag(match))
    oddsComparison.at[index, 'predictionAway'] = np.sum(np.triu(match, 1))
    
oddsComparison
Out[21]:
homeTeam awayTeam probabilityHome probabilityDraw probabilityAway predictionHome predictionDraw predictionAway
0 Crystal Palace AFC Bournemouth 0.446621 0.278184 0.280698 0.476202 0.244736 0.279062
1 Burnley Manchester City 0.078889 0.142672 0.783021 0.141815 0.209624 0.648561
2 Leicester City Watford 0.722064 0.182308 0.096157 0.605308 0.214103 0.180589
3 Wolverhampton Wanderers West Ham United 0.545556 0.254550 0.197469 0.404315 0.253410 0.342275
4 Manchester United Tottenham Hotspur 0.360370 0.269330 0.367358 0.494316 0.247723 0.257960
5 Chelsea Aston Villa 0.757460 0.157504 0.085877 0.762314 0.154636 0.083050
6 Southampton Norwich City 0.555291 0.230385 0.212717 0.584896 0.230161 0.184943
8 Liverpool Everton 0.705308 0.185312 0.110337 0.591807 0.219169 0.189024
9 Sheffield United Newcastle United 0.525045 0.271690 0.200526 0.580386 0.277075 0.142538
10 Arsenal Brighton & Hove Albion 0.611504 0.224742 0.163010 0.719846 0.176092 0.104062
11 Everton Chelsea 0.277356 0.249067 0.470769 0.315406 0.262128 0.422466
12 Watford Crystal Palace 0.394858 0.286736 0.315733 0.400292 0.269940 0.329768
13 AFC Bournemouth Liverpool 0.143374 0.194918 0.662495 0.233139 0.214543 0.552318
14 Tottenham Hotspur Burnley 0.694225 0.195339 0.111359 0.670989 0.202277 0.126734
15 Manchester City Manchester United 0.758049 0.161821 0.080909 0.561185 0.229067 0.209748
16 Newcastle United Southampton 0.349712 0.294878 0.352319 0.394028 0.267667 0.338306
17 Norwich City Sheffield United 0.328983 0.267008 0.401523 0.187461 0.305808 0.506731
18 Aston Villa Leicester City 0.207391 0.236914 0.553380 0.276359 0.253021 0.470620
19 Brighton & Hove Albion Wolverhampton Wanderers 0.354964 0.301526 0.339650 0.418474 0.275161 0.306365
20 West Ham United Arsenal 0.266243 0.223645 0.504668 0.263234 0.233234 0.503531

Now that we have the bookmakers numbers and our model's predicted numbers, it's time to see just how different they are and what we can learn from them.

In [22]:
difference = oddsComparison
difference['differenceHome'] = difference['predictionHome'] - difference['probabilityHome']
difference['differenceDraw'] = difference['predictionDraw'] - difference['probabilityDraw']
difference['differenceAway'] = difference['predictionAway'] - difference['probabilityAway']

difference
Out[22]:
homeTeam awayTeam probabilityHome probabilityDraw probabilityAway predictionHome predictionDraw predictionAway differenceHome differenceDraw differenceAway
0 Crystal Palace AFC Bournemouth 0.446621 0.278184 0.280698 0.476202 0.244736 0.279062 0.029581 -0.033448 -0.001636
1 Burnley Manchester City 0.078889 0.142672 0.783021 0.141815 0.209624 0.648561 0.062926 0.066952 -0.134460
2 Leicester City Watford 0.722064 0.182308 0.096157 0.605308 0.214103 0.180589 -0.116756 0.031795 0.084432
3 Wolverhampton Wanderers West Ham United 0.545556 0.254550 0.197469 0.404315 0.253410 0.342275 -0.141240 -0.001140 0.144806
4 Manchester United Tottenham Hotspur 0.360370 0.269330 0.367358 0.494316 0.247723 0.257960 0.133946 -0.021606 -0.109398
5 Chelsea Aston Villa 0.757460 0.157504 0.085877 0.762314 0.154636 0.083050 0.004854 -0.002868 -0.002827
6 Southampton Norwich City 0.555291 0.230385 0.212717 0.584896 0.230161 0.184943 0.029605 -0.000224 -0.027774
8 Liverpool Everton 0.705308 0.185312 0.110337 0.591807 0.219169 0.189024 -0.113501 0.033856 0.078687
9 Sheffield United Newcastle United 0.525045 0.271690 0.200526 0.580386 0.277075 0.142538 0.055341 0.005385 -0.057988
10 Arsenal Brighton & Hove Albion 0.611504 0.224742 0.163010 0.719846 0.176092 0.104062 0.108342 -0.048650 -0.058949
11 Everton Chelsea 0.277356 0.249067 0.470769 0.315406 0.262128 0.422466 0.038050 0.013061 -0.048304
12 Watford Crystal Palace 0.394858 0.286736 0.315733 0.400292 0.269940 0.329768 0.005434 -0.016796 0.014035
13 AFC Bournemouth Liverpool 0.143374 0.194918 0.662495 0.233139 0.214543 0.552318 0.089765 0.019625 -0.110177
14 Tottenham Hotspur Burnley 0.694225 0.195339 0.111359 0.670989 0.202277 0.126734 -0.023236 0.006938 0.015375
15 Manchester City Manchester United 0.758049 0.161821 0.080909 0.561185 0.229067 0.209748 -0.196865 0.067246 0.128839
16 Newcastle United Southampton 0.349712 0.294878 0.352319 0.394028 0.267667 0.338306 0.044315 -0.027211 -0.014013
17 Norwich City Sheffield United 0.328983 0.267008 0.401523 0.187461 0.305808 0.506731 -0.141522 0.038800 0.105208
18 Aston Villa Leicester City 0.207391 0.236914 0.553380 0.276359 0.253021 0.470620 0.068968 0.016107 -0.082760
19 Brighton & Hove Albion Wolverhampton Wanderers 0.354964 0.301526 0.339650 0.418474 0.275161 0.306365 0.063511 -0.026366 -0.033286
20 West Ham United Arsenal 0.266243 0.223645 0.504668 0.263234 0.233234 0.503531 -0.003009 0.009590 -0.001137

This table combines everything that I've been working towards, but it's not exactly what I would call 'readable'.

In [23]:
dfinal = difference[['homeTeam', 'awayTeam', 'differenceHome', 'differenceDraw', 'differenceAway']]
dfinal.style.background_gradient(cmap="RdBu")
Out[23]:
homeTeam awayTeam differenceHome differenceDraw differenceAway
0 Crystal Palace AFC Bournemouth 0.0295806 -0.0334482 -0.00163555
1 Burnley Manchester City 0.062926 0.0669524 -0.13446
2 Leicester City Watford -0.116756 0.031795 0.0844319
3 Wolverhampton Wanderers West Ham United -0.14124 -0.00114008 0.144806
4 Manchester United Tottenham Hotspur 0.133946 -0.0216061 -0.109398
5 Chelsea Aston Villa 0.00485387 -0.00286784 -0.0028272
6 Southampton Norwich City 0.0296049 -0.000223657 -0.0277741
8 Liverpool Everton -0.113501 0.0338565 0.078687
9 Sheffield United Newcastle United 0.0553408 0.00538518 -0.0579879
10 Arsenal Brighton & Hove Albion 0.108342 -0.0486499 -0.0589486
11 Everton Chelsea 0.0380498 0.0130609 -0.0483035
12 Watford Crystal Palace 0.00543388 -0.0167964 0.0140354
13 AFC Bournemouth Liverpool 0.0897648 0.0196254 -0.110177
14 Tottenham Hotspur Burnley -0.023236 0.00693792 0.0153747
15 Manchester City Manchester United -0.196865 0.0672459 0.128839
16 Newcastle United Southampton 0.0443154 -0.0272114 -0.014013
17 Norwich City Sheffield United -0.141522 0.0387996 0.105208
18 Aston Villa Leicester City 0.0689679 0.0161069 -0.08276
19 Brighton & Hove Albion Wolverhampton Wanderers 0.0635109 -0.0263657 -0.0332856
20 West Ham United Arsenal -0.00300906 0.00958961 -0.00113655

And there we have it, the final table with the next two weeks worth of fixtures. Dark blue cells represent a bet with good value (our calculated probability is higher than the bookies) and dark red represents bad value. Lighter coloured cells indicate situations where our model and the online exchange are in agreement.

Conclusion

While the code might work and while the scores it spits out are somewhat realistic, my model is still very primitive and not ready for prime time action quite yet. It only takes into consideration the teams and goals, while a real life match is affected by much much more than that. Player fitness, injuries, the weather, and the state of the playing surface are only a few of the things that often play a big factor in the outcome of a football match and are also much tougher to find concrete data for. Despite its inherent flaws, this simplistic Poisson model makes for a good starting point and a nice introduction into data mining and statistical modelling.