I am a huge sports fan and I have always had a big fascination with data. With this data mining project, I finally got a chance to combine the two. I was also recently introduced to sports betting and it got me thinking, could I potentially merge the enormous amount of available data with some python to build a working model and predict the outcome of games? I never really had the tools to do this, but after taking Seng 474 I was finally able to wrap my head around it and give it a go.
Soccer has always been my favourite sport to play and watch so that is the sport that I am going to be focusing on. In soccer the winner is the team that scores the most goals, and each goal is alawys worth eactly 1, this makes the Poisson distribution a prime candidate to build my model around. After training on ten years of historical data, it will be able to calculate the probable number of goals scored and conceded by each team against each team, and therefore determine the probability of certain outcomes for future games. This can potentially give us an edge against the bookmaker and help us get more opportunites to make profitable bets. The formula for a Poisson distribution is:
Where X represents the number of goals per 90 minutes and λ is the expected number (average rate) of goals per 90 minutes, which we will be calculating later.
For the sake of this project, and due to the amount of resources available I am only going to be focusing specifically building this project around the English Premier League. The dataset that I am using was very easy to find, as it is coming from a github repository called football-csv where a ton of soccer data has been collected and uploaded for anyone to use. Despite having data available for the past 100 years, in an attempt to keep things a little more relevant, I am only going to be using data from past 10 years to train my model which should be plenty.
First things first, lets import all of the python necessities into our notebook.
import pandas as pd
import numpy as np
import re
import warnings
import glob
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import poisson
epl = pd.concat([pd.read_csv(f) for f in glob.glob('eng/2010s/*/eng.1.csv')])
epl
This is the top and bottom 5 columns from the training dataset. 3540 games in total.
Since the dataset is rather large in size, I need to remove some of the duplicate or unnecessary information. The following are redundant and will be removed.
epl = epl.drop(columns=['Round', 'HT'], axis=1)
epl.replace(to_replace ='\s?F?C?\s*\(.*?\)', value = "", regex = True, inplace=True)
epl.head()
That is starting to look a little better. Next, I need to remove the dash and split up the values from the FT column so there are separate columns for the Home and Away scores.
df = epl[["Date", "Team 1", "Team 2"]]
new = epl["FT"].str.split("-", n = 1, expand = True)
warnings.filterwarnings('ignore')
df["HomeFT"] = new[0]
df["AwayFT"] = new[1]
df = df[["Date", "Team 1", "HomeFT", "AwayFT", "Team 2"]]
df.columns = ['Date', 'Home', 'Hscore', 'Ascore', "Away"]
df = df.astype({"Hscore": int, 'Ascore': int})
df.head()
Now that the data is formatted and looks just as we wanted it to, we can begin to play with it a little.
For my model to be accurate, I am going to have to gather a couple important metrics. These include:
1. Average number of goals scored by home team per game
2. Average number of goals scored by away team per game
3. Total number of goals scored per game
dfgraph = df[["Hscore", 'Ascore']]
dfgraph=dfgraph.astype(int)
means = dfgraph.mean()
means
It's pretty clear to see that, on average, the home team scores more goals than the away team. This is the so called "home field advantage" and it is one of the main features that my model is going to be built around. Using this information, I can see the exact distribution of how many goals the home team scores vs the visiting team.
dfgraph['Total Goals'] = dfgraph.apply(lambda row: row['Hscore'] + row['Ascore'], axis=1)
dfgraph=dfgraph.groupby(dfgraph.columns.tolist()).size().reset_index().rename(columns={0:'Count'})
dfgraph['Rate'] = dfgraph['Count'].transform(lambda x: x / x.sum())
dfgraph1 = dfgraph.groupby('Ascore')['Rate'].sum()
dfgraph2 = dfgraph.groupby('Hscore')['Rate'].sum()
result = pd.concat([dfgraph1, dfgraph2], axis=1, sort=False)
result.columns = ['Hscore', 'Ascore']
result.fillna(0)
The leftmost column here represents the number of goals scored in a game and the other two show the percentage of how many times the home and away teams have scored that many goals.
Interestingly enough, history was made while I was working on this project. No away team had ever scored 8 goals in a single game until Leicester City scored 9 against Southampton on October 25th, 2019! We can see that this game represents 0.028 percent of our dataset, and was the only time a team had scored 9 goals in a game over the past 10 years.
Moving on to more important things...
result.plot.bar(y=['Hscore', 'Ascore'])
We can see pretty easily from this graph that the "home team advantage" is very real. Away teams are over 10% more likely to fail to score a single goal, while home teams are more likely to score 2, 3, 4, or 5.
dfgraph3=dfgraph.groupby('Total Goals')['Rate'].sum()
dfgraph3.plot.bar(x='Total Goals',y='Rate',color='red')
This graph removes the distinction between home and away, and shows us the total distribution of the number of goals scored.
Now we get to start the fun stuff. Using the prepared data from the previous steps, we fit our model with the home and away teams attacking and defending stats to compute a probability for each potential outcome.
Data = pd.concat([df[['Home','Away','Hscore']].assign(home=1).rename(columns={'Home':'team', 'Away':'opponent','Hscore':'goals'}), df[['Away','Home','Ascore']].assign(home=0).rename(columns={'Away':'team', 'Home':'opponent','Ascore':'goals'})])
model = smf.glm(formula="goals ~ home + team + opponent", data=Data, family=sm.families.Poisson()).fit()
model.summary()
Now that we have our model for the attacking and defensive abilities of each team for home and away, we can start to simulate some matches...
def matchSim(homeTeam, awayTeam):
hAvg = model.predict(pd.DataFrame(data={'team': homeTeam,'opponent': awayTeam,'home':1}, index=[1])).values[0]
aAvg = model.predict(pd.DataFrame(data={'team': awayTeam,'opponent': homeTeam,'home':0}, index=[1])).values[0]
pred = [[poisson.pmf(i, tAvg) for i in range(0, 18)] for tAvg in [hAvg, aAvg]]
return(np.outer(np.array(pred[0]), np.array(pred[1])))
Time to test my model with my favourite team against their bitter rivals
testGame = matchSim("Tottenham Hotspur", 'Arsenal')
print("Probability of Spurs winning at home: " + str(np.sum(np.tril(testGame, -1))))
print("Probability of a draw: " + str(np.sum(np.diag(testGame))))
print("Probability of Arsenal winning away: " + str(np.sum(np.triu(testGame, 1))))
Tottenham are the deserved favourites but more importantly, my model seems to be functional. Now it is time to test it against the online betting lines and to see if we can make some money.
odds = pd.read_csv('eplodds.csv',)
I got this file from from a site called Cheap Data Feeds. They scrape the odds from the reputable betting exchange Pinnacle.com and share them for free. The prices should reflect the "true odds" of each event happening and are a good baseline to test our model against. This dataframe needs a small amount of processing before we can compare the two.
dfodds = odds[["homeTeam", "awayTeam", "gameMoneylineHomePriceEU", "gameMoneylineDrawPriceEU", "gameMoneylineAwayPriceEU"]]
dfodds=dfodds.dropna()
dfodds['probabilityHome'] = (1/dfodds['gameMoneylineHomePriceEU'])-0.01
dfodds['probabilityDraw'] = (1/dfodds['gameMoneylineDrawPriceEU'])-0.01
dfodds['probabilityAway'] = (1/dfodds['gameMoneylineAwayPriceEU'])-0.01
oddsComparison = dfodds[["homeTeam", "awayTeam", "probabilityHome", "probabilityDraw", "probabilityAway"]]
oddsComparison
This dataframe contains each Premier League matchup in the next two weeks along with the probabilities for each possible outcome according to Pinnacles bookmakers.
oddsComparison = oddsComparison.replace("Bournemouth", "AFC Bournemouth")
oddsComparison = oddsComparison.replace("Wolves", "Wolverhampton Wanderers")
oddsComparison = oddsComparison.replace("Brighton and Hove Albion", "Brighton & Hove Albion")
#Just some edge cases with the team names
#Loop through each row (game) and compute the probability for each outcome and add it to the table
for index, row in oddsComparison.iterrows():
match = matchSim(row.homeTeam, row.awayTeam)
oddsComparison.at[index, 'predictionHome'] = np.sum(np.tril(match, -1))
oddsComparison.at[index, 'predictionDraw'] = np.sum(np.diag(match))
oddsComparison.at[index, 'predictionAway'] = np.sum(np.triu(match, 1))
oddsComparison
Now that we have the bookmakers numbers and our model's predicted numbers, it's time to see just how different they are and what we can learn from them.
difference = oddsComparison
difference['differenceHome'] = difference['predictionHome'] - difference['probabilityHome']
difference['differenceDraw'] = difference['predictionDraw'] - difference['probabilityDraw']
difference['differenceAway'] = difference['predictionAway'] - difference['probabilityAway']
difference
This table combines everything that I've been working towards, but it's not exactly what I would call 'readable'.
dfinal = difference[['homeTeam', 'awayTeam', 'differenceHome', 'differenceDraw', 'differenceAway']]
dfinal.style.background_gradient(cmap="RdBu")
And there we have it, the final table with the next two weeks worth of fixtures. Dark blue cells represent a bet with good value (our calculated probability is higher than the bookies) and dark red represents bad value. Lighter coloured cells indicate situations where our model and the online exchange are in agreement.
While the code might work and while the scores it spits out are somewhat realistic, my model is still very primitive and not ready for prime time action quite yet. It only takes into consideration the teams and goals, while a real life match is affected by much much more than that. Player fitness, injuries, the weather, and the state of the playing surface are only a few of the things that often play a big factor in the outcome of a football match and are also much tougher to find concrete data for. Despite its inherent flaws, this simplistic Poisson model makes for a good starting point and a nice introduction into data mining and statistical modelling.