Stats Project - Soccer¶
by Flor Nightgale¶
For this project we used secondary data about Premier League (Soccer).
Team Statistics¶
import pandas as pd
data = pd.read_html('https://www.espn.com/soccer/table/_/league/eng.1')
teams = data[0].join(data[1]) # join the two data tables together
teams
2020-2021 | GP | W | D | L | F | A | GD | P | |
---|---|---|---|---|---|---|---|---|---|
0 | 1LEILeicester City | 2 | 2 | 0 | 0 | 7 | 2 | 5 | 6 |
1 | 2EVEEverton | 2 | 2 | 0 | 0 | 6 | 2 | 4 | 6 |
2 | 3ARSArsenal | 2 | 2 | 0 | 0 | 5 | 1 | 4 | 6 |
3 | 4LIVLiverpool | 2 | 2 | 0 | 0 | 6 | 3 | 3 | 6 |
4 | 5CRYCrystal Palace | 2 | 2 | 0 | 0 | 4 | 1 | 3 | 6 |
5 | 6TOTTottenham Hotspur | 2 | 1 | 0 | 1 | 5 | 3 | 2 | 3 |
6 | 7MNCManchester City | 1 | 1 | 0 | 0 | 3 | 1 | 2 | 3 |
7 | 8BHABrighton & Hove Albion | 2 | 1 | 0 | 1 | 4 | 3 | 1 | 3 |
8 | 9AVLAston Villa | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 |
9 | 10LEELeeds United | 2 | 1 | 0 | 1 | 7 | 7 | 0 | 3 |
10 | 11CHEChelsea | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
11 | 12WOLWolverhampton Wanderers | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
12 | 13NEWNewcastle United | 2 | 1 | 0 | 1 | 2 | 3 | -1 | 3 |
13 | 14BURBurnley | 1 | 0 | 0 | 1 | 2 | 4 | -2 | 0 |
14 | 15MANManchester United | 1 | 0 | 0 | 1 | 1 | 3 | -2 | 0 |
15 | 16WHUWest Ham United | 2 | 0 | 0 | 2 | 1 | 4 | -3 | 0 |
16 | 17SHUSheffield United | 2 | 0 | 0 | 2 | 0 | 3 | -3 | 0 |
17 | 18FULFulham | 2 | 0 | 0 | 2 | 3 | 7 | -4 | 0 |
18 | 19SOUTSouthampton | 2 | 0 | 0 | 2 | 2 | 6 | -4 | 0 |
19 | 20WBAWest Bromwich Albion | 2 | 0 | 0 | 2 | 2 | 8 | -6 | 0 |
Columns in the data set are:
GP: Games Played
W: Wins
D: Draws
L: Losses
F: Goals For
A: Goals Against
GD: Goal Difference
P: Points
Notice that the ranking (index values) start at zero. As well, the team names got combined with their ranks and abbreviations, let’s cut those out and leave just the team names.
For each team name, the second character is a lowercase letter, so we’ll find the first lowercase letter then take just the characters from one before that until the end of the name.
We’ll also rename the columns.
for i, row in teams.iterrows():
for character in row[0]:
if character.islower(): # we've found the first lowercase letter
start_here = row[0].index(character)-1
team_name = row[0][start_here:]
break # stop looking through the team name
teams.iloc[i,0] = team_name
teams.columns = ['Team','Games Played','Wins','Draws','Losses','Goals For','Goals Against','Goal Difference','Points']
teams
Team | Games Played | Wins | Draws | Losses | Goals For | Goals Against | Goal Difference | Points | |
---|---|---|---|---|---|---|---|---|---|
0 | Leicester City | 2 | 2 | 0 | 0 | 7 | 2 | 5 | 6 |
1 | Everton | 2 | 2 | 0 | 0 | 6 | 2 | 4 | 6 |
2 | Arsenal | 2 | 2 | 0 | 0 | 5 | 1 | 4 | 6 |
3 | Liverpool | 2 | 2 | 0 | 0 | 6 | 3 | 3 | 6 |
4 | Crystal Palace | 2 | 2 | 0 | 0 | 4 | 1 | 3 | 6 |
5 | Tottenham Hotspur | 2 | 1 | 0 | 1 | 5 | 3 | 2 | 3 |
6 | Manchester City | 1 | 1 | 0 | 0 | 3 | 1 | 2 | 3 |
7 | Brighton & Hove Albion | 2 | 1 | 0 | 1 | 4 | 3 | 1 | 3 |
8 | Aston Villa | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 |
9 | Leeds United | 2 | 1 | 0 | 1 | 7 | 7 | 0 | 3 |
10 | Chelsea | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
11 | Wolverhampton Wanderers | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
12 | Newcastle United | 2 | 1 | 0 | 1 | 2 | 3 | -1 | 3 |
13 | Burnley | 1 | 0 | 0 | 1 | 2 | 4 | -2 | 0 |
14 | Manchester United | 1 | 0 | 0 | 1 | 1 | 3 | -2 | 0 |
15 | West Ham United | 2 | 0 | 0 | 2 | 1 | 4 | -3 | 0 |
16 | Sheffield United | 2 | 0 | 0 | 2 | 0 | 3 | -3 | 0 |
17 | Fulham | 2 | 0 | 0 | 2 | 3 | 7 | -4 | 0 |
18 | Southampton | 2 | 0 | 0 | 2 | 2 | 6 | -4 | 0 |
19 | West Bromwich Albion | 2 | 0 | 0 | 2 | 2 | 8 | -6 | 0 |
Statistical Calculations¶
The describe()
method does some statisical calculations for us.
team_stats = teams.describe()
team_stats
Games Played | Wins | Draws | Losses | Goals For | Goals Against | Goal Difference | Points | |
---|---|---|---|---|---|---|---|---|
count | 20.000000 | 20.000000 | 20.0 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 |
mean | 1.800000 | 0.900000 | 0.0 | 0.900000 | 3.350000 | 3.350000 | 0.000000 | 2.700000 |
std | 0.410391 | 0.788069 | 0.0 | 0.788069 | 2.084403 | 2.158825 | 3.077935 | 2.364207 |
min | 1.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | -6.000000 | 0.000000 |
25% | 2.000000 | 0.000000 | 0.0 | 0.000000 | 2.000000 | 2.000000 | -2.250000 | 0.000000 |
50% | 2.000000 | 1.000000 | 0.0 | 1.000000 | 3.000000 | 3.000000 | 0.000000 | 3.000000 |
75% | 2.000000 | 1.250000 | 0.0 | 1.250000 | 5.000000 | 4.000000 | 2.250000 | 3.750000 |
max | 2.000000 | 2.000000 | 0.0 | 2.000000 | 7.000000 | 8.000000 | 5.000000 | 6.000000 |
We can also find the median values.
teams.median()
Games Played 2.0
Wins 1.0
Draws 0.0
Losses 1.0
Goals For 3.0
Goals Against 3.0
Goal Difference 0.0
Points 3.0
dtype: float64
The Goal Difference
column is probably the most interesting, since it has the largest range and highest standard deviation. The top teams scored a lot more than they were scored on, and the bottom teams were scored on a lot more than they scored.
Since we are looking at data for all of the teams, we see that the mean number of wins is equal to the mean number of losses. The same goes for goals scored and goals scored against.
import plotly_express as px
fig = px.bar(team_stats.iloc[3:], y='Goal Difference', title='')
fig.show()
If we want to see which teams scored more than the mean value of “Goals For”, we can use the following code.
gf_mean = teams['Goals For'].mean()
teams[teams['Goals For'] > gf_mean]
Team | Games Played | Wins | Draws | Losses | Goals For | Goals Against | Goal Difference | Points | |
---|---|---|---|---|---|---|---|---|---|
0 | Leicester City | 2 | 2 | 0 | 0 | 7 | 2 | 5 | 6 |
1 | Everton | 2 | 2 | 0 | 0 | 6 | 2 | 4 | 6 |
2 | Arsenal | 2 | 2 | 0 | 0 | 5 | 1 | 4 | 6 |
3 | Liverpool | 2 | 2 | 0 | 0 | 6 | 3 | 3 | 6 |
4 | Crystal Palace | 2 | 2 | 0 | 0 | 4 | 1 | 3 | 6 |
5 | Tottenham Hotspur | 2 | 1 | 0 | 1 | 5 | 3 | 2 | 3 |
7 | Brighton & Hove Albion | 2 | 1 | 0 | 1 | 4 | 3 | 1 | 3 |
9 | Leeds United | 2 | 1 | 0 | 1 | 7 | 7 | 0 | 3 |
In general, but not always, the top teams scored more than the average number of goals.
Mean is probably the best measure of central tendency here, since using the median would just give us the top half of the teams. Mode wouldn’t be useful because there aren’t a lot of repeated values in the column.
Let’s see if the top teams had fewer than the mean number of goals scored against them.
ga_mean = teams['Goals Against'].mean()
teams[teams['Goals Against'] < gf_mean]
Team | Games Played | Wins | Draws | Losses | Goals For | Goals Against | Goal Difference | Points | |
---|---|---|---|---|---|---|---|---|---|
0 | Leicester City | 2 | 2 | 0 | 0 | 7 | 2 | 5 | 6 |
1 | Everton | 2 | 2 | 0 | 0 | 6 | 2 | 4 | 6 |
2 | Arsenal | 2 | 2 | 0 | 0 | 5 | 1 | 4 | 6 |
3 | Liverpool | 2 | 2 | 0 | 0 | 6 | 3 | 3 | 6 |
4 | Crystal Palace | 2 | 2 | 0 | 0 | 4 | 1 | 3 | 6 |
5 | Tottenham Hotspur | 2 | 1 | 0 | 1 | 5 | 3 | 2 | 3 |
6 | Manchester City | 1 | 1 | 0 | 0 | 3 | 1 | 2 | 3 |
7 | Brighton & Hove Albion | 2 | 1 | 0 | 1 | 4 | 3 | 1 | 3 |
8 | Aston Villa | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 |
10 | Chelsea | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
11 | Wolverhampton Wanderers | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 |
12 | Newcastle United | 2 | 1 | 0 | 1 | 2 | 3 | -1 | 3 |
14 | Manchester United | 1 | 0 | 0 | 1 | 1 | 3 | -2 | 0 |
16 | Sheffield United | 2 | 0 | 0 | 2 | 0 | 3 | -3 | 0 |
Again, it is generally true that the top teams that had fewer goals scored against them.
Teams Visualizations¶
Let’s create some plots of Wins
, Losses
, Draws
versus team rank.
columns = ['Wins', 'Losses', 'Draws']
for column in columns:
fig = px.scatter(teams, x=teams.index, y=column, title=column+' vs Rank', hover_data=['Team'])
fig.show()
Player Data¶
We are also going to look at individual player data for scoring and assists. We’ll download both and then look first at the top 10, head(10)
, of the scorers
data table.
stats = pd.read_html('https://www.espn.com/soccer/stats/_/league/ENG.1/view/scoring')
scorers = stats[0]
assists = stats[1]
scorers.head(10)
RK | Name | Team | P | G | |
---|---|---|---|---|---|
0 | 1.0 | Dominic Calvert-Lewin | Everton | 2 | 4 |
1 | NaN | Son Heung-Min | Tottenham Hotspur | 2 | 4 |
2 | 3.0 | Mohamed Salah | Liverpool | 2 | 3 |
3 | NaN | Wilfried Zaha | Crystal Palace | 2 | 3 |
4 | 5.0 | Hélder Costa | Leeds United | 2 | 2 |
5 | NaN | Aleksandar Mitrovic | Fulham | 2 | 2 |
6 | NaN | Neal Maupay | Brighton & Hove Albion | 2 | 2 |
7 | NaN | Sadio Mané | Liverpool | 2 | 2 |
8 | NaN | Patrick Bamford | Leeds United | 2 | 2 |
9 | NaN | Raúl Jiménez | Wolverhampton Wanderers | 2 | 2 |
Columns:
RK: Ranking
P: Games played
G: Goals scored
A: Assists
There are quite a few missing (NaN
) values, which means that player is tied with the player above them, so we can use fillna(method='ffill')
which means “forward fill” values to replace missing values.
scorers = scorers.fillna(method='ffill')
scorers.head(10)
RK | Name | Team | P | G | |
---|---|---|---|---|---|
0 | 1.0 | Dominic Calvert-Lewin | Everton | 2 | 4 |
1 | 1.0 | Son Heung-Min | Tottenham Hotspur | 2 | 4 |
2 | 3.0 | Mohamed Salah | Liverpool | 2 | 3 |
3 | 3.0 | Wilfried Zaha | Crystal Palace | 2 | 3 |
4 | 5.0 | Hélder Costa | Leeds United | 2 | 2 |
5 | 5.0 | Aleksandar Mitrovic | Fulham | 2 | 2 |
6 | 5.0 | Neal Maupay | Brighton & Hove Albion | 2 | 2 |
7 | 5.0 | Sadio Mané | Liverpool | 2 | 2 |
8 | 5.0 | Patrick Bamford | Leeds United | 2 | 2 |
9 | 5.0 | Raúl Jiménez | Wolverhampton Wanderers | 2 | 2 |
assists = assists.fillna(method='ffill')
assists.head()
RK | Name | Team | P | A | |
---|---|---|---|---|---|
0 | 1.0 | Harry Kane | Tottenham Hotspur | 2 | 4 |
1 | 2.0 | Daniel Podence | Wolverhampton Wanderers | 2 | 2 |
2 | 2.0 | Willian | Arsenal | 2 | 2 |
3 | 2.0 | Richarlison | Everton | 2 | 2 |
4 | 5.0 | Tariq Lamptey | Brighton & Hove Albion | 2 | 1 |
Let’s create histograms for these two data sets.
fig1 = px.histogram(scorers, x='G', title='Histogram of Goals Scored by Top Players')
fig1.show()
fig2 = px.histogram(assists, x='A', title='Histogram of Assists by Top Players')
fig2.show()
Both of these histograms show that there are many more players that scored (or assisted) fewer goals, so the data are not normally distributed.
Research Question¶
Does having more top scoring or top assisting players on a team mean that team has a higher standing?
To answer this question, we will need to group the player data by team and merge the two data tables together. We’ll also drop the columns that we don’t need.
# group the data by team
scorers_team = scorers.groupby('Team').count().drop(columns=['RK', 'Name', 'P'])
assists_team = assists.groupby('Team').count().drop(columns=['RK', 'Name', 'P'])
# merge the players data tables
players = scorers_team.merge(assists_team, on='Team')
# create a column that adds goals and assists
players['Goals and Assists'] = players['G']+players['A']
# sort the values, create an index column, and display the data
players = players.sort_values('Goals and Assists', ascending=False).reset_index()
players
Team | G | A | Goals and Assists | |
---|---|---|---|---|
0 | Leeds United | 5 | 5 | 10 |
1 | Brighton & Hove Albion | 4 | 5 | 9 |
2 | Arsenal | 4 | 4 | 8 |
3 | Leicester City | 5 | 3 | 8 |
4 | Crystal Palace | 3 | 4 | 7 |
5 | Everton | 3 | 4 | 7 |
6 | Chelsea | 3 | 2 | 5 |
7 | Liverpool | 3 | 2 | 5 |
8 | Manchester City | 3 | 2 | 5 |
9 | Newcastle United | 2 | 3 | 5 |
10 | Southampton | 2 | 3 | 5 |
11 | West Bromwich Albion | 2 | 3 | 5 |
12 | Fulham | 2 | 2 | 4 |
13 | Tottenham Hotspur | 2 | 2 | 4 |
14 | Wolverhampton Wanderers | 2 | 2 | 4 |
15 | Burnley | 2 | 1 | 3 |
16 | West Ham United | 1 | 2 | 3 |
17 | Aston Villa | 1 | 1 | 2 |
Now we need to merge this data table with the Teams
data table from earlier.
combined_data = teams.merge(players, on='Team', how='left') # left means keep the order from the teams data table
combined_data
Team | Games Played | Wins | Draws | Losses | Goals For | Goals Against | Goal Difference | Points | G | A | Goals and Assists | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Leicester City | 2 | 2 | 0 | 0 | 7 | 2 | 5 | 6 | 5.0 | 3.0 | 8.0 |
1 | Everton | 2 | 2 | 0 | 0 | 6 | 2 | 4 | 6 | 3.0 | 4.0 | 7.0 |
2 | Arsenal | 2 | 2 | 0 | 0 | 5 | 1 | 4 | 6 | 4.0 | 4.0 | 8.0 |
3 | Liverpool | 2 | 2 | 0 | 0 | 6 | 3 | 3 | 6 | 3.0 | 2.0 | 5.0 |
4 | Crystal Palace | 2 | 2 | 0 | 0 | 4 | 1 | 3 | 6 | 3.0 | 4.0 | 7.0 |
5 | Tottenham Hotspur | 2 | 1 | 0 | 1 | 5 | 3 | 2 | 3 | 2.0 | 2.0 | 4.0 |
6 | Manchester City | 1 | 1 | 0 | 0 | 3 | 1 | 2 | 3 | 3.0 | 2.0 | 5.0 |
7 | Brighton & Hove Albion | 2 | 1 | 0 | 1 | 4 | 3 | 1 | 3 | 4.0 | 5.0 | 9.0 |
8 | Aston Villa | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 1.0 | 1.0 | 2.0 |
9 | Leeds United | 2 | 1 | 0 | 1 | 7 | 7 | 0 | 3 | 5.0 | 5.0 | 10.0 |
10 | Chelsea | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 | 3.0 | 2.0 | 5.0 |
11 | Wolverhampton Wanderers | 2 | 1 | 0 | 1 | 3 | 3 | 0 | 3 | 2.0 | 2.0 | 4.0 |
12 | Newcastle United | 2 | 1 | 0 | 1 | 2 | 3 | -1 | 3 | 2.0 | 3.0 | 5.0 |
13 | Burnley | 1 | 0 | 0 | 1 | 2 | 4 | -2 | 0 | 2.0 | 1.0 | 3.0 |
14 | Manchester United | 1 | 0 | 0 | 1 | 1 | 3 | -2 | 0 | NaN | NaN | NaN |
15 | West Ham United | 2 | 0 | 0 | 2 | 1 | 4 | -3 | 0 | 1.0 | 2.0 | 3.0 |
16 | Sheffield United | 2 | 0 | 0 | 2 | 0 | 3 | -3 | 0 | NaN | NaN | NaN |
17 | Fulham | 2 | 0 | 0 | 2 | 3 | 7 | -4 | 0 | 2.0 | 2.0 | 4.0 |
18 | Southampton | 2 | 0 | 0 | 2 | 2 | 6 | -4 | 0 | 2.0 | 3.0 | 5.0 |
19 | West Bromwich Albion | 2 | 0 | 0 | 2 | 2 | 8 | -6 | 0 | 2.0 | 3.0 | 5.0 |
To see if there is a relationship between Goals and Assists
and team rank, let’s create another scatterplot.
fig = px.scatter(combined_data, y='Goals and Assists', x=combined_data.index, hover_data=['Team'], title='Goals and Assists vs Team Rank')
fig.show()
Conclusion¶
It looks like higher ranked teams (lower \(x\) values) tend to have more players with more goals and assists, although there is a fair amount of variation in the data.
Perhaps we could look at a similar analysis using a larger data set from a league such as the National Hockey League where there are more games played by more teams.