Callysto.ca Banner

Open in Callysto

Stats Project - Soccer

by Flor Nightgale

For this project we used secondary data about Premier League (Soccer).

Team Statistics

import pandas as pd
data = pd.read_html('https://www.espn.com/soccer/table/_/league/eng.1')
teams = data[0].join(data[1]) # join the two data tables together
teams
2020-2021 GP W D L F A GD P
0 1LEILeicester City 2 2 0 0 7 2 5 6
1 2EVEEverton 2 2 0 0 6 2 4 6
2 3ARSArsenal 2 2 0 0 5 1 4 6
3 4LIVLiverpool 2 2 0 0 6 3 3 6
4 5CRYCrystal Palace 2 2 0 0 4 1 3 6
5 6TOTTottenham Hotspur 2 1 0 1 5 3 2 3
6 7MNCManchester City 1 1 0 0 3 1 2 3
7 8BHABrighton & Hove Albion 2 1 0 1 4 3 1 3
8 9AVLAston Villa 1 1 0 0 1 0 1 3
9 10LEELeeds United 2 1 0 1 7 7 0 3
10 11CHEChelsea 2 1 0 1 3 3 0 3
11 12WOLWolverhampton Wanderers 2 1 0 1 3 3 0 3
12 13NEWNewcastle United 2 1 0 1 2 3 -1 3
13 14BURBurnley 1 0 0 1 2 4 -2 0
14 15MANManchester United 1 0 0 1 1 3 -2 0
15 16WHUWest Ham United 2 0 0 2 1 4 -3 0
16 17SHUSheffield United 2 0 0 2 0 3 -3 0
17 18FULFulham 2 0 0 2 3 7 -4 0
18 19SOUTSouthampton 2 0 0 2 2 6 -4 0
19 20WBAWest Bromwich Albion 2 0 0 2 2 8 -6 0

Columns in the data set are:

  • GP: Games Played

  • W: Wins

  • D: Draws

  • L: Losses

  • F: Goals For

  • A: Goals Against

  • GD: Goal Difference

  • P: Points

Notice that the ranking (index values) start at zero. As well, the team names got combined with their ranks and abbreviations, let’s cut those out and leave just the team names.

For each team name, the second character is a lowercase letter, so we’ll find the first lowercase letter then take just the characters from one before that until the end of the name.

We’ll also rename the columns.

for i, row in teams.iterrows():
    for character in row[0]:
        if character.islower(): # we've found the first lowercase letter
            start_here = row[0].index(character)-1
            team_name = row[0][start_here:]
            break # stop looking through the team name
    teams.iloc[i,0] = team_name
teams.columns = ['Team','Games Played','Wins','Draws','Losses','Goals For','Goals Against','Goal Difference','Points']
teams
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
0 Leicester City 2 2 0 0 7 2 5 6
1 Everton 2 2 0 0 6 2 4 6
2 Arsenal 2 2 0 0 5 1 4 6
3 Liverpool 2 2 0 0 6 3 3 6
4 Crystal Palace 2 2 0 0 4 1 3 6
5 Tottenham Hotspur 2 1 0 1 5 3 2 3
6 Manchester City 1 1 0 0 3 1 2 3
7 Brighton & Hove Albion 2 1 0 1 4 3 1 3
8 Aston Villa 1 1 0 0 1 0 1 3
9 Leeds United 2 1 0 1 7 7 0 3
10 Chelsea 2 1 0 1 3 3 0 3
11 Wolverhampton Wanderers 2 1 0 1 3 3 0 3
12 Newcastle United 2 1 0 1 2 3 -1 3
13 Burnley 1 0 0 1 2 4 -2 0
14 Manchester United 1 0 0 1 1 3 -2 0
15 West Ham United 2 0 0 2 1 4 -3 0
16 Sheffield United 2 0 0 2 0 3 -3 0
17 Fulham 2 0 0 2 3 7 -4 0
18 Southampton 2 0 0 2 2 6 -4 0
19 West Bromwich Albion 2 0 0 2 2 8 -6 0

Statistical Calculations

The describe() method does some statisical calculations for us.

team_stats = teams.describe()
team_stats
Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
count 20.000000 20.000000 20.0 20.000000 20.000000 20.000000 20.000000 20.000000
mean 1.800000 0.900000 0.0 0.900000 3.350000 3.350000 0.000000 2.700000
std 0.410391 0.788069 0.0 0.788069 2.084403 2.158825 3.077935 2.364207
min 1.000000 0.000000 0.0 0.000000 0.000000 0.000000 -6.000000 0.000000
25% 2.000000 0.000000 0.0 0.000000 2.000000 2.000000 -2.250000 0.000000
50% 2.000000 1.000000 0.0 1.000000 3.000000 3.000000 0.000000 3.000000
75% 2.000000 1.250000 0.0 1.250000 5.000000 4.000000 2.250000 3.750000
max 2.000000 2.000000 0.0 2.000000 7.000000 8.000000 5.000000 6.000000

We can also find the median values.

teams.median()
Games Played       2.0
Wins               1.0
Draws              0.0
Losses             1.0
Goals For          3.0
Goals Against      3.0
Goal Difference    0.0
Points             3.0
dtype: float64

The Goal Difference column is probably the most interesting, since it has the largest range and highest standard deviation. The top teams scored a lot more than they were scored on, and the bottom teams were scored on a lot more than they scored.

Since we are looking at data for all of the teams, we see that the mean number of wins is equal to the mean number of losses. The same goes for goals scored and goals scored against.

import plotly_express as px
fig = px.bar(team_stats.iloc[3:], y='Goal Difference', title='')
fig.show()

If we want to see which teams scored more than the mean value of “Goals For”, we can use the following code.

gf_mean = teams['Goals For'].mean()
teams[teams['Goals For'] > gf_mean]
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
0 Leicester City 2 2 0 0 7 2 5 6
1 Everton 2 2 0 0 6 2 4 6
2 Arsenal 2 2 0 0 5 1 4 6
3 Liverpool 2 2 0 0 6 3 3 6
4 Crystal Palace 2 2 0 0 4 1 3 6
5 Tottenham Hotspur 2 1 0 1 5 3 2 3
7 Brighton & Hove Albion 2 1 0 1 4 3 1 3
9 Leeds United 2 1 0 1 7 7 0 3

In general, but not always, the top teams scored more than the average number of goals.

Mean is probably the best measure of central tendency here, since using the median would just give us the top half of the teams. Mode wouldn’t be useful because there aren’t a lot of repeated values in the column.

Let’s see if the top teams had fewer than the mean number of goals scored against them.

ga_mean = teams['Goals Against'].mean()
teams[teams['Goals Against'] < gf_mean]
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
0 Leicester City 2 2 0 0 7 2 5 6
1 Everton 2 2 0 0 6 2 4 6
2 Arsenal 2 2 0 0 5 1 4 6
3 Liverpool 2 2 0 0 6 3 3 6
4 Crystal Palace 2 2 0 0 4 1 3 6
5 Tottenham Hotspur 2 1 0 1 5 3 2 3
6 Manchester City 1 1 0 0 3 1 2 3
7 Brighton & Hove Albion 2 1 0 1 4 3 1 3
8 Aston Villa 1 1 0 0 1 0 1 3
10 Chelsea 2 1 0 1 3 3 0 3
11 Wolverhampton Wanderers 2 1 0 1 3 3 0 3
12 Newcastle United 2 1 0 1 2 3 -1 3
14 Manchester United 1 0 0 1 1 3 -2 0
16 Sheffield United 2 0 0 2 0 3 -3 0

Again, it is generally true that the top teams that had fewer goals scored against them.

Teams Visualizations

Let’s create some plots of Wins, Losses, Draws versus team rank.

columns = ['Wins', 'Losses', 'Draws']
for column in columns:
    fig = px.scatter(teams, x=teams.index, y=column, title=column+' vs Rank', hover_data=['Team'])
    fig.show()

Player Data

We are also going to look at individual player data for scoring and assists. We’ll download both and then look first at the top 10, head(10), of the scorers data table.

stats = pd.read_html('https://www.espn.com/soccer/stats/_/league/ENG.1/view/scoring')
scorers = stats[0]
assists = stats[1]
scorers.head(10)
RK Name Team P G
0 1.0 Dominic Calvert-Lewin Everton 2 4
1 NaN Son Heung-Min Tottenham Hotspur 2 4
2 3.0 Mohamed Salah Liverpool 2 3
3 NaN Wilfried Zaha Crystal Palace 2 3
4 5.0 Hélder Costa Leeds United 2 2
5 NaN Aleksandar Mitrovic Fulham 2 2
6 NaN Neal Maupay Brighton & Hove Albion 2 2
7 NaN Sadio Mané Liverpool 2 2
8 NaN Patrick Bamford Leeds United 2 2
9 NaN Raúl Jiménez Wolverhampton Wanderers 2 2

Columns:

  • RK: Ranking

  • P: Games played

  • G: Goals scored

  • A: Assists

There are quite a few missing (NaN) values, which means that player is tied with the player above them, so we can use fillna(method='ffill') which means “forward fill” values to replace missing values.

scorers = scorers.fillna(method='ffill')
scorers.head(10)
RK Name Team P G
0 1.0 Dominic Calvert-Lewin Everton 2 4
1 1.0 Son Heung-Min Tottenham Hotspur 2 4
2 3.0 Mohamed Salah Liverpool 2 3
3 3.0 Wilfried Zaha Crystal Palace 2 3
4 5.0 Hélder Costa Leeds United 2 2
5 5.0 Aleksandar Mitrovic Fulham 2 2
6 5.0 Neal Maupay Brighton & Hove Albion 2 2
7 5.0 Sadio Mané Liverpool 2 2
8 5.0 Patrick Bamford Leeds United 2 2
9 5.0 Raúl Jiménez Wolverhampton Wanderers 2 2
assists = assists.fillna(method='ffill')
assists.head()
RK Name Team P A
0 1.0 Harry Kane Tottenham Hotspur 2 4
1 2.0 Daniel Podence Wolverhampton Wanderers 2 2
2 2.0 Willian Arsenal 2 2
3 2.0 Richarlison Everton 2 2
4 5.0 Tariq Lamptey Brighton & Hove Albion 2 1

Let’s create histograms for these two data sets.

fig1 = px.histogram(scorers, x='G', title='Histogram of Goals Scored by Top Players')
fig1.show()
fig2 = px.histogram(assists, x='A', title='Histogram of Assists by Top Players')
fig2.show()

Both of these histograms show that there are many more players that scored (or assisted) fewer goals, so the data are not normally distributed.

Research Question

Does having more top scoring or top assisting players on a team mean that team has a higher standing?

To answer this question, we will need to group the player data by team and merge the two data tables together. We’ll also drop the columns that we don’t need.

# group the data by team
scorers_team = scorers.groupby('Team').count().drop(columns=['RK', 'Name', 'P'])
assists_team = assists.groupby('Team').count().drop(columns=['RK', 'Name', 'P'])
# merge the players data tables
players = scorers_team.merge(assists_team, on='Team')
# create a column that adds goals and assists
players['Goals and Assists'] = players['G']+players['A']
# sort the values, create an index column, and display the data
players = players.sort_values('Goals and Assists', ascending=False).reset_index()
players
Team G A Goals and Assists
0 Leeds United 5 5 10
1 Brighton & Hove Albion 4 5 9
2 Arsenal 4 4 8
3 Leicester City 5 3 8
4 Crystal Palace 3 4 7
5 Everton 3 4 7
6 Chelsea 3 2 5
7 Liverpool 3 2 5
8 Manchester City 3 2 5
9 Newcastle United 2 3 5
10 Southampton 2 3 5
11 West Bromwich Albion 2 3 5
12 Fulham 2 2 4
13 Tottenham Hotspur 2 2 4
14 Wolverhampton Wanderers 2 2 4
15 Burnley 2 1 3
16 West Ham United 1 2 3
17 Aston Villa 1 1 2

Now we need to merge this data table with the Teams data table from earlier.

combined_data = teams.merge(players, on='Team', how='left') # left means keep the order from the teams data table
combined_data
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points G A Goals and Assists
0 Leicester City 2 2 0 0 7 2 5 6 5.0 3.0 8.0
1 Everton 2 2 0 0 6 2 4 6 3.0 4.0 7.0
2 Arsenal 2 2 0 0 5 1 4 6 4.0 4.0 8.0
3 Liverpool 2 2 0 0 6 3 3 6 3.0 2.0 5.0
4 Crystal Palace 2 2 0 0 4 1 3 6 3.0 4.0 7.0
5 Tottenham Hotspur 2 1 0 1 5 3 2 3 2.0 2.0 4.0
6 Manchester City 1 1 0 0 3 1 2 3 3.0 2.0 5.0
7 Brighton & Hove Albion 2 1 0 1 4 3 1 3 4.0 5.0 9.0
8 Aston Villa 1 1 0 0 1 0 1 3 1.0 1.0 2.0
9 Leeds United 2 1 0 1 7 7 0 3 5.0 5.0 10.0
10 Chelsea 2 1 0 1 3 3 0 3 3.0 2.0 5.0
11 Wolverhampton Wanderers 2 1 0 1 3 3 0 3 2.0 2.0 4.0
12 Newcastle United 2 1 0 1 2 3 -1 3 2.0 3.0 5.0
13 Burnley 1 0 0 1 2 4 -2 0 2.0 1.0 3.0
14 Manchester United 1 0 0 1 1 3 -2 0 NaN NaN NaN
15 West Ham United 2 0 0 2 1 4 -3 0 1.0 2.0 3.0
16 Sheffield United 2 0 0 2 0 3 -3 0 NaN NaN NaN
17 Fulham 2 0 0 2 3 7 -4 0 2.0 2.0 4.0
18 Southampton 2 0 0 2 2 6 -4 0 2.0 3.0 5.0
19 West Bromwich Albion 2 0 0 2 2 8 -6 0 2.0 3.0 5.0

To see if there is a relationship between Goals and Assists and team rank, let’s create another scatterplot.

fig = px.scatter(combined_data, y='Goals and Assists', x=combined_data.index, hover_data=['Team'], title='Goals and Assists vs Team Rank')
fig.show()

Conclusion

It looks like higher ranked teams (lower \(x\) values) tend to have more players with more goals and assists, although there is a fair amount of variation in the data.

Perhaps we could look at a similar analysis using a larger data set from a league such as the National Hockey League where there are more games played by more teams.

Callysto.ca License