Is soccer more a game of chance, or a game of skill?

With the FIFA World Cup in my recent memory and the English Premier League (EPL) kicking off this Friday (see here for match schedules), I’ve been thinking a bit about the mathematics/statistics of the beautiful game. In this post, I want to answer the following question: Is soccer more a game of chance, or a game of skill?

I’m interested in this because I want to get a sense of how much randomness is inherent in the game of soccer. Being more precise: in a perfect game of chance (e.g. coin flipping), the better team will beat the weaker team exactly 50% of the time, since there is no element of skill. At the other extreme, in a perfect game of skill, the better team will beat the weaker team 100% of the time.

Where does soccer lie on this continuum? Anyone who’s watched a game of soccer knows that (i) the team which scores more goals wins, and (ii) goals are rare! With the outcome of the game hinging on just a few events, my initial guess was that soccer might be closer to the “game of chance” end than teams would like us to believe.

At this point you might raise the question: what does it mean for a team to be better anyway? For this post, I will take a team’s ranking at the end of the season as a measure of its quality: a team with a higher ranking is deemed better than a team with a lower ranking. I will also be assuming that the team’s quality is constant throughout the season.

For this analysis, I looked at EPL data for 9 seasons, beginning from 2008-2009 to 2016-2017. (One of my data sources did not have 2017-2018 data yet.) I used data from 2 sources:

1. england.rda from jalapic’s engsoccerdata repository. This is a real treasure trove of english football results, containing statistics for matches for the top 4 tiers of English football all the way back to 1888!
2. Standings from this Google sheet. Again, another treasure trove of data! Unfortunately, the owner has disabled the ability to download or copy the data, so I had to record them manually.

Now for the analysis. (R code for the analysis can be found here.) In turns out that in these 9 seasons, the better-ranked team won 53.8% of the time, and won or drew 79.2% of the time. These figures are fairly stable across seasons, as we can see in the figure below:

How do we compare this with the sliding scale of 50% win for games of chance vs. 100% for games of skill? Well… we can’t! At least not directly. There is the issue of how to deal with draws: our sliding scale assumes that the outcome of the game is either a W or an L.

There are 2 ways we can fix this issue. The first is to compare the EPL results with a modified sliding scale, where the probability of a draw is equal to the proportion of EPL games that end in a draw. As an example: if 50% of games end in a draw, then a game of chance the better team will win 25% of the time, and win/draw 75% of the time. For a game of skill, the better team will win 50% of the time, and win/draw 100% of the time.

With this, we can update the baseline in the figure above (dashed line below is for game of chance, dashed line above is for game of skill):

The second way to fix this issue is to do the analysis conditional on the game outcome being a win or a loss. (This is tantamount to throwing away games which end in a draw.) If we do this, then our original sliding scale (chance 50% skill 100%) applies. The figure shows the results of the conditional analysis:

So, is soccer closer to a game of chance or a game of skill? Draw your own conclusions!

[Note 1: This is a crude, first-pass model that does not capture more complex ideas. For example, we implicitly assume that all that matters in determining a better team’s win percentage is the fact that it is better. This is overly simplistic: a better team is going to win much more often if it is playing a vastly weaker opponent compared to playing an opponent that is just slightly weaker.]

Note 2: This analysis was done with just EPL data. It would be interesting to see if we get similar results for leagues in other countries.]

Posted in Sports & Games | Tagged | 4 Comments

FIFA World Cup analyses

It’s been a week since the FIFA World Cup ended but I’m still feeling hungover from it…

I’ve done some data analyses on the 2018 FIFA World Cup, you can see them at the following links:

World cup group stage outcomes

With the group stages of World Cup 2018 drawing to a close, I was wondering what the possible scores were attainable in each group (e.g. 9, 6, 3, 0 for Group A), and how many different match outcomes resulted in each score configuration. With just $3^6 = 729$ possibilities (“win”, “draw” or “loss” for each of 6 games), this was easy to code up.

There are 40 different possible group score configurations, with 7, 4, 4, 1 and 6, 4, 4, 3 being the most “common”, in the sense that they are the most common result if each of “win”, “draw” and “loss” was equally likely for each game. The table below shows the full list:

Score configuration No. of permutations
7, 4, 4, 1 36 (4.9%)
6, 4, 4, 3 36 (4.9%)
9, 6, 3, 0 24 (3.3%)
9, 4, 3, 1 24 (3.3%)
9, 4, 2, 1 24 (3.3%)
7, 6, 4, 0 24 (3.3%)
7, 6, 3, 1 24 (3.3%)
7, 6, 2, 1 24 (3.3%)
7, 5, 4, 0 24 (3.3%)
7, 5, 3, 1 24 (3.3%)
7, 5, 2, 1 24 (3.3%)
7, 4, 3, 3 24 (3.3%)
7, 4, 3, 2 24 (3.3%)
7, 4, 3, 1 24 (3.3%)
7, 4, 2, 2 24 (3.3%)
6, 6, 4, 1 24 (3.3%)
6, 6, 3, 3 24 (3.3%)
6, 5, 4, 1 24 (3.3%)
6, 4, 4, 2 24 (3.3%)
5, 5, 4, 1 24 (3.3%)
5, 4, 4, 3 24 (3.3%)
5, 4, 4, 2 24 (3.3%)
5, 4, 3, 2 24 (3.3%)
9, 6, 1, 1 12 (1.6%)
9, 4, 4, 0 12 (1.6%)
7, 7, 3, 0 12 (1.6%)
7, 3, 2, 2 12 (1.6%)
6, 5, 2, 2 12 (1.6%)
5, 5, 3, 2 12 (1.6%)
5, 5, 3, 1 12 (1.6%)
5, 5, 2, 2 12 (1.6%)
5, 3, 3, 2 12 (1.6%)
9, 3, 3, 3 8 (1.1%)
6, 6, 6, 0 8 (1.1%)
4, 4, 4, 3 8 (1.1%)
7, 7, 1, 1 6 (0.8%)
4, 4, 4, 4 6 (0.8%)
9, 2, 2, 2 4 (0.5%)
5, 5, 5, 0 4 (0.5%)
3, 3, 3, 3 1 (0.1%)

The code I used to produce the table above is below:

```import collections
import itertools

def update_score(scores, home, away, outcome):
"""Update scores based on outcome."""
if outcome == "win":
scores[home] += 3
elif outcome == "loss":
scores[away] += 3
else:
scores[home] += 1
scores[away] += 1

score_dict = collections.defaultdict(int)

for outcome in itertools.product(["win", "draw", "loss"], repeat = 6):
# compute points for each of the teams
scores = [0, 0, 0, 0]
update_score(scores, 0, 1, outcome[0])
update_score(scores, 0, 2, outcome[1])
update_score(scores, 0, 3, outcome[2])
update_score(scores, 1, 2, outcome[3])
update_score(scores, 1, 3, outcome[4])
update_score(scores, 2, 3, outcome[5])

score_dict[tuple(sorted(scores, reverse = True))] += 1

score_list = [(v, k) for k, v in score_dict.items()]
score_list.sort(reverse = True)

for item in score_list:
print item[1], item[0], round(item[0] / 729.0 * 100, 1)

print len(score_list)

```

World cup 2018 FAQ

World cup fever is underway! I’ve assembled a little FAQ below on some questions that I thought about while watching the group stages of the tournament. (I wrote a similar post back in 2014 which you can view here.)

1. What is the minimum number of points needed to guarantee qualifying for the knockout stage?

A team needs 7 points to guarantee qualifying. The most number of points the 4 teams can earn together is 18 (6 games of 3 points each). While it’s possible for 3 teams to get 6 points each, it’s not possible for 3 teams to get at least 7 points each.

2. Is 2 wins enough to guarantee qualification for the knockout stage?

Somewhat surprisingly, no! It is possible to get knocked out even with 2 wins (i.e. points). Let’s say our 4 teams are A, B, C and D. A beats B, B beats C, C beats A, and all 3 teams beat D. In this case, A, B and C all have 6 points but only two of them can qualify.

Interestingly enough, this might happen in this year’s Group F (Mexico, Germany, Sweden, South Korea).

3. Is it possible for the group to be decided after Matchday 2 (i.e. first 4 matches)?

Yes, in the sense that after Matchday 2, we know which 2 teams go through to the knockout rounds and which two get knocked out. For example, if A beats C and D, and B beats C and D, then A and B are through for sure. This happened in this year’s Group A (Russia, Uruguay, Egypt, Saudi Arabia) and Group G (England, Belgium, Tunisia, Panama).

Having A and B beat C and D (as above) is the only way for the group to be decided after Matchday 2 in the sense above. There is no way for the winner of the group to be decided after Matchday 2.

4. Is 2 losses enough to guarantee elimination?

Interestingly, it is not enough! This is the case with this year’s Group F (Mexico, Germany, Sweden, South Korea). Here is a possible configuration: A beats B, C and D to top the group with 9 points. B beats C, C beats D, and D beats B, so they each have 3 points and 2 losses. Since the top 2 teams go through, one of the teams with 2 losses will go into the knockout stages.

This is the only possible configuration for a team with 2 losses to go through. Let’s say D loses to A and B. The maximum number of points D can obtain at the end of the group stage is 3 points. In order for D to advance, we cannot have two other teams scoring more than 3 points. However, A and B already have 3 points, with two other matches (B vs. C and A vs. C) from which they can earn points.

• If A and B both win or draw these games, they will both have at least 4 points and D cannot advance.
• If A and B both lose these games, C will have 6 points, one of A and B will get some points from A vs. B on Matchday 3, so D cannot advance.
• If A draws and B loses, then A and C will have 4 points, and D cannot advance. (Similarly, if B draws and A loses.)
• If A wins and B loses, then A has 6 points, B has 3 points and C has 3 points. The only way D can advance is for A to beat B in the last remaining match, and that is the configuration above. (Similarly, if A loses and B wins.)

p^2-q and q^2-p prime

$p$ and $q$ are 2 prime numbers. $p^2 - q$ and $p - q^2$ are also prime. If you divide $p^2 - q$ by a composite number $n$, where $n < p$, you’ll get a remainder of 14. If you divide $p - q^2 + 14$ by the same number, what will you get as the remainder?” – Akash

Statistical odds and ends

Between school and family duties, I’ve been finding it hard to find any time to indulge in olympiad math blogging 😦 At the same time, I’ve missed the feeling of typing up stuff that I find interesting and sharing it with others.

To that end, I just started a new blog Statistical Odds and Ends! The idea for this began when I found myself spending a lot of time googling relatively simple things in the course of my studies and research. For example:

• Why does the ridge regression solution exist and why is it unique?
• What is the formula for the matrix $P$ such that the projection of the vector $v$ onto the column space of a matrix $A$ is $Px$?
• Can I switch supremums and expectations and still have equality? If not, can I get an inequality instead?
• How can I derive the bias-variance decomposition?

I was often googling for the same things over and over again, and trying to re-understand what others were writing.

Hence the idea of Statistical Odds and Ends. The blog will be a place for me to pen down my understanding of these statistical tidbits, and to share it with others. Hopefully some of the material there will be of interest to you! If the content is relevant to this audience, I will cross-post over on this blog too.