Birthday Paradox
June 20, 2014
The birthday paradox, which we studied in a previous exercise, states that in any group of 23 people there is a 50% chance that two of them share a birthday. The BBC recently published an article that shows 16 of the 32 World Cup teams, each consisting of 23 players, have shared birthdays, thus demonstrating the paradox precisely. Today’s exercise asks you to recreate their calculation.
You can obtain the same listing of player birthdays that the BBC used from FIFA. Another source is the player rosters at WikiPedia.
Your task is to demonstrate that the 2014 World Cup rosters honor the birthday paradox. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
In Python. I took the lazy approach. Starting from the FIFA pdf file, I selected all text and copied it to a file. The an re on the dates give exactly 32 * 23 dates. Stripping off the year and splitting by team, the result is quickly obtained.
import re FILE = "allplayer.txt" # from FIFA pdf m = re.findall(r"\d\d\.\d\d\.19\d\d", open(FILE).read()) assert len(m) == 32 * 23 mm = [mi[:-5] for mi in m] # strip off the year score = 0 for t in range(32): M = mm[23*t:23*(t+1)] if len(set(M)) < 23: score += 1 print score # -> 16I’m so sorry.
sum(len(set(_))<23 for _ in zip(*[iter(__import__('re').findall(r'(\d\d? (?:January|February|March|April|May|June|July|August|September|October|November|December)) \d{4} ', __import__('urllib2').urlopen('http://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads').read()))]*23))Praxis, did you over-edit the example output? The players from France and Honduras shouldn’t be a pair, but there appears to be a different pair for each of them in the Wikipedia data (assuming my scripts don’t make stuff up).
This is an XSL Transform that extracts the data from the Wikipedia page (when given the Wikipedia page as input) and writes Scheme code. Three extra entries at end are not teams and have empty “player lists”, which I let be. I didn’t write the Scheme code to count coincidences.
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"><xsl:text>(define ros '(</xsl:text> <xsl:for-each select='//h3[span[@class="mw-headline"]]'> <xsl:text>("</xsl:text><xsl:value-of select="span"/><xsl:text>" </xsl:text> <xsl:for-each select="following-sibling::table[1]/tr/td/table/tr[td]"> <xsl:text>("</xsl:text><xsl:value-of select="td[3]/a/@title"/><xsl:text>" . "</xsl:text> <xsl:value-of select="td[4]/span/span[@class]" /> <xsl:text>")</xsl:text> </xsl:for-each> <xsl:text>)</xsl:text> </xsl:for-each><xsl:text>))</xsl:text> </xsl:template> </xsl:transform>Instead, I edited my XSL Transform to write Python code, thus:
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"><xsl:output method="text"/> <xsl:template match="/"><xsl:text>ros = {</xsl:text> <xsl:for-each select='//h3[span[@class="mw-headline"]]'> <xsl:text>"</xsl:text><xsl:value-of select="span"/><xsl:text>" : {</xsl:text> <xsl:for-each select="following-sibling::table[1]/tr/td/table/tr[td]"> <xsl:text>"</xsl:text><xsl:value-of select="td[3]/a/@title"/><xsl:text>" : "</xsl:text> <xsl:value-of select="td[4]/span/span[@class]" /> <xsl:text>", </xsl:text> </xsl:for-each> <xsl:text>}, </xsl:text> </xsl:for-each><xsl:text>}</xsl:text> </xsl:template> </xsl:transform>This is run from the shell so:
And after importing the result to a Python session, I used it like this (some linebreaks added for presentation):
>>> [ key for key in ros.keys() if len(set(date.split('-', 1)[1] for date in ros[key].values())) < len(ros[key]) ] ['Brazil', 'Netherlands', 'Colombia', 'France', 'United States', 'Croatia', 'Iran', 'Switzerland', 'Honduras', 'Argentina', 'Cameroon', 'Nigeria', 'Australia', 'Algeria', 'Russia', 'Germany', 'Bosnia and Herzegovina', 'Chile', 'South Korea', 'Spain'] >>> >>> >>> len([ key for key in ros.keys() if len(set(date.split('-', 1)[1] for date in ros[key].values())) < len(ros[key]) ]) 20Twenty countries agrees with the official example result because I have not corrected the data in any way.
According to Wikipedia page on Jose Rojas Jose Rojas from Chile was born on 23.06.1983, which checks with the FIFA list on
FIFA list.
However on Wikipedia FIFA squads there is a birth date of 03.06.1983. There are clearly inconsistencies in Wikipedia. Maybe also in the FIFA list?!?
It appears there are 2 answers depending on the input data. In the script below I calculate the probabillities for the number of teams with equal birthdays. As expected the highest probabillity is for 16 teams, but the distribution is pretty wide. Twenty teams is still a likely outcome (probabillity is about half of the probability for 16 teams). So, even if the answer is 20, then the birthday paradox has been still “proven”. Most people would think, that only a few teams would have equal birthdays.
from __future__ import division from random import randrange from collections import defaultdict NR_TEAMS = 32 NR_PLAYERS = 23 def simulate(N): """find distribution of number of teams with equal birthdays 32 teams with 23 players """ scores = defaultdict(int) for n in range(N): score = 0 for t in range(NR_TEAMS): dates = [randrange(1, 366) for p in range(NR_PLAYERS)] if len(set(dates)) < NR_PLAYERS: score += 1 scores[score] += 1 s = 0 for k in sorted(scores.keys()): a = scores[k] / N s += a print "{:2d} {:5.3f} {:5.3f}".format(k, a, s) simulate(1000000) """ freq cumul 3 0.000 0.000 4 0.000 0.000 5 0.000 0.000 6 0.000 0.000 7 0.001 0.001 8 0.002 0.003 9 0.005 0.008 10 0.013 0.021 11 0.026 0.047 12 0.047 0.094 13 0.074 0.168 14 0.103 0.271 15 0.127 0.399 16 0.139 0.538 17 0.135 0.673 18 0.116 0.788 19 0.088 0.876 20 0.059 0.935 21 0.035 0.970 22 0.018 0.987 23 0.008 0.995 24 0.003 0.999 25 0.001 1.000 26 0.000 1.000 27 0.000 1.000 28 0.000 1.000"""A really practical explanation on birthday paradox is here.: http://betterexplained.com/articles/understanding-the-birthday-paradox/
@krups. The explanation given in your link is nice, but unfortunately the formula to calculate the probability for 23 people is not correct. The correct formula can be found on Wikipedia.