Birthday Paradox

June 20, 2014

I decided to use the Wikipedia roster, and downloaded the “view source” to file Then I used awk, which is a good choice for data laundry like this, to scrape the player data from the file:

# rosters.awk

BEGIN { OFS = "|" }

/Captain/ { next }

/Defender/ { next }

/headline/ { # country
    c = $0 }

/title/ { # player
    p = $0 }

/bday/ { # birthday
    # print c, p, $0 }
    b = substr($0,length($0)-5)
    if (c == "Algeria" && p ~ /Lacen/) b = "1984-03-15"
    if (c == "Chile" && p ~ /Manuel Rojas/) b = "1983-06-23"
    if (c == "Chile" && p ~ /Beausejour/) b = "1984-06-01"
    if (c == "Croatia" && p ~ /Badelj/) b = ""
    if (c == "Germany" && p ~ /Kramer/) b = "1991-02-19"
    if ( birthdays[c,b] != "" ) {
        print birthdays[c,b]
        print c,p,$0
        print ""
        countries[c]++ }
    birthdays[c,b] = c OFS p OFS $0 }

END { for (i in countries) pairs++
      print pairs }

The program looks for three bits of information. The /headline/ action extracts a country name from the file. The /title/ action extracts a player name from the file. The /bday/ action extracts the player’s birthday, then looks for matches among the country/birthday pairs already seen. The /Captain/ and /Defender/ actions remove unneeded lines from the output.

The five lines in the middle of the /bday/ action are interesting. When I first ran the program, without those lines, I had 20 teams with birthday pairs instead of the 16 teams of the BBC. Examining the data showed several errors in the WikiPedia data (I’m assuming the FIFA data is correct), which I fixed with those five lines of code. Here’s the output, edited to fix unicode errors:

$ awk -f rosters.awk
Brazil|Hulk |1986-07-25
Brazil|Paulinho |1988-07-25

Cameroon|Eric Maxim Choupo-Moting|1989-03-23
Cameroon|Eyong Enoh|1986-03-23

Australia|Mathew Ryan|1992-04-08
Australia|Dario Vidosic|1987-04-08

Netherlands|Daryl Janmaat|1989-07-22
Netherlands|Dirk Kuyt|1980-07-22

Spain|Koke |1992-01-08
Spain|David Silva|1986-01-08

Colombia|A.C. Milan|1976-01-13
Colombia|Santiago Arias|1992-01-13

France|Remy Cabella|1984-07-29
Honduras|Wilson Palacios|1984-07-29

Switzerland|Stephan Lichtsteiner|1984-01-16
Switzerland|Reto Ziegler|1986-01-16

Switzerland|Valentin Stocker|1989-04-12
Switzerland|Blerim Dzemaili|1986-04-12

Argentina|Sergio Romero|1987-02-22
Argentina|Enzo Perez|1986-02-22

Argentina|Fernando Gago|1986-04-10
Argentina|Augusto Fernandez|1986-04-10

Bosnia_and_Herzegovina|Asmir Begovic|1987-06-20
Bosnia_and_Herzegovina|Sead Kolasinac|1993-06-20

Iran|Amir Hossein Sadeghi|1981-09-06
Iran|Pejman Montazeri|1983-09-06

Iran|Reza Haghighi|1989-02-01
Iran|Steven Beitashour|1987-02-01

Nigeria|Victor Moses|1990-12-12
Nigeria|Ramon Azeez|1992-12-12

United_States|John Anthony Brooks|1993-01-28
United_States|Chris Wondolowski|1983-01-28

Russia|Sergei Ignashevich|1979-07-14
Russia|Maksim Kanunnikov|1991-07-14

South_Korea|Kwak Tae-hwi|1981-07-08
South_Korea|Son Heung-min|1992-07-08

South_Korea|Kim Young-gwon|1990-02-27


You can run the program at

About these ads

Pages: 1 2

9 Responses to “Birthday Paradox”

  1. Paul said

    In Python. I took the lazy approach. Starting from the FIFA pdf file, I selected all text and copied it to a file. The an re on the dates give exactly 32 * 23 dates. Stripping off the year and splitting by team, the result is quickly obtained.

    import re
    FILE = "allplayer.txt"   # from FIFA pdf
    m = re.findall(r"\d\d\.\d\d\.19\d\d", open(FILE).read())
    assert len(m) == 32 * 23
    mm = [mi[:-5] for mi in m]   # strip off the year
    score = 0
    for t in range(32):
        M = mm[23*t:23*(t+1)]
        if len(set(M)) < 23:
            score += 1
    print score   # -> 16
  2. Nick said

    I’m so sorry.

    sum(len(set(_))<23 for _ in zip(*[iter(__import__('re').findall(r'(\d\d? (?:January|February|March|April|May|June|July|August|September|October|November|December)) \d{4} ', __import__('urllib2').urlopen('').read()))]*23))
  3. Jussi Piitulainen said

    Praxis, did you over-edit the example output? The players from France and Honduras shouldn’t be a pair, but there appears to be a different pair for each of them in the Wikipedia data (assuming my scripts don’t make stuff up).

  4. Jussi Piitulainen said

    This is an XSL Transform that extracts the data from the Wikipedia page (when given the Wikipedia page as input) and writes Scheme code. Three extra entries at end are not teams and have empty “player lists”, which I let be. I didn’t write the Scheme code to count coincidences.

    <xsl:transform version="1.0" xmlns:xsl="">
      <xsl:output method="text"/>
      <xsl:template match="/"><xsl:text>(define ros '(</xsl:text>
        <xsl:for-each select='//h3[span[@class="mw-headline"]]'>
          <xsl:text>("</xsl:text><xsl:value-of select="span"/><xsl:text>" </xsl:text>
          <xsl:for-each select="following-sibling::table[1]/tr/td/table/tr[td]">
            <xsl:text>("</xsl:text><xsl:value-of select="td[3]/a/@title"/><xsl:text>" . "</xsl:text>
            <xsl:value-of select="td[4]/span/span[@class]" />
  5. Jussi Piitulainen said

    Instead, I edited my XSL Transform to write Python code, thus:

    <xsl:transform version="1.0" xmlns:xsl=""><xsl:output method="text"/>
      <xsl:template match="/"><xsl:text>ros = {</xsl:text>
        <xsl:for-each select='//h3[span[@class="mw-headline"]]'>
          <xsl:text>"</xsl:text><xsl:value-of select="span"/><xsl:text>" : {</xsl:text>
          <xsl:for-each select="following-sibling::table[1]/tr/td/table/tr[td]">
            <xsl:text>"</xsl:text><xsl:value-of select="td[3]/a/@title"/><xsl:text>" : "</xsl:text>
            <xsl:value-of select="td[4]/span/span[@class]" />
            <xsl:text>", </xsl:text>
          <xsl:text>}, </xsl:text>

    This is run from the shell so:

    $ xsltproc -o roster.xsl

    And after importing the result to a Python session, I used it like this (some linebreaks added for presentation):

    >>> [ key for key in ros.keys()
          if len(set(date.split('-', 1)[1]
                 for date in ros[key].values())) < len(ros[key]) ]
    ['Brazil', 'Netherlands', 'Colombia', 'France', 'United States',
    'Croatia', 'Iran', 'Switzerland', 'Honduras', 'Argentina', 'Cameroon',
    'Nigeria', 'Australia', 'Algeria', 'Russia', 'Germany', 'Bosnia and
    Herzegovina', 'Chile', 'South Korea', 'Spain']
    >>> len([ key for key in ros.keys()
              if len(set(date.split('-', 1)[1]
                     for date in ros[key].values())) < len(ros[key]) ])

    Twenty countries agrees with the official example result because I have not corrected the data in any way.

  6. Paul said

    According to Wikipedia page on Jose Rojas Jose Rojas from Chile was born on 23.06.1983, which checks with the FIFA list on
    FIFA list.
    However on Wikipedia FIFA squads there is a birth date of 03.06.1983. There are clearly inconsistencies in Wikipedia. Maybe also in the FIFA list?!?

  7. Paul said

    It appears there are 2 answers depending on the input data. In the script below I calculate the probabillities for the number of teams with equal birthdays. As expected the highest probabillity is for 16 teams, but the distribution is pretty wide. Twenty teams is still a likely outcome (probabillity is about half of the probability for 16 teams). So, even if the answer is 20, then the birthday paradox has been still “proven”. Most people would think, that only a few teams would have equal birthdays.

    from __future__ import division
    from random import randrange
    from collections import defaultdict
    NR_TEAMS = 32
    NR_PLAYERS = 23
    def simulate(N):
        """find distribution of number of teams with equal birthdays
           32 teams with 23 players
        scores = defaultdict(int)
        for n in range(N):
            score = 0
            for t in range(NR_TEAMS):
                dates = [randrange(1, 366) for p in range(NR_PLAYERS)]
                if len(set(dates)) < NR_PLAYERS:
                    score += 1
            scores[score] += 1
        s = 0
        for k in sorted(scores.keys()):
            a = scores[k] / N
            s += a
            print "{:2d} {:5.3f} {:5.3f}".format(k, a, s)
       freq  cumul
     3 0.000 0.000
     4 0.000 0.000
     5 0.000 0.000
     6 0.000 0.000
     7 0.001 0.001
     8 0.002 0.003
     9 0.005 0.008
    10 0.013 0.021
    11 0.026 0.047
    12 0.047 0.094
    13 0.074 0.168
    14 0.103 0.271
    15 0.127 0.399
    16 0.139 0.538
    17 0.135 0.673
    18 0.116 0.788
    19 0.088 0.876
    20 0.059 0.935
    21 0.035 0.970
    22 0.018 0.987
    23 0.008 0.995
    24 0.003 0.999
    25 0.001 1.000
    26 0.000 1.000
    27 0.000 1.000
    28 0.000 1.000"""    
  8. krups said

    A really practical explanation on birthday paradox is here.:

  9. Paul said

    @krups. The explanation given in your link is nice, but unfortunately the formula to calculate the probability for 23 people is not correct. The correct formula can be found on Wikipedia.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 615 other followers

%d bloggers like this: