Parsing Football Records from IHSA.org with Python and BeautifulSoup

If you run the attached python code, you’ll output a little over 36k records, every high school football season that the IHSA website has a record for. The IHSA website is fantastic for historical data, but the data is not easily manipulated or downloadable, so I spent some time learning the python library BeautifulSoup to force the issue. Without spending much time playing with this first project in Excel, just a couple things I noticed:

  • There are 2041 seasons listed in the IHSA team record section that do not list a head coach. I’m wondering how difficult it would be to send emails out to each school to see if they could help complete the set.
  • It’s really interesting and easy to select a single coach and have the view of how their total career went, across schools. Pretty labor-intensive before.
    Coal City 1984-85 5 4 Hal Chiodo
    Coal City 1985-86 4 5 Hal Chiodo
    Lexington 1992-93 1A Q 9 3 Hal Chiodo
    Lexington 1993-94 1A Q 6 4 Hal Chiodo
    Lexington 1994-95 1A Q 2 11 3 Hal Chiodo
    Morton 1995-96 4A Q 6 4 Hal Chiodo
    Morton 1996-97 4A Q 7 3 Hal Chiodo
    Morton 1997-98 4A Q 9 3 Hal Chiodo
    Morton 1998-99 4A Q 7 3 Hal Chiodo
    Morton 1999-00 3 6 Hal Chiodo
    Morton 2000-01 4A Q 6 4 Hal Chiodo
    Morton 2001-02 5A Q 5 5 Hal Chiodo
    Morton 2002-03 5A Q 8 2 Hal Chiodo
    Morton 2003-04 5A Q 8 2 Hal Chiodo
    Morton 2004-05 5A Q 6 4 Hal Chiodo
    Morton 2005-06 5A Q 8 3 Hal Chiodo
    Morton 2006-07 4A Q 7 3 Hal Chiodo
    Morton 2007-08 5A Q 5 5 Hal Chiodo
    Highland Park 2009-10 6A Q 7 4 Hal Chiodo
    West Chicago (H.S.) 2009-10 0 9 Hal Chiodo
    Highland Park 2010-11 7A Q 5 5 Hal Chiodo
    Highland Park 2011-12 5 4 Hal Chiodo
    Highland Park 2012-13 3 6 Hal Chiodo
    Highland Park 2013-14 7A Q 7 3 Hal Chiodo
  • The earliest season on file:
    Jacksonville (Illinois School for the Deaf) 1885-86
  • Only two teams have ever won a state championship with four losses. Teams with four losses never made the playoffs before the eight-class era. It’s not surprising that both are private schools.
    Elmhurst (IC Catholic) 2008-09
    Lombard (Montini) 2009-10

Here’s the raw text output of the script if you’d rather just have it directly.

from bs4 import BeautifulSoup
import requests

text_file = open("Output.txt", "w")

schoolrecordslinks =  ['http://www.ihsa.org//data/fb/records/sum-a.htm', 'http://www.ihsa.org//data/fb/records/sum-b.htm',
                       'http://www.ihsa.org//data/fb/records/sum-c.htm', 'http://www.ihsa.org//data/fb/records/sum-d.htm',
                       'http://www.ihsa.org//data/fb/records/sum-e.htm', 'http://www.ihsa.org//data/fb/records/sum-f.htm',
                       'http://www.ihsa.org//data/fb/records/sum-g.htm', 'http://www.ihsa.org//data/fb/records/sum-h.htm',
                       'http://www.ihsa.org//data/fb/records/sum-i.htm', 'http://www.ihsa.org//data/fb/records/sum-k.htm',
                       'http://www.ihsa.org//data/fb/records/sum-l.htm', 'http://www.ihsa.org//data/fb/records/sum-m.htm',
                       'http://www.ihsa.org//data/fb/records/sum-n.htm', 'http://www.ihsa.org//data/fb/records/sum-o.htm',
                       'http://www.ihsa.org//data/fb/records/sum-p.htm', 'http://www.ihsa.org//data/fb/records/sum-r.htm',
                       'http://www.ihsa.org//data/fb/records/sum-s.htm', 'http://www.ihsa.org//data/fb/records/sum-t.htm',
                       'http://www.ihsa.org//data/fb/records/sum-u.htm', 'http://www.ihsa.org//data/fb/records/sum-w.htm']

for schoolrecordlink in schoolrecordslinks:
    result = requests.get(schoolrecordlink)
    c = result.content
    soup = BeautifulSoup(c, "html.parser")

    schoolnames = soup.find_all('h3')
    schooltables = soup.find_all('table')

    schoolindex = 0
    for schooltable in schooltables:
        for tr in schooltable.find_all('tr',attrs={'class': None})[2:]:
            recordrow = schoolnames[schoolindex].text.strip()
            tds = tr.find_all('td')
            if len(tds) == 8:
                for td in tds:
                    recordrow = (recordrow + "," + td.text)
                print(recordrow)
                text_file.write(recordrow + "\n")
            recordrow = ""

        schoolindex = schoolindex + 1

As satisfying as this has been, the next steps are going to take a lot more work. I want to attach a public/private indicator and  enrollment numbers to as many years as possible – but they’re all on different pages and I’m not entirely sure I can use the school name as a key to link them (in fact, I’m pretty sure it’s going to fail quite a bit).  Also, I was unaware that the concept of “football enrollment” has been long abandoned, which might make apples-to-apples comparisons difficult. Anyway, once I have the table as complete as I can make it, I want to create some visualizations on what exactly it means for football success to be near the class cutoff, and come up with lists of teams that beat the odds spectacularly. Later (and this is ambitious, so we’ll see), I’d like to cross-reference public health data and census data to try to spot trends. I’d also like to identify “dead” schools,  and see what happened after they lost football using some of the same sources. I’ve made a lot of assumptions over the years, and I’d like to see if any of it is backed up.

Leave a Reply

Your email address will not be published. Required fields are marked *