Extracting table data with Scrapy

Submitted by olaf on 2015-04-05

Today, I wanted to get some population data per federal state. There is a government site providing such data, but only as an HTML table. I already used Scrapy for a small project a few days ago. So I wanted to give it another try, to become a bit more familiar with it.

Installation

Installation on Ubuntu is straightforward and described in detail at Ubuntu packages.

Setup

You start with defining the desired data as shown on Scrapy at a glance. In my case, it’s the state and its associated population

import scrapy

class PopulationItem(scrapy.Item):
    state = scrapy.Field()
    population = scrapy.Field()

Next comes the spider skeleton, which is equally simple

from scrapy import log

class PopulationSpider(scrapy.Spider):
    name = 'population'
    allowed_domains = ['example.com']
    start_urls = ( 'http://www.example.com/population.php', )

    def parse(self, response):
        log.msg('parse(%s)' % response.url, level = log.DEBUG)
        pass

Running the spider to see how it works, is done with

scrapy runspider population.py

and shows a lot of log messages, including our log message from the parse() method

2015-04-05 15:39:13+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: scrapybot)
2015-04-05 15:39:13+0200 [scrapy] INFO: Optional features available: ssl, http11, boto, django

2015-04-05 15:39:16+0200 [population] DEBUG: Crawled (200) <GET http://www.example.com/population.php> (referer: None)
2015-04-05 15:39:16+0200 [scrapy] DEBUG: parse(http://www.example.com/population.php)
2015-04-05 15:39:16+0200 [population] INFO: Closing spider (finished)
2015-04-05 15:39:16+0200 [population] INFO: Dumping Scrapy stats:

Extracting the data

Now everything is in place, and we can figure out how to extract the table data. To get an idea where to look for the table in the DOM tree, you can analyze the web page with Firefox’s Page Inspector.

Scrapy supports selecting DOM elements with XPath. Getting the selectors right, takes some time. Scrapy provides an interactive shell, which avoids running the spider and downloading the web page over and over again. To invoke the shell, you just say

scrapy shell http://www.example.com/population.php

The shell loads the web page and shows a prompt to enter Python code. The web page is available as response, and to select parts of it, you can enter

response.xpath('//table[@id="tbl00"]')

This gives already the population table in question. But I am only interested in a small part of it, so I further restrict it to the rows

response.xpath('//table[@id="tbl00"]/tbody/tr')

which shows as

[<Selector xpath=’//table[@id=”tbl00”]/tbody/tr’ data=u’<tr><th>Baden-W\xfcrttemberg </th><td class’>,
<Selector xpath=’//table[@id=”tbl00”]/tbody/tr’ data=u’<tr><th>Bayern</th><td class=”noWrap”>70’>,

<Selector xpath=’//table[@id=”tbl00”]/tbody/tr’ data=u’<tr><th>Th\xfcringen</th><td class=”noWrap”’>,
<Selector xpath=’//table[@id=”tbl00”]/tbody/tr’ data=u’<tr class=”heading”><th>Deutschland</th>’>]

I can use this in the parse method to iterate over the rows of the table

    def parse(self, response):
        log.msg('parse(%s)' % response.url, level = log.DEBUG)
        rows = response.xpath('//table[@id="tbl00"]/tbody/tr')
        for row in rows:
            pass

Now there’s only a small part left, extracting the relevant columns from each row. Each row looks like <tr><th>State name</th><td>Column 2</td><td>Population</td>...</tr>. Translating this into code

for row in rows:
    item = PopulationItem()
    item['state'] = row.xpath('th/text()').extract()
    item['population'] = row.xpath('td[2]/text()').extract()
    yield item

This fills item with the first (th) and third (td[2]) column of the table row.

We’re done. You can pick up the population data and store it in a text file by

scrapy runspider population.py -o population.csv

Post a comment

All comments are held for moderation; Markdown and basic HTML formatting accepted. If you want to stay anonymous, leave name, e-mail and website empty.