Debugging Scrapy by rolling your own

Submitted by olaf on 2015-07-12
Tags: scrapy python

About two weeks ago, I wanted to automate downloading some reports from a web site. Again, I used Scrapy, because I’m familiar with it by now.

But despite all of my efforts, I couldn’t get past the login page. I tried to find out, why I wasn’t able to log into this web site. I thought about providing a proper User-Agent and Referer header, looked for cookies and analyzed the login form, but couldn’t find anything special.

Eventually, I gave up and decided to implement my own little web scraper.

“Fortunately”, I made the same error, which prevented Scrapy from logging into the web site. But maybe because of the much smaller code size or the help of HttpFox or even both, this time I found the missing piece.

Analysis

This particular login page, uses a button instead of an input control to submit the form. This is perfectly valid and works both, with and without Javascript enabled.

To submit a form with Scrapy, you create a FormRequest, which is returned from your Spider’s callback function

class ExampleSpider(scrapy.Spider):
    # ...
    def parse(self, response):
        req = scrapy.FormRequest.from_response(
            response,
            formdata = { 'username': self.account,
                         'password': self.password },
            callback = self.after_login
        )

        return req

    def after_login(self, response):
        # check login succeeded before going on
        # ...

To see how FormRequest.from_response works, we need to look into scrapy/http/request/form.py

# get the form from the response
form = _get_form(response, formname, formnumber, formxpath)
# populate form data from the `value` attributes
formdata = _get_inputs(form, formdata, dont_click, clickdata, response)

Here in _get_inputs is also the place, where the submit controls are retrieved by calling _get_clickable. And right here in the first line

clickables = [el for el in form.xpath('.//input[@type="submit"]')]

we can see, that only input controls are evaluated. To include submit buttons as well, the XPath query must be extended.

… and improvement

query = ('descendant::input[@type="submit"]'
         '|descendant::button[not(@type) or @type="submit"]')
clickables = [el for el in form.xpath(query)]

This already enables buttons as submit controls, but doesn’t take into account missing name attributes or disabled controls. To improve this, we say

query = ('descendant::input[@name and not(@disabled) and @type="submit"]'
         '|descendant::button[@name and not(@disabled)'
         ' and (not(@type) or @type="submit")]')
clickables = [el for el in form.xpath(query)]

That’s it to allow for submit button controls. There’s one final implementation specific caveat to consider. The selected submit control is returned by

return (el.name, el.value)

However, this works only with an lxml.html.InputElement (<input>) and not with an lxml.html.HtmlElement (<button>). To fix this, I changed it to

return (el.get('name'), el.get('value') or '')

which works with both input and button and additionally observes a missing value attribute.

1 Comment

Pietro on 2015-09-22 00:04:00 +0200

Thank you very match, this is really a good post

Post a comment

All comments are held for moderation; Markdown and basic HTML formatting accepted. If you want to stay anonymous, leave name, e-mail and website empty.