Introduction

A website may have hundreds, thousands, or even millions, of public facing pages. When we are responsible for maintaining such a website, it’s impractical to traverse it manually looking for broken links. We need an automated testing tool: one which can scan the whole website and log any broken links, so we can get them fixed sooner rather than later.

In this blog, I am going to describe a web crawler project which can easily and efficiently achieve the goal. The primary technologies used in this project are Scrapy and Docker.

Why Scrapy?

Scrapy is a web crawling framework which does most of the heavy lifting in developing a web crawler. You can build and run the web crawler in a fast and simple way.

Why Docker?

Docker is a tool designed to create, deploy, and run applications by using containers. It allows us to build, deploy and run the crawler easily, even though the host platforms vary.

Explore the project

The source code of the simple prototype is available in Github

Project structure

scrapy_project_structure

Build the project

Please refer to the installation guide of the Scrapy documentation for how to install Scrapy. Once Scrapy is installed successfully, you can create a project by running the command:

$ scrapy startproject mycrawler

and then create the spider named demospider by:

$ scrapy genspider -t crawl demospider mydomain.com

A demospider.py file is created and the spider class is like this:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DemospiderSpider(CrawlSpider):
  name = 'demospider'
  allowed_domains = ['mydomain.com']
  start_urls = ['http://mydomain.com/']

  rules = (
    Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
  )

  def parse_item(self, response):
    i = {}
    #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
    #i['name'] = response.xpath('//div[@id="name"]').extract()
    #i['description'] = response.xpath('//div[@id="description"]').extract()
    return i

  • The DemospiderSpider is a subclass of CrawlSpider which is based on a pre-defined template crawl.
  • The allowed_domains defines which domain the spider is allowed to crawl.
  • The start_urls is where the spider should start the crawl.
  • The rules is a list of Rule objects which provides a convenient mechanism for following links by defining a certain behaviour for crawling the site.
  • The LinkExtractor defines how links will be extracted from each crawled page.

As the goal is to navigate the whole website and yield a log of the broken pages, I have customised the demospider.py code like this:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DemospiderSpider(CrawlSpider):
  handle_httpstatus_list = [403, 404]
  name = 'demospider'
  allowed_domains = ['mydomain.com']
  start_urls = ['http://mydomain.com/']
  custom_settings = {
    'LOG_FILE': 'logs/demospider.log',
    'LOG_LEVEL': 'DEBUG'
  }

  rules = (
    Rule(
      LinkExtractor(
        tags='a',
        attrs='href',
        unique=True
      ),
      callback='parse_item'
      follow=True
    ),
  )

  def parse_item(self, response):
    pass

  • Only successful responses (those whose status codes are in the 200-300 range) are processed by default. As I also need to check for response codes outside that range, I can specify the response codes in handle_httpstatus_list.
  • The LOG_FILE and LOG_LEVEL defined in custom_settings are used for logging output to a file located in the logs directory.
  • The LinkExtractor tells the crawler to look for links from the href attribute of all of the ‘a’ tags in the page.
  • The follow=True specifies that the crawler will keep navigating the links unless the rule doesn’t match.

Some websites have implemented ways to restrict bots from crawling. To avoid getting banned, there are a few tips from the Common Practises of the Scrapy documentation. In this project, I limit the number of requests, and specify a minimum time between requests. I set the following configurations in the settings.py file to make the crawler appear like a human:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) Safari/537.36'
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS_PER_DOMAIN = 1
  • DOWNLOAD_DELAY can be used to throttle the crawling speed to avoid hitting servers too hard. It is two seconds in this case.
  • CONCURRENT_REQUESTS_PER_DOMAIN defines the maximum number of concurrent requests that will be performed to the single domain.

Run the crawler

There are two ways of running the crawler in Scrapy. It can be run from the command line using $ scrapy crawl demospider, or via the API from a script.

We can run a single crawler in a script (go-spider.py) using the API like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from mycrawler.spiders.demospider import DemospiderSpider


process = CrawlerProcess(get_project_settings())
process.crawl(DemospiderSpider)
process.start()

When the crawling is complete, you can inspect the log file to learn if any broken links were found.

The last section of the log file shows a summary of results, including the count of each response status:

'downloader/response_status_count/200': 2000,
'downloader/response_status_count/301': 100,
'downloader/response_status_count/403': 20,
'downloader/response_status_count/404': 10,

Build and run in Docker container

Dockerfile

# As Scrapy runs on Python, I choose the official Python 3 Docker image.
FROM python:3

# Set the working directory to /usr/src/app.
WORKDIR /usr/src/app

# Copy the file from the local host to the filesystem of the container at the working directory.
COPY requirements.txt ./

# Install Scrapy specified in requirements.txt.
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the project source code from the local host to the filesystem of the container at the working directory.
COPY . .

# Run the crawler when the container launches.
CMD [ "python3", "./go-spider.py" ]

Note: In this sample code, the project source code is copied from localhost into the container. You can also put the source code into a Git repository and then git-clone it into the container.

Build

Run the build command to create the Docker image which is tagged as mycrawler so it has a friendly name.

$ docker build -t mycrawler .

Run

Run the container to start crawling.

$ docker run mycrawler

Further on

Since we have containerized the crawler, it is simple to run it on any platform, even on the cloud. Some of the scenarios I feel this tool can help with:

  • As part of a regression test, it will find out whether all the public pages are accessible.
  • In a CMS system where the page content is updated very frequently, it can help locate any stale links in certain pages.
  • When developers build new features or fix defects in the website, it can check if any of the functionality changes affect existing pages.

Conclusion

Scrapy provides a simple and flexible way of creating a web crawler with minimal effort. Wrapping it in a Docker container makes the build and deployment quite handy. The combination can be useful in discovering the annoying problems of missing pages and broken links so we can eliminate them.

Acknowledgements

Portions of this post are based on:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s