
13 Sep 2018 Running a Web Crawler in a Docker Container
Introduction
A website may have hundreds, thousands, or even millions, of public facing pages. When we are responsible for maintaining such a website, it’s impractical to traverse it manually looking for broken links. We need an automated testing tool: one which can scan the whole website and log any broken links, so we can get them fixed sooner rather than later.
In this blog, I am going to describe a web crawler project which can easily and efficiently achieve the goal. The primary technologies used in this project are Scrapy and Docker.
Why Scrapy?
Scrapy is a web crawling framework which does most of the heavy lifting in developing a web crawler. You can build and run the web crawler in a fast and simple way.
Why Docker?
Docker is a tool designed to create, deploy, and run applications by using containers. It allows us to build, deploy and run the crawler easily, even though the host platforms vary.
Explore the project
The source code of the simple prototype is available in Github
Project structure
Build the project
Please refer to the installation guide of the Scrapy documentation for how to install Scrapy. Once Scrapy is installed successfully, you can create a project by running the command:
$ scrapy startproject mycrawler
and then create the spider named demospider by:
$ scrapy genspider -t crawl demospider mydomain.com
A demospider.py file is created and the spider class is like this:
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DemospiderSpider(CrawlSpider): name = 'demospider' allowed_domains = ['mydomain.com'] start_urls = ['http://mydomain.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i
- The DemospiderSpider is a subclass of CrawlSpider which is based on a pre-defined template crawl.
- The allowed_domains defines which domain the spider is allowed to crawl.
- The start_urls is where the spider should start the crawl.
- The rules is a list of Rule objects which provides a convenient mechanism for following links by defining a certain behaviour for crawling the site.
- The LinkExtractor defines how links will be extracted from each crawled page.
As the goal is to navigate the whole website and yield a log of the broken pages, I have customised the demospider.py code like this:
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DemospiderSpider(CrawlSpider): handle_httpstatus_list = [403, 404] name = 'demospider' allowed_domains = ['mydomain.com'] start_urls = ['http://mydomain.com/'] custom_settings = { 'LOG_FILE': 'logs/demospider.log', 'LOG_LEVEL': 'DEBUG' } rules = ( Rule( LinkExtractor( tags='a', attrs='href', unique=True ), callback='parse_item' follow=True ), ) def parse_item(self, response): pass
- Only successful responses (those whose status codes are in the 200-300 range) are processed by default. As I also need to check for response codes outside that range, I can specify the response codes in handle_httpstatus_list.
- The LOG_FILE and LOG_LEVEL defined in custom_settings are used for logging output to a file located in the logs directory.
- The LinkExtractor tells the crawler to look for links from the href attribute of all of the ‘a’ tags in the page.
- The follow=True specifies that the crawler will keep navigating the links unless the rule doesn’t match.
Some websites have implemented ways to restrict bots from crawling. To avoid getting banned, there are a few tips from the Common Practises of the Scrapy documentation. In this project, I limit the number of requests, and specify a minimum time between requests. I set the following configurations in the settings.py file to make the crawler appear like a human:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) Safari/537.36' DOWNLOAD_DELAY = 2 CONCURRENT_REQUESTS_PER_DOMAIN = 1
- DOWNLOAD_DELAY can be used to throttle the crawling speed to avoid hitting servers too hard. It is two seconds in this case.
- CONCURRENT_REQUESTS_PER_DOMAIN defines the maximum number of concurrent requests that will be performed to the single domain.
Run the crawler
There are two ways of running the crawler in Scrapy. It can be run from the command line using $ scrapy crawl demospider
, or via the API from a script.
We can run a single crawler in a script (go-spider.py) using the API like this:
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from mycrawler.spiders.demospider import DemospiderSpider process = CrawlerProcess(get_project_settings()) process.crawl(DemospiderSpider) process.start()
When the crawling is complete, you can inspect the log file to learn if any broken links were found.
The last section of the log file shows a summary of results, including the count of each response status:
'downloader/response_status_count/200': 2000, 'downloader/response_status_count/301': 100, 'downloader/response_status_count/403': 20, 'downloader/response_status_count/404': 10,
Build and run in Docker container
Dockerfile
# As Scrapy runs on Python, I choose the official Python 3 Docker image. FROM python:3 # Set the working directory to /usr/src/app. WORKDIR /usr/src/app # Copy the file from the local host to the filesystem of the container at the working directory. COPY requirements.txt ./ # Install Scrapy specified in requirements.txt. RUN pip3 install --no-cache-dir -r requirements.txt # Copy the project source code from the local host to the filesystem of the container at the working directory. COPY . . # Run the crawler when the container launches. CMD [ "python3", "./go-spider.py" ]
Note: In this sample code, the project source code is copied from localhost into the container. You can also put the source code into a Git repository and then git-clone it into the container.
Build
Run the build command to create the Docker image which is tagged as mycrawler so it has a friendly name.
$ docker build -t mycrawler .
Run
Run the container to start crawling.
$ docker run mycrawler
Further on
Since we have containerized the crawler, it is simple to run it on any platform, even on the cloud. Some of the scenarios I feel this tool can help with:
- As part of a regression test, it will find out whether all the public pages are accessible.
- In a CMS system where the page content is updated very frequently, it can help locate any stale links in certain pages.
- When developers build new features or fix defects in the website, it can check if any of the functionality changes affect existing pages.
Conclusion
Scrapy provides a simple and flexible way of creating a web crawler with minimal effort. Wrapping it in a Docker container makes the build and deployment quite handy. The combination can be useful in discovering the annoying problems of missing pages and broken links so we can eliminate them.
Acknowledgements
Portions of this post are based on:
No Comments