Run Scrapy inside of a docker container.

December 15, 2020

Scrapy is a python library that allows you to scrape and interact with content on the web. However, until you get to the advanced usage it is meant to be run client side. Running scrapy inside of a docker container will allow you to run your spiders in the same environment across all instances and make it easy to import your scraped data into any project.

Project setup

To begin, we will create the directory, the git repository and the base Dockerfile.

mkdir docker-scrapy && cd-docker-scrapy
git init
touch .gitignore
echo "__pycache__" >> .gitignore
vim Dockerfile

Docker image creation

scrapy needs the full python docker image rather than an alpine image. We will then install scrapy and then copy the current directory contents into the image's /app folder. You may notice the repeated use of --volume arguments, those are called bind mounts which sync your local files with the docker container container.

The entrypoint of the image is the scrapy cli.

FROM python:latest
RUN pip install scrapy
WORKDIR /app
COPY . .
ENTRYPOINT ["scrapy"]

To build the image and generate the scrapy project, we must first set the volume to the location the scraper will be generated at, this will allow us to move the files up a directory.

docker build -t scraper .
docker run --volume $(pwd):/app/scraper scraper startproject scraper

We have now generated the scrapy project in our working directory.

Create the spider

To create a spider, run this command. Notice that the volume is different that the last command since we have the scrapy project in the current directory rather than a subfolder.

docker run --volume $(pwd):/app scraper genspider quotes quotes.toscrape.com

Since the files were generated by docker, you may need to fix the permissions on them.

ls -l
total 12
-rw-rw-r-- 1 tom  tom    86 Dec 26 08:49 Dockerfile
drwxr-xr-x 4 root root 4096 Dec 26 08:53 scraper
-rw-r--r-- 1 root root  257 Dec 26 08:52 scrapy.cfg

sudo chown -R $(whoami) scraper scrapy.cfg

You should now be able to rebuild the image and run the scraper using these commands.

Run the scraper through docker

docker build -t scraper .
docker run -it scraper crawl quotes

You should see some kind of scrapy output simliar to below.

docker run scraper crawl quotes
2020-12-26 14:51:58 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scraper)
...
2020-12-26 14:51:58 [scrapy.core.engine] INFO: Spider opened
2020-12-26 14:51:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-26 14:51:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-26 14:51:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-12-26 14:51:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2020-12-26 14:51:58 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-26 14:51:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2557,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.34936,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 12, 26, 14, 51, 58, 690933),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 57229312,
 'memusage/startup': 57229312,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 12, 26, 14, 51, 58, 341573)}
2020-12-26 14:51:58 [scrapy.core.engine] INFO: Spider closed (finished)

Writing a basic scraper

To scrape the quotes let's modify the quotes scraper as follows

# scraper/spiders/quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

And now run

docker run -it -v $(pwd):/app scraper crawl quotes -O output.json

The -O argument to scrapy tells it to generate a new file called output.json in your current directory.

After running your command you should be able to see your new json file!

[
  {
    "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
    "author": "Albert Einstein",
    "tags": ["change", "deep-thoughts", "thinking", "world"]
  },
  {
    "text": "\u201cA day without sunshine is like, you know, night.\u201d",
    "author": "Steve Martin",
    "tags": ["humor", "obvious", "simile"]
  }
]

In this post you have learned how to create a simple scrapy spider inside of a docker container for your next project.

You can view the full code for the project here