Scrapy notes

16th January 2024

Starting a project

scrapy startproject <project_name>

This creates a file structure like so:

├── scrapy.cfg
└── <project_name>
	├── __init__.py
	├── items.py
	├── middlewares.py
	├── pipelines.py
	├── settings.py
	└── spiders
		└── __init__.py

settings.py is where project settings are contained, like activating pipelines, middlewares etc. You can change the delays, concurrency, and lots more things.
items.py is a model for the extracted data. You can define a custom model (like a ProductItem) that will inherit the Scrapy Item class and contain your scraped data.
pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to file outputs or databases (CSV, JSON SQL, etc).
middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.
scrapy.cfg is a configuration file to change some deployment settings, etc.

Spiders

Scrapy provides several different spider types. Some of the most common ones:

Spider - Takes a list of start_urls and scrapes each one with a parse method.
CrawlSpider - Designed to crawl a full website by following any links it finds.
SitemapSpider - Designed to extract URLs from a sitemap

To create a new generic spider, run:

scrapy genspider <name_of_spider> <website>

A new spider will now have been added to your spiders folder, and it should look like this:

import scrapy

class NAMEOFSPIDERSpider(scrapy.Spider):
	name = 'NAMEOFSPIDER'
	allowed_domains = ['website']
	start_urls = ['website']

def parse(self, response):
	pass

This spider class contains:

name - an attribute that gives a name to the spider. We will use this when running our spider
allowed_domains - an attribute that tells Scrapy that it should only ever scrape pages of the <website> domain. This prevents the spider from going and scraping lots of websites. This is optional.
start_urls - an attribute that tells Scrapy the first URL it should scrape.
parse - this function is called after a response has been received from the target website.

To start using this Spider we will have to:

Change the start_urls to the URL we want to scrape
Insert our parsing code into the parse function

You run a spider with:

scrapy crawl <name_of_spider>

Scrapy Shell

scrapy shell

If we run

fetch(<start_url>)

we should see a 200 response in the logs. Scrapy will save the HTML response in an object called response

You can get a list of elements matching a CSS selector by running

response.css("<selector>")

To just get the first matching element run

response.css("<selector>").get()

This returns all the HTML in this node of the DOM tree.

CSS Selectors

Tags: Programming