Getting Started with Scrapy: A Python-Script-Based Web Scraper

Muhammad Fahim
6 min readDec 15, 2022

--

Picture Source

Hello everyone, let’s build a simple web scraper using python web-scraping framework “Scrapy”.

Overview of Scrapy

Scrapy is Python’s framework for large-scale web scraping. Scrapy is build around Twisted asynchronous networking engine which means it’s not using standard python async/await infrastructure. Twisted is an event-driven networking engine written in Python. It allows Scrapy to process requests and responses asynchronously, by assigning each incoming request to a “reactor” that waits for a response. This allows Scrapy to handle multiple requests simultaneously, improving its performance and efficiency.

Before going into technical details and start coding , lets first try to understand or get overview of few things which necessary for web scrapping. i.e XPATH and CSS SELECTORS.

XPATH

XPath is a query language that is used to select elements from an XML document, such as an HTML page. It provides a way to navigate through the document’s hierarchy and find the elements that match a specified set of criteria, such as an element’s attributes or its position in the document.

Here are a few basic examples of XPath expressions:

  1. Select all elements in the document: //*
  2. Select all p elements: //p
  3. Select all elements that have a class attribute: //*[@class]
  4. Select the element with id=”main”: //*[@id="main"]
  5. Select the p elements in the document and get its text: //p/text()

Note that these are just a few examples to illustrate the syntax of XPath. There are many more powerful and complex expressions that can be used to select specific elements from an HTML document.

CSS Selectors

Just like XPATH, CSS selectors are also used for selecting elements within HTML page.CSS selectors use the syntax of the Cascading Style Sheets (CSS) language to specify the elements that you want to select. For example, you can use a CSS selector to select all <h1> elements on a page, all elements with the class title, or all elements with the ID main-content.

Here are a few examples of CSS selectors that you might use in a web scraping project:

  • h1: This selector selects all <h1> elements on the page.
  • .title: This selector selects all elements with the class title.
  • #main-content: This selector selects the element with the ID main-content.
  • a[href]: This selector selects all <a> elements that have an href attribute.
  • table tr:nth-child(2) td:nth-child(3): This selector selects the third <td> element in the second <tr> element of a <table>.

These are just a few examples of the many different CSS selectors that you can use in web scraping. To learn more about CSS selectors and how to use them, you can refer to the documentation for the web scraping library or framework that you are using.

TIP: CSS selectors are faster than XPath expressions for selecting elements on a web page. This is because most modern web browsers, including Google Chrome and Mozilla Firefox, use optimized algorithms to quickly evaluate CSS selectors and find the matching elements on the page.

Scrapy Shell

Scrapy, comes with an interactive shell which is a actually command-line tool which allows us to test our scraping code in a live environment before running full-fledged scraping spider.

To run scrapy shell we need to write this command in our command-prompt or terminal

scrapy shell

after entering the above command the terminal will look like this

Now we need to give URL of the website which we want to scrap and test in scrapy shell. For that command is below:

fetch(“URL”)

In our case, we will scrap this website : https://beutlich.com/products/. so the command will look like this.

fetch(“https://beutlich.com/products/”)

Note the response code is 200. Which means that we successfully crawled the entered URL.

After crawling the URL now we need to scrap available product URLs. If we inspect the product URLs are available in the anchor tag and which is inside an article tag. See screenshot below

so we can construct XPATH like this: //article[contains(@class, ‘portfolio-item count’)]/a/@href

In our scrapy shell, we will write command to extract URLs:

response.xpath(“//article[contains(@class, ‘portfolio-item count’)]/a/@href”).extract()

This will return us list of URL’s of different available products.

Now we are able to get product URL’s so we can finally scrap product information.

Now we will send another request to the scrapped URLs and get response. In our Python code, we will write sperate function for this and pass response of each URL on by one to that function.

But for now we will test our code on one URL in scrapy shell.

Now we want scrap product title which is available inside an h1 tag and h1 tag does not have class but we can access this tag with header tag ID. So the XPATH would look like this

//header[@id=’page-heading’]/h1/text()

and scrapy command

response.xpath(“//header[@id=’page-heading’]/h1/text()”).extract_first()

Note: Its important to note that extract() would return list of all matching results whereas the extract_first() would string of just first result.

To scrap product sku the scrapy code would look like this.

response.css(“.entry ul li:nth-child(8)::text”).get().split(“: “)[1]

To scrap the product description the scrapy code would look like this

response.css(“.entry ::text”).getall()

Note that there is space between class name and double colon which means we want all the available text inside this tag.

To scrap the image the scrapy code would look like this.

response.css(“.slide>a::attr(href)”).get()

Now let’s move to Python code .

First we import scrapy and its required functions

import scrapy 
from scrapy.crawler import CrawlerProcess

Now we will define our scraper class and inside that class we will some class variables.

class beutlich_scraper(scrapy.Spider):

custom_settings = {
'DOWNLOAD_DELAY' : 0.25,
'RETRY_TIMES': 10,
# export as CSV format
'FEED_FORMAT' : 'csv',
'FEED_URI' : 'Beutlich-data.csv'
'OBEY_ROBOTS' : False,
}

start_urls =['https://beutlich.com/products/']

Custom_settings: for defining some custom setting while sending requests to the URL’s.

start_urls: This is our starting URL.

Now we will define our first function inside our scraper class whose name must be parse().(By default it is required).

def parse(self, response):
links =response.xpath("//article[contains(@class, 'portfolio-item count')]/a/@href").extract()


for link in links:
yield scrapy.Request(link, callback=self.parse_product)

This function scraping product URL’s and through loop we are passing it to another function for scraping product information.

Inside our parse_product() function we scraping required information and saving it inside a dictionary

    def parse_product(self, response):
data_dict = {}
data_dict['Product Title']=response.css("#page-heading>h1").get()
data_dict['Seller SKU']= response.css(".entry ul li:nth-child(8)::text").get().split(": ")[1]
data_dict['Description']=response.css(".entry ::text").getall()
data_dict['Image URL']=response.css(".slide>a::attr(href)").get()
yield data_dict

Then we need to initialize CrawlProcess class and pass our scraper class as an argument.

process = CrawlerProcess()
process.crawl(beutlich_scraper)
process.start()

When you run this code this scraper will scrap information and save the data in csv file format.

That’s the end of blog.

Please share this blog with your friends.

Please follow me on twitter. https://twitter.com/faheem2920

--

--

Muhammad Fahim
Muhammad Fahim

Written by Muhammad Fahim

Junior Data engineer @M3hive. Passionate about Data Engineering and Web Scraping & Automation. Sharing insights

No responses yet