Secrets of Web Scraping: Leveraging Hidden APIs to Capture Dynamic Data.

Muhammad Fahim
6 min readMar 28, 2023

--

iStock/101cats

In the first article we tried to learn basics of Web scraping and built and basic web scraper with scrapy.

In this article we will try to learn how we can leverage the hidden API’s and capture the dynamically loaded data.

I have seen a lot of people who are quite good at web scraping but they don’t know about how we can capture API’s call in the network tab and scrap dynamically loaded data just like static data.

Since they do not know how they can capture API’s calls so they tend to use libraries/framework such as Selenium or Playwright . Which can helps work done but it comes with heavy price also. first of all they were not designed for web scraping instead they were created for website testing.

Other than this there are other drawbacks

  • Slower: Selenium and Playwright are slower than scrapy because they were designed to simulate user interaction with website, such clicking on button, filling out forms etc. This makes them more resource-intensive and time consuming than scrapy which focuses on parsing HTML data.
  • Requires browser: Selenium and Playwright requires a web browser to be installed on system
  • Limited scalability: Because selenium is browser-based, it may be ideal for large-scale web scraping. Running multiple instances on single machine can be resource consuming also slows down the scraping process.

Before learning to capture API calls to detect we need first learn how to check that how data is dynamically loaded or not and how many ways to dynamically load data into a website.

What makes a page “dynamic “?

Some websites load the content as you open the page are called static page. While some pages return some data on first load but more is rendered “dynamically” (updating the DOM) based on our certain actions like scrolling down, clicking on an element or hovering over an element, are called dynamic pages.

There several ways to dynamically load data into a Website. few of them are.

AJAX: Asynchronous JavaScript and XML (AJAX) is a technique used to load data asynchronously without reloading the entire page. This technique is commonly used to load new content when the user interacts with the page, such as scrolling, clicking a button, or typing in a search box

Dynamic IFrames: An iframe is an HTML element that allows a webpage to embed another webpage within itself. Dynamic iframes can be used to load new content or data from a different webpage without reloading the entire page.

DOM Manipulation: JavaScript can be used to dynamically modify the Document Object Model (DOM) of a webpage, allowing new content or data to be loaded and displayed on the page without reloading the entire page.

Others are Web sockets, Server side events(SSE).

How to identify that a page is dynamically loaded or not?

To identify that whether is dynamically loaded or not first we need to inspect element (it can be done by right clicking on the element and clicking on inspect or it can be simply by pressing F12 button). after that we need press Ctrl + Shift + P then a popup window will appear we need to write Disable JavaScript and click on first option.

After that we need to deep refresh the page by pressing Ctrl + Shift + R and JavaScript loaded data will disappear from the page. This means that page was dynamically loaded.

P.S: It is possible that either all of data on the page or some data on the page can be dynamically loaded.

To enable it again we can write Enable JavaScript and deep refresh the page and it will appear again.

For example this website https://www.sunglasshut.com/ has dynamically loaded data. If we try to scrap it either with Scrapy or with Beautiful Soup (Which are used scraping static data) we will not be able to do it.

But if we try to catch API call in network tab from which this data in being loaded into website then we can easily scrap the data with scrapy.

If we examine each request in the network tab then we can see that first API call has our data in JSON format.

We need to right click on it and copy its link address and make request directly to this URL.

Here is the output

Now it will be more easier for us to scrape the JSON data then trying to scrap it with libraries like selenium/playwright.

Now here is the trick. Not always you will find “Hidden API” like this easily or even you find it then it won’t give us access without passing some “payload” or authentication.

Now I'll try to share one example of a Airbnb.com. This website is dynamically loaded and to scrap it we need to use selenium or playwright.

but if we click anywhere on the screen and click “Inspect” to bring up the developer tools. Then, go to the “Network” tab. This tab monitors what requests are sent to what endpoints, so if there is any hidden API, it is going to be revealed here. Then, click “XHR”. If you’re curious what XHR is, it’s short for XMLHttpRequest, and it’s a JavaScript object used to transfer data. Essentially, by clicking “XHR”, we are separating the APIs that are trying to fetch data, from the APIs that are trying to fetch images, HTML, CSS, or JavaScript. Finally, reload the page to monitor the requests.

Clicking “XHR” will reveal all the HTTP methods that are used to transfer data. By clicking on each request name, we can view the request as well as the response headers, request payload and the site’s response body. A tip: Try to evaluate the API name and see if something is indicating a search product query, then view its response body to verify it.

if we see closely there is a API call and it has our required data.

Now we need to copy this URL.

Now we just need to simply send request to the URL and change the required query parameter.

import requests
import json

#this is pretty print, it just makes JSON more human-readable in the console:
from pprint import pprint

options = dict(
page_no = 1,
checkin = "07/15/2016",
checkout = "07/16/2016",
sw_lat = "40.83397847641101",
sw_lng = "-74.0845568169126",
ne_lat = "40.88991628064286",
ne_lng = "-73.86380028615088"
)

json_url = "https://www.airbnb.com/search/search_results?page={page_no}&source=map&airbnb_plus_only=false&sw_lat={sw_lat}&sw_lng={sw_lng}&ne_lat={ne_lat}&ne_lng={ne_lng}&search_by_map=true&location=Manhattan,+New+York,+NY,+United+States&checkin={checkin}&checkout={checkout}&guests=1".format(**options)

# download the raw JSON
raw = requests.get(json_url).text

# parse it into a dict
data = json.loads(raw)

# pretty-print some cool data about the 0th listing
pprint( data['results_json']['search_results'][0]['listing'] )
# and price info
pprint( data['results_json']['search_results'][0]['pricing_quote'] )

Additional Considerations:

There are some cases in which though we will able to find the API call in the network but that API would require some “Authentication” it can be in cookies or it can inside headers also. Another thing I want to share with you sometimes web page is dynamically loaded but you won’t find any API call in the network tab because data will be inside script tag inside its source page.

Ok let's stop it here. Though this subject is quite long and complex which requires more explanation but I tried my best to give you basic overview of “reversed engineering technique “ that can be used and scrap dynamically loaded data just like static data.

Conclusion:

We tried to understand what makes a web page “dynamic” and how to check whether website’s page was dynamically loaded or not then we learned how capture API call in the Network tab and parse the data.

You can connect with me on Twitter or on LinkedIn.

--

--

Muhammad Fahim
Muhammad Fahim

Written by Muhammad Fahim

Junior Data engineer @M3hive. Passionate about Data Engineering and Web Scraping & Automation. Sharing insights

Responses (1)