What is JSON?

JSON (JavaScript Object Notation) is a lightweight text-based data-interchange format. It is not only easy for humans to read and write but also easy for machines to parse and generate. As a result, it is widely used by websites to improve network transmission efficiency.

Why extract from JSON links?

Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us

  1. Achieve faster data extraction without loading images and such

  2. Bypass anti-scraping restrictions on many websites

  3. deal with load more buttons and infinite scrolling more easily

How to use JSON extraction in Octoparse?

For demonstration purposes, we will scrape data from a listing page on Booking.com using JSON extraction. Check out the sample URL: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com

Below are the three major steps for this demo.

  1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need

  2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links

  3. Select the data for extraction - to get the data we need


1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need

  • Open the sample URL in Chrome

  • Right-click on the webpage and select Inspect to open the DevTools

  • Select Fetch/XHR from the Network tab in the DevTools

  • Click the clear icon to clear all the loaded information

  • Scroll down the job listing in the scrollable column to refresh the page

  • Check the reloaded URLs in XHR to see if they contain any JSON file

26.png
  • Click on the name of a URL and check its Headers info. We will see the content type under Request Headers contains JSON.

27.png
  • Switch to the Preview tab and see how much data we are talking about. We can see the total count is 363 for this demo.

28.png
  • Scroll down a bit more and compare the request URLs to find a pattern

By comparing the request URLs, we find that the parameter start= in the URL increases by 10 each time.


2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links

Next, we need to batch generate the JSON URL list in Octoparse.

  • Open Octoparse and start a new advanced task that batch generates input URLs

  • Paste the copied URL into the URL format box

  • Select the changing element in the URL and click Add Parameter

29.png
  • Set Initial value to 0, Every time to +10, and End value to 363 and click Confirm to save

Note: The End value is constantly changing. Input the actual value you find in Chrome.

30.png
  • Click the Go to Web Page action and tick the JSON box in the General tab

  • Click Apply to save your settings


3. Select the data for extraction - to get the data we need

  • Toggle the structure tree and select the page elements we want in the positions node

  • Extract data fields like name, display_job id, business unit, and location

31.png
  • Save the task and run it to get the data we need

Here is the sample data output.

32.png
Did this answer your question?