What is JSON?
Why extract from JSON links?
Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us
Achieve faster data extraction without loading images and such
Bypass anti-scraping restrictions on many websites
deal with load more buttons and infinite scrolling more easily
How to use JSON extraction in Octoparse?
For demonstration purposes, we will scrape data from a listing page on Booking.com using JSON extraction. Check out the sample URL: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com
Below are the three major steps for this demo.
1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need
Open the sample URL in Chrome
Right-click on the webpage and select Inspect to open the DevTools
Select Fetch/XHR from the Network tab in the DevTools
Click the clear icon to clear all the loaded information
Scroll down the job listing in the scrollable column to refresh the page
Check the reloaded URLs in XHR to see if they contain any JSON file
Click on the name of a URL and check its Headers info. We will see the content type under Request Headers contains JSON.
Switch to the Preview tab and see how much data we are talking about. We can see the total count is 363 for this demo.
Scroll down a bit more and compare the request URLs to find a pattern
By comparing the request URLs, we find that the parameter start= in the URL increases by 10 each time.
Copy the URL containing the JSON file (Request URL in Headers), which is https://jobs.booking.com/api/apply/v2/jobs?domain=booking.com&start=10&num=10&location=netherlands&domain=booking.com
2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
Next, we need to batch generate the JSON URL list in Octoparse.
Open Octoparse and start a new advanced task that batch generates input URLs
Paste the copied URL into the URL format box
Select the changing element in the URL and click Add Parameter
Set Initial value to 0, Every time to +10, and End value to 363 and click Confirm to save
Note: The End value is constantly changing. Input the actual value you find in Chrome.
Click the Go to Web Page action and tick the JSON box in the General tab
Click Apply to save your settings
3. Select the data for extraction - to get the data we need
Toggle the structure tree and select the page elements we want in the positions node
Extract data fields like name, display_job id, business unit, and location
Save the task and run it to get the data we need
Here is the sample data output.