Scrape data from JSON links
FollowWhat is JSON?
JSON (JavaScript Object Notation) is a lightweight text-based data-interchange format. It is not only easy for humans to read and write, but also easy for machines to parse and generate. As a result, it is widely used by websites to improve network transmission efficiency.
Why extract from JSON links?
Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us
- Achieve faster data extraction without loading images and such
- Bypass anti-scraping restrictions on many websites
- deal with load more buttons and infinite scrolling more easily
How to use JSON extraction in Octoparse?
For demonstration purposes, we will scrape data from a listing page on Booking.com using JSON extraction. Check out the sample URL: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com
Below are the three major steps for this demo.
- Inspect the webpage in a browser - to identify the URL containing the JSON file we need
- Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
- Select the data for extraction - to get the data we need
1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need
- Open the sample URL in Chrome
- Right-click on the webpage and select Inspect to open the DevTools
- Select Fetch/XHR from the Network tab in the DevTools
- Click the clear icon (
) to clear all the loaded information
- Scroll down the job listing in the scrollable column to refresh the page
- Check the reloaded URLs in XHR to see if they contain any JSON file
- Click on the name of a URL and check its Headers info. We will see the content-type under Request Headers contains json.
- Switch to the Preview tab and see how much data we are talking about. We can see the total count is 363 for this demo.
- Scroll down a bit more and compare the request URLs to find a pattern
By comparing the request URLs, we find that the parameter start= in the URL increases by 10 each time.
- Copy the URL containing the JSON file (Request URL in Headers), which is https://jobs.booking.com/api/apply/v2/jobs?domain=booking.com&start=10&num=10&location=netherlands&domain=booking.com
2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
Next we need to batch generate the JSON URL list in Octoparse.
- Open Octoparse and start a new advanced task that batch generate input URLs
- Paste the copied URL into the URL format box
- Select the changing element in the URL and click Add Parameter
- Set Initial value to 0, Every time to +10, and End value to 363 and click Confirm to save
- Click the Go to Web Page action and tick the JSON box in the General tab
- Click Apply to save your settings
3. Select the data for extraction - to get the data we need
- Toggle the structure tree and select the page elements we want in the positions node
- Extract data fields like name, display_job id, business unit, and location
- Save the task and run it to get the data we need
Here is the sample data output.
Feel free to leave a message if you still have questions about JSON extraction. We will get back to you ASAP.
Author: Crix
Editor: Isabel