What is JSON?
Why extract from JSON links?
Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us
- Achieve faster data extraction without loading images and such
- Bypass anti-scraping restrictions on many websites
- deal with load more buttons and infinite scrolling more easily
How to use JSON extraction in Octoparse?
For demonstration purposes, we will scrape data from a listing page on Booking.com using JSON extraction. Check out the sample URL: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com
Below are the three major steps for this demo.
- Inspect the webpage in a browser - to identify the URL containing the JSON file we need
- Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
- Select the data for extraction - to get the data we need
1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need
- Open the sample URL in Chrome
- Right-click on the webpage and select Inspect to open the DevTools
- Select Fetch/XHR from the Network tab in the DevTools
- Click the clear icon to clear all the loaded information
- Scroll down the job listing in the scrollable column to refresh the page
- Check the reloaded URLs in XHR to see if they contain any JSON file
- Click on the name of a URL and check its Headers info. We will see the content-type under Request Headers contains json.
- Switch to the Preview tab and see how much data we are talking about. We can see the total count is 363 for this demo.
- Scroll down a bit more and compare the request URLs to find a pattern
By comparing the request URLs, we find that the parameter start= in the URL increases by 10 each time.
- Copy the URL containing the JSON file (Request URL in Headers), which is https://jobs.booking.com/api/apply/v2/jobs?domain=booking.com&start=10&num=10&location=netherlands&domain=booking.com
2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
Next we need to batch generate the JSON URL list in Octoparse.
- Open Octoparse and start a new advanced task that batch generate input URLs
- Paste the copied URL into the URL format box
- Select the changing element in the URL and click Add Parameter
- Set Initial value to 0, Every time to +10, and End value to 363 and click Confirm to save
- Click the Go to Web Page action and tick the JSON box in the General tab
- Click Apply to save your settings
3. Select the data for extraction - to get the data we need
- Toggle the structure tree and select the page elements we want in the positions node
- Extract data fields like name, display_job id, business unit, and location
- Save the task and run it to get the data we need
Here is the sample data output.
Feel free to leave a message if you still have questions about JSON extraction. We will get back to you ASAP.