What is JSON?
Why should you use JSON extraction?
- Faster data extraction as Octoparse doesn't have to load images and other information
- Fewer anti-extraction restrictions on some websites
- An easier way to deal with websites with load more button or infinitive scrolling
How to use JSON Extraction in Octoparse?
In this tutorial, we will scrape data from a listing page on Booking.com using JSON extraction with Octoparse as a simple example. Here is the link: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com
1. First, we'll need to identify the correct URL containing the JSON file we would like to scrape
- Open the web page in the Chrome browser
- Right-click the page and select "Inspect" and you will see the Chrome developer tool shown
- Click "Network" and select "XHR"
- Clear all the loaded information and then scroll down to refresh the page. Check if the URLs loaded and shown up in "XHR" contain JSON files. (If a URL contains a JSON file, you'll find "json" listed in the "content-type" in "Headers".)
- Find the JSON file which contains the data we want to scrape. You can select "Preview" to preview the JSON data. In this case, we would like to scrape job information, and thus we want the JSON file with job information. Here we also notice that the total number is 219 in this case.
- Scroll down again to get another request URL. By comparing their Request URLs, we find that the parameter "start" in the URL would increase by 10.
- Copy the URL containing the targeted JSON file. The URL is "Request URL" in "Headers", which is: https://jobs.booking.com/api/apply/v2/jobs?domain=booking.com&start=10&num=10&location=netherlands&domain=booking.com
2. Open the URL containing the targeted JSON file in Octoparse
- Start a new task and batch generates the URLs containing the JSON file with the incremental number set as 10 and the item number as 23.
- Click the setting of the "Go to Web Page"
- Select the box of "JSON" and click "OK"
3. Select the data for extraction
- Find the JSON nodes that contain the information we need. In this case, we open the "positions" node.
- Select the data in the tree structure. "Name", "display_job_id", "business_unit" and "location" are selected as an example.
- Click "Extract data" in the Tips panel and Octoparse will automatically generate a Loop Item to scrape all the "name", "display_job_id", "business_unit" and "location" in the tree.
4. Save the task and run it
Here is the sample data out.
Artículo en español: Scraping los datos de JSON con Octoparse
También puede leer artículos de web scraping en el website oficial