Scrape data from JSON with Octoparse
FollowIn version 7.2, you can now use JSON extraction for faster and more reliable data extraction. We will show you exactly how to use this feature in this tutorial.
What is JSON?
JSON stands for JavaScript Object Notation. JSON is a lightweight data exchange format that is widely used because it is smaller, faster, easier to parse and read than XML, effectively improving network transmission efficiency.
Why should you use JSON extraction?
- Faster data extraction as Octoparse doesn't have to load images and other information
- Fewer anti-extraction restrictions on some websites
How to use JSON Extraction in Octoparse?
In this tutorial, we will scrape data from a list page on Walmart using JSON extraction with Octoparse as a simple example.
1. First, we'll need to identify the correct URL containing the JSON file we would like to scrape
- Open the web page in the Chrome browser
- Right-click the page and select "Inspect"
- Click "Network" and select "XHR"
- Refresh the page and check if the URLs loaded and shown up in "XHR" contain JSON files
If a URL contains a JSON file, you'll find "json" listed in the "content-type" in "Headers".
- Find the JSON file which contains the data we want to scrape
You can select "Preview" to preview the JSON data. In this case, we would like to scrape product information, and thus we want the JSON file with product information.
- Copy the URL containing the targeted JSON file
The URL is "Request URL" in "Headers", which is: https://www.walmart.com/search/api/wpa?el=sponsored-container-bottom-1&type=product&min=2&max=20&placementId=1145x345_B-C-OG_TI_2-20_HL-BOTTOM&platform=desktop&bucketId=&moduleLocation=bottom&zipCode=94066&isZipLocated=true&sMode=0&pageType=search&customerId=2DC7B9C9052A1369-60000104C00000E4&vtc=WDA-vOofwJ8UtA5N-A1r9M&uid=6ed59512-38e3-409c-ac96-cccdc745720f&rviItems=32703709%2C11070434%2C16674418&itemsAddedToCart=0&viewportHeight=507&viewportWidth=1920&userLoggedIn=false&showBrand=false&pageId=na&pageNumber=1&keyword=pens&taxonomy=6735581_4705218&persistControls=true&isTwoDayDeliveryTextEnabled=true&mloc=bottom&module=wpa
Let's take a look at how to identify the correct URL in action.
2. Open the URL containing the targeted JSON file in Octoparse
- Copy and paste the URL containing the JSON file into Advanced Mode
- Select the box for "Extract from JSON" and click "OK" to refresh the page in Octoparse
You can see the JSON data in a tree structure, which can be expanded or collapsed as needed.
3. Select the data for extraction and start extraction
- Select the data in the tree structure
"productId" and "productName" are selected as an example.
- Click "Extract data" and Octoparse will automatically generate a loop item to scrape all the "productId" and "productName" in the tree
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Tips! If you want to set up JSON request, check "Extract from JSON" first then refer to the options at the bottom. |