In this tutorial, we are going to show you how to scrape product data from Walmart.com.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Walmart Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
Suppose we want to scrape some specific information about headphones, and we can start with the search results page (https://www.walmart.com/search/?query=headphones) to create our crawler. We will scrape data such as the product title, price, product ID, and reviews from the product details page with Octoparse.
Here are the main steps in this tutorial: [Download demo task file here]
- "Go to web page" - open the targeted web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Modify XPath - extract accurately
- Start extraction - run the task and get data
1. "Go to web page" - open the targeted web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Walmart.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down the page and click the next page button ">"
- Click "Loop click next page" on the "Action Tips" panel
- Set up AJAX Load for the "Click to paginate" step
In this case, Octoparse has automatically detected AJAX on the selected element and set up the AJAX Timeout.
- Uncheck the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Set up "AJAX Timeout" according to your network condition(long enough for the page to load), and in this case, we set "3s" for demonstration
- Click "OK" to save
If you want to learn more about AJAX, here is a related tutorial you might need:
3. Create a "Loop Item" - loop click into each item on each list
- Click "Go To Web Page" to go to the first page
When extracting data throughout multiple pages, you should always begin your task building on the first page.
- Click the title of the 1st item
Octoparse will automatically detect other similar items on the list. The selected items will be highlighted in green while other links to the detail pages will be highlighted in red.
- Click "Select All" on the "Action Tips" panel
- Select "Loop click each element" to create a loop
Octoparse will click through each link captured in the "Loop Item", and open the detail page.
4. Extract data - select the data for extraction
After you click "Loop click each element", Octoparse will open the detail page of the 1st item.
- Click on the data you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
5. Modify XPath - extract accurately
XPath is a language that allows you to locate specific elements from a page precisely based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page firstly.
- Find the correct XPath with Firepath/Firebug extension tool in Firefox browser.
In this case, the correct XPath isn't located accurately by default, so we have to input the right XPath manually. The correct XPath is //span[@class="price display-inline-block arrange-fit price price--stylized"]/span
- Rename the fields by selecting from the predefined list or inputting on your own
6. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output. You may see some blank fields in the column “Walmart ID” when you start extracting data. This is because these products do not have a product ID.
By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may leave a field blank even if the element needed is shown on the website. If you encounter such a problem, here is a related tutorial for you:
Was this article helpful? Contact us any time if you need our help!