In this tutorial, we are going to show you how to scrape product data from Walmart.com.
Suppose we want to scrape some specific information about headphones, and we can start with the search results page (https://www.walmart.com/search/?query=laptop&cat_id=0) for creating our crawler. We will scrape data such as the product title, price, product ID, and reviews from the product details page with Octoparse.
Here are the main steps in this tutorial: [Download demo task file here ]
1) Go to web page - to open the targeted web page
· Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Walmart.com, we strongly recommend Advanced Mode to start your data extraction project.
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2) Create a pagination loop - to scrape all the results from multiple pages
· Turn on the “Workflow Mode” by switching the "Workflow" button in the top-right corner in Octoparse
We strongly suggest you turn on the "Workflow Mode" to get a better picture of what you are doing with your task, just in case you mess up with the steps.
· Scroll down the page and click the next page button ">"
· Click "Loop click next page" from the "Action Tips"
When you click the next page button for pagination on Walmart.com, you can find out that only the product listing part will be updated. As there is no refreshing after clicking the next page button on Walmart.com, Octoparse cannot receive the signal to act and would be stuck in the pagination step. Therefore, we need to set up AJAX Load in the "Click to paginate" step.
· Set up AJAX Load for the "Click to paginate" step
· Uncheck the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
· Check the box for "Load the page with AJAX" (2-4 seconds will work usually)
· Click "OK" to save
If you want to learn more about AJAX, here is a related tutorial you might need：
3) Create a "Loop Item" - to loop click into each item on each list
· Click "Go To Web Page" to go to the first page
When extracting data throughout multiple pages, you should always begin your task building on the first page.
· Make Octoparse identify and select all 20 links on the page
· Click the first product titles and click "A" tag from the "Action Tips"
· Click the second product titles and click "A" tag from the "Action Tips"
In HTML source code, the "A" tag defines a hyperlink, which is used to link from one page to another. By clicking the "A" tag on the "Action Tips", we can help Octoparse select the link to the detail page.
Normally, you don't need to click "A" manually since Octoparse will automatically distinguish and select hyperlinks. But if Octoparse fails to distinguish hyperlinks, you’ll need to select the "A" tag on your own to help Octoparse distinguish and select the link.
The selected links will be highlighted in green while other links to the detail pages will be highlighted in red. If certain links on the list page are still missing after the first two clicks, keep clicking on more links from the same list until all links desired are selected and highlighted in green.
· Click "Loop click each element" to create a "Loop Item"
Octoparse will click through each link captured in the "Loop Item", and open the detail page.
4) Extract data - to select the data for extraction
After you click "Loop click each element", Octoparse will open the detail page of the first product.
· Click on the data you need on the page
· Select "Extract text of the selected element" from the "Action Tips"
· Rename the fields by selecting from the pre-defined list or inputting on your own
5) Start extraction - to run the task and get data
· Click “Start Extraction” on the upper left side
· Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Here is the sample output. You can see some blank fields in the column “Walmart ID”. This is because these products do not have a product ID.
By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may fail to find the element of the defined pattern even if the element needed is shown on the website. If you encounter this problem, here are a related tutorial you might need：