You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
As a division of Walmart Inc., Sam's Club provides a membership warehouse club solution for customers' daily life with high-quality products. It has become well in demand around the world in recent years.
This tutorial will introduce how to scrape basic information such as the name, price, etc., of products from Sam's Club.
To follow through the tutorial, you may want to use the URL below:
Here are the main steps in this tutorial: [Download task file here]
1. Create a Go to Web Page - to open the target website
Enter the target URL into the search bar on the home screen and click Start
2. Auto-detect the webpage - to create a workflow
Octoparse's Auto-detection function can help you create a workflow quickly according to the design of the target website.
Click Auto-detect web page data in Tips and wait for the detection to complete
Check the data fields in Data preview and delete unwanted fields or rename them if needed
Uncheck Add a page scroll and uncheck Click on a “Load More” button
Click Create workflow
3. Modify the XPath of the data field(s) - to locate the data accurately
The auto-generated XPath of some fields needs to be modified to make sure that Octoparse extracts accurate data.
In this case, the data in the field Price is incomplete, so we need to modify the XPath of Price to get the right data.
Click More(...) next to the data field to change its settings
Choose Customize XPath
Input the Matching XPath for Price as: //span[contains(text(),'current price')]
Click Apply to save the change
NOTE: You may find that the price data contains not only numbers but also irrelevant words such as CURRENT PRICE in this case. If you would like to remove them, check here to learn more about how Octoparse can help refine the data.
4. Create a Pagination - to load and extract more data
Click on the next page button at the bottom of the webpage
Click Loop click single button on the Tips panel
Set appropriate AJAX timeout: 7-10s recommended
NOTE: If you are interested in how Octoparse handles AJAX websites, please check out here.
Now, you will see a workflow created like the one below:
5. Run the task - to get your desired data
Click Save on the upper right to save your task
Click Run next to it and wait for a Run Task window to pop up
Select Run on your device to run the task on your local device
Wait for the task to complete
Here is sample output from a local run: