You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
As a division of Walmart Inc., Sam's Club provides a membership warehouse club solution for customers' daily life with high-quality products. It has become well in demand around the world in recent years.
This tutorial will introduce how to scrape basic information such as the name, price, etc., of products from Sam's Club.
To follow through the tutorial, you may want to use the URL below:
NOTE: If you want to check whether your workflow works correctly, please download the OTD file for this case at the bottom of this page.
Here are the main steps of this tutorial:
- Create a Go to Web Page - to open the target website
- Auto-detect the webpage - to create a workflow
- Modify the XPath of the data field(s) - to locate the fields accurately
- Create a Pagination - to load and extract more data
- Run the task - to get your desired data
1. Create a Go to Web Page - to open the target website
- Enter the target URL into the search bar on the home screen and click Start
2. Auto-detect the webpage - to create a workflow
Octoparse's Auto-detection function can help you create a workflow quickly according to the design of the target website.
- Click Auto-detect web page data in Tips and wait for the detection to complete
- Check the data fields in Data preview and delete unwanted fields or rename them if needed
- Uncheck Add a page scroll and uncheck Click on a “Load More” button
- Click Create workflow
3. Modify the XPath of the data field(s) - to locate the data accurately
The auto-generated XPath of some fields needs to be modified to make sure that Octoparse extracts accurate data.
In this case, the data in the field Price is incomplete, so we need to modify the XPath of Price to get the right data.
- Click More(...) next to the data field to change its settings
- Choose Customize XPath
- Input the Matching XPath for Price as: //span[contains(text(),'current price')]
- Click Apply to save the change
NOTE: You may find that the price data contains not only numbers but also irrelevant words such as CURRENT PRICE in this case. If you would like to remove them, check here to learn more about how Octoparse can help refine the data.
4. Create a Pagination - to load and extract more data
- Click on the next page button at the bottom of the webpage
- Click Loop click single button on the Tips panel
- Set appropriate AJAX timeout: 7-10s recommended
NOTE: If you are interested in how Octoparse handles AJAX websites, please check out here.
Now, you will see a workflow created like the one below:
5. Run the task - to get your desired data
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
- Wait for the task to complete
Here is sample output from a local run: