Scrape product information from Sam's Club
FollowIn this tutorial, we will show you how to collect product information on Sam's Club with Octoparse.
Also, you can go to "Task Template" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Sam's Club Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
We will scrape each detail page URL in Task 1 and extract the product details such as the product title, price, and brand from the product details page with Task 2. By splitting one task into two, the extraction speed could be improved to a certain degree, especially when we use the Octoparse Cloud Extraction.
To follow through, you may want to use this URL in the tutorial:
This tutorial will also cover:
- Deal with AJAX for pagination
Here are the main steps in this tutorial [Download demo task files here]
Task 1: Extract all the URLs of detail pages on the search result pages
- "Go to Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple search results pages
- Build a "Loop Item"- loop extract each URL on the search results pages
- Start extraction - run the task and get data
Task 2: Collect the product information from scraped URLs
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Start extraction - run the task and get data
Task 1: Extract the detail page URLs on the search result pages
1. "Go to Web Page"- open the target web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like samsclub.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the " Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple search results pages
- Scroll down and click the ">" button on the web page
- Click "Loop click the selected element" on the "Action Tips" panel
Octoparse Version 7.2.2 can detect the AJAX technique and set up AJAX Load automatically. To ensure the web page is fully loaded in Octoparse built-in browser, we need to set up a time for AJAX timeout.
- Select an appropriate time for "AJAX Timeout" in the drop-down menu
- Click "Save"
Tips! To learn more about dealing with AJAX in Octoparse, please refer to Deal with AJAX Also, you may go to this video tutorial Octoparse: AJAX 101 |
3. Build a "Loop Item"- loop extract each URL on the search results pages
- Click "Go To Web Page" to go back to the first page, and then click "Pagination" box
- Select one of the product items on the search result page
In this step, Octoparse is able to detect similar elements and highlight them in red.
- Click "A" tab on the bottom of the "Action Tips"
As we need to extract the URLs in a loop, we’d better make sure you select the "A" tag when you extract the URL. ( "A" tag stands for anchor.)
- Click "Select All"
- Click "Extract the URLs of the selected elements"
4. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - loop open the detail pages
In Task 1, we have already got a batch of URLs.
- Click "+ Task" to start a task using Advanced Mode to build Task 2
- Input batch URL
There are three ways to batch import URLs to any single task/crawler (up to one million URLs). In this case, we will batch import URLs from local files. You just need to copy the URLs from Task 1 extraction output file and then paste them under "Website" text box. For further study, please refer to Batch Import URLs
2. Extract data - select the data for extraction
As we can see, we are on the detail page now.
- Click the information you need on the page
- Select "Extract data" on the "Action Tips"
In this step, we are able to rename the fields by selecting from the predefined list or inputting on your own.
3. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Now you have got the data you want. A splitting task can help improve the efficiency of data extraction as well as minimize some problems caused by small changes happened in websites.
However, you can still make it one task so that you can extract data in one time. You may go to this similar case tutorial Scrape product data from Tokopedia to have a basic idea about the whole procedure.
Author: Erika
Editor: Fergus