In this tutorial, we are going to show you how to scrape the product details from Wayfair, an American home improvement retailercommerce company that sells home goods.
For this example, we will use the URL below in order to scrape data such as product title, description, and price from each product detail page.
Here are the main steps in this tutorial: [Download demo task file here]
- "Go To Web Page" - open the targeted web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Start extraction - run the task and get data
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
Extracting data from a list of URLs is recommended for large scrape data scraping projects. This approach is considerably more efficient and manageable. At times when the list of URLs are large, Octoparse supports batch/bulk URL import from local files (text or spreadsheet), from another task or even generate the URLs based on some pre-defined patterns.
- Scroll down to the bottom of the page, click the "Next" button
- Click "Loop click next page" on "Action Tips" panel
- Set up an AJAX timeout for 5s (optional according to your local network condition)
- Click "OK" to save
AJAX timeout can often be used as webpage timeout for Click Action. For example, when you have a page that takes forever to finish loading, long after the data you need gets loaded, you can conveniently use AJAX timeout to tell Octoparse to move on to the next action when the set time is reached. Check this video if you want to know more about AJAX.
- Click on any product titles on the page
- Click "Select all" on the "Action Tips" panel
- Click "Loop click each element"
Octoparse detects for any similar items on the same page when an element is selected. The selected links are highlighted in green while all the other similar links detected are highlighted in red. When a Loop click action is added, Octoparse will click through each link captured in the Loop Item, and open the product detail page one by one.
- Uncheck "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Click "Save"
After you click "Loop click each element", Octoparse will open the detail page of the first product.
- Click on the data you need on the page
- Select "Extract text of the selected element" from the "Action Tips"
- Rename the fields by selecting from the pre-defined list or inputting on your own
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here's the data we extracted.
Happy data hunting!