In this tutorial, we will show you how to collect product information on Houzz with Octoparse.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the product title, price, and the number of reviews from the product details page with Octoparse.
Here are the main steps in this tutorial [Download demo task file here]
- "Go to Web Page"- to open the target web page
- Create a pagination loop - to scrape all the results from multiple pages
- Create a "Loop Item" - to loop click into each item on each list
- Extract data - to select the data for extraction
- Save and start extraction - to run the task and get data
1. "Go to Web Page" - to open the target web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like houzz.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2. Create a pagination loop - to scrape all the results from multiple pages
- Scroll down and click the "Next page" button on the web page
- Click "Loop click next page" on "Action Tips"
Octoparse Version 7.2.2 can detect the AJAX technique and automatically set up AJAX Load. To ensure the web page is fully loaded in Octoparse built-in browser, we need to set up a time for AJAX timeout.
- Select an appropriate "AJAX Timeout" in the drop-down menu
- Click "OK" to have the step saved.
If you want to learn more about AJAX, here is a related tutorial you might need：
3. Create a "Loop Item" - to loop click into each item on each list
- Select the "Loop Item" in the Workflow
- Select "Variable list" on the Loop mode
- Copy the XPath expression ".//div[@class="hz-br-container hz-spf-animation-container hz-br-container__products hz-br-container--unify hz-spf-animation-container--no-transition"]/div/a"
- Paste it on the box for "Variable list"
We need to modify XPath in order to locate all the products on the web page correctly.
- Click "OK" to save the result
By doing the above steps, we can see the product items are select in red.
- Drop "Click Item" step in the "Loop Item" in the Workflow
Octoparse will automatically click into the detail pages of the product items.
1. "Variable list" is a loop mode in Octoparse. For more about loop modes in Octoparse:
2. If you want to learn more about XPath and how to generate it, here is a related tutorial you might need：
4. Extract data - to select the data for extraction
- Click on the data you need on the page
- Select "Extract text of the selected element" from the "Action Tips"
- Rename the fields by selecting from the pre-defined list or inputting on your own
5. Save and start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)