In this tutorial, we will show you how to collect product information on Houzz with Octoparse.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Houzz Product Template directly to save your time. With this feature, there is no need to configure scraping tasks. You just need to input the URL of the search result page. For further details, you may check it out here: Task Templates
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the product title, price, and the number of reviews from the product details page with Octoparse.
Here are the main steps in this tutorial [Download demo task file here]
- "Go to Web Page"- open the target web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Start extraction - run the task and get data
1. "Go to Web Page" - to open the target web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Houzz.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the "Next page" button on the web page
- Click "Loop click next page" on the "Action Tips" panel
Octoparse can detect the AJAX technique and automatically set up AJAX Load. To ensure the web page is fully loaded in Octoparse built-in browser, we need to set up a time for AJAX timeout.
- Select an appropriate "AJAX Timeout" in the drop-down menu
- Click "OK" to save
If you want to learn more about AJAX, here are the related tutorials you might need:
3. Create a "Loop Item" - loop click into each item on each list
- Click the title of the 1st item on the list
- Click "Select all" on the "Action Tips" panel
- Select "Loop click each element" on the "Action Tips" panel
Now, you have built the loop. However, you may find that some items are not selected correctly. In this case, you need to modify the XPath for the "Loop Item".
- Select "Variable list" on the Loop mode
- Copy and paste the modified XPath expression into the text box of "Variable list"
- .//div[@class="hz-br-container hz-spf-animation-container hz-br-container__products hz-br-container--unify hz-spf-animation-container--no-transition"]/div/a
Now, you can find that all the items on the list are selected correctly.
- Click "OK" to save
1. "Variable list" is a loop mode in Octoparse. For more about loop modes in Octoparse:
2. If you want to learn more about XPath and how to generate it, here are the related tutorials you might need:
4. Extract data - select the data for extraction
- Click on the data you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
5. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Was this article helpful? Contact us any time if you need our help!