In this tutorial, we will show you how to scrape the reviews from Trustpilot.com, a consumer review website hosting reviews of businesses worldwide.
We will use the URL below to scrape consumers' reviews about NIKE product:
Here are the main steps in this tutorial [Download demo task file here]:
- "Go To Web Page" - open the targeted web page
- Create a pagination loop - scrape all the reviews from multiple pages
- Create a "Loop Item" - scrape all the reviews on one page
- Extract data - select the data needed
- Run extraction - run your task and get data
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Input URL" box
- Click "Save URL" to move on
You can also choose to start the extraction with trustpilot.com then proceed to search with keywords. However, it is always recommended to use the URL of the search result page directly to save unnecessary steps.
- Scroll down the page and click the next page button "Next page >"
- Click "Loop click the selected link" on the "Action Tips"
- Uncheck "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Check "Load the page with AJAX" and set Time out for 10s (optional according to your local network condition)"
- Click "OK" to save
- Select any reviews in the built-in browser
There're always some Ad posts scattered within the list. If Ad is unwanted, please select the non-Ad posts, so that Octoparse can recognize and highlight only the non-Ad ones.
Octoparse's built-in algorithm detects patterns based on the clicked items, hence it is important to make sure you are selecting the desired data fields when creating a list.
We need to make sure the whole block of the review is selected, that said, the whole review block is highlighted in green, with all the sub-elements, like title, customer name, date, content… in red, just as the following image shows:
- Click "Select all sub-elements" on the "Action Tips"
Now Octoparse would automatically recognize all the similar sections on this page and highlight them in red.
- Click "Select all"
- Click "Extract data in the loop"
By default, all the selected data fields are automatically extracted. We can keep the ones we want and delete the unwanted ones in the "Customize Action" area.
- Delete the unwanted data fields
You can select/delete multiple data fields at once by pressing "Shift" or "Ctrl".
- Click to extract any data fields that Octoparse has failed to detect. Please make sure data is selected from the selected loop item
- Rename the fields by selecting from the pre-defined list or inputting on your own
- Click "OK" to save
Here is the sample field name.
- Click "Save" to save the task first
- Click "Start Extraction"
- Select "Local Extraction" to run the task on your computer or "Cloud extraction" to run the task in the cloud (for premium users only).
Here is the output sample for your information.
Happy data hunting!