In this tutorial, we will show you how to scrape customer reviews from Trustpilot.com, which is a consumer review website hosting reviews of businesses worldwide.
We will use the link below to scrape consumers' reviews about Bank of America:
In this case, we are going to scrape all the information including username, the total number of reviews posted, location, rating, date posted, title, and review contents, as shown below.
Here are the main steps that you need to understand before we get started: [Download demo task file here]
- "Go to Web Page" - Open the target webpage
- Modify the XPath of Pagination
- "Extract data" - Select the data needed
- "Run extraction" - Run your task and collect data
1. "Go to Web Page" - Open the target webpage
- Simply copy and paste the link on the home page
2. "Auto-detect webpage data" - Scrape all information from multiple pages
- Simply click Auto-detect webpage data to conduct automatic page detection
- Click Create workflow to scrape data
- Click Click to paginate
- Set AJAX timeout for 5s (Optional setting depends on your local network speed, 5-10s are recommended)
- Click Apply to save settings
We have to make sure the whole block of the review is selected, which means the whole review block has been highlighted in green, with all the sub-elements, such as title, username, date, etc. in red, to ensure the precise positioning in the following section.
You can also drag the blocks on the workflow to adjust the sequential order of the action if the auto-detection is inaccurate.
3. Modify the XPath of Pagination
The auto-generated XPath does not work well. We can modify the Xpath for the Pagination to make sure we scrape all the pages.
- Click on Pagination
- Replace the XPath with //a[@name="pagination-button-next"]
- Click Apply to save
4. "Extract Data" - Select the data needed
- Delete or modify unwanted data fields
Here is the rough data preview on the bottom, you can drag, rename or delete the field by clicking on each title. Please also note that Octoparse can only detect letters, numbers and "_" on the name field, so please avoid typing other symbols when modifying each title.
However, here are some extracted data on the picture that we are not willing to collect, such as the "Date Posted". In this case, we want it to be in the "year/month/day" format. Therefore, we need to conduct Customize field and Clean data to modify the extracted content.
- Click Customise field - Extract attribute - datetime, to extract the designated attribute from HTML code
- Click Clean data - Add step - Reformat extracted date/time to modify content
5. "Run extraction" - Run your task and collect data
- Click Save and Run on the top right corner
- Select Run on your device or Run in the Cloud to run the task in the cloud (for premium users only).
Here is the output sample for your information:
Is this article helpful? Contact us anytime if you need our help!