You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
In this tutorial, we will show you how to scrape customer reviews from Trustpilot.com, which is a consumer review website hosting reviews of businesses worldwide.
We will use the link below to scrape consumers' reviews about Bank of America:
In this case, we are going to scrape all the information including username, the total number of reviews posted, location, rating, date posted, title, and review contents, as shown below.
Here are the main steps that you need to understand before we get started: [Download demo task file here]
1. "Go to Web Page" - Open the target webpage
Simply copy and paste the link on the home page
2. "Auto-detect webpage data" - Scrape all information from multiple pages
Simply click Auto-detect webpage data to conduct automatic page detection
Click Create workflow to scrape data
Click Click to paginate
Set AJAX timeout for 5s (Optional setting depends on your local network speed, 5-10s are recommended)
Click Apply to save settings
We have to make sure the whole block of the review is selected, which means the whole review block has been highlighted in green, with all the sub-elements, such as title, username, date, etc. in red, to ensure the precise positioning in the following section.
You can also drag the blocks on the workflow to adjust the sequential order of the action if the auto-detection is inaccurate.
3. Modify the XPath of Pagination
The auto-generated XPath does not work well. We can modify the Xpath for the Pagination to make sure we scrape all the pages.
Click on Pagination
Replace the XPath with //a[@name="pagination-button-next"]
Click Apply to save
4. "Extract Data" - Select the data needed
Delete or modify unwanted data fields
Here is the rough data preview on the bottom, you can drag, rename or delete the field by clicking on each title. Please also note that Octoparse can only detect letters, numbers and "_" on the name field, so please avoid typing other symbols when modifying each title.
However, here are some extracted data on the picture that we are not willing to collect, such as the "Date Posted". In this case, we want it to be in the "year/month/day" format. Therefore, we need to conduct Customize field and Clean data to modify the extracted content.
Click Customise field - Extract attribute - datetime, to extract the designated attribute from HTML code
Click Clean data - Add step - Reformat extracted date/time to modify content
Tips: In this section, we have mentioned two key points that you need to understand when doing data cleaning. To learn more, please click on the following titles:
5. "Run extraction" - Run your task and collect data
Click Save and Run on the top right corner
Select Run on your device or Run in the Cloud to run the task in the cloud (for premium users only).
Here is the output sample for your information: