You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

In this tutorial, we will show you how to scrape customer reviews from Trustpilot.com, which is a consumer review website hosting reviews of businesses worldwide.

We will use the link below to scrape consumers' reviews about Bank of America:

https://www.trustpilot.com/review/www.bankofamerica.com

In this case, we are going to scrape all the information including username, the total number of reviews posted, location, rating, date posted, title, and review contents, as shown below.

2021-09-06_10-42-56.png

Here are the main steps that you need to understand before we get started: [Download demo task file here]

  1. "Go to Web Page" - Open the target webpage

  2. "Auto-detect webpage data" - Scrape all information from multiple pages

  3. Modify the XPath of Pagination

  4. "Extract data" - Select the data needed

  5. "Run extraction" - Run your task and collect data


1. "Go to Web Page" - Open the target webpage

  • Simply copy and paste the link on the home page

12345.gif

2. "Auto-detect webpage data" - Scrape all information from multiple pages

  • Simply click Auto-detect webpage data to conduct automatic page detection

  • Click Create workflow to scrape data

ooooooooooooooooooo.gif
  • Click Click to paginate

  • Set AJAX timeout for 5s (Optional setting depends on your local network speed, 5-10s are recommended)

hhhhhhhhhhh.gif
  • Click Apply to save settings

2021-09-07_15-04-40.png

We have to make sure the whole block of the review is selected, which means the whole review block has been highlighted in green, with all the sub-elements, such as title, username, date, etc. in red, to ensure the precise positioning in the following section.

2021-09-03_16-52-23.png

You can also drag the blocks on the workflow to adjust the sequential order of the action if the auto-detection is inaccurate.

___________________.gif

3. Modify the XPath of Pagination

The auto-generated XPath does not work well. We can modify the Xpath for the Pagination to make sure we scrape all the pages.

  • Click on Pagination

  • Replace the XPath with //a[@name="pagination-button-next"]

  • Click Apply to save

modify_pagination.jpg

4. "Extract Data" - Select the data needed

  • Delete or modify unwanted data fields

Here is the rough data preview on the bottom, you can drag, rename or delete the field by clicking on each title. Please also note that Octoparse can only detect letters, numbers and "_" on the name field, so please avoid typing other symbols when modifying each title.

2021-09-06_15-16-13.png

However, here are some extracted data on the picture that we are not willing to collect, such as the "Date Posted". In this case, we want it to be in the "year/month/day" format. Therefore, we need to conduct Customize field and Clean data to modify the extracted content.

  • Click Customise field - Extract attribute - datetime, to extract the designated attribute from HTML code

_____________.gif
  • Click Clean data - Add step - Reformat extracted date/time to modify content

mmmmmmmmmmmmmmm.gif

Tips: In this section, we have mentioned two key points that you need to understand when doing data cleaning. To learn more, please click on the following titles:


5. "Run extraction" - Run your task and collect data

  • Click Save and Run on the top right corner

  • Select Run on your device or Run in the Cloud to run the task in the cloud (for premium users only).

______________________.gif

Here is the output sample for your information:

2021-09-03_19-24-05.png
Did this answer your question?