You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

In this tutorial, we are going to introduce how to scrape customer review data from Tripadvisor. We will scrape the hotel's basic information, the reviewers' names, and the comments from customers.

To follow through, you may want to use the URL in this tutorial:

https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html

Here are the main steps in this tutorial: [Download demo task file here]

  1. Go to Web Page - to open the target web page

  2. Auto-detect the web page - to create a workflow

  3. Click on links to scrape reviews

  4. Create a "Loop Item" - to scrape review information

  5. Create pagination - to scrape all reviews from multiple pages

  6. Customize the data field by modifying XPath – to improve the accuracy of certain data fields

  7. Data cleaning - to reformat data fields

  8. Start extraction - to run the task and get the data


1. Go to Web Page - open the targeted web page

  • Enter the URL to the home page and click Start

8.png

2. Auto-detect the web page - to create a workflow

  • Click Auto-detect web page data and wait for the detection to complete

11.png
  • Go to Data preview to see if you're okay with the current data output

  • Delete or rename the fields if needed

3.png
  • Uncheck the option of Add a page scroll

  • Click Create workflow

14.png

3. Click on links to scrape reviews

  • Select Click on link(s) to scrape the linked page(s)

click_links.jpg
  • Choose the Title URL and Confirm

title_URL.jpg

The auto-generated action Click URLs in the list cannot always click the title URL, so we need to modify the XPath of this action. (To know more about what is XPath, please check here)

  • Click Click Item and input the XPath: //a[contains(@class,"property_title prominent")]

modify_XPath_of_click_item.jpg

If Octoparse does not open the first hotel page after we save the XPath, we can click on another action (for example, the Extract Data action), then click Click URLs in the list to open the hotel page.


4. Create a "Loop Item" - to scrape reviews

You may want to know which hotel these reviews are for. We can scrape the hotel information along with the reviews.

  • Click on the data you need and click Extract the text of the selected element respectively

2.png
  • Scroll down the page, select the first 2 reviews and click Extract the text of the selected elements

  • Select the data (Username and Comment) and click Extract the text of the selected link separately

__3.gif

5. Create pagination - to scrape all the reviews from multiple pages

  • Scroll down to click the Next button and choose Loop click next page

  • Set up AJAX (To know more about AJAX, please click here)

__9.gif

6. Customize the data field by modifying XPath – to improve the accuracy of data fields

As we have mentioned, the auto-generated XPath does not always work, we need to modify the XPath of the fields to make the scraping more precise. We have prepared fields Xpath down below. You can just copy and paste to customize XPath.

  • Phone number: //div[@data-blcontact="PHONE "]

  • Address: //span[contains(@class,'map')]/following-sibling::span[1]

  • Number of reviews: //a[@href="#REVIEWS"]

  • Reviewer name: //a[contains(@class, "header_link")]

  • Review title: //div[@data-test-target="review-title"]

  • Review rating: //div[@data-test-target="review-rating"]

  • Date of stay: //span[contains(text(),"Date of stay:")]/..

  • Review content: //div[@data-test-target="review-title"]/following-sibling::div[1]

  • Switch Horizontal view to Vertical View

  • Copy and paste the Xpath we provided to each field

4.png

7. Data cleaning - to reformat the fields

For the "Rating", "Date of Stay" and "Review time" fields, you might find that modifying the Xpath can not get the exact data you want. So we can use Clean Data to solve this problem. To learn more about clean data, please click here.

  • Make sure to click on Extract the outer HTML of the selected element when extracting data (for the 3 fields we mentioned above)

6.png
  • Click ... -> Clean data -> Add Step -> Match with Regular Expression

4.png
  • Choose Try RegEx Tool

  • Input 'rating bubble_' to Start with and input '"' in End with

  • Click Generate and Apply

Regular_Expression.jpg
  • Add a step of Replace with Regular Expression

  • Input the expression ([0-9]+)([0-9]{1})

  • Input $1.$2 in With

RegEx.jpg

8. Start extraction - to run the task and get the data

  • Click Save

  • Click Run on the upper left side

  • Select Run on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

7.png

Here is the sample output.

754ebac9352b0a422e7a8a42721f382.png
Did this answer your question?