In this tutorial, we are going to introduce how to scrape information from TripAdvisor.com.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use TripAdvisor Templates directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
To follow through, you might want to use the URL in this tutorial:
We will scrape the hotel title, location, price, and rating from the hotel page with Octoparse.
This tutorial will also cover:
- Reformat the star rating into numerals with RegEx tool in Octoparse
Main steps in the tutorial: [Download demo task file here ]
- "Go To Web Page" - open the targeted web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Customize the data field by modifying XPath – improve the accuracy of a certain data field (Optional)
- Customize the data field using RegEx tool - reformat rating data (Optional)
- Start extraction - run the task and get data
1. "Go To Web Page" - open the targeted web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
Because of the cookie setting of TripAdvisor, we need to configure the filters in Octoparse.
- Select a "Check-in" date in the built-in browser and click "Click Element" on the "Action Tips"
- Repeat the actions to configure the "Check-out" date and "Guest Information"
Now, we can have the result page we need.
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the "Next Page" button on the web page
- Click "Loop click next page" on the "Action Tips" panel
As TripAdvisor loads the content with AJAX, we should set up AJAX Load for the “Click to paginate" action.
- Uncheck "Auto retry when no response"
- Check "Load the page with AJAX"
- Set up "AJAX Timeout", and in this case, we set "5" seconds
- Click "Save"
If you want to learn more about AJAX, here are related tutorials you might need:
3. Create a "Loop Item" - loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.
- Click "Go To Web Page" in the workflow
- Delete the three "Click item" actions
Octoparse will send the saved cookie to the website at loading, so we can open the result page directly. As the TripAdvisor has already "remembered" us, now there’s no need to keep these actions.
- Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the "Loop Item" at the appropriate position in the workflow.
Now, let’s build the loop item.
- Click the title of the first item on the listing page except for those "Sponsored" items
- Click "Select All" on the "Action Tips" panel
- Select "Loop click each URL"
Octoparse will automatically generate the loop and open the detail page of the 1st item.
4. Extract data - select the data for extraction
- Click the information you need on the page
- Select "Extract data" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
When you click on the rating of the listing, choose "Extract outer HTML of the selected element". The data extracted needs to be processed further with Regular Expression. See how it's done in Step 5.
5. Customize the data field by modifying XPath - improve the accuracy of the item list (Optional)
In this case, the "Address" element is not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the "Address" element on each page to be precisely detected.
Let's revise the XPath of the "Address" data field :
- Click the row of "Address" field
- Click the icon of "Customize data field"
- Select "Customize XPath"
- Paste the revised XPath into the "Matching XPath" text box
- Revised XPath: //div[contains(@class,'address')]/span
- Click "OK" to save the result
To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need:
6. Customize the data field using RegEx tool - reformat rating data (Optional)
When the data we want is not shown as readable text on the web page, we need to extract its source code (HTML) at first, and then process the extracted source code into our desired format.
- Select the "Rating" field to be modified
- Click "Customize data field"
- Select "Refine extracted data", click"Add step", and then select "Match with Regular Expression"
- Select "Try RegEx Tool"
- Check the box for "Start With" and enter: alt="
- Check the box for "End With" and enter: of 5 bu (Please be noticed with the blank space)
- Click "Generate" and "Match"
- Click "Apply" and "OK"
- Click "OK" to save
7. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output. You may find some blank fields and that's because they don't have the value.
Was this article helpful? Contact us any time if you need our help!