Scrape hotel data from Tripadvisor
FollowIn this tutorial, we will show you how to collect hotel information on Tripadvisor.com with Octoparse.
We are going to demonstrate how to scrap hotel details beginning with a listing URL. Please note that starting with keywords or URLs are both feasible with Octoparse.
The easiest way to achieve this goal is to use the Tripadvisor pre-set template. You can find the Tripadvisor icon on the main screen of the Octoparse scraping tool. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates
If you want to know how to build a task from scratch with Octoparse, please follow the steps below. We will scrape data including the hotel name, location, description, and rating on the hotel details page with Octoparse.
To follow through, you might want to use this URL in the tutorial:
https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html
Here are the main steps in this tutorial: [Download demo task file here]
- Go to Web Page - open the target web page
- Auto-detect the web page - create a workflow
- Click into each detail link to scrape more information
- Extract Data - extract data on the detail pages
- Set up wait time - slow down the scraping speed
- Modify the XPath of the "Click URLs in the list"
- Start extraction - run the task and get data
1) "Go to Web Page"Open the targeted web page
- Enter the URL on the home page and click Start
2) Auto-detect the web page - create a workflow
- Click "Auto-detect web page data" and wait for the detection to complete
- Go to "Data preview" to see if you're okay with the current data output
- You can delete unnecessary data fields directly by clicking the icon
- You can also modify the data field names here directly by clicking the icon
- You can delete unnecessary data fields directly by clicking the icon
- Click "Create workflow"
If the data you need can all be scraped from the listing page, you can just jump to Set up wait time to slow down the scraping speed. If you want to click into each detail link to get more information, please follow the next step.
3) Click into each detail link to scrape more information
- Choose to “Click on link(s) to scrape the linked page(s)” on the Tips panel
- Select "Click on an extracted data field" and select the one you want to click on from the drop-down menu (you can confirm if it's the correct link on the Data Preview)
- Click "Confirm"
Octoparse would automatically go to the first detail page.
4) Extract Data - extract data on the detail pages
- Select information on the web page
- Choose "Extract text of the selected element"
- Repeat the above steps to extract all the data you need
- Edit the name of the data field if needed
5) Set up wait time to slow down the scraping speed
Tripadvisor might block your IP if you scrape it too much, therefore we need to control the scraping speed.
- Click open the action settings of the "Extract Data1" action
- Tick "Wait before action"
- Set up the time as 5s-10s
6) Modify the XPath of the "Click URLs in the list"
The auto-generated action "Click URLs in the list" cannot always click the Title URL, so we need to modify the XPath of this action. (To know more about what is XPath, please check here)
- Double-click the "Click URLs in the list"
- Click the icon
- Enter the XPath //A[contains(@class,"property_title prominent")]
- Click "OK" to confirm
7) Run extraction - run your task and get data
- Click"Save"
- Click "Run" on the upper left side
- Select "Run on your device
" to run the task on your computer, or select "Run task in the Cloud
" to run the task in the Cloud (for premium users only)
Here is the sample output.
Is this article helpful? Contact us anytime if you need our help!
Writter:Yanni
Editor:Yina