In this tutorial, we are going to show you how to scrape hotel information on Booking.com.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Booking Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the hotel name, rating, address, one photo, and room facilities with Octoparse.
1. It is recommended that you use the URL of the search result page directly whenever possible. Adding keywords/filters within Octoparse can complicate the task and leads to less efficient scraping.
2. The structure and display of Booking.com might vary depending on your IP and preferred language.
Here are the main steps in this tutorial: [Download demo task file here]
- "Go To Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Start extraction - run the task and get data
1. "Go To Web Page" - open the target web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down the page and click the next page button ">"
- Click "Loop click the selected link" from "Action Tips"
3. Create a "Loop Item" - loop click into each item on each list
- Click "Go To Web Page" to go to the first page
When extracting data throughout multiple pages, you should always begin your task building from the first page.
- Click the title of the 1st item in the list
- Click "Select all" on "Action Tips" if Octoparse detects all the elements you want
- Otherwise, click the title of the 2nd item in the list
Octoparse will automatically select all the links to the detail pages on the current page. The selected links will be highlighted in green while other links to the detail pages will be highlighted in red.
- Click "Loop click each element" to create a "Loop Item"
Octoparse will click through each link captured in the "Loop Item", and open the detail page.
When you go to other pages to see if the workflow is correct, you may find this message "Cannot find any element using this XPath". In this case, to improve the accuracy of locating elements in the list, it is suggested to modify the XPath.
- Click the "Loop Item" box
- Go "Loop mode" and click "Variable list"
- Enter the modified XPath below into the text box of "Variable list":
- //a[@class='hotel_name_link url']
- Click "OK" to save
1. "Fixed list" and "Variable list" are loop modes in Octoparse. For more about loop modes in Octoparse:
2. If you want to learn more about XPath and how to generate it, here are some related tutorials you might need:
4. Extract data - select the data for extraction
After you click "Loop click each element", Octoparse will open the detail page of the first hotel.
- Click on the data you need on the page
- Extract the selected data
- For extracting the text, select "Extract text of the selected element" on the "Action Tips" panel
- For extracting the image URL, select "Extract the URL of the selected link" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
5. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output.
You can see some blank fields in the column “Hotel_Image_1” and column "All_Rooms_Include". This is because some detail pages do not contain any image of the hotel and/or the information about room facilities.
By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may fail to find the element of the defined pattern even if the element needed is shown on the website. If you encounter this problem, here are a related tutorial you might need:
Was this article helpful? Contact us any time if you need our help!