In this tutorial, we are going to show you how to scrape the listing details from Airbnb.com.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Airbnb Templates directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the title, location, price, and rating from each listing details page with Octoparse.
This tutorial will also cover:
- Handle pagination empowered by scrolling down in Octoparse
- Locate all the listings by modifying the loop mode and XPath in Octoparse
- Reformat star-rating into numerals with RegEx tool in Octoparse
1. It is recommended that you use the URL of the search result page directly whenever possible. Adding keywords/filters within Octoparse can complicate the task and leads to less efficient scraping.
2. The structure and display of airbnb.com might vary depending on your IP, preferred language, display screen, and even browser.
Here are the main steps in this tutorial: [Download demo task file here ]
- "Go To Web Page" - open the target web page
- Set "Scroll Down" - load all items from one page
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Customize the data field by modifying XPath - improve the accuracy of the item list (Optional)
- Customize the data field using RegEx tool - reformat the rating of the room (Optional)
- Start extraction - run the task and get data
1. "Go To Web Page" - open the target web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Set "Scroll Down" - load all items from one page
- Turn on the “Workflow Mode”
We strongly suggest you turn on the "Workflow Mode" to get a better picture of what you are doing with your task, just in case you mess up with the steps.
- Check the box for "Scroll down to bottom of the page when finished loading", set "Scroll times" as 1 and "Interval" as 3 seconds. For "Scroll way", select "Scroll down to the bottom of the page"
"Interval" is the time interval between every two scrolls.
For some websites like Airbnb.com, clicking the next page button to paginate is not an option for loading content. To fully load the listings, we need to scroll the page down to the bottom continuously.
To learn more about how to deal with infinite scrolling in Octoparse, please refer to:
3. Create a "Loop Item" - loop click into each item on each list
- Click the title of the 1st item in the list
Octoparse will automatically select all the listings on the current page.
- Click "Select All" on the “Action Tips” panel
- Select "Loop click each element"
Octoparse will click through each listing on the current page.
As mentioned above, to enable Octoparse to capture the listings fully loaded after scrolling down to the bottom for enough times, you’ll need to modify the loop mode and the XPath. See how it's done in Step 5.
4. Extract data - select the data for extraction
After you click "Loop click each element" on the "Action Tips" panel, Octoparse will automatically open the detail page of the first item.
- Click on the data you want to extract on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Rename the fields by selecting from the pre-defined list or inputting on your own
When you click on the star-rating data, choose "Extract button outer HTML". The data extracted needs to be processed further with Regular Expression. See how it's done in Step 6.
5. Customize the data field by modifying XPath - improve the accuracy of the item list (Optional)
Once we click “Loop click each element”, Octoparse will generate a loop item using the “Fixed list” loop mode by default. “Fixed list” is a loop mode used for dealing with a fixed amount of elements. However, the number of listings on Airbnb.com is not fixed but increases with scrolling down. To enable Octoparse to capture all the listings, including those to be loaded later, we need to switch the loop mode to “Variable list”, and enter the proper XPath to have all the listings to be located.
- Select "Loop Item" box
- Select "Variable list" and enter:
- Click "OK" to save
1. "Fixed list" and "Variable list" are loop modes in Octoparse. For more about loop modes in Octoparse:
2. If you want to learn more about XPath and how to generate it, here are some related tutorials you might need:
6. Customize the data field using RegEx tool - reformat the rating of the room (Optional)
When the data we want is not shown as readable text on the web page, we need to extract its source code (HTML) at first, and then process the extracted source code into our desired format.
- Select "Rating" and click "Customize data field"
- Choose "Refine extracted data"
- Click "Add step" and choose "Match with Regular Expression"
- Choose "Try RegEx Tool"
- Check the box "Start With" and enter "Rated "
- Check the box "End With" and enter " out"
- Click "Generate" and "Match"
- Click "Apply" and "OK"
- Click "OK" to save
Octoparse offers 8 data reformat options for you to further process or clean the data extracted into the right format. For more about how to reformat data with Regular Expression:
7. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output. You can see some blank fields in the column “Rating”. This is because these listings don't have a rating.
By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may fail to find the element of the defined pattern even if the element needed is shown on the website. If you encounter this problem, here is a related tutorial you might need:
Was this article helpful? Contact us any time if you need our help!