In this tutorial, I am gonna show you how to scrape hotel information from booking.com step by step. Before we get started, open https://www.booking.com/ using your own browser, type in the keyword that you need. In this case, we are looking for the hotel information in “Hong Kong”. So type that in. After the page finishes loading, copy this URL. This is the URL we will use in this demonstration.
note: To learn more about AJAX click https://youtu.be/MuOC1yCKai0
To learn more about XPath click https://youtu.be/kZwD6szlvas
Step One: Enter the URLs of the websites you would like to scrape
- Build a new task by clicking “Advance Model”, and enter the URL
- And click “Save URL” on the left corner. This will bring you to the hotel listing page with Octoparse’s built-in browser.
Step Two: Create a pagination loop
- Scroll down to the bottom of the page and find the pagination bar. Then click the “Next Page” button. The command panel called “Action tip” will show up once you interact with the websites with the action of clicking. It will show you what you can to do with the selected element
- In this case, select "Loop click the selected link"
- Now go to the setting area, we need to do a few adjustments. Uncheck “Auto retry” since we don’t need this function in this tutorial.
- com applies the Ajax technique in pagination. So we keep this setting here.
- Click the “Ok” to save the steps.
Step three, create a “Loop item”
- To create a loop item, select the element. In this case the Hotel name from the listing. Click the first title, you will notice only 2 similar elements have been found and highlighted in red. We want to create a loop item with all listings. To help Octoparse recognizing all other similar listings, we need to select another Hotel name on the list.
- Now, all 16 results are highlighted in green, which means they have been successfully selected.
- Then Click “Loop click the selected link” to create a “Loop Item.” Octoparse will click through each hotel for detail information.
- We still need to go back to the setting area and adjust the setting.
- Uncheck “Auto Retry” Still
- This time we don’t need to check the “Ajax Load” since the detail page doesn’t apply Ajax Technique.
- Click “Save” to Save the steps.
Now we need to come back and check if things have been set as expected. To check the workflow, If the web page in the built-in browser shows the corresponding reaction as we click through each step, it means the setting is correct. As you may notice, when I click “Loop Item”, the lists on the setting area shows “Cannot find any element using the XPath expression”, this means that we need to fix the Xpath. And tell Octoparse to locate the element in the webpage and avoid the situation of incomplete extraction. To do this, go to the setting area. I have already prepared the correct Xpath for the purpose of the demonstration. Copy this Xpath and past the expression at Variable List. I have also attached the tutorial of how to write an Xpath down below. Since not all webpages are well written with the exact same structure. The robot will skip the element if scraper can’t locate it. You can ignore this step if the webpage is well organized.
Step Four, Data Extraction.
- To extract the data, Click the element, for example, in this case, the hotel title and select “extract text of the selected element” from Action Tip.
- You can preview the extraction from the data fields and edit.
- Repeat the above steps and get the needed data extracted.
- Click “Save” to save the steps.
Step Five, Run the task and get data
- After finishing setting up the rules, we can run the task by clicking “start extraction”
- Then the Select “Local extraction” to run the task. You can switch the view to check. The scraping status on the websites and the data have been extracted in the table.