Scrape real estate data on Realtor.com
FollowIn this tutorial, we are going to introduce how to scrape information from realtor.com.
We will scrape data from the house detail page and scrape the title, location, price, and rating with Octoparse.
To follow through, you may want to use the URL in the tutorial:
https://www.realtor.com/realestateandhomes-search/Tallassee_AL
This tutorial will also cover:
- Deal with AJAX for pagination
- Locate elements correctly by modifying XPath in Octoparse
Here are the main steps in this tutorial: [Download the demo task file here ]
- "Go To Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Start extraction - run the task and get data
1. "Go To Web Page" - open the target web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Realtor, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the ">" button on the web page
- Click "Loop click single element" on the "Action Tips" panel
As Realtor loads the content with AJAX, we should set up AJAX Load for the “Pagination” action.
- Uncheck "Auto retry when no response"
- Check "Load the page with AJAX"
- Set up "AJAX Timeout" as "5" seconds
Tips! If you want to learn more about AJAX, here are some related tutorials you might need: |
3. Create a "Loop Item" - loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we'd better go back to the 1st page.
- Click "Go To Web Page" in the workflow
- Select the pagination loop
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
Now, let’s build the loop item:
- Click the first image item on the web page
- Click "Select All" on the "Action Tips" panel
- Select "Loop click each element"
We need to set up "AJAX Load" for this step as well since it loads the content with AJAX.
- Uncheck "Auto retry when no response"
- Uncheck "Open the link in the new tab"
- Check "Load the page with AJAX"
- Set up "AJAX Timeout" as "5" seconds
4. Extract data - select the data for extraction
- Click the information you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
Tips! If you want the data to be extracted correctly to the corresponding data fields, you’d better write a new XPath that will always pinpoint the right data on all pages. The related tutorials you might need are listed below. |
5. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output.
Artículo en español: Scraping información de bienes raíces de Realtor.com
También puede leer artículos de web scraping en el website oficial
Related Articles:
Scraping property info from Daft.ie
Scrape real estate information from Kijiji
Author: Vanny
Editor: Fergus
Was this article helpful? Contact us any time if you need our help!