Scrape property data from Realtor.com
FollowIn this tutorial, we are going to show you how to scrape property data from Realtor.com.
We will scrape data from the house detail page and scrape the title, location, price, and rating, etc with Octoparse. To follow through, you may want to use the URL in the tutorial:
https://www.realtor.com/realestateandhomes-search/Tallassee_AL
We'll use 2 tasks to get data in the detail pages.
Here are the main steps in this tutorial:
Task 1: Extract all the URLs of detail pages on the search result pages [Download the demo task file here ]
- "Go To Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - to loop extract URLs of all the listings
- Refine the data field of URL
- Start extraction - run the task and get data
Task 2: Collect the product information from scraped URLs [Download the demo task file here ]
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Refine the data fields
- Start extraction - run the task and get data
Task 1: Extract the detail page URLs on the search result pages
1. "Go To Web Page" - open the target web page
- Enter the example URL and click "Start"
2. Create a Pagination - scrape all the results from multiple pages
- Scroll down and click the ">" button on the web page
- Click "Loop click single URL" on the Tips panel
Octoparse auto-detects AJAX applied for the click action as 3 seconds. You can modify it based on your local Internet condition (Click to know more about AJAX: Handling AJAX).
- Set up "AJAX timeout" as "5" seconds
- Double-click the "Pagination" step in the workflow
- Copy and paste the revised XPath for the next page button: //li[contains(@class,'pagination-next')]/a
- Expand the section of "Before action is performed"
- Check "Wait before action" and set the wait time as 2s
- Click "OK" to save
3. Create a "Loop Item" - to loop extract URLs of all the listings
- Click the address of the 1st item on the list
- Click the A tag on the bottom of the Tips panel (A tag defines a hyperlink, which is used to link from one page to another)
- Click "Select all" on the Tips panel
- Select "Extract the URL of the link"
We can see some items are not selected, so we need to modify the Loop Item.
- Click
of the "Loop Item"
- Change the Loop Mode from "Fixed list" to "Variable list"
- Enter XPath //ul[@data-testid='property-list-container']/li into the text box
- Click "OK" to save
4. Refine the data field of the URL
The scraped URL sometimes opens a page with a different page design. To avoid this, we need to refine the URL field.
- Double-click the "Extract Data" in the workflow
- Choose the field of title URL and click "..." to choose "Clean data"
- Click "Add step", and then choose "Add a suffix"
- Enter "?view=qv" into the text box, and then press "Evaluate" to get the result.
- Click "Confirm" to save it.
- Click the field name to modify it if needed
5. Start extraction - run the task and get data
- Click "Save"
- Click "Run" on the upper left side
- Select "Run task on your device" to run the task on your computer, or select “Run task in the Cloud” to run the task in the Cloud (for premium users only)
If you're a premium user or trial user, I will suggest you use "Run task in the Cloud" so that you can use the feature of associative tasks (Check this guide What is parent task and child task in Octoparse? for details).
Here is the output data.
Task 2: Collect the property data from scraped property URLs
1. Input a batch of the scraped URLs - loop open the detail pages
In Task 1, we have already got a list of URLs.
- Click "+ New" to start a task using Advanced Mode to build Task 2
- Choose "Import from task" to get the URLs from Task 1
Tips! There are 4 ways to input URLs. In this tutorial, we use "Import from task" for demonstration. Please note that this one only works when the parent task is running in the Cloud. If we import from a local run data result, only 100 lines of data will be imported. To know more about importing URLs, check this guide: Batch URL input. |
After clicking the "Save" button, you will see a loop item named "Loop URLs" be generated in the workflow.
2. Extract data - select the data for extraction
- Click on elements you want to scrape
- Choose "Extract text/URL/image URL of the selected element" on the Tips panel
- Click
to rename the fields
3. Refine the data fields
To avoid data fetched to a wrong column, we will need to Customize element XPath.
- Double-click the "Extract Data" step in the workflow to revise the XPath of some data fields
- Click the icon
to modify XPath
- Enter the revised XPath into the text box and click "OK" to save
Here are revised XPaths for some common data fields
- Presented_by: //div[contains(text(),'Presented')]/following-sibling::span[2]
- Brokered_by: //li[contains(text(),'Brokered')]/following-sibling::li[1]
- Price: //span[contains(@class,'price')]
- Facilities: //ul[contains(@class,'property')]
- Address: //h1[contains(@class,'address')]
- Property_type: //span[contains(text(),'Property')]/following-sibling::span[1]
- Last_sold: //span[contains(text(),'Last Sold')]/following-sibling::span[1]
- Days_on_realtor: //span[contains(text(),'Days on')]/following-sibling::span[1]
- Parcel_number: //li[contains(text(),'Parcel')]
- Source Listing Status: //li[contains(text(),'Source Listing Status')]
If you need data such as latitude and longitude, you need to extract the image URL of the maps and then clean data to find coordinates information.
- Click the maps image to extract the URL of the selected image
- Repeat the step above
- Go to "Extract Data" and find the option of "Clean data"
- Click "Add step" and choose "Match with Regular expression"
- Use "center=" as the starting value, and "%2C" as the ending value to match out the latitude
- Use "%2C" as the starting value, and "&channel" as the ending value to match out the longitude
4. Start extraction - run the task and get data
- Click "Save" to save the task first
- Then, click "Run" on the upper left side
- Select "Run task on your device" to run the task on your computer, or select “Run task in the Cloud” to run the task in the Cloud (for premium users only)
Here is the sample output.
Author: Vanny
Editor: Yina