You are browsing a tutorial guide for Octoparse version 8.4. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Realtor is a website where you can search real estate for sales, discover new homes, shop for mortgages, and find property records.

In this tutorial, we are going to show you how to scrape property data from Realtor.com. The website has anti-scraping techniques, so we need to make sure not to scrape the website too fast.

We will scrape the data from the property detail page and scrape the title, location, price, rating, etc with Octoparse.

To follow through, you may want to use this URL in the tutorial:

https://www.realtor.com/realestateandhomes-search/Tallassee_AL

We'll use 2 tasks to get the data in the detail pages.

Here are the main steps in this tutorial:

Task 1: Extract all the URLs of detail pages on the search result pages [Download the demo task file here]

  1. "Go To Web Page" - open the target web page

  2. Create a pagination loop - scrape all the results from multiple pages

  3. Create a "Loop Item" - to loop extract URLs of all the listings

  4. Start extraction - run the task and get data

Task 2: Collect the product information from scraped URLs [Download the demo task file here]

  1. Input a batch of the scraped URLs - loop opens the detail pages

  2. Extract data - select the data for extraction

  3. Refine the data fields

  4. Set up wait time - slow down the scraping

  5. Start extraction - run the task and get data


Task 1: Extract the detail page URLs on the search result pages

1. "Go to Web Page" - open the target web page

  • Enter the example URL and click Start

1.png

2. Create a Pagination - scrape all the results from multiple pages

  • Scroll down and click the "Next" button on the web page

  • Click Loop click next page on the Tips panel

2.gif

Octoparse auto-detects AJAX applied for the click action as 3 seconds. You can modify it based on your local Internet condition (Click to know more about AJAX: Handling AJAX).

  • Set up AJAX timeout as 10 seconds

3.png
  • Click on the Pagination step in the workflow and enter the Xpath: //a[@aria-label="Go to next page"][not(contains(@class, "disabled"))]

17.png

3. Create a "Loop Item" - to loop extract URLs of all the listings

  • Click on the image of the first item on the list

  • Click the A tag at the bottom of the Tips panel (A tag defines a hyperlink, which is used to link from one page to another)

  • Click Select All on the Tips

  • Choose Extract the URLs of the selected elements

123.gif

We can see that some items are not selected, so we need to modify the Loop Item, so we need to modify the Xpath of the Loop Item.

  • Click on Loop Item

  • Change Loop Mode from Fixed List to Variable list

  • Enter XPath //ul[@data-testid='property-list-container']/li into the text box

  • Click Apply to save

7.png
  • Go to Extract Data and modify the URL XPath

  • Set the XPath as //a[@rel="noopener"]

11.gif

4. Start extraction - run the task and get data

  • Run the task from the upper left side

  • Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

Here is the sample output -

9.png

Task 2: Collect property data from scraped property URLs

1. Input a batch of the scraped URLs - loop opens the detail pages

In Task 1, we already have a list of URLs.

  • Click + New to start a task using Advanced Mode to build Task 2

10.png
  • Choose Import from the task to get the URLs from Task 1

11.png

TIP: There are 4 ways to input URLs. In this tutorial, we use Import from the task for demonstration. Please note that this one only works when the parent task is running in the Cloud. If we import from a local run data results, only 100 lines of data will be imported. To learn more about importing URLs, check this guide: Batch URL input.

After clicking the Save button, you will see a loop item named Loop URLs be generated in the workflow.

2. Extract data - select the data for extraction

  • Click on the elements you want to scrape

  • Choose Extract text/URL/image URL of the selected element on the Tips panel

12.gif
  • Double click each field to rename it

13.png

3. Refine the data fields

To avoid data fetched to the wrong column, we will need to Customize element XPath.

  • Click More(...) and select Customize Xpath

  • Input the revised XPath into the text box and click Apply to save

14.png

Here are revised XPaths for some common data fields

  • Presented_by: //div[contains(text(),'Presented')]/following-sibling::span[2]

  • Price: //div[@data-testid="list-price"]

  • Facilities://div[@data-testid="property-meta"]

  • Address: //div[@data-testid="address"]

  • Property_type: //div[contains(text(),'Property')]/following-sibling::div[1]

  • Time _on _realtor: //div[contains(text(),'Time on realtor.com')]/following-sibling::div[1]

  • Price _ per _sqft: //div[contains(text(),'Price per sqft')]/following-sibling::div[1]

  • Year_Built: //div[contains(text(),'Year Built')]/following-sibling::div[1]

4. Set up wait time - slow down the scraping

As the website applies anti-scraping techniques, we need to set up a wait time to slow down the scraping speed so as to avoid being blocked.

  • Click on the Extract Data

  • Go to Options

  • Tick Wait before action and set it as 7s-10s

  • Click Apply to save

wait_time.jpg

5. Start extraction - run the task and get data

  • Click Save to save the task first

  • Click Run on the upper left side

  • Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

Here is the sample output:

15.png
Did this answer your question?