This tutorial is for the latest version 8.2 Beta. Check and download it for free!
Yelp is one of the largest business directory websites on the Internet. Millions of people search for target businesses or leave a review on this website every day. The business info and the reviews are adding valuable information to the platform. In this tutorial, we will show you how to collect business information on Yelp with Octoparse 8.
The easiest way would be using our pre-built Yelp templates. No need to configure scraping tasks but just enter keywords/URLs to get data. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check this video below.
To demonstrate, we will use the URL below as an example.
We will scrape data such as title, star rating, number of reviews, and website with Octoparse.
Here are the main steps in this tutorial: [Download demo task file here ]
- Go To Web Page - open the target web
- Set up a loop for pagination
- Modify the settings of the Pagination
- Extract data on the listing page
- Click into each detail page to get more information
- Extract Data - extract data on the detail pages
- Set up wait time to slow down the scraping speed
- Start extraction - run the task and get data
1. Go To Web Page - open the target web page
- Enter the URL on the home page and click "Start"
2. Set up a loop for pagination
- Select the Next button
- Click "Loop click single element" on the tips panel
3. Modify the settings of the Pagination
- Click open the action settings of the Pagination
- Enter the XPath: //a[contains(@class,'next-link')]
- Click "OK" to confirm
If you want to learn more about XPath, please check the following tutorial:
4. Extract data on the listing page
- Select the first and the second block
- Click "Extract text of the selected elements " on the tips panel to create a loop item
- Select the data needed in the highlighted block (the red one) and click "Extract text/URL of the selected link" one by one. Repeat this move until all the data needed are extracted
- Delete or modify the data on the Data Preview
If all the data you need could be scraped from the listing page, you can stop here and jump to Start extraction - run the task and get data. If you want to go to each detail page to get more info, follow the steps below.
5. Click into each detail page to get more information
- Click the title in the highlighted block
- Select "Click URL" to go to the detailed page
6. Extract data - extract data on the detailed pages
- Select information on the web page
- Choose "Extract text of the selected element"
- Repeat the above steps to extract all the data you need
- Rename the fields is needed
Now we need to revise the element XPath for each data field to scrape the information more precisely.
- Double-click the "Extract Data1" action in the workflow
- Click the icon to modify the XPath one by one
- Replace the default XPath with the revised one
No worries! We have prepared frequently used fields' XPath for you. You can just use the element XPath provided below.
- Website: //p[text()='Business website']/following-sibling::p
- Phone: //p[text()='Phone number']/following-sibling::p
- Open hours: //table[contains(@class,'hours-table')]
- Address: //address
- Amenities: //h4[text()='Amenities and More']/../../following-sibling::div
- About the business: //h4[text()='About the Business']/../../following-sibling::div
- Price range: //span[contains(text(),'$')]
- Category: //a[contains(@class,'link-color--inherit')]
- Rating: (//div[contains(@aria-label,'star rating')])
The rating info is stored inside the value of the attribute. We can click the Rating area, choose to extract the text. Then customize the field to scrape the value of "aria-label" attribute.
To know more about how to deal with "Extract Data", check the following guides:
7. Set up wait time to slow down the scraping speed
- Double-click the "Extract Data1" action
- Tick "Wait before action"
- Select the wait time as 7s-10s
8. Start extraction - run the task and get data
- Click on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run task in the Cloud" to run the task in the Cloud (for premium users only)
Here is the sample output.