Scrape business information from Yelp
FollowIn this tutorial, we will show you how to collect business information on Yelp with Octoparse 7.
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check the newly updated tutorial on scraping Yelp with our brand-new version 8.1 beta.
To demonstrate, we will use the URL below as an example.
https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA&ns=1
We will scrape data such as Title, star rating, number of reviews, telephone, and website with Octoparse.
Here are the main steps in this tutorial: [Download demo task file here ]
- Go To Web Page - open the target web
- Create a pagination loop - scrape multiple list pages
- Create a "Loop Item" - loop extract item on the list
- Extract data - select the data fields needed for the extraction
- Start extraction - run the task and get data
1. Go To Web Page - open the target web page
- Click "+ Task" to start a task using the Advanced Mode
- Copy and paste the target URL into the "Website" box
- Uncheck "Retry when page fails to load"
- Click "Save URL" to move on
2. Create a pagination loop - scrape multiple list pages
- Scroll down the page and RIGHT click the "next" button
- Click "Loop click next page" on the "Action Tips" panel
- Uncheck the "Retry when the page remains unchanged"
- Click "Load the page with Ajax" and set AJAX Timeout as 5s (optional)
Tips! AJAX timeout can often be used as a webpage timeout for Click Action. For example, when you have a page that takes forever to finish loading, long after the data you need gets loaded, you can conveniently use AJAX timeout to tell Octoparse to move on to the next action when the set time is reached. If you want to learn more about AJAX, you can enjoy the video tutorial here |
- Click on the first title of the listing on the current page
- Select "Select all" on the "Action Tips" panel
- Select "Extract link text" on the "Action Tips" panel
- Click "Save" to move on
Tips! Normally, the first several items on each page of Yelp are advertisements. When looping items, please pay attention to select only our targets, the items with a sequence number. |
4. Extract data - select the data fields needed for the extraction
- Click on phone number
- Select "Extract text of the selected element" on the "Action Tips"
- Click on the number of reviews
- Select "Extract text of the selected element" on the "Action Tips"
- Click on the title
- Select "Extract the URL of the selected element" on the "Action Tips"
- Click on the star rating
- Select "Extract outer HTML of the selected element"
5. Reformat data - Use "Match regression" to extract rating
For Star rating, we would need to reformat the data.
- Click the icon
called "Customize data field"
- Click "Refine extracted data"
- Click "Add step"
- Select "Match with Regular Expression"
- Click "Try Regex Tool"
- Check "Starts with" and type in "aria-label=""
- Check "Ends with" and type in " star"
- Click "Generate" and then click "Match" to see whether we extract rating right
- Click "Apply" and then "OK" to save
- Rename the fields by selecting from the predefined list or inputting on your own
5. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the sample of your data.
日本語記事:Yelpからビジネス情報を取得する
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Scrape la información comercial de Yelp
También puede leer artículos de web scraping en el website oficial
Writer: Eric
Editor: Fergus
Related articles:
How to exclude "Ads" items when creating a list