Yelp is one of the largest business directory websites on the Internet. Millions of people search for target businesses or leave a review on this website every day. The business info and the reviews are adding valuable information to the platform. In this tutorial, we will show you how to collect business information on Yelp with Octoparse 8.1.
The easiest way would be using our pre-built Yelp templates. No need to configure scraping tasks but just enter keywords/URLs to get data. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check this video below.
To demonstrate, we will use the URL below as an example.
We will scrape data such as title, star rating, number of reviews, and website with Octoparse.
Here are the main steps in this tutorial: [Download demo task file here ]
- Go To Web Page - open the target web
- Auto-detect the web page - create the workflow
- Modify the settings of the Pagination
- Click into each detail page to get more information
- Extract Data - extract data on the detail pages
- Start extraction - run the task and get data
1. Go To Web Page - open the target web page
- Enter the URL on the home page and click "Start"
2. Auto-detect the web page - create the workflow
- Click "Auto-detect web page data" and wait for the detection to complete
- Delete the unwanted fields or rename fields if needed on the Data preview
- Click "Edit" under "Paginate to scrape more pages"
- Select the next button on the web page
- Click "Confirm"
- Click "Create workflow"
3. Modify the settings of the Pagination
- Click open the action settings of the Pagination
- Enter the XPath: //a[contains(@class,'next-link')]
- Click "OK" to confirm
If you want to learn more about XPath, please check the following tutorial:
- Click open the action settings of "Click to Paginate"
- Extend the AJAX timeout to 7-10s
If all the data you need could be scraped from the listing page, you can stop here and jump to Start extraction - run the task and get data. If you want to go to each detail page to get more info, follow the steps below.
4. Click into each detail page to get more information
- Choose “Click on link(s) to scrape the linked page(s)” on the Tips panel
- Select "Click on an extracted data field" and select "Title_URL" from the drop-down menu (you can confirm if it's the correct link on the Data Preview)
- Click "Confirm"
5. Extract Data - extract data on the detail pages
- Select information on the web page
- Choose "Extract text of the selected element"
- Repeat the above steps to extract all the data you need
- Rename the fields is needed
Now we need to revise the element XPath for each data field to scrape the information more precisely.
- Double-click the "Extract Data1" action in the workflow
- Click the icon to modify the XPath one by one
- Replace the default XPath with the revised one
No worries! We have prepared frequently used fields' XPath for you. You can just use the element XPath provided below.
- Website: //p[text()='Business website']/following-sibling::p
- Phone: //p[text()='Phone number']/following-sibling::p
- Open hours: //table[contains(@class,'hours-table')]
- Address: //address
- Amenities: //h4[text()='Amenities and More']/../../following-sibling::div
- About the business: //h4[text()='About the Business']/../../following-sibling::div
- Price range: //span[contains(text(),'$')]
- Category: //a[contains(@class,'link-color--inherit')]
- Rating: (//div[contains(@aria-label,'star rating')])
The rating info is stored inside the value of the attribute. We can click the Rating area, choose to extract the text. Then customize the field to scrape the value of "aria-label" attribute.
To know more about how to deal with "Extract Data", check the following guides:
6. Set up wait time to slow down the scraping speed
- Double-click the "Extract Data1" action
- Tick "Wait before action"
- Select the wait time as 7s-10s
7. Start extraction - run the task and get data
- Click "Run" on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run task in the Cloud" to run the task in the Cloud (for premium users only)
Here is the sample output.
Tutorial en español: Scrapear información comercial de Yelp
También puedes leer más artículos de web scraping en el sitio web oficial