The new tutorial about using templates in version 8 is available here.
In this tutorial, we are going to introduce how to scrape Yelp review data. We will go through the detail page of each coffee shop to scrape the shop name, the reviewer's name, and the comment.
To follow through you might want to use the URL in this tutorial:
This tutorial will also cover:
- Modify XPath to locate the desired price data accurately
Main steps in the tutorial: [Download demo task file here ]
- "Go To Web Page" - open the targeted web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - loop capture review information on the list for extraction
- Customize the data field by modifying XPath – improve the accuracy of a certain data field (Optional)
- Start extraction - run the task and get data
1. "Go To Web Page" - open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, we strongly recommend "Advanced Mode" to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
- Click “Save URL” and go to the target website.
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the "Next Page" button on the web page
- Click "Loop click next page" on the "Action Tips" panel
As this website applies the AJAX technique to load the new content, we need to set up "AJAX load" to help Octoparse avoid being stuck.
- Uncheck "Auto-Retry"
- Check "AJAX Load" and set up "AJAX Timeout" as "3" seconds
To know more about AJAX, please refer to:
3. Create a "Loop Item" - loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the 1st item on the 1st page.
- Click "Go To Web Page" in the workflow.
- Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
- Click the first cafe item
- Click "Select All" on the "Action Tips" panel
- Select "Loop click each URL"
4. Extract data - loop capture review information on the list for extraction
This tutorial will only scrape the 1st page of review information for demonstration. If you need to scrape multiples pages of reviews, just need to create another pagination loop.
- Click cafe name on the web page
- Click "Extract text of selected element" on the "Action Tips" panel
Now, let's build a "Loop item" to have all reviews captured.
- Click first and second comment sections consecutively
Please be noticed to select the whole comment block. Octoparse will automatically identify all the comment sections on the page based on the pattern you've just defined.
- Click "Extract text of the selected elements"
A "Loop Item" will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected; however, if this is not exactly what you are looking for, you can delete the fields and add new ones you need as below.
- Delete the unwanted data fields
- Select the data you want in the comment area, like the username, location, and comment
- Rename the fields by selecting from the predefined list or inputting on your own
- Click "Extract text of the selected element" on the "Action Tips" panel
- Click "OK" to save
Here is a tutorial for capturing a list of items:
5. Customize the data field by modifying XPath – improve the accuracy of a certain data field (Optional)
In this case, the cafe names are not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the element on each page to be precisely detected.
The revised XPath of the cafe name is:
- Click "Customize data field"
- Select "Customize XPath"
- Paste the revised XPath into the "Matching XPath" text box
- Click "OK" to save
To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need:
6. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample output.
Artículo en español: Scraping Datos de Comentarios de Yelp
También puede leer artículos de web scraping en el website oficial
Was this article helpful? Contact us anytime if you need our help.