Octoparse 7 has simplified the operations of scraping data from directory websites. In this tutorial, we will show you how to web scrape data from directories by using yelp.com as an example.
To follow through, you may want to use this URL in the tutorial:
We are going to extract restaurant data including restaurant name, restaurant website, price range, telephone, star rating and category.
1) "Go To Web Page" - open the targeted webpage in the built-in browser
· Select "Advanced Mode" to create a task. Advanced mode supports flexible configuration and complex websites. Once you get familiar with it, it should enable you to do the web scraping on most websites
· Enter URL and click "Save URL"
· Turn on "Workflow" mode to check and edit your workflow conveniently
2) Create a pagination loop - to scrape all the results from all the pages
· Click "Next" button, then select "Loop click the selected link" in the "Action Tips" panel
· Set up "AJAX Timeout". To know more about AJAX, please refer to another tutorial: Deal with AJAX
3) Create a "Loop Item"- scrape all the items on the page
To create a "Loop item", we 'd better go back to the first page of the website, which is the reference page for other pages.
· Click "Go To Web Page", then Octoparse will open the original web page
· Select the pagination loop
Now we can create the loop to scrape the information of all the items on the page.
· Click the title of the first-listed product, Octoparse 7 will automatically identify the similar URLs on the page
· Click "Select all" in the "Action Tips"
· Select "Loop click each element"
4) Extract data - select the data to be extracted from the webpage
· Click the data you need to extract
· Select "Extract text of the selected element" in the "Action Tips"
· Edit the field name
When you click on "Stars Rating" on the page, choose "Extract inner HTML of the selected element". The data extracted needs to be processed further with Regular Expression. See how it's done in step 5.
5) Customize the data field - to reformat star-rating data (Optional)
In some cases, the data you need might hide in the HTML with extra strings that you don't need. For example, we need to extract the star rating but it seems like it cannot be done by clicking to extract. In this case, we would need to extract the HTML first and then reformat the data extracted in order to trim the strings we don't need. There are mainly 3 steps in this example to process the data extracted.
1. Modify the "Star_Rating" field
· Select the data, and click "Customize data field"
· Select "Refine extracted data", select "Add step", and then select "Match with Regular Expression"
· Select "Try RegEx Tool"
· Star with " alt=" " and end with "star rating", click "generate", then "Match", you will see the result of number only in the matches field
· Click "Apply" and "OK"
2. Delete the unwanted spaces in the Category and Address
· Select the Category data, and click "Customize data field"
· Choose "Refine extracted data", then choose "Add step", Click "Replace with Regular Expression"
· Input "\s+" in the Regular Expression, and a space in Replace With, then click "Evaluate"
· Click "OK"
3. Improve the accuracy of the location of "Title"
Because of the various structures of web pages, sometimes we need to modify the XPath in case of data missing.
· Select the Title, and click "Customize data field"
· Choose "Customize Xpath"
· Enter XPath "//H1"
· Click "OK"
6) Save and start extraction - to run your task and get data
· Click "Save"
· Click "Start Extraction"