You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Google Scholar provides a simple way to search for scholarly literature broadly. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.
In this tutorial, we will show you how to scrape search results from Google Scholar with Octoparse.
Before you build a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!
If the template falls short of your needs and you want to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr
We will search with multiple keywords and scrape each article's title, author, and description information from the search results pages.
Here are the main steps in this tutorial: [Download task file here]
1. Create a Go To Web Page - to open the target web page
Every workflow in Octoparse starts by telling Octoparse a web page to start from.
Enter the sample URL into the search bar at the top of the home screen and click Start
Check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs.
Now we have reached the target web page.
2. Create a Loop Item - to enter multiple keywords
If we want to search for multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.
Mouseover the down arrow and click the Add Step button to add a Loop Item to the workflow
Click Loop Item to go to its settings panel
Set its loop mode to Text List in the General tab
Click the Edit icon to enter your keyword list (e.g., data mining, big data, etc.), one keyword per line
Click Confirm and then Apply to save the settings
Click on the search box on the web page and select Enter text on the Tips panel
Make sure Enter Text is inside the Loop Item. If not, drag the Enter Text action into the Loop Item
Click on Enter Text action and select Use text in the loop to enter the text box
Click OK, and you will see that the default name of the action has been changed to Enter loop items
We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.
Click the Google Scholar search button on the web page
Select Click button on the Tips panel, and you will notice the Click Item action is added to the workflow
Click open the settings of the Click Item and extend the AJAX timeout
Octoparse will automatically enter every keyword in the list in the search box and click the search icon.
3. Auto-detect the search result page to scrape data
If you are on version 8 or above, Octoparse can auto-detect all web page elements and guide you through the settings on data extraction, pagination, page scroll, and so on. Use this feature to set up another loop to extract data from each result page.
Click Auto-detect the web page data and wait for it to complete
Turn to the Data preview section to either rename or delete the auto-captured data fields
Check the Paginate to scrape more pages options to see if Octoparse the right next page button
Edit the pagination setup and then click Confirm
Uncheck the Add a page scroll as the web page doesn't need to be scrolled to load
Click Create workflow
Octoparse will go to each result page and scrape the data we want.
4. Set up a wait time to slow down the scraping speed
This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.
ClickExtract Data action
Tick Wait before action in the Options tab and set the wait time to 3s
Click Apply to save the settings
Octoparse will wait 3 seconds every time it executes the Extract Data action.
5. Save and start to run the task and get data
The last step is to save your task and run it.
Click Save on the upper right to save your task
Click Run next to it and wait for a Run task window to pop up
Select Run on your device to run the task on your local device
Wait for the task to complete
Here is the sample output from a local run.
Tip: Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.