Google Scholar provides a simple way to broadly search for scholarly literature. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.
In this tutorial, we are going to show you how to scrape search results from Google Scholar with Octoparse.
Before you start building a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!
If the template falls short of your needs and you would like to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr
We will search with multiple keywords and scrape the title, author, and description information for each article from the search results pages.
Here are the major steps that will be mentioned in this tutorial: [Download the demo task file at the bottom of this article]
- Create a Go to Web Page - to open the target web page
- Create a Loop Item - to enter multiple search keywords
- Auto-detect search result page to scrape data
- Set up wait time to slow down the scraping speed
- Save and start to run the task and get data
1. Create a Go To Web Page - to open the target web page
Every workflow in Octoparse starts by telling Octoparse a web page to start from.
- Enter the sample URL into the search bar at the top of the home screen and click Start
Check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs.
Now we have reached the target web page.
2. Create a Loop Item - to enter multiple keywords
If we want to search for multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.
- Mouseover the down arrow and click to add a Loop Item to the workflow
- Click Loop Item to go to its settings panel
- Set its loop mode to Text List in the General tab
- Click theicon to enter your keyword list (e.g. data mining, big data, and etc), one keyword per line
- Click Confirm and then Apply to save the settings
- Click on the search box on the web page and select Enter text on the Tips panel
- Make sure Enter Text is inside the Loop Item. If not, drag the Enter Text action into the Loop Item
- Click on Enter Text action and select Use text in the loop to enter the text box
Click OK and you will see that the default name of the action has been changed to Enter loop items
We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.
- Click the Google Scholar search button on the web page
- Select Click button on the Tips panel, and you will notice the Click Item action is added to the workflow
- Click open the settings of the Click Item and extend the AJAX timeout
Now Octoparse will automatically enter every keyword in the list in the search box and click the search icon.
3. Auto-detect the search result page to scrape data
If you are on version 8 or above, Octoparse can auto-detect all sorts of web page elements and guide you through the settings on data extraction, pagination, page scroll, and so on. Use this feature to set up another loop to extract data from each result page.
- Click Auto-detect the web page data and wait for it to complete
- Turn to the Data preview section to either rename or delete the auto-captured data fields
- Check the Paginate to scrape more pages options to see if Octoparse the right next page button
- Edit the pagination setup and then click Confirm
- Uncheck the Add a page scroll as the web page doesn't need to be scrolled to load
- Click Create workflow
Now Octoparse will go to each result page and scrape the data we want.
4. Set up a wait time to slow down the scraping speed
This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.
- ClickExtract Data action
- Tick Wait before action in the Options tab and set the wait time to 3s
- Click Apply to save the settings
Now Octoparse will wait 3 seconds every time it executes the Extract Data action.
5. Save and start to run the task and get data
The last step is to save your task and run it.
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run task window to pop up
- Select Run on your device to run the task on your local device
- Wait for the task to complete
Here is the sample output from a local run.
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.