You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Google Scholar provides a simple way to search for scholarly literature broadly. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.

In this tutorial, we will show you how to scrape search results from Google Scholar with Octoparse.

Before you build a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!

If the template falls short of your needs and you want to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr

We will search with multiple keywords and scrape each article's title, author, and description information from the search results pages.

Here are the main steps in this tutorial: [Download task file here]

  1. Create a Go to Web Page - to open the target web page

  2. Create a Loop Item - to enter multiple search keywords

  3. Auto-detect search result page to scrape data

  4. Set up wait time to slow down the scraping speed

  5. Save and start to run the task and get data


1. Create a Go To Web Page - to open the target web page

Every workflow in Octoparse starts by telling Octoparse a web page to start from.

  • Enter the sample URL into the search bar at the top of the home screen and click Start

mceclip0.png

Check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs.

Now we have reached the target web page.


2. Create a Loop Item - to enter multiple keywords

If we want to search for multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.

  • Mouseover the down arrow and click the Add Step button to add a Loop Item to the workflow

mceclip1.png
  • Click Loop Item to go to its settings panel

  • Set its loop mode to Text List in the General tab

mceclip2.png
  • Click the Edit icon to enter your keyword list (e.g., data mining, big data, etc.), one keyword per line

mceclip3.png
  • Click Confirm and then Apply to save the settings

  • Click on the search box on the web page and select Enter text on the Tips panel

mceclip4.png
  • Make sure Enter Text is inside the Loop Item. If not, drag the Enter Text action into the Loop Item

  • Click on Enter Text action and select Use text in the loop to enter the text box

1.jpg
  • Click OK, and you will see that the default name of the action has been changed to Enter loop items

We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.

  • Click the Google Scholar search button on the web page

  • Select Click button on the Tips panel, and you will notice the Click Item action is added to the workflow

click_search.jpg
  • Click open the settings of the Click Item and extend the AJAX timeout

AJAX.jpg

Octoparse will automatically enter every keyword in the list in the search box and click the search icon.


3. Auto-detect the search result page to scrape data

If you are on version 8 or above, Octoparse can auto-detect all web page elements and guide you through the settings on data extraction, pagination, page scroll, and so on. Use this feature to set up another loop to extract data from each result page.

  • Click Auto-detect the web page data and wait for it to complete

  • Turn to the Data preview section to either rename or delete the auto-captured data fields

  • Check the Paginate to scrape more pages options to see if Octoparse the right next page button

  • Edit the pagination setup and then click Confirm

  • Uncheck the Add a page scroll as the web page doesn't need to be scrolled to load

  • Click Create workflow

2.jpg

Octoparse will go to each result page and scrape the data we want.


4. Set up a wait time to slow down the scraping speed

This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.

  • ClickExtract Data action

  • Tick Wait before action in the Options tab and set the wait time to 3s

3.jpg
  • Click Apply to save the settings

Octoparse will wait 3 seconds every time it executes the Extract Data action.


5. Save and start to run the task and get data

The last step is to save your task and run it.

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run task window to pop up

  • Select Run on your device to run the task on your local device

  • Wait for the task to complete

Here is the sample output from a local run.

4.jpg

Tip: Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.

Did this answer your question?