In this tutorial, we are going to show you how to scrape search results from Google Scholar. Also, the ready-to-use Google Scholar Template has been inserted in our latest version, you may want to check it out here: Task Templates.
If you would like to build the crawler from scratch, you might want to use the URL in this tutorial:
We will scrape data such as the title, author, description, and other related information from each searching results page with Octoparse.
Here are the main steps in this tutorial:[Download task file here]
- "Go To Web Page" - to open the targeted web page
- Create a "loop Item" - to loop enter searching keywords
- Create a pagination loop - to scrape data from multiple listing pages
- Create a "Loop Item" – to loop extract each item
- Extract data - to select data you need to scrape
- Run extraction - to run your task and get data
1)"Go To Web Page" - to open the targeted web page
- Click "+Task" to start a new task with Advanced Mode
- Paste the following URL into the"Website" box
- Click "Save URL" to move on
2)Create a "Loop Item"- to loop enter searching keywords
We can customize our "text list" to create a loop search action. Octoparse will automatically enter every keyword in the list into the search box, one line a time.
- Drop a "loop item" action into the workflow designer
- Go to loop mode and select “Text list”
- Click "a" to enter the keywords list with one keyword per line. Here we'll enter "data mining" and "big data"
- Click "OK" and "OK" when you finish entering. Then you can see your keywords in the "Loop Item"
- Click on the search box on the page in the built-in browser and select “Enter text" on "Action Tips"
When you click on the input field in the built-in browser, Octoparse can detect that you select a search box, the “Enter text” action will automatically appear on “Action tips”.
- Input the first keyword "data mining" on "Action Tips"
- Click "OK", then the"Enter Text" action will be generated in the workflow
- Drag the "Enter Text" action into the "Loop Item". Click on the "Enter Text" action
Go to "Loop Text" and select"Use the text in loop item to fill in the text box” and click “OK” to save.
set up "wait before execution"
- Click the search button of the web page and select “Click button" on "Action Tips", you will notice the "Click Item" action is added into the workflow.
- check "open the link in new tab" and click "Save"
- Scroll down to the buttom of the page
- Click "Next" button
- Click "Loop click next page" on "Action Tips"
4) Create a "Loop Item" -to loop extract each item
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we'd better go back to the first page.
- Click "Go To Web Page" in the workflow.
- Click "Loop Item"
- Click “Enter Text"
- Click "Click Item"
- Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
- Click the first item in the built-in browser
We need to make sure the whole block of the first item is covered in blue when you curse over your mouse. Only in this way, we could see the whole item block is highlighted in green after clicking, covering all other information like title, author, date...
- Click the second item
We also need to select the whole block of the second item. Octoparse will automatically recognize the other items and highlight them in green.
- Click "Select All" on "Action Tips"
- Click " Extract text of the selected element " on "Action Tips" panel.
Normally we can just click "Select all sub-elements" on the "Action Tips" panel, but under certain circumstances (like this case), Octoparse fails to do that. Thus, we’ll create a loop at first, and select the data of each block for extracting manually in the next step.
5) Extract data - to select data you need to scrape
- Click on the Data field
- Click "Delete Data Field"
- Click "Yes”
- Click data you need in the item block which is highlighted in red.
- Click "Extract text of the selected element" and rename the "Field name" column if necessary.
Rename the fields by selecting from the pre-defined list or inputting on your own.
- Click "OK" to save the result.
Google applies sensitive anti-bot techniques. If Google detects too many requests from a single IP address in a specific time period, Captcha would show up and stop the whole crawling. Therefore, we could set up "wait before execution" as "Random" seconds for some steps to simulate human browsing behaviors. In this case, Google may not identify the crawler as a robot and we could scrape information fluently.
6) Run extraction - to run your task and get data
- Click "start extraction"
- Select "local extraction" to run the task on your computer
Here is the output sample:
Was this article helpful? Contact us anytime if you need our help!