In this tutorial, we are going to show you how to scrape search results from Google Scholar.
Also, the ready-to-use Google Scholar Template has been inserted in our latest version, you may want to check it out here: Task Templates. You just need to enter the keyword to get data extracted in minutes!
If you would like to build the crawler from scratch, you might want to use the URL in this tutorial:
We will scrape data such as the title, author, description, and other related information from each searching results page with Octoparse.
Here are the main steps in this tutorial:[Download task file here]
- Open the targeted web page
- Create a "Loop Item"- loop enter searching keywords
- Auto-detect the search result page to scrape data
- Set up wait time to slow down the scraping speed
- Save and start to run the task and get data
1. "Go To Web Page" - open the targeted web page
- Enter the example URL and click "Start"
2. Create a "Loop Item"- loop enter searching keywords
We can customize our "text list" to create a loop search action. Octoparse will automatically enter every keyword in the list into the search box, one line at a time.
- Add a "Loop Item" in the workflow by clicking
- Double-click the "Loop Item" to go to the settings panel
- Go to loop mode and select “Text list”
- Click to enter the keywords list with one keyword per line. Here we'll enter "data mining" and "big data"
- Click "OK" to confirm
- Click on the search box on the web page
- Select “Enter text" on the "Tips" panel
- Click "Confirm"
- Drag the "Enter Text" action into the "Loop Item"
- Double-click the "Enter Text" action
- Select "Use text in the loop to enter the text box"
- Click "OK"
We can check if the steps are set up correctly by clicking the "Loop Item" and then "Enter Text" in the workflow to see if the text would be entered into the web page.
- Click the search button of the web page
- Select the “Click element" on the "Tips" panel, and you will notice the "Click Item" action is added into the workflow
- Click open the settings of the "Click Item" and select "Open in a new tab" option
3. Auto-detect the search result page to scrape data
- Click "Auto-detect the web page data" and wait for it to complete
- Rename the fields or deleted unwanted ones on the "Data preview"
- Edit the "Paginate to scrape more pages" and uncheck the "Add a page scroll"
- Detail of edit the pagination and then click "confirm"
- Click "Create workflow"
4. Set up wait time to slow down the scraping speed
- Double-click the Extract Data action
- Tick "Wait before action"
- Select the wait time as 1s-3s
5. Save and start to run the task and get data
- Click "Run" on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
Here is the output sample:
Tutorial en español: Scrapear resultados de búsqueda de Google Scholar
También puedes leer más artículos de web scraping en el sitio web oficial