Scraping data from a search engine is a good way to collect information related to one topic. In this tutorial, we are going to show you how to scrape the search results data on Google search.

You can go to "Task Templates" on the home screen of the Octoparse and start with the ready-to-use Google Search Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates

1.png

If you want to create your own task with our advanced mode, you can look through this tutorial as a reference. We will scrape data such as the title, URL, and description from the search results page with Octoparse.

You may need this link to follow through:

https://www.google.com/

Here are the main steps in this tutorial: [Download demo task file click here]

  1. Open the targeted web page

  2. Auto-detect the web page to create steps to enter text

  3. Modify the settings for the "Click Item"

  4. Auto-detect the search result page to scrape data

  5. Set up wait time to slow down the scraping speed

  6. Save and start to run the task and get data


1. Open the targeted web page

  • Enter the URL on the home page and click Start

2.png

2. Auto-detect the web page to create steps to enter text

  • Click "Auto-detect web page data" and wait for the detection to complete

3.png
  • Choose "Search with keywords" on the Tips panel, and you will see instructions to help you set up steps

4.png

a. "Add a search box": click "Settings" and select the search box on the web page

10.gif

b. "Add a keyword(s)": click the Edit button and input the keyword(s). One keyword per line

20.gif

c. "Click the search button": tick "Click the search button when finishing entering" and select the search button on the web page and click "Confirm" to save the settings

14.gif

A Loop Item with an Enter Text and a Click Item action inside it will be created in the workflow:

mceclip0.png

3. Modify the settings for the "Click Item"

  • Click "Click Item" to enter the Options panel

  • Tick "Open in a new tab"

  • Extend the AJAX Load timeout

6.png

4. Auto-detect the search result page to scrape data

  • Auto-detect the page again

  • Uncheck the option of "Add a page scroll"

  • Click "Create workflow"

21.gif
  • Double-click to rename the fields or delete the fields you don't want

22.gif

Tips!

If the auto-detect function scrapes several fields you don't want, it is more convenient to switch to the vertical view to delete them in batch.

10.png
  • Modify the XPath for pagination

If the auto-detect can not locate "NEXT," you can write a precise XPath to solve the problem.

  • Click the "Pagination" then input the //span[contains(text(),"Next")] under the Matching XPath.

google.png

Tips!

Check out more details about XPath here: What is XPath and how to use it in Octoparse


5. Set up wait time to slow down the scraping speed

Google search applies an anti-scraping technique and it would show reCAPTCHA to solve. We need to slow down the scraping by setting the wait time.

  • Click on Extract Data action

  • Select "Options"

  • Tick "Wait before action"

  • Select the wait time as 1s-3s and click "Apply" to confirm

7.png

6. Save and start running the task and get the data

  • Click Save

  • Click Run on the upper left side

  • Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)

8.png

Here is the sample output.

13.png
Did this answer your question?