Scraping data from a search engine is a good way to collect information related to one topic. In this tutorial, we are going to show you how to scrape the search results data on Google search.
You can go to "Task Templates" on the home screen of the Octoparse and start with the ready-to-use Google Search Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you want to create your own task with our advanced mode, you can look through this tutorial as a reference. We will scrape data such as the title, URL, and description from the search results page with Octoparse.
You may need this link to follow through:
Here are the main steps in this tutorial: [Download demo task file click here]
1. Open the targeted web page
Enter the URL on the home page and click Start
2. Auto-detect the web page to create steps to enter text
Click "Auto-detect web page data" and wait for the detection to complete
Choose "Search with keywords" on the Tips panel, and you will see instructions to help you set up steps
a. "Add a search box": click "Settings" and select the search box on the web page
b. "Add a keyword(s)": click the Edit button and input the keyword(s). One keyword per line
c. "Click the search button": tick "Click the search button when finishing entering" and select the search button on the web page and click "Confirm" to save the settings
A Loop Item with an Enter Text and a Click Item action inside it will be created in the workflow:
3. Modify the settings for the "Click Item"
Click "Click Item" to enter the Options panel
Tick "Open in a new tab"
Extend the AJAX Load timeout
4. Auto-detect the search result page to scrape data
Auto-detect the page again
Uncheck the option of "Add a page scroll"
Click "Create workflow"
Double-click to rename the fields or delete the fields you don't want
If the auto-detect function scrapes several fields you don't want, it is more convenient to switch to the vertical view to delete them in batch.
Modify the XPath for pagination
If the auto-detect can not locate "NEXT," you can write a precise XPath to solve the problem.
Click the "Pagination" then input the //span[contains(text(),"Next")] under the Matching XPath.
Check out more details about XPath here: What is XPath and how to use it in Octoparse
5. Set up wait time to slow down the scraping speed
Google search applies an anti-scraping technique and it would show reCAPTCHA to solve. We need to slow down the scraping by setting the wait time.
Click on Extract Data action
Tick "Wait before action"
Select the wait time as 1s-3s and click "Apply" to confirm
6. Save and start running the task and get the data
Click Run on the upper left side
Here is the sample output.