Scraping data from a search engine is a good way to collect information related to one topic. In this tutorial, we are going to show you how to scrape the search results data on Google search.
You can go to "Task Templates" on the home screen of the Octoparse and start with the ready-to-use Google Search Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you want to create your own task with our advanced mode, you can look through this tutorial as a reference. We will scrape data such as the title, URL, and description from the search results page with Octoparse.
You may need this link to follow through:
Here are the main steps in this tutorial: [Download demo task file click here]
1. Open the targeted web page
- Enter the URL on the home page and click Start
2. Auto-detect the web page to create steps to enter text
- Click "Auto-detect web page data" and wait for the detection to complete
- Choose "Search with keywords" on the Tips panel and you will see instructions to help you set up steps
a. "Add a search box": click "Settings" and select the search box on the web page
b. "Add a keyword(s)": click and input the keyword(s). One keyword per line
c. "Click the search button": tick "Click the search button when finishing entering" and select the search button on the web page and click "Confirm" to save the settings
A Loop Item with an Enter Text and a Click Item action inside it will be created in the workflow:
3. Modify the settings for the "Click Item"
- Click "Click Item" to enter the Options panel
- Tick "Open in a new tab"
- Extend the AJAX Load timeout
4. Auto-detect the search result page to scrape data
- Auto-detect the page again
- Uncheck the option of "Add a page scroll"
- Click "Create workflow"
- Double-click to rename the fields or delete the fields you don't want
If the auto-detect function scrapes several fields you don't want, it is more convenient to switch to the vertical view to delete them in batch.
- Modify the XPath for pagination
If the auto-detect can not locate to "NEXT", you can write a precise XPath to solve the problem.
Click the "Pagination" then input the //span[contains(text(),"Next")] under the Matching XPath.
Check out more details about XPath here: What is XPath and how to use it in Octoparse
5. Set up wait time to slow down the scraping speed
- Click on Extract Data action
- Select "Options"
- Tick "Wait before action"
- Select the wait time as 1s-3s and click "Apply" to confirm
6. Save and start running the task and get the data
- Click on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
Here is the sample output.
Is this article helpful? Contact us any time if you need our help!