Duckduckgo is a search engine that provides instant answers according to people's search keywords. The protection of privacy idea wins itself hundreds of millions of users and is still increasing. To get the wanted information in a batch, we will show you how to scrape search results on the website with Octoparse in this case.
To follow through with the tutorial, kindly please use the below URL for reference:
Here are the main steps in this tutorial:
- Enter the URL on the home page - to open the target page
- Start auto-detection - to generate a workflow
- Modify Xpath for pagination - to locate the Load more button accurately
- Modify XPath for fields - to get the data precisely
- Modify the workflow - to extract data after all results loaded
- Run the task - to get desired data
1. Enter the URL on the home page - to open the target page
The target URL needs to be input first to start a scrape journey.
- Enter the Covid search URL into the search box at the center of the home screen
- Click Start to create a new task in Advanced Mode
2. Start auto-detection - to generate a workflow
Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.
- Click on Auto-detect web page data and wait for the detection to complete
- Check the data fields in Data Preview and delete unwanted data or rename them if needed
- Click Create workflow
The workflow would then be generated as below:
3. Modify Xpath for pagination - to locate the Load more button accurately
To ensure load more results go right, modifying Xpath for the pagination is important.
- Click on Pagination
- Input the XPath in the Matching Xpath box under the General setting as : //a[@class="result--more__btn btn btn--full"]
- Click Apply
4. Modify XPath for fields - to get the data precisely
The summer of a result can be displayed differently. Some with post dates but some don't. Thus, we need to modify the field's XPath to ensure it can always locate the correct information.
- Click on More (...)
- Choose Customize XPath
- Input the XPath as /article/div
- Click Apply to save
5. Modify the workflow - to extract data after all results loaded
To avoid scraping duplicate data, moving the Extract Data Loop out of pagination would be safer.
- Drag the Extract Data Loop Item out and put it under pagination
6. Run the task - to get the desired data
- Click the Save button first to save all the settings you have made
- Then click Run to run your task either locally or cloudly
- Select Run on your device and click Run Now to run the task on your local device
- Wait for the task to complete
Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.