You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Duckduckgo is a search engine that provides instant answers according to people's search keywords. The protection of privacy idea wins itself hundreds of millions of users and is still increasing. To get the wanted information in a batch, we will show you how to scrape search results on the website with Octoparse in this case.
To follow through with the tutorial, kindly please use the below URL for reference:
Here are the main steps in this tutorial: [Download task file here]
1. Enter the URL on the home page - to open the target page
The target URL needs to be input first to start a scrape journey.
Enter the Covid search URL into the search box at the center of the home screen
Click Start to create a new task in Advanced Mode
2. Start auto-detection - to generate a workflow
Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.
Click on Auto-detect web page data and wait for the detection to complete
Check the data fields in Data Preview and delete unwanted data or rename them if needed
Click Create workflow
The workflow would then be generated as below:
3. Modify Xpath for pagination - to locate the Load more button accurately
To ensure load more results go right, modifying Xpath for the pagination is important.
Click on Pagination
Input the XPath in the Matching Xpath box under the General setting as : //a[@class="result--more__btn btn btn--full"]
4. Modify XPath for fields - to get the data precisely
The summer of a result can be displayed differently. Some with post dates but some don't. Thus, we need to modify the field's XPath to ensure it can always locate the correct information.
Click on More (...)
Choose Customize XPath
Input the XPath as /article/div
Click Apply to save
5. Modify the workflow - to extract data after all results loaded
To avoid scraping duplicate data, moving the Extract Data Loop out of pagination would be safer.
Drag the Extract Data Loop Item out and put it under pagination
6. Run the task - to get the desired data
Click the Save button first to save all the settings you have made
Then click Run to run your task either locally or cloudly
Select Run on your device and click Run Now to run the task on your local device
Wait for the task to complete
Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.