You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Duckduckgo is a search engine that provides instant answers according to people's search keywords. The protection of privacy idea wins itself hundreds of millions of users and is still increasing. To get the wanted information in a batch, we will show you how to scrape search results on the website with Octoparse in this case.

target.jpg

To follow through with the tutorial, kindly please use the below URL for reference:

https://duckduckgo.com/?q=Covid&t=h_&ia=coronavirus

Here are the main steps in this tutorial: [Download task file here]

  1. Enter the URL on the home page - to open the target page

  2. Start auto-detection - to generate a workflow

  3. Modify Xpath for pagination - to locate the Load more button accurately

  4. Modify XPath for fields - to get the data precisely

  5. Modify the workflow - to extract data after all results loaded

  6. Run the task - to get desired data


1. Enter the URL on the home page - to open the target page

The target URL needs to be input first to start a scrape journey.

  • Enter the Covid search URL into the search box at the center of the home screen

  • Click Start to create a new task in Advanced Mode

url.jpg

2. Start auto-detection - to generate a workflow

Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.

  • Click on Auto-detect web page data and wait for the detection to complete

detec.jpg
  • Check the data fields in Data Preview and delete unwanted data or rename them if needed

data_preview.jpg
  • Click Create workflow

create.jpg

The workflow would then be generated as below:

workflow.jpg

3. Modify Xpath for pagination - to locate the Load more button accurately

To ensure load more results go right, modifying Xpath for the pagination is important.

  • Click on Pagination

  • Input the XPath in the Matching Xpath box under the General setting as : //a[@class="result--more__btn btn btn--full"]

  • Click Apply

pagination.jpg

4. Modify XPath for fields - to get the data precisely

The summer of a result can be displayed differently. Some with post dates but some don't. Thus, we need to modify the field's XPath to ensure it can always locate the correct information.

difference.jpg
  • Click on More (...)

  • Choose Customize XPath

customize_XPath.jpg
  • Input the XPath as /article/div[3]

  • Click Apply to save

Input_XPath.jpg

5. Modify the workflow - to extract data after all results loaded

To avoid scraping duplicate data, moving the Extract Data Loop out of pagination would be safer.

  • Drag the Extract Data Loop Item out and put it under pagination

modify.jpg

6. Run the task - to get the desired data

  • Click the Save button first to save all the settings you have made

  • Then click Run to run your task either locally or cloudly

mceclip8.png
  • Select Run on your device and click Run Now to run the task on your local device

  • Wait for the task to complete

Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.

data_overview.jpg
Did this answer your question?