Skip to main content

Why do I get so many duplicates?

Updated over a year ago

You are browsing a tutorial guide for Octoparse latest version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Duplicates in Octoparse refer to data lines that are identical across all fields. If there are only a few duplicates, you can remove them when exporting the data.

However, encountering many duplicates and only a small number of valid data lines can be frustrating. In this tutorial, we will guide you on how to resolve duplicate issues.


Error 1: When scraping multiple pages, Octoparse jumps back to previously scraped pages or keeps scraping the last page.

The auto-generated pagination XPath may not always work well. If you find Octoparse duplicates on several pages repeatedly, you will need to make some adjustments.

Solution: Modify the XPath of the pagination to make sure it locates the next page button precisely.

  • Click on Pagination

  • Enter the new XPath and click Apply to save

31.png

Tip: Check how to write an XPath here at What is XPath and how to use it in Octoparse.


Error 2: When scraping multiple pages, the AJAX timeout for pagination is too short to load the next page, and Octoparse keeps scraping the current page data.

For pages loaded with AJAX, if the timeout is not set long enough, the next page may fail to load, causing Octoparse to scrape the current page again, which can result in duplicates.

Solution: Extend the AJAX timeout to make it long enough for the page to load.

  • Click on Click to Paginate

  • Select a longer AJAX timeout

22.png

Error 3: When scraping a list of items, Octoparse only scrapes the first row of data repeatedly, or one data field is the same in all lines.

When looping through a list of items to get data, Octoparse may keep scraping from one item. Or other fields are gathered correctly from each item, but one or two fields are fixed.

This is because the Extract Data action is not associated with the Loop Item action. Two options need to be selected to associate the Extract Data and Loop Item.

  • Extract data in the loop in the Extract Data settings

23.png
  • Relative XPath in the setting of the data field

33.png

The Extract Data and Loop Item are associated with the two options selected, and Octoparse will scrape the data from each item in the loop.

Solution 1: Re-create the fields

  • After checking the Extract data in the loop option, click Loop Item in the workflow, and then click Extract Data

  • The first item will be highlighted, and we can choose elements from the highlighted area to extract the text

re-create_fields.gif

Solution 2: Modify the XPath of the fields directly

  • Click on Extract Data

  • Click on More and select Customize XPath

  • Tick Relative Xpath and enter the correct XPath

Method 1: Click Extract data then Customize Xpath

11.png

Method 2: Click Extract data, switch to the vertical view, and double-click each field to customize Xpath, which is more convenient if you need to modify several Xpaths.

114.png

Error 4:The task keeps scraping the last page

Did this answer your question?