Duplicates in Octoparse mean the data lines that are the same in all the fields. You can delete the duplicates when exporting the data if there are only a few duplicates.
But if you only get a small number of valid data lines with many duplicates, that could be really frustrating. In this FAQ tutorial, we will help you to resolve the duplicates.
Error 1: When scraping multiple pages, Octoparse jumps back to previously scraped pages or keeps scraping the last page.
The auto-generated pagination XPath may not always work well. If you find Octoparse duplicates several pages over and over again, you will need to make some adjustments.
Solution: Modify the XPath of the pagination to make sure it locates the next page button precisely.
- Click open the settings of the Pagination
- Enter the new XPath and click OK to save
Check how to write an XPath here at What is XPath and how to use it in Octoparse
Error 2: When scraping multiple pages, the AJAX timeout for pagination is too short to load the next page and Octoparse keeps scraping the current page data.
For pages that are loaded with AJAX, once the timeout is not set long enough, the next page could not be loaded, Octoparse would scrape the current page again, which would produce some duplicates.
Solution: Extend the AJAX timeout to make it long enough for the page to load.
- Click open the settings of "Click to Paginate"
- Select a longer time for AJAX timeout
Error 3: When scraping a list of items, Octoparse only scrapes the first row of data repeatedly or one data field is the same in all lines.
When looping through a list of items to get data, Octoparse may keep scraping from one item. Or other fields are gathered correctly from each item, but one or two fields are fixed.
This is because the Extract Data action is not associated with the Loop Item action. To associate the Extract Data and Loop Item, two options need to be selected.
1. "Extract data in the loop" in the "Extract Data" settings
2. "Relative XPath" in the setting of the data field
With the two options selected, the Extract Data and Loop Item are associated and Octoparse will scrape the data from each item in the loop.
*Make sure the "Extract data in the loop" is selected first before any modifications.
Solution 1: Re-create the fields
- After checking the "Extract data in the loop", click the "Loop Item" in the workflow, and then click "Extract Data"
- Select data to scrape from the first element
Solution 2: Modify the XPath of the fields directly
- Open the setting of the Extract Data
- Tick Relative Xpath and enter the correct XPath
Check how to write a Relative XPath here.