Why does my task get so many duplicates?
FollowQuestion:
Why does my task get so many duplicates?
Answer:
There are mainly three reasons for this issue.
1) Pagination XPath does not always locate the next page button.
In many cases, Octoparse may jump back to the previous pages and scrape the same pages over again or it keeps scraping the last page and never stops.
Solution: Modify the XPath of the pagination to make sure it locates the next page button precisely.
As for how to modify the XPath, you can check the related tutorials below:
Why does Octoparse skip some pages?
Why does Octoparse keep scraping the last page and never stop?
2) The AJAX timeout for the click to pagination action is too short.
For pages with AJAX load, if the AJAX timeout is not long enough for the page to load, Octoparse may scrape the current page again.
Solution: Extend the AJAX timeout to make it long enough for the page to load.
3) The Extract Data action is not associated with the Loop Item action.
When extracting from a loop item, Octoparse may keep scraping the first item and duplicate it. That's because Octoparse does not scrape from the loop item but from the page directly.
Solution: Check how to resolve this issue at
Why does Octoparse only extract the first item and duplicate?
Artículo en español: ¿Por qué mi tarea extrae tantos duplicados?
También puede leer artículos de web scraping en el sitio web oficial
Author: Yina
Editor: Yanni