If you have a task that keeps getting duplicated data extracted, chances are the webpage is loaded with AJAX. In this tutorial, you will learn how to avoid getting duplicated data when scraping webpages loaded with AJAX.
What is AJAX?
AJAX is very frequently used for infinitive scrolling when you have to scroll down to the bottom of the page continually or loop clicks on the "Load More" button to fully load page content.
Read more about AJAX:
We'll take trip.com as an example: https://bit.ly/2J2Fe0m
Trip.com applies the AJAX technique as it reloads more after clicking "Search More Hotels". Here is the crawler for troubleshooting.
Octoparse executes each step of the workflow in a top-down, inside-out manner. For Trip.com in particular, pagination is achieved via clicking the button "Search More Hotels" repeatedly, very much similar to a typical "Load More" button.
Imagine if we build the loop item within the pagination loop, the scraping process goes like this: the page gets loaded, all available listings on the page get looped and extracted, the "Search More Hotels" button is clicked, more listings get loaded, all available listings including those that have been previously extracted get looped and extracted, so on a so forth until the page's been fully loaded, but we ended up with many duplicated data.
How to make it right?
To make sure the workflow works right for this scenario, we can first finish all the page loading prior to looping and extracting the listings. So drag the "Loop Item" box out of the pagination loop and position it right under the pagination loop. This way, the crawler will actually loop clicking the "Load More Hotels" until there are no more listings to load, then loop through all the available listings on the page to extract whatever data is needed.
More debugging tutorials are here:
Artículo en español: ¿Por qué sigues recibiendo datos duplicados en el websites de AJAX?
También puede leer artículos de web scraping en el sitio web oficial