When we create a list of items for a website, sometimes the list may include several unwanted “Ads” items.
Take Yelp as an example, the auto-generated loop may include all the sponsored results that we don't need.
So what should you do if you only want to scrape the non-ads items?
A convenient way to do it is to delete the unwanted data rows by clicking the trash bin icon under the Action column.
However, for the thing to work for each and every item we don't need, we have to modify the XPath for the loop item so that it only locate the non-ads items.
Check the source code of the items in the Chrome DevTools, you will see there are no major differences between ads items and non-ads.
So we need to narrow our selection using XPath. Time for an XPath quiz!
You may want to use the following link to follow:
A simple way to narrow down your selection is to add more conditions for your XPath:
(1) The data we need are listed under "All results" section, so that's where we begin our selection.
(2) Each page contains ten results only, so we need to end our selection after getting the first ten results.
So the final XPath will be:
Now Octoparse will exclude all the unwanted promotional items from our list.