Websites, such as News portals or forums, typically have new content added fast if not dynamically. To stay up-to-date with such websites, Octoparse’s incremental extraction allows you to extract updated data much more effectively by skipping the pages that have already been extracted, in another word, only scrape the new ones.
When do you want to use incremental extraction?
Consider enabling incremental extraction if the followings are met:
1. If you need updated data from any single website quite frequently
2. If the new information shows up as new web pages with new URLs (as opposed to new information being added/updated to existing webpages).
So a perfect example will be CNN.com. Imagine if you need to get News feeds from CNN.com almost in real-time. It is important to schedule and run the task/crawler as frequently as needed so whatever gets added to the site can be extracted in a timely manner. So, criteria (1) is met. Obviously, each news article on CNN.com is going to have a different URL that can be easily identified - criteria (2) is also met.
Assuming you have a task set up for the job, but it doesn't really make sense to re-scrape those articles which have already been captured in previous runs. Using Incremental extraction, you can easily have the URLs checked first to make sure they have not been extracted already, and only capture the ones that are truly new.
How does Incremental extraction identify the "new" data?
Incremental extraction is going to work only if the newly added data can be identified with new URLs. During the extraction process, Octoparse checks each URL to judge whether it is one that had been crawled before. If an URL is identified as one from the previous crawl, it will be skipped automatically when running with incremental extraction.
How to set up incremental extraction?
You can easily enable incremental extraction following the steps below:
1. First, make sure the Extract data step from the workflow is selected, then click on Setting
2. Tick Enable incremental extraction
3. Select Match the entire URL or Match by part of the URL
Match the entire URL
With this option, Octoparse will use the entire URL to match the current one. Even the slightest difference will have it identified as a "new" URL.
Match by part of the URL
In many cases, URLs are composed of various attributes, for example, the one for eBay below includes attributes "_from", "_trksid", "_nkw", and "sacat" (usually anything that comes before "=" sign).
When running with Incremental extraction, Octoparse detects attributes automatically and makes them available as parameters. Having one or more attributes selected as parameters for the match, you are telling Octoparse to compare the current URL based on the selected attributes, if any of those are the same, skip it, otherwise, scrape the page.
1. Incremental extraction is only available for Cloud Extraction and for tasks with only one
If you have any questions, you are welcome to reach out to us.