Websites, such as News portal or forums, typically have new contents added fast if not dynamically. To stay up-to-date with such websites, Octoparse’s incremental extraction allows you to extract updated data much more effectively by skipping the pages that have already been extracted, in the anthother word, only scrape the new ones.
When do you want to use incremental extraction?
Consider enabling incremental extraction if the followings are met:
1. Need updated data from any single website quite frequently
2. New information shows up as new web pages with new URLs (as opposed to new information being added/updated to existing webpages).
So a perfect example will be CNN.com. Imagine if you need to get News feeds from CNN.com almost in real time. It is important to schedule and run the task/crawler as frequently as needed so whatever gets added to the site can be extracted in a timely manner. So, criteria (1) is met. Obviously, each news article on CNN.com is going to have a different URL that can be easily identified - criteria (2) is also met.
Assuming you have a task set up for the job, but it doesn't really make sense to re-scrape those articles which have already been captured in previous runs. Using Incremental extraction, you can easily have the URLs checked first to make sure they have not been extracted already, and only capture the ones that are truly new.
How does Incremental extraction identify the "new" data?
Incremental extraction is going to work only if the newly added data can be identified with new URLs. During the extraction process, Octoparse checks each URL to judge whether it is one that had been crawled before. If an URL is identified as one from the previous crawl, it will be skipped automatically when running with incremental extraction.
How to set up incremental extraction?
You can easily enable incremental extraction following the steps below:
1. First, make sure Extract data step from the workflow is selected, click on Setting
2. Tick Enable incremental extraction
3. Select Identify by the entire URL or Identify by part of the URL
Identify by the entire URL
With this option, Octoparse will use the entire URL to match the current one. Even the slightest difference will have it identified as a "new" URL.
Identify by part of the URL
In many cases, URLs are composed of various attributes, for example, the one for eBay below includes attributes "_from", "_trksid", "_nkw", and "sacat" (usually anything that comes before "=" sign).
When running with Incremental extraction, Octoparse detects for attributes automatically and make it available as parameters. Having one or more attributes selected as parameters for the match, you are telling Octoparse to compare the current URL based on the selected attributes, if any of those are the same, skip it, otherwise, scrape the page.
1. Incremental extraction is only available for Cloud Extraction and for tasks with only one
"Extract Data" action.
2. If there is no parameters shown when you choose "identify by part of URL" and the "Extract
data" action is selected, it means the URL does not contain any parameters, you can only select
"identify by the entire URL".
3. When multiple parameters are selected, Octoparse will identify the current URL as a "new"
URL when any of those parameters is different.
Written by Yina Huang(Octoparse Team)