In version 7.2, Octoparse enables a new feature "Triggers". With the use of "Trigger", users can define one or more conditions for whether the data should be extracted. "Trigger" can be easily added in the Extract Data step.
When should you use Triggers?
For example, if you only want to scrape a portion of the data on a web page, let's say, products with price less than $100, you can use Triggers to abandon "useless" data lines, specifically, any products with price equal/over $100 and only keep the ones you need.
To achieve, you can create a trigger like this: if data field "price" is equal or greater than "100", do abandon the line of data. This way, Octoparse will just "judge" whether the data meets the defined criteria before having it actually extracted. In the end, the dataset will be clean and only has the data desired.
Another useful application is when you need to extract data associated with a specific date, say, all news articles published today (ex. 2019-01-01). To achieve this, you can create a trigger: If the data field "date" is not "2019-01-01", do abandon the line of data. As a result, you will only fetch the article for 2019-01-01.
Multiple conditions can be used together. For example, if you need to extract news articles for 2019-01-01 and only when the article title contains the words "CPI", it can be done using the following two conditions:
Condition 1: If the data field "date" is not "2019-01-01", do abandon the line of data
Condition 2: If the data field "title" does not contain "CPI", do abandon the line of data
How to set up Triggers?
1. Create a new trigger
- Click "Add trigger" to create a new trigger
2. Name the Trigger
- Name the trigger by typing in the name directly
3. Define the Trigger
- Select the target data field. In the example below, the data field "title" is selected.
- Set the condition for the selected data field. You can set conditions based on "text", "numerals" or "time"
For general texts
There are five options (is, is not, contains, does not contain, is not blank) for general texts.
For example, If you select "contains" and type in the word "pen" in the text box, the condition will be: If the data field "Title" contains words "pen".
If "is not blank" is selected, there's no need to fill the text box and the condition will be: If the data field "Title" is not blank.
b. For numerals
There are four options available for numerals: greater than, less than, greater than or equal to.
For example, if you select data field "Price", "greater than", and fill in the value "8", the condition will be: If the data field "Price" is greater than 8.
c. For time
There are four options available for time: after, before, on or after, on or before.
For example, for the data field "Published_Time", if you select "after", "00:00 the extraction day" and do "Abandon this line of data", the condition will be: if the published time is after 0:00 AM on the extraction day, then discard the line of data. As a result, only those articles with published time before 0:00 AM on the extraction day gets fetched.
4. Add more conditions using [AND] or [OR]
Multiple conditions can be added to the same trigger. Use condition [AND] or condition [OR] to define the relationships between the various conditions.
If you click "Add [AND] condition" and add a condition, the action will be executed if the data field meets both conditions.
If you click "Add [OR] condition" and add a condition, the action will be executed if the data field meets either one of the two conditions.
5. Do one of the following steps
Now that you have the conditions defined, Octoparse will execute one of the following steps when the conditions are triggered.
a. Abandon this line of data
If "Abandon this line of data" is selected, Octoparse will abandon this line of data regardless of whether the other data of the same line has been extracted or not.
More specifically, suppose that a task has two "Extraction data" steps and only the latter one sets the trigger. Even if the data for the first "Extraction data" step has been extracted, Octoparse will abandon this line of data once the trigger for the latter step is triggered.
b. End the loop
If "End the loop" is selected, you'll need to select one of the loop items from the drop-down list. The selected loop item will be ended once the corresponding condition is satisfied.
c. Terminate the extraction
If "Terminate the extraction" is selected, the extraction will be terminated once the corresponding condition is satisfied.
Artículo en español: Disparadores
También puede leer artículos de web scraping en el website oficial