Remove duplicates
FollowHaving duplicates in the dataset can be due to that the websites have duplicated data in itself or the task could have been set up to capture the same data twice or more. When this happens, there are two ways to have duplicates removed depending on your data requirements.
- Remove duplicates when the entire data lines are the same (default setting)
- Remove duplicates when selected data fields are the same (manual setting, only for 8.1.16 and above versions)
1. Remove duplicates when the entire data lines are the same (default setting)
When the run is completed, Octoparse treats data lines as duplicates when the entire lines are the same (all the data fields are the same) by default. You can remove the duplicates and keep only the unique lines.
Example: Line #1 and line #4 below have the same value for each data field, so they are duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case.
2. Remove duplicates when selected data fields are the same
Note: This feature is for Octoparse 8.1.16 and above.
When you're building the task workflow, you can further customize the task to remove data lines that share the same values for one or more data fields. As long as the values of the selected data fields are the same, the data lines will be treated as duplicates. Other unselected data field(s) will not be considered.
Example 1: If we select "Field2" to compare for data deduplication, then line #1, line #2 and line #4 all have the same value for "Field2". In this case, these data lines will be considered as duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case, and get rid of line #2 and line #4.
Example 2: If we select "Field3" and "Field4" to compare for data deduplication, then line #1 and line #4 both have the same values for "Field3" and "Field4" respectively. In this case, line #1 and line #4 will be considered as duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case and get rid of line #4 automtically.
Follow the steps below to customize de-dup settings:
1. Set up the task and the data fields you need to collect
2. Click the icon on the right top corner of Data Preview
3. Select the data field(s) you'd like to compare for de-duplication. After selection, click Apply to save the settings.
For Cloud runs, only data that's been treated with the same de-deup setting will be compared and de-dup'ed on continuous basis.
For example, let's say you set the 1st de-dup setting as A (e.g. select "Field1" to compare) and got the first bath of Cloud data.
Then, you go back to your task and modify the de-dup setting to B (e.g. select "Field2" to compare), and got the 2nd batch of Cloud data. This second batch of data will not be compared against the 1st batch of data for de-duplication.
After that, if you change the setting back to A (e.g. select "Field1" to compare) and got the third batch of Cloud data. This third batch of dta will be compared and dedup'ed against the 1st batch of Cloud data.
Feel free to leave a message if you have more questions about data deduplication in Octoparse.. We will get back to you ASAP.
Author: Scarlett
Editor: Isabel