How to cleanse data extracted with Octoparse?
FollowIn this article, we will introduce some common ways to cleanse data using "Reformat" and "Trigger".
Here is the link will be used to start: https://www.walmart.com/search/?cat_id=0&query=computer
1. Reformat
1) Extract rating from HTML with "Match with Regular Expression"
When extracting product information, we always need "Rating" for each product. However, many ratings would appear as stars instead of numbers. Don't worry, you can use "Reformat" to extract the rating number from the icons.
- Click the full icon of rating
- Select "Extract Outer HTML of the selected element" at the "Action Panel"
- Rename the field as "Rating"
Tips: HTML is the standard markup language for creating web pages. When we extract the inner HTML of an element on the page, we will get the HTML markup contained within the element. Outer HTML is an element property that includes the opening and the closing tags as well as the content. So, capturing the outer HTML can technically provide more information than inner HTML. If the information needed cannot be found in the inner HTML, it is still possible to locate it in the outer HTML. |
Then we start to reformat the data.
- Click the icon
called "Customize data field"
- Click "Refine extracted data"
- Click "Add step"
- Select "Match with Regular Expression"
Here we can see the outer HTML of this icon, and we can find that the rating number locates after "aria-label='" and before "Stars".
Now we could use the RegEx tool to extract "Rating".
- Click "Try Regex Tool"
- Check "Starts with" and type in "aria-label="
- Check "Ends with" and type in " Stars"
- Click "Generate" and then click "Match" to see whether we extract rating right
- Click "Apply" and then "OK" to save
Now you have the rating number as you need.
2) Reformat data with "Replace"
Sometimes we may need to delete some information in the data we just extract. "Replace" would be a good choice at this time. For example, suppose we need to delete all comma in the "Title" field because we need to export the file as a CSV file. If the information in the CSV file has a comma, the file would separate the data before and after the comma because it is a delimiter.
For example, the original title is "HP 15 Laptop 15.6", Intel Core i3, 4GB SDRAM, 1TB HDD, Natural Silver, 15-bs031wm" and we will transform it into "HP 15 Laptop 15.6" Intel Core i3 4GB SDRAM 1TB HDD Natural Silver 15-bs031wm"
- Click the icon
called "Customize data field"
- Click "Refine extracted data"
- Click "Add step"
- Select "Replace"
Then we start to replace all the commas.
- Type in "," in the "Replace" text box
- Leave the "With" text box with empty
- Click "OK" and "OK" to move on
Now there is no comma in the title. You also can replace any words with other words as you need.
3) Reformat data with "Replace with Regular Expression"
When you need to replace texts based on their locations, "Replace with Regular Expression" would be a more accurate choice than "Replace". For example, we need to leave the name and model code in the field "title" and delete all other information. Name locates before the first comma and model code locates after the last comma.
For example, the original data is "HP 15 Laptop 15.6", Intel Core i3, 4GB SDRAM, 1TB HDD, Natural Silver, 15-bs031wm" and we need to transform it into "HP 15 Laptop 15.6", 15-bs031wm"
- Click the icon
called "Customize data field"
- Click "Refine extracted data"
- Click "Add step"
- Select "Replace with Regular Expression"
Then we start to replace the information.
- Click "Try Regex Tool"
- Check "Starts with" and type in ","
- Check "Ends with" and type in ","
In this case, we could match out all texts between two commas, which is "Intel Core i3". But we need to leave a comma between title and model code. Therefore, we need to check the "Include Start" near the first text box, and then we could locate the texts ", Intel Core i3".
- Check the "Include Start" near the first text box
- Click "Generate" and then click "Match" to see whether we extract rating right
- Click "Apply" and then "OK" to save
Tips! The difference between "Match with Regular Expression" and "Replace with Regular Expression" is: "Match with Regular Expression" will keep the texts located by Regular Expression while "Replace with Regular Expression" will change the texts located by Regular Expression. You can check more details about Reformat function at Re-format data extracted. |
2. Trigger
Sometimes we need to extract numbers between a specific range or the time in a certain time frame. In this case, "trigger" would be your solution.
Here are the triggers' conditions.
For example, here we need to scrape all products whose price is greater than 200. In other words, we would abandon all items whose price equals or is less than 200.
It starts with an "Add Trigger".
- Click "Trigger" and then click "Add Trigger"
- Select "Price", "less than" and type in "200"
Here, select the field that you want to set a condition and adjust the formula as you need. Remember the condition should be set to find all the data you want to abandon.
- Leave the "Do" dropdown list with "Abandon this line of data"
- Click "OK" and then click "ok" to save
Here is the sample output. As you can see, all the prices extracted are greater than 200.
Artículo en español: Cómo verificar los datos extraídos en Octoparse
También puede leer artículos de web scraping en el sitio web oficial
Author: Eric
Editor: Yanni