Refine extracted data (replace content, add a prefix, ..)
FollowDuring your web scraping project, you may want to clean the data fields while doing the web scraping. Octoparse offers 8 data cleaning options for turning the extracted data into the format you need.
When should I refine the extracted data?
If you have a desired data format for a certain field, you can use our "Clean Data" function to refine the field within Octoparse. Octoparse would scrape and refine it directly during the scraping process. No need to re-format the field after exporting the data into an excel file.
How to refine the extracted data in Octoparse?
To access these features in Octoparse, you should follow the 4 steps below:
1. Select the data field to refine
2. Click on the "..." icon and select "Clean data".
3. Click "Add step"
4. Select an operation to re-format your data
Tips! In programming, a "string" basically refers to a collection of characters like letters, numerals, symbols, and punctuation marks. For example, " " (space) is a string; "Octoparse" is a string; and "Hello 2 *% World!" is also a string. A string can consist of no character as well. In other words, a string that contains no character is empty. If you replace a word with an empty string, colloquially, it is equal to say that you delete the word. You would see the word "string" in a lot of function instructions of Octoparse's data reformat options. If you see the word "string" there, that means you can use the corresponding options to deal with a variety of character types in the data extracted, such as letters, words, sentences, numbers, spaces, symbols, and punctuation marks. |
9 Data reformat options
2. Replace with regular expression
3. Match with regular expression
7. Reformat extracted date/time
1. Replace
Function: Replace the specific string/s in the extracted data with the new string/s that you want.
2. Replace with regular expression
Function: Use a specific regular expression to replace the matched string/s in the extracted data with the string/s that you want.
You can learn more about regular expression in W3schools .
3. Match with regular expression
Function: Use a specific regular expression to pick up the matched string/s from the extracted data.
You can learn more about regular expression in W3schools .
4. Trim spaces
Function: Remove the unwanted space/s from the start and/or the end of the data extracted.
If you want to delete the spaces amid the data, you can use Replace or Replace with regular expression.
5. Add a prefix
Function: Add a string or strings to the front of the data extracted.
6. Add suffix
Function: Add a string to the end of the data extracted.
7. Reformat extracted date/time
Function: Shift the extracted date/time into one of the built-in formats, or into your own customized format.
8. Timestamp conversion
Function: Shift the Unix timestamp into your own customized format.
The Unix timestamp is a sequence of numbers that represents a specific date and time. This function will convert Unix time to a format that we can understand easily.
9. HTML transcoding
Tips! All the steps added can be edited and deleted here by clicking the |
Octoparse Regex Tool
Octoparse also offers a RegEx Tool to auto-generate the regular expression that you need. Let's have a quick look at how to use Octoparse's RegEx Tool to generate and apply a regular expression. For example, here we want to pick up the numeral of star-rating from the outer HTML extracted.
· Click "Try RegEx Tool"
· Enter the match criteria: start with "src="", end with " " "
· Click "generate" to produce regular expression
· Click "Match" to pick up the matched strings
· Click "Apply"
· Click "Confirm" to save the settings
Click the link here for more information about the use of the Regex tool.
If you have questions, you are welcome to submit a request here. Our support team will get back to you later.
Author: Joy
Editor: Yina