Lesson 3: Refine your data
FollowRename/move/duplicate/delete a field
As soon as you have the data extracted and shown in Data Preview, you can now look through the data set and start organizing your data. A few typical things you can do to refine your data set include renaming the fields, reordering the columns, duplicating data fields, and deleting the fields that are not required for your project.
To rename a field, click the pencil icon next to the field name, then type in the new name directly. Note that you should only use numbers, letters, and "_" for field names.
To move a field, place your cursor at the front of the field and when the shows up, drag and drop the field to the right spot.
To duplicate a field, click on the show more icon and select "Copy". The selected field will be duplicated automatically.
To delete a field, click on the show more icon and select "Delete"
Tips!
|
Clean data
Octoparse provides many different ways for you to clean your data. For example, you can replace a text string, trim extra spaces, add a prefix/suffix, replace a string with RegEx, reformat date/time, and more. You can clean any single data field in one or more ways until the data meets your requirements. Some of these may require you to deal with Regular Expression with which you can use the Octoparse RegEx tool for assistance.
In Data Preview, right-click the show more icon for the data field you'd like to clean, select "Clean data".
Click "Add step", and then select what you'd like to do with the data. You can keep working with the data by adding more steps until the data meets your requirements.
- Replace: replace the specific string(s) in the extracted data with the new string(s) that you want.
- Replace with Regular Expression: use a specific regular expression to replace the matched string(s) in the extracted data with the string(s) that you want.
- Match with Regular Expression: use a specific regular expression to pick up the matched string(s) from the extracted data.
- Trim spaces: remove the unwanted space(s) from the start or/and the end of the data extracted.
- Add a prefix: add a string/strings to the front of the data extracted.
- Add a suffix: add a string/strings to the end of the data extracted.
- Reformat extracted date/time: shift the extracted date/time into one of the 14 built-in formats, or into your own customized format.
- HTML: convert some specific HTML tags into plain texts automatically. For example, transcode ">" into ">" and " " into a space.
Tips: To learn more about reformatting data and RegEx tool |
Capture HTML code
When auto-detect is used to capture any data from a web page, Octoparse automatically extracts the text and the URL of the elements that you've selected. You can customize the data field and tell Octoparse to extract any HTML code.
In Data Preview, click the show more icon and select "Customize field".
From the "Customize field" setting panel, select what you'd like to extract.
Extract page-level data and date & time
Octoparse offers a number of pre-defined data fields that you can use to capture page-level data, current data & time, or any fixed value conveniently.
- Current date & time: the date and time of when the data is extracted from the web page
- Page-level data: page URL, page title, meta keyword, meta description, and HTML source code
- Fixed value: any fixed value you define
Click on the + sign at the upper right hand corner of Data Preview. Select any pre-defined data fields that you'd like to add to the data set.
Until now, we have gone through all the steps about building and refining the workflow, it's time to start a test run! >> Lesson 4: Test-run the task
Artículo en español: Lección 3: Refina tus datos
También puede leer artículos de web scraping en el sitio web oficial.