What is "Extract data"?
"Extract Data" is a must-have step when you set up your task to get the data you need. All the data fields you need can be found in this step. Under this step, you can clean data, modify XPath, change the sequence of, copy, delete data fields, and so on. Without this step, your task cannot be executed.
How to add "Extract data" to the workflow?
There are two ways to generate an "Extract Data" action.
1. Click on an element on the web page to auto-generate one (The most common one)
If you want to capture data from the page, you just need to click on the element first. Then, click the option of "Extract......" on the Tips panel and an "Extract Data" action would show in the workflow.
The options can be:
- Extract text/URL of the selected element
- Extract the inner/outer HTML of the selected element
- Extract data
- Extract data in the loop
2. Add from the workflow
When you hover over the workflow, you can see an icon showing up. Click the icon to display the drop-down options and choose "Extract Data" to add this step to the workflow.
To adjust more settings related to the data fields, you can click on the Extract Data action on the workflow and find the settings panel at the bottom. You can see there are 3 main features.
In the "General" section, you will find "Extract data in the loop" when the Extract Data action is inside a Loop Item.
In the "Options" section, you will find "Wait before action" as well as "Trigger".
1. Extract data in the loop
This option only shows when the Extract Data is inside a Loop Item. It is normally ticked automatically when you extract data directly from a listing page instead of clicking into the detail page to pull out data.
Here is an example of a listing page.
To learn more about extracting data from a listing result page, check this guide: Scrape a list of data.
The trigger is used when you want to scrape data based on some conditions.
For example, if the Username field is not blank and you want to dump this line of data, you can use Trigger to achieve it. Check out more details about Trigger.
3. Before action is performed (add wait time)
This one will allow you to add a wait time before executing this action. Different websites may have different timeout settings to load the data, so sometimes you need to add some wait time or waiting conditions to give more time for the web page to load.
You can check this guide for different use cases: Wait before action.
4. Define data fields
You can find data field details on the Data Preview part and you can take actions such as renaming (double-click on the field name), deleting, moving, or cleaning your data field(s), and adding field(s) such as extraction time, current page URL, etc from a pre-defined list.
You can also customize the XPath of a certain data field here if it is not located correctly in the output.
To better understand what those icons stand for, check the following details:
To add custom data fields from a predefined list
To Import data field(s) from a data file [Octoparse extract config (*.oec)]
To Export data field(s) to a data file [Octoparse extract config (*.oec)]
Horizontal & Vertical views
You can change to Vertical Views to modify the XPath of all the fields easily or do actions to multiple fields by ticking the box before each field.
Remove Duplicates from the extracted data
More options: to make more modifications to your data
- Customize field: to select what information (text, HTML, an attribute value, or URL) you need to scrape from the page element.
- Customize XPath
- Clean data: to clean the data output to your desired one (e.g. add prefix, suffix, transform the time, replace, etc).
- Merge multiple rows of data into one: to combine the same field of data from other loop items.
- Delete: to remove the current data field.
- Copy: to duplicate a certain data field.