Contents on web pages are usually organized in some kinds of patterns. One of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.
This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...
Our goal is to get data extracted into excel like this:
Now, let's explore different ways to get this done in Octoparse:
You may need this link to follow through: https://www.octoparse.com/blog
1. Extract a list with Auto-detect
Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.
2. Extract a list manually
If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.
- Load the web page in Octoparse and hover your cursor over the first item until the entire section gets highlighted in blue
- Continue to click on the second item and you will find all you need on one page has been selected.
- Choose "Extract text of the selected elements" and Octoparse will create a Loop Item automatically
You will notice that the first item is now highlighted in red. You can select the information like title, date and keyword from the highlighted area.
- Select the title and choose "Extract the text of the element"
- Repeat the steps to get other information
- Double click on the field name to rename it if needed
- Hover your cursor over the first item until the entire section gets highlighted in blue
You will notice that Octoparse detects sub-elements from the section and highlights them in red.
- Choose "Select sub-elements"
- Choose "Select all"
- Select "Extract data". A loop item will be generated automatically to scrap the list of items on the page.
TIP: If you want to edit or delete the extracted data fields, you can click "Extract Data" and modify the fields on the Data Preview panel.