Contents on web pages are usually organized in some kind of pattern. One of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.

19.png

Scraping a list is quick and easy with Octoparse's auto-detect feature. Based on its advanced algorithm, Octoparse is capable of auto-detecting items from a list and generating the task workflow automatically. With Octoparse Auto-detect, scraping the list couldn't be easier. Now let's see how it is done with an example.

This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...

20.png

Our goal is to get data extracted into excel like this:

21.png

Now, let's explore different ways to get this done in Octoparse:

  1. Extract a list with Auto-detect

  2. Extract a list manually

You may need this link to follow through: https://www.octoparse.com/blog


1. Extract a list with Auto-detect

Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.

auto.gif

2. Extract a list manually

If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.

Method 1:

  • Load the web page in Octoparse and hover your cursor over the first item until the entire section gets highlighted in blue

  • Continue to click on the second item and you will find all you need on one page has been selected.

  • Choose "Extract text of the selected elements" and Octoparse will create a Loop Item automatically

manually.gif

You will notice that the first item is now highlighted in red. You can select the information like title, date and keyword from the highlighted area.

  • Select the title and choose "Extract the text of the element"

  • Repeat the steps to get other information

  • Double click on the field name to rename it if needed

select_data.gif

Method 2:

  • Hover your cursor over the first item until the entire section gets highlighted in blue

You will notice that Octoparse detects sub-elements from the section and highlights them in red.

  • Choose "Select sub-elements"

  • Choose "Select all"

  • Select "Extract data". A loop item will be generated automatically to scrap the list of items on the page.

119.gif

Tip: If you want to edit or delete the extracted data fields, you can click "Extract Data" and modify the fields on the Data Preview panel.

121313.png
Did this answer your question?