Contents on web pages are usually organized in some kinds of patterns. And one of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.
Example URL: https://www.octoparse.com/blog
This particular web page consists of items sharing the same structure. Each item contains title, time, keyword, article...
Our goal is to get these data extracted into excel like this:
Now, let's explore the different ways to get this done in Octoparse.
Extract list with Auto-detect
Once you've created a new task using the example URL, select "Auto-detect web page data". Octoparse will now detect any data on the page and we can click "Create workflow" to generate the workflow.
Extract list manually
If for some reason, Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.
1) Load the web page in Octoparse, hover your cursor over one of the list items until the entire section gets highlighted in blue, then click on it.
Please make sure all the sub-elements you want to extract are all included in this highlighted section.
2) Once you've selected the item, notice how the sub-elements get highlighted in red, which means Octoparse has successfully identified those sub-elements, click "Select sub-elements".
3) Then, click "Select all" to select all similar elements detected on the page.
4) Next, select "Extract data". A loop item will be generated automatically for scraping list items on the page.
5) If you want to edit the extracted data fields, you can click the setting icon for the Extract Data action.
If you need any help with task configuration or data collection, submit a ticket to our support team! We'll get back to you within 24 hours.
Artículo en español: Extraer datos de lista
También puedes leer artículos de web scraping en sitio web oficial