You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Once you have completed the introductory lessons, you should have a solid grasp of the fundamentals of Octoparse and should be able to successfully create a few tasks.

In this article, we will delve deeper into the workings of Octoparse and explain how it can extract data from any webpage. We will also discuss how different actions can be combined in a workflow to achieve the desired results. A strong understanding of these basic principles is crucial for creating more advanced and effective scraping tasks.

1. How Octoparse works in extracting data from websites

1.1 Octoparse simulates human browsing behavior

Octoparse operates by simulating the actions of a human browsing on its built-in browser. This includes actions such as opening web pages, clicking on page elements, using the next page button, and scrolling down the page. The simulated scraping process is identical to how you would access web data on any regular browser.

1.2 Octoparse automates the process of data extraction in a established workflow

When setting up a scraping task in Octoparse, you are essentially configuring a scraping workflow that can be converted into a sequence of commands/ actions for Octoparse to execute. This workflow is automatically generated by Octoparse while utilizing the built-in browser. In certain instances, it may not be necessary to alter the automatically generated workflow; however, in other cases, it may be necessary to manually setup or troubleshoot the workflow if it is not functioning as intended. In either scenario, it is highly advised to have a firm understanding of the fundamentals of the workflow to achieve more precise and accurate scraping results.

2. Understanding workflow

An Octoparse workflow is essentially a set of configurations that are arranged in a specific order to scrap the target web data. The steps of the workflow should always be read from top to bottom, and from inside to outside for nested actions.

Let's take a look at some examples.

Example 1 - Extract from a list of elements to get data

The above workflow will be executed in the below order:

Step 1: Go to Web Page, to open the target web page
Step 2: Pagination, to locate the next page button on the page (you are currently on Page 1)
Step 3: Loop Item, to locate the list of elements on the page
Step 4: Extract Data, to extract the needed data from the list of the elements
Step 5: Click to Paginate, to click on the next page button to go to Page 2
Step 6: Continue to extract data from the loop, and click the next page button until Octoparse gets to the last page
Step 7: No next page button is located on the last page and the workflow ends

Example 2 - Click a list of elements on the web page and extract data from the detail page

The above workflow will be executed in the below order:

Step 1: Go to Web Page, to open the target web page
Step 2: Pagination, to locate the next page button on the page(you are currently on Page 1)
Step 3: Loop Item, to locate the list of elements on the page
Step 4: Click Item, to click the elements from the Loop Item and go to the detail page
Step 5: Extract Data, to extract the needed data from the detail page
Step 6: Click to Paginate, to click on the next page button to go to Page 2
Step 7: Continue to click elements from the loop, extract data from the detail page and click the next page button until Octoparse gets to the last page
Step 8: No next page button is located on the last page and the workflow ends

Example 3 - Load more elements by clicking the Load More button and scrape data from the list of elements

The above workflow will be executed in the below order:

Step 1: Go to Web Page, to open the target web page
Step 2: Pagination, to locate the Load More button on the page
Step 3: Click to paginate, to click on the Load More button to load more elements on the page
Step 4: Continue to click on the Load More button until it disappears
Step 5: Loop Item, to locate the list of elements on the page
Step 6: Extract Data, to extract the target data from the list of the elements

3. Test run the workflow

It is crucial (and always recommended!) to thoroughly test each step of the workflow before executing the task. By clicking on a step within the workflow, Octoparse will simulate the action in the built-in browser to ensure it functions correctly. This allows you to make any necessary modifications.

For instance, when selecting the "Go to Web Page" step, Octoparse will automatically load the webpage in the built-in browser. You can check more details about testing the workflow here.

Tips:

There are no fixed ways to build a workflow. You can add any actions as long as they work logically together.
You can use multiple click actions or loop items to scrape data from pages of multiple levels, for example, list page and product page for directory websites.
You can easily drag and move action to the right spot.

Lesson 2: Optimize your task

Lesson 4: Test-run the task

Loop Item (Loop URLs/Pagination)

Scrape data from both listing and detail pages

Troubleshooting Common Octoparse Scraping Issues

Intro to workflow & actions

1. How Octoparse works in extracting data from websites

1.1 Octoparse simulates human browsing behavior

1.2 Octoparse automates the process of data extraction in a established workflow

2. Understanding workflow

Example 1 - Extract from a list of elements to get data

Example 2 - Click a list of elements on the web page and extract data from the detail page

Example 3 - Load more elements by clicking the Load More button and scrape data from the list of elements

3. Test run the workflow