When you are building a scraping task in Octoparse, you'll almost surely go to use a Loop item sometime during the process. A Loop Item is most often used for capturing a list of elements or for paginating through the different pages of a website. In this article, I will explain exactly how a Loop item works in Octoparse.

  1. What is a Loop Item

  2. Loop Item settings

  3. The 6 loop modes and how to use them?

  4. How to create a Loop Item

  5. Loop Item troubleshooting


1. What is a Loop Item

A loop is a programming function that repeats an instruction continuously until a certain condition is reached. The Loop Item in Octoparse is similar to a loop. A Loop Item is usually created using more than one URL/element and there will be action(s) added to the Loop Item. Once a Loop Item is created, Octoparse will repeat the looped actions for a designated X number of times or until there's no way to keep repeating the actions, for example, when there's no more next page to flip over (when you've reached the last page).

Let's look at an example. Suppose we have a list of URLs to extract data from. First, we'll create a Loop item using the list of URLs, then we'll insert a Go to Web Page action and an Extract Data action inside the Loop Item. The workflow would look like this:

555555.png

This workflow translates to a set of instructions telling Octoparse to take the first URL of the URL list, load the page with the Go to Web Page action, then scrape the data with the "Extract Data" action. The same set of actions will be repeated for all the URLs in the list until the last URL is taken, and then the loop stops.


2. Loop Item settings

If you click on the loop items and select General, you'll be taken to the settings panel. Let's take a look at the options available.

mmii.png
  • Action name: This is the place where you can change the name of the specific Loop Item. Assigning a unique name to a "Loop Item" can help you sort things out when you have more than one "Loop Item" in your workflow.

  • Loop Mode: In order for a "Loop Item" to work right, it is critical that you have the correct loop mode selected. There are six loop modes and each of them is well explained in the section below.

  • Exit Loop: Besides having the loop quit automatically, you can also end the loop prematurely by designating the number of times to repeat the looped actions.

001111.png
  • Wait before action: You can use this feature to set up a wait time between each repeat.


3. The 6 loop modes and how to use them?

There are 6 loop modes: Single Element, Variable List, Fixed List, List of URLs, Text List, and Scroll Page.

dffdf.png
  • Single Element is used to locate a specific element on the page. Octoparse would perform the looped actions to the same element over and over again until the element is no longer found on the page. One common use for a single element is when you want Octoparse to click the "Next page" button repeatedly until you've reached the last page (no more "Next Page" ).

88741.png
  • Variable List is used to locating a list of items that can be matched with a single XPath query. Octoparse would perform the looped actions to match elements one by one until the last element is reached. A variable list should be used when the number of elements you'd like to loop through is inconsistent across different pages.

8000.png
  • Fixed List, similar to Variable List, also locates a list of items, but Fixed List is a list of XPath queries with each XPath locating a unique element on the page. It is used when the number of elements on the page is consistent across all pages.

9884.png
  • List of URLs is used for looping through a list of URLs, in which case Octoparse would open the URLs one by one. There are three ways to input the URLs. Check out the different ways to input the URL here.

85996.png
  • Text List is a list of the text strings. When a text list is used, Octoparse would input the strings on the page one by one.

87705.png
87706.png
  • Scroll Page is a new way of scrolling. This mode is particularly designed for websites that use infinite scroll to view more content. The option can help to scrape data while scrolling instead of scraping after the scroll finishes.

102323.png

TIPS:

  • When Fixed List, List of URLs, and Text list are used, the task can be further split into subtasks that can be run concurrently in the Cloud for faster data capturing.

  • Variable List can be changed to Fixed List for faster extractions.


4. How to create a Loop Item

The type of Loop Item you need would depend on what data you are trying to fetch and the specific webpage structure. Check out the tutorials below on how to create a Loop Item for various use cases.


5. "Loop Item" troubleshooting

There are many issues related to Loop Item, such as missing elements, skipping pages, and so on. The most frequently asked issues about Loop Item are listed below:

5.1 Pagination:

5.2 Missing elements:

5.3 Others:

Did this answer your question?