When you are building a scraping task in Octoparse, you'll almost surely go to use a "Loop item" sometime during the process. A "Loop Item" is most often used for capturing a list of elements or for paginating through the different pages of a website. In this article, I will explain exactly how a "Loop item" works in Octoparse.
1. What is a Loop Item
A "loop" is a programming function that repeats an instruction continuously until a certain condition is reached. The Loop Item in Octoparse is similar to a loop. A Loop Item is usually created using more than one URL/element and there will be action(s) added to the Loop Item. Once a Loop Item is created, Octoparse will repeat the looped actions for a designated X number of times or until there's no way to keep repeating the actions, for example, when there's no more next page to flip over (when you've reached the last page).
Let's look at an example. Suppose we have a list of URLs to extract data from. First, we'll create a Loop item using the list of URLs, then we'll insert a "Go to Web Page" action and an "Extract Data" action inside the Loop Item. The workflow would look like this:
This workflow translates to a set of instructions telling Octoparse to take the first URL of the URL list, load the page with the "Go to Web Page" action, then scrape the data with the "Extract Data" action. The same set of actions will be repeated for all the URLs in the list until the last URL is taken, then the loop stops.
2. "Loop Item" settings
If you double-click the Loop Item or click the on it, you'll be taken to the settings panel. Let's take a look at the options available.
1) Action name: This is the place where you can change the name of the specific Loop Item. Assigning a unique name to a "Loop Item" can help you sort things out when you have more than one "Loop Item" in your workflow.
2) Loop Mode: In order for a "Loop Item" to work right, it is critical that you have the correct loop mode selected. There are five loop modes and each of them is well explained in the section below.
3) Exit Loop: Besides having the loop quit automatically, you can also end the loop prematurely by designating the number of times to repeat the looped actions.
4) Wait before action: You can use this feature to set up wait time between each repeat.
3. The 5 loop modes and how to use them?
There are 5 loop modes: Single Element, Variable List, Fixed List, List of URLs, and Text List.
- Single Element is used to locate a specific element on the page. Octoparse would perform the looped actions to the same element over and over again until the element is no longer found on the page. One common use for a single element is when you want Octoparse to click the "Next page" button repeatedly until you've reached the last page (no more "Next Page" ).
- Variable List is used to locate a list of items that can be matched with a single XPath query. Octoparse would perform the looped actions to matched elements one by one until the last element is reached. Variable list should be used when the number of elements you'd like to loop through is inconsistent across different pages.
- Fixed List, similar to Variable List, also locates a list of items, but Fixed List is a list of XPath queries with each XPath locating a unique element on the page. It is used when the number of elements on the page is consistent across all pages.
- List of URLs is used for looping through a list of URLs, in which case Octoparse would open the URLs one by one. There are four ways to input the URLs. Check out the different ways to input the URL here.
- Text List is a list of the text strings. When a text list is used, Octoparse would input the strings on the page one by one.
Fixed List, List of URLs, and Text list can be split in the Cloud in order to speed up the extraction.
Variable List can be changed to Fixed List for faster extractions.
4. How to create a Loop Item
The type of Loop Item you need would depend on your scraping requirements. Check out the tutorials below on how to create a Loop Item.
- Scrape a list of elements on a page
- Click through a list of elements and scrape
- Loop through a list of URLs
- Loop through a list of text
- Scrape from multiple pages
5. "Loop Item" troubleshooting
There many issues related to Loop Item, such as missing elements, skipping pages, and so on. The most frequently asked issues about Loop Item are listed below:
- Why does Octopasre skip pages during the scrape?(Version 8)
- Why does Octoparse keep scraping the last page and never stop?
- Infinitive Scroll has setup but no new elements added to the list?
- Why does Octoparse scrape less data when there should be more?(Version 8)
Artículo en español: Elemento de bucle
También puedes leer artículos de web scraping en sitio web oficial