What is a Loop in Octoparse?
Web scraping itself is much of a repetitive process as it loads web pages and extracts data from each one of them. As you can expect, there will be a ton of scenarios where you may need to repeat doing something.
Octoparse uses Loop to instruct the scraping bot when an action needs to be performed repeatedly, such as hitting the "next page" button, clicking through embedded links in the list, clicking through options on a dropdown menu, extracting data through a list, and etc. As a rule of thumb, if you would like to repeat any actions, put them in a loop.
By having a "Go to Web Page" action and another "Extract Data" action nested within a Loop Item, you are telling Octoparse to repeat the steps in a row until some ending conditions are met.
But it is not quite enough just to have the workflow set up correctly, you'll also need to make sure the settings are right in order for the bot to work as expected.
1. Loop URLs for "Go to Web Page"
If you have a list of URLs like the product detail pages and would like to load and scrape data from each page, you'll most likely have a workflow looking something like this:
1) Add the list of URLs to the loop.
Go to the setting of the Loop action by hovering on it and clicking the setting icon. First, make sure you have "List of URLs" selected as the Loop Mode. Second, click on the edit icon, then directly copy and paste the URL list into the text box. Click "Save" to save the setting.
2) Associate the action with the URLs in the loop
To make sure the "Go to Web Page" action works with the URLs in the loop, go to the setting of the "Go to Web Page" action and check the box for "Load URLs in the loop". This step is essential for Octoparse to know when it should use the items in the loop to complete the action.
2. Enter a list of text/keywords in a loop
If you have a list of keywords to search with, you can build a loop with an "Enter Text" action nested inside.
1) Add the list of text/keywords to the loop
This time, you'll need to have "Text list" selected as the Loop Mode, after which you can then input the list of keywords directly in the textbox.
2) Associate the action with the text/keywords in the loop
Check the option "Use text in the loop to enter the text box" to automate the text-entering process.
Learn more about how to use "Enter Text" to scrape search results:
3. Click elements in a loop
If you are looking to click on a number of elements on the web page, you can have Octoparse click the elements in a loop. Instead of entering a fixed list of URLs or keywords to the loop, you'll use an XPath to locate the elements via a Variable List or a Fix List.
In most cases, Octopasre automatically generates the steps and XPath for you as you build the task via point-and-click, however, if you'll need to revise the settings manually then you'll be required to write and enter the new XPath.
Regardless of how it got set up, it is important to make sure the bot knows that it should click the elements in the loop. To do this, simply check the "Click items in the loop" option.
Learn more about how to click on the page elements and scrape the data you need:
4. Extract data in the loop
Extracting data in the loop is often used for scraping data from a listing or a table. Upon having the list of elements added to the loop, go to the setting for the "Extract Data" action and check the option "Extract data in the loop".
Learn more about how to scrape data from a listing/table:
There isn't just one way to make things happen in Octoparse. The Octoparse workflow is versatile enough to accomodate all kinds of browsing activities then eventually capture the data of interest, for example, you can have both Extract Data and Click Item to use the loop items when you need to scrape from both the listing page and detail page.
Edited by Isabel