With the "List of URLs" loop mode, Octoparse has no need to deal with some steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using "Lists of URLs" is set to run in the Cloud, the task will be split up into sub-tasks which are then set to run on various cloud servers simultaneously.
1. Speed up pagination by using a URL list
If your scraping task needs to extract data from thousands of multiple pages, you can use the URL list to scrape rather than "click to paginate" one by one. This can help your task run in a more efficient way.
Let's take the URLs below as an example:
This website has a total number of 849 pages. By observing the URLs for each page, you can find that they share the same structure. In this case, you can use "Batch Generate" to auto-generate the URLs for each page.
Here are the steps you can follow:
Click New+ from the sidebar menu and select Custom Task
Select Batch generate(1) and enter the URL of page one into the URL Format bar(2) then select Add Parameter(3)
Parameter Type: Number
Initial value: 1
Every time: +1
Remember to remove the number "1" after the page. The output after generating the URL would look below:
Tip: There are three ways to batch import URLs to any single task/crawler (up to a million URLs):
Batch import URLs from local files
Batch import URLs from another task
Please check this tutorial Batch URL input for more details.
2. Speed up scraping detail pages by using a URL list
When you need to click through the items on the list and scrape their corresponding detail pages, it takes some time to click all the items one by one. In this case, it is wise to scrape the URLs of all the listed items first. After you get all the URLs of detail pages, you can start a new task by inputting all the scraped URLs from the previous task.
Here is a case tutorial on how to scrape the URLs of items: Scrape product info from Sam's Club