Speed up scraping by using URL list
FollowWith the "List of URLs" loop mode, Octoparse has no need to deal with some steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using "Lists of URLs" is set to run in the Cloud, the task will be split up into sub-tasks which are then set to run on various cloud servers simultaneously.
1. Speed up pagination by using URL list
If your scraping task needs to extract data from thousands of multiple pages, you can use the URL list to scrape rather than "click to paginate" one by one. This can help your task run in a more efficient way.
Let's take the URL below as an example:
This website has a total number of 83,663 pages for pagination. Each page has 20 listed items. By observing the URLs for each page, you can find that they share the same structure. In this case, you can use "Batch Generate" to auto-generate the URLs for each page.
Here are the steps you can follow:
- Select "Batch generate" under "Input URL"
- Paste the URL into "URL format"
- Select the number "0" and then click "Add parameter"
- For the "Parameter settings", it depends on the different websites. In this case, we can set:
- Start value: input 0
- Action: select "increase" and input 20 (because each page has 20 items on the list)
- End condition: input 83663 for the "Item" (since this website total has 83663 pages)
- No need to input "End value". When you click "OK", it will auto-generate an end value.
- Then you can see the preview of 100 rows of auto-generate URLs. Click "Save URL"
- Now, you can see that you have a total of 83663 URLs for the "Loop Item"
Tips! There are three ways to batch import URLs to any single task/crawler (up to a million URLs):
Please check this tutorial Batch URL input |
2. Speed up scraping detail pages by using URL list
When you need to click through the items on the list and scrape their corresponding detail pages, it takes some time to click all the items one by one. In this case, it is wise to scrape the URLs of all the listed items first. After you get all the URLs of detail pages, you can start a new task by inputting all the scraped URLs from the previous task.
Here is a case tutorial applying this technique: Scrape product information from Sam's Club
Tips! In Octoparse, there are two ways to create a "List of URLs" loop. |
Artículo en español: Acelere el scraping utilizando la lista de URL
También puede leer artículos de web scraping en el sitio web oficial
Related Articles:
Extract data from a list of URLs
Scrape product information from Amazon
What is "task split" on Cloud Extraction? (Speed up Cloud Extraction)
Author: Vanny
Editor: Yina
Was this article helpful? Contact us any time if you need our support.