There isn’t just one way to scrape a webpage, depending on how the webpage is structured, there are usually multiple approaches you can try.
In some cases, you may have a list of similar-structured URLs (like a batch of product URLs) on hand, and you want to extract the data from it directly. In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.
When should you consider scraping by using a list of URLs?
Here are some cases that you can start the task with a list of URLs for extraction.
- All the URLs should be under the same domain, sharing the same webpage structure (Most Important).
- Example: I have a list of product URLs, and I want to start a task with a list of URLs directly to scrape updated pricing data regularly.
- Some websites use infinitive-scrolling/load more to load the content. If you need to collect data by clicking each product name or something similar to scrape details on the deeper layer, then you'll need to split the task into two. One task to load the page and scrape URLs, and the other one to input a list of extracted URLs for details scraping.
- Example: Zara's search result page uses infinitive-scrolling to keep loading new items. If the data you need is on the item page, then you need to set scrolling times and collect enough product URLs first for the next task.
- The website applies AJAX(Deal with AJAX) to load new content, which means after clicking on the first product page, the system fails to go back to the listing page automatically (and click into the second product page from there). So we'll need to extract the detail page URLs firstly, and then scrape data you want with the URL list (video tutorial).
- Some website tends to load pages quite slowly while paginating, which might affect the data scraping of our scheduled tasks, so it's better to loop through page URLs directly to avoid the issue.
How do I know the pages are with the same structure?
If you are scraping news articles from any particular website, most likely the article page will share the same page structure, like:
Another example is from Google maps. Every business page is like this:
To scrape by using a list of URLs, we'll simply set up a loop of all the URLs we need to scrape from then add a data extraction action right after it to get the data we need. Octoparse will load the URL one by one and scrape the data from each page.
By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using "Lists of URLs" is set to run in the Cloud, the task will be split up into sub-tasks by default which are then set to run on various cloud servers simultaneously.
"List of URLs" mode is very effective. You can add particular web pages to the list, and it doesn't matter whether they are consecutive pages or not, as long as they share the same page layout. Octoparse will scrape data from each URL in the list, and no page would be omitted.
1. Can I use URLs that do not share the same page layout?
Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To make sure data is extracted consistently and accurately, it is necessary to ensure that these pages share the same page layout.
To learn more about the "List of URLs" mode, you can check out the following article: Loop Item
2. Is there a limit to the number of URLs that I can add at a time?
Yes. We suggest adding no more than 10,000 URLs if you copy and paste the URLs directly into Octoparse. However, use Batch URL input feature, you can input up to 1 million URLs.
3. Can Octoparse automatically collect and add the URLs?
Octoparse Advanced API enables modifying the list of URLs without accessing the App.
To extract data from a list of URLs, the extraction process can generally be broken down into 3 simple steps:
In Octoparse, there are two ways to create a "List of URLs" loop.
1) Start a new task with a list of URLs
1. Select "+New" and click "Advanced Mode" to create a new task, or, enter webpage URLs to get started
2. Paste the list of URLs in the textbox and click "Save URL"
After we input the URLs and click "Save", Octoparse will remove duplicates automatically and keep the valid URLs only:
After clicking "Save URL", the "Loop Item" (which loops through each URL of the list) is automatically created in the workflow.
If you hover over the "Loop Item" or click on settings of the "Loop Item", you can see that the URLs that you entered have been added to the "Loop Item".
Octoparse enters the "List of URLs" loop mode by default when more than one line of URL is added to start a task.
3. Set up "Wait before execution"
Sometimes if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up "Wait before execution".
Click on the settings of "Go to Web Page". Under "Advanced Options", set a wait time before the action is executed (2 seconds will work usually).
2) Create a "List of URLs" loop in Workflow Designer
1. Add a "Loop Item" in the workflow
2. Go to "Loop mode" and select "List of URLs"
3. Click and enter/paste the list of URLs. Don’t forget to click "OK" to save the setting.
Notice the "Go to Web Page" action is automatically generated in the workflow. And by clicking on "Loop Item", you can find the list of URLs being added to "Loop Item"
4. Set up "Wait before execution"
Octoparse will load each URL in the list before starting extracting the data. But if the page doesn't load completely, Octoparse may have problems in scraping data or executing the next step in the workflow. In case Octoparse starts extraction before the page loads completely, we need to set up "Wait before execution"(2 seconds are recommended).
After the URLs are saved, the first page would be opened automatically, and you can select the data on the page to extract. Extract element text/URL/image/HTML/attribute
1. If the scraping stopped right after we start the extraction, we can try adding a longer Timeout for the opening webpage step, so the system will wait longer for the webpage to be fully loaded.
2. If you want to get data exported lined up with the original URL list you entered, you can add the current page URL here:
Should you have any questions, feel free to leave your message.