Batch URL input
FollowThe updated tutorial for the latest version 8.1 is available here. Go to have a check now!
Extracting data from a list of URLs is definitely one of the most efficient and powerful way to achieve large scale data scraping with Octoparse. In times when the list of URLs is large, Octoparse supports batch/bulk URL import from local files (text or spreadsheet), from another task or even generate the URLs based on some pre-defined patterns. Through these features, Octoparse aims to further reduce the tedious workload associated with large scale data extractions.
There are three ways to batch import URLs to any single task/crawler (up to a million URLs):
1. Batch import URLs from local files
2. Batch import URLs from another task
3. Batch generate URLs based on a pre-defined pattern
Tips! Once the number of imported/generated URLs reaches the limit of 1 million, Octoparse would stop importing/generating immediately. |
1. Batch import URLs from files
You can now import URLs from any of the file formats below,
- CSV
- TXT
- Excel (.xlsx & .xls)
· Select "Advanced Mode" and click "+Task" to create a new task
· Select "Input from file"
· Click "Select file" then choose the file containing the URLs for the import
Octoparse automatically identifies and imports all the URLs from the file. Note only the the first 100 URLs will be shown for preview purposes.
· Click "Save URL" to complete the import
2. Batch import URLs from another task
This feature makes it possible to integrate two tasks seamlessly when URL extraction need to be done separately with another task. No more extra URL export-and-import is needed.
· Select "Advanced Mode" and click "+Task" to create a new task
· Select "Input from task"
· Select the task containing the target URLs then specify the proper data field
· Click "Save URL" to complete the import
Note the selected task (one that contains the URLs needed for more crawling) is referred as the parent task, and the new task to be configured becomes the child task. Two tasks will be associated automatically and can be executed in association with one anther.
When a task is selected as the parent task, Octoparse will automatically retrieve all the data extracted for the selected task (cloud and local ).
Tasks that have yet been run and do not have any URLs fetched can also be selected as the parent task - simply enter one example URL into the text box then proceed to configure the child task.
- Associated run
When a child task is set to run, you can specify the criteria for starting the extraction.
· Click "Start Extraction" on the task configuration interface or "Options" from Dashboard
· Select "Parent Task settings" / "Config with start"
There are four options to select from-
· Select "Run task as soon as its parent task starts" if you wish to run the child task as soon as any URLs is fetched to the parent task.
Tips! 1. If you set up an associated run by selecting any option from Parent task settings, both tasks will be executed in the cloud via Octoparse Cloud Service 2. When an associated run is setup, task scheduling |
3. Batch generate URLs based on a pre-defined pattern
With URL Batch Generate feature, you can easily generate a large number of URLs following specific patterns by modifying various parameters of one given URL.
This feature would be especially useful for scraping from a large number of different pages from a particular website. Use the URL generator to quickly generate all the page URLs and scrape all the pages simultaneously. No need to go through the pages one by one.
· Select "Advanced Mode" and click "+Task" to create a new task
· Select "Batch generate"
· Input the URL as a base for batch generate
· Highlight the selected URL parameter, and click "Add parameter"
· Select from the four Parameter Type options to define the pattern you need
· Click "Save URL" to save the list
- Four Parameter Type options
- Type 1 : Numbers
- Type 2 : Letters
- Type 3 : Date
- Type 4: Custom list
日本語記事:URLの一括インポート
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Batch URL input
También puede leer artículos de web scraping en el website oficial
Related articles:
Extract data from a list of URLs