What is Batch URL input?

The Batch URL input feature is importing a large number of URLs into Octoparse. Octoparse supports batch/bulk URL import from local files (text or spreadsheet), from another task, or even generates the URLs based on a pre-defined pattern.


How to batch input URLs?

Click +New from the sidebar menu and select Custom Mode and you will see the URLs importing panel.

There are three ways to batch import URLs to any single task/crawler (up to a million URLs):

  1. Import URLs from a file

  2. Import URLs from another task

  3. Batch generate URLs based on a pre-defined pattern

TIP: Once the number of imported/generated URLs reaches the limit of 1 million, Octoparse would stop importing/generating immediately.


1. Import URLs from a file

You can import URLs from any of these file formats: CSV/ TXT/ Excel (.xlsx & .xls)

  • Select "Import from file".

  • Click "Select" then choose the file containing the URLs and then select the sheet and column that contains the URLs.

  • Click "Save" to complete the import process.

1.png

NOTE:

  1. Only the first 100 URLs will be shown for preview purposes.

  2. When importing from a CSV file, please make sure the file only has one column with the URLs. If the file has several columns, the URLs won't be imported and will be recognized as invalid URLs.


2. Import URLs from another task

This feature makes it possible to integrate two tasks seamlessly when URL extraction needs to be done separately with another task. No more manual URL export and import are needed.

  • Select "Import from task".

  • Select the task containing the target URLs, then specify the proper data field.

  • Click "Save" to complete the import process.

__2.gif

Note that the selected task (one that contains the URLs needed for more crawling) is referred to as the parent task, and the new task to be configured with the URLs becomes the child task. Two tasks will be associated automatically and can be executed in association with one another.

TIPS:

1. You can set up to run the child task according to the status of the parent task in the Cloud. If you set up an associated run by selecting an option from Parent task settings, both tasks will be executed in the cloud via Octoparse Cloud Service. The associated run is not available for Local Extraction.

2.png

2. When an associated run is set up, task scheduling is not available for running the child task.

3. Importing from another task supports importing more than 1 million URLs.


3. Batch generate URLs based on a pre-defined pattern

With the "Batch generate" feature, you can easily generate a large number of URLs following specific patterns by modifying various parameters of one given URL.

  • Select "Batch generate".

  • Input one URL as a base for batch generating.

  • Highlight the selected URL parameter and click "Add parameter".

  • Select from the four Parameter Type options to define the pattern you need and click "Save URL" to save the list.

__4.gif

Four Parameter Type options

1. Numbers

You can enter the initial number, choose to increase(+) or decrease(-) a number every time, and enter Repeat or an end value. For example, if you want to generate URLs for different pages, you may need to set up the parameter of page number from 1 to 100. You should enter the initial number as 1, every time + 1, and Repeat 100 times. The end value will be automatically filled as 100.

3.png

2. Letters

You can enter the starting letter and the ending letter.

4.png

3. Time

5.png

4. Custom list

You can enter your own list, like a list of search keywords or product numbers.

6.png

TIP: You can set up multiple parameters to generate URLs. For example, if the base URL is www.XXX.com/[parameter1]/[parameter2]

Parameter1={A, B}, Parameter2={1, 2}

The final URL list would be like:

www.XXX.com/A/1

www.XXX.com/B/1

www.XXX.com/A/2

www.XXX.com/B/2

Did this answer your question?