Splitting Tasks to Speed up Cloud Extraction
There are two ways to start your extraction: Local Extraction and Cloud Extraction. For Cloud Extraction, Octoparse offers a Cloud platform with many Cloud servers for you to run your tasks 24/7 and up to 6-20 times faster than local extraction. You can see this tutorial to learn more about Cloud Extraction.
It is important to note that in order for a task to run more efficiently in the cloud, the task is required to be splittable. A splittable task can be broken down into multiple subtasks which can be run on multiple servers simultaneously, thus making the extraction faster.
Cloud extraction is now only available to Premium subscribers (Standard/Professional plan).
What kind of tasks is splittable?
When you create any kind of loop item in Octoparse, Octoparse automatically assigns a loop mode to it based on the items selected and how they relate to the general webpage structure.
In Octoparse, there are 5 types of loop mode but only 3 types are splittable:
- List of URLs
- Text list
- Fixed list
1. List of URLs
A URL loop is used when you start an extraction task using more than one URL. This is especially handy if the desired data spans through multiple web pages sharing the same page structure. You could easily set up a loop of URLs to go through each of these pages. Octoparse will load the URLs one by one, and execute the same set of extraction actions on each page.
A URL loop is splittable. Hence, when a task built with a list of URLs is set to run in the Cloud, Octoparse would split it into multiple sub-tasks for faster and more effective extraction.
2. Text list loop
A Text list loop works similarly to that of the URL list loop, but instead of looping through a list of URLs now the loop works to loop through a list of predefined text values. A Text list loop is also splittable.
For more about Text list loop, please refer to Text/keyword input.
3. Fixed list loop
Many web pages, such as e-commercial websites, often organize webpage contents (ie. product information) as a collection of recurring elements with a shared HTML pattern (see Use lists to extract).
When capturing such elements, such as the product titles, Octoparse would intelligently detect all the elements sharing the same HTML pattern and generate a collection of XPath(s) to locate all elements of the same kind.
Besides these 3 types of splittable loop modes, there are also 2 other loop types/modes that are not splittable: single element loop and variable list loop. As each of these two loops only involve one single XPath thus can't be split further into sub-tasks.
1. Single element loop
It is mostly used for pagination loop when you have to loop click a "Next" button.
2. Variable list loop
Contrary to a Fixed list, a Variable list is used to capture all similar elements with one single XPath, based on the shared HTML pattern they have.
When it is better to have tasks not split
By default, Octoparse would split the task if it is splittable. This way, we'll make sure extraction is as effective as possible when running in the cloud. However, there are also times when it is better to have the task not split.
- Disable "task split" if you need to run multiple tasks concurrently in the Cloud
This is because when a task is split into many sub-tasks, these sub-tasks will take up as many servers as possible depending on the type of account you have. At this time, all the other tasks or sub-tasks will line up until running sub-tasks are completed and servers are released.
- Disable "task split" when the target website requires a login to access the desired data, especially when multiple logins at the same time is not allowed.
- Disable "task split" if order matters. When a task is split into subtasks, subtasks will be executed as soon as a server is released or concurrently if more servers are available. For this reason, data might not be extracted in the same order as to how it is shown on the website.
To disable task split
click "Settings" > check "Disable task split" > click "Save"
Artículo en español: ¿Qué es "división de tareas" en Cloud Extraction? (Acelerar Cloud Extraction)
También puede leer artículos de web scraping en el website oficial