Octoparse offers a Cloud platform with many Cloud servers for you to run your tasks 24/7 and reach up to 6-20 times faster than local extraction. But sometimes, the speed of the Cloud may not be that satisfying. In this tutorial, we will explain the principle of speeding up in the Cloud and how to revise a task to make it run faster.
Octoparse Cloud speeds up by splitting one task into multiple sub-tasks and run the sub-tasks with multiple Cloud servers. One sub-task needs one Cloud server to run, so the speed depends on how many Cloud servers your account have and if the task is splittable.
The standard plan has 6 Cloud servers while the Professional plan has 20. You can easily upgrade to a higher plan to speed up. But if you don't want to change your plan, modify the task to be splittable is quite essential.
What kind of tasks are splittable?
When you try to create any kinds of loop items in Octoparse, Octoparse will automatically assign a loop mode to it based on the items selected and how they interact with the general webpage structure.
Specifically, there are three types of splittable loop modes in Octoparse.
- List of URLs
- Text list
- Fixed list
1. List of URLs
A URL loop is used when you start an extraction task using more than one URL. This is especially handy if the desired data spans through multiple web pages sharing the same page structure. You could easily set up a loop of URLs to go through each of these pages. Octoparse will load the URLs one by one, and execute the same set of extraction actions on each page.
A URL loop is splittable. Hence, when a task built with a list of URLs is set to run in the Cloud, Octoparse would split it into multiple sub-tasks for faster and more effective extraction.
To learn more about the List of URLs, please refer to Batch URL input.
2. Text List
A Text list loop works similarly to that of the URL list loop, but instead of looping through a list of URLs now the loop works to loop through a list of predefined text values.
For more about the Text list loop, please refer to Enter Text.
3. Fixed List
Many web pages, such as e-commercial websites, often organize webpage contents (ie. product information) as a collection of recurring elements with a shared HTML pattern.
When capturing such elements, such as the product titles, Octoparse would intelligently detect all the elements sharing the same HTML pattern and generate a collection of XPath(s) to locate all elements of the same kind.
Besides these 3 types of splittable loop modes, there are 2 other loop modes that are not splittable: single element loop and variable list loop. As both loop modes only involve one single XPath, they can't be split further into sub-tasks to speed up.
How to make my task splittable?
1. For a task with a Variable List to click through a list of elements, we can
- change it to a Fixed List by listing the XPaths for every element on the page
- scrape only the element URLs first without clicking into the pages, and then create another task with the URLs to get the detailed data. Here is an example: Scraping product information from Target.com
2. For a task that scrapes from multiple pages, we can use the URLs of each page to build the workflow:
Should you encounter any problems with Cloud extraction, feel free to leave your message.