Everything you do in Octoparse starts with building a task. A scraping task in Octoparse is also referred to as "a bot" or "an agent". Regardless of what it is called, a task is essentially a set of instructions for the program to follow.
Building a task in Octoparse is straightforward. You'll first load your target webpage in Octoparse, and click to select the data you need to fetch. Once you've finished selecting the data you need, a workflow is auto-generated according to how you've interacted with the webpage, for example, if you've clicked a certain button, hovered on the navigation menu, or if you've clicked to select any data on the page.
Octoparse simulates the real browsing actions as it clicks, searches, paginates, etc, and finally reaches and fetches the target data, all done by following the steps in the workflow. This is how Octoparse works to extract data from any webpage.
Advanced Mode vs. Task Templates
There are two ways to create a scraping task in Octoparse. You can create a task under Advanced Mode or pick up a Task template right off the bat.
With Advanced mode, you'll get to customize your own scraping task in any way you like, such as searching with keywords, log into your account, clicking through a dropdown, and much more. To put it simply, the Advanced mode is all you need to scrape data from any website.
Contrary to Advanced Mode, Task Templates provides a large number of pre-set scraping templates for some of the most popular websites. These tasks are pre-built so you'll only need to input certain variables, such as the search term, the target page URL, to fetch a pre-defined set of data from the particular website.
Ready to get your hands on some data? Follow the introductory lessons for step-by-step guidance on how to create your first task.
- The interface of version7 and version8 is different, the auto-detect feature only comes with version8
- You can utilize the auto-detection feature to get the basic workflow first, then modify or optimize it to meet your own needs
- Usually to scrape data from one website(or URLs under one domain) will use one task/crawler. Because one task/crawler can only scrape data from pages with a similar page structure. But you can try scraping email addresses from a list of websites by using one crawler, here are the tutorials for your reference: Can I extract email addresses from a series of websites without similarities?
Tips for managing your tasks
1. Task information editing
Task name is automatically created as you save the URL entered.
· To modify the task name, click the textbox above the workflow panel and enter a new name.
· Or click to edit the name of a saved task
2. More actions of task management
Here are more actions of task management you might use.
Options for task management in "More Actions"
· "Edit" – Edit task (Or double-click the task name on the dashboard to edit.)
· "Delete" – Delete task
· "Rename" – Rename task
· "Settings" – Basic settings (including task group and description) and extractions settings
· "Duplicate" – Replicate task
· "Export" – Export task
To batch manage tasks:
· Select multiple tasks (It also works for selecting one task).
· Select the options available here to batch operate
· To undo the items selected, click "Unselected"