Now you have downloaded Octoparse on your device and learned about the basics, it is time to start your own web scraping project!
Most of the websites (directories, e-commerce, real estate sites, etc.) share similar layouts, eg. a page containing many items nested in a list. Let's look at a few examples.
Octoparse's brand-new auto-detect algorithm is specially designed to scrape pages of such a kind. It automatically detects for listing data (including text elements and links), "Next page" buttons, "load more" buttons and scroll down of a page and then generates the scraping task automatically.
In this lesson, we will go through how to scrape webpage data by using the auto-detect algorithm.
Octoparse Hello World provides a number of test sites to help you practice scraping data from different kinds of web pages.
1. Create a new task
Enter the sample URL "http://test-sites.octoparse.com/?product_cat=e-commerce-category-1" into the search box at the top of the home screen. Click "Start" to create a new task with Advanced Mode.
2. Get data via auto-detect
Octoparse will load the webpage URL in the built-in browser and start the auto-detect process automatically. Please wait patiently until the process is completed and when more info is provided on "Tips".
If the data you need is not accessible upon page loading, check out this tutorial about how you can interact with the web page before getting data auto-detected.
3. Check the data
Once the auto-detection is completed, follow the instructions provided on "Tips" and check your data in the preview section. You can rename the data fields or remove those that are not needed. The detected data will also be highlighted on the webpage for you.
4. Confirm your options
Now, go to "Tips" and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:
- Extract the data in the list - This option is selected by default as Octoparse thinks this is what you need to do for sure.
- Paginate to scrape more pages- Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to extract data from more pages.
To find out if the button detected is the correct one, click "Check" and watch it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on "Tips".
5. Create workflow
After confirming the settings, click "Create workflow".
Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.
Wanna know how to optimize the task workflow, please continue to >> Lesson 2: Optimize your task