Now that you’ve downloaded Octoparse on your device and learned about the basics, it’s time to start your own web scraping project!
Most of the websites (directories, e-commerce, real estate sites, etc) share similar layouts, ie. a page containing many items nested in a list. Let's look at a few examples.
Octoparse's brand-new auto-detect algorithm is specially designed to scrape pages of such kind. It automatically detects for listing data (including text elements and links), "Next page" buttons, "load more" buttons and scroll down of a page, and then generates the scraping task automatically.
In this lesson, we will go through how to scrape webpage data using the auto-detect algorithm.
Octoparse Hello World provides a number of test sites to help you practice scraping data from different kinds of webpages.
1. Create a new task
Enter the example URL "http://test-sites.octoparse.com/?product_cat=e-commerce-category-1" into the search box at the center of the home screen. Click "Start" to create a new task with Advanced Mode.
2. Get data via auto-detect
Octoparse will load the webpage URL in the built-in browser and start the auto-detect process automatically. Wait patiently until the process completes and when more info is provided on "Tips".
If the data you need is not accessible upon page loading, check out this tutorial about how you can interact with the web page before getting data auto-detected.
3. Check the data
When the auto-detection completes, follow the instruction provided on "Tips" and check your data in the preview section. You can rename the data fields or remove those that are not needed. The detected data will also be highlighted on the webpage for you.
4. Confirm your options
Now, go to "Tips" and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:
1. Extract the data in the list - This option is selected by default as Octoparse thinks this is what you need to do for sure.
2. Click the "Next" button to capture multiple pages - Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to extract data from more pages.
Hints: To find out if the button detected is the correct one, click "Check" and watch it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on "Tips".
3. Click the "links" to capture data on the page that follows - Octoparse is asking if you want to click on the links detected and extract more information from the detail pages. Check this option if this is what you need.
Hint: To confirm if the links are the ones you'd like to click through, click "Check" to have the links highlighted on the web page.
In this example, we only want to scrape the list information across all pages. Hence, we'll go ahead and check the first and the second option.
5. Save task settings
After confirming the settings, click "Save Settings".
Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.
To know how to optimize the task workflow, please continue to >> Lesson 2: Optimize your task
Artículo en español: Lección 1: Extraer datos con el nuevo algoritmo de Auto-detect
También puede leer artículos de web scraping en el sitio web oficial