Now you have downloaded Octoparse on your device and learned about the basics, it is time to start your own web scraping project!
Most of the websites (directories, e-commerce, real estate sites, etc.) share similar layouts, eg. a page containing many items nested in a list. Let's look at a few examples.
Octoparse's brand-new auto-detect algorithm is specially designed to scrape pages of such a kind. It automatically detects listing data (including text elements and links), "Next page" buttons, "load more" buttons and scroll down of a page and then generates the scraping task automatically.
In this lesson, we will go through how to scrape webpage data by using the auto-detect algorithm.
STEP 1. Create a new task
Enter the sample URL (http://test-sites.octoparse.com/?product_cat=e-commerce-category-1) into the search box at the top of the home screen. Click Start to create a new task with Custom Mode.
STEP 2. Get data via auto-detect
Octoparse will load the webpage URL in the built-in browser and start the auto-detect process automatically. Please wait patiently until the process is completed and when more info is provided on the Tips panel.
TIP: If the data you need is not accessible upon page loading, check out this tutorial about how you can interact with the web page before getting data auto-detected.
STEP 3. Check the data
Once the auto-detection is completed, follow the instructions provided on the Tips panel and check your data in the preview section. You can rename the data fields or remove those that are not needed. The detected data will also be highlighted on the webpage for you.
STEP 4. Confirm your options
Now, go to the Tips panel and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:
- Extract the data in the list - This option is selected by default as Octoparse thinks this is what you need to do for sure.
- Paginate to scrape more pages- Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to extract data from more pages.
TIP: To find out if the button detected is the correct one, click Check and watch if it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on the tips panel.
STEP 5. Create workflow
After confirming the settings, click Create workflow.
Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.
continue to >> Lesson 2: Optimize your task