You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Bol is a leading e-commerce platform in the Netherlands. As the strongest retail brand in Holland, the website holds large numbers of both users and merchants.

In this case, we will scrape product info from Bol with Octoparse and hope the data would provide some help for both buyers and sellers. Here we use AirPods product search results page as an example.

info.jpg

To follow through with the tutorial, kindly please use the following URL for reference:

https://www.bol.com/nl/nl/s/?searchtext=airpod

Here are the main steps in this tutorial: [Download task file here]

  1. Enter the URL on the home page - to open the target website

  2. Start auto-detection - to generate a workflow

  3. Add pagination - to get more results for similar products

  4. Modify Xpath for Pagination - to locate the next page button accurately

  5. Clean data - to get the correct format of the number

  6. Run the task - to get the desired data


1. Enter the URL on the home page - to open the target website

To start our scrape journey, the target website URL needs to be input first.

  • Enter the search URL into the search box at the center of the home screen. Click Start to create a new task with Advanced Mode

bol.jpg
  • Click Accepteren to set cookies

cookie.jpg
  • Click Click element in the Tips box to finish the cookie settings

click_element.jpg

2. Start auto-detection - to generate a workflow

Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.

  • Click Auto-detect web page data in the Tips box and wait for the detection to complete

detec.jpg
  • Check the data fields in Data Preview and delete unwanted data or rename them if needed

data_preview.jpg
  • Untick Click on a "Load More" button because there is no load more button on this page

UNTICK.jpg
  • Click Create workflow

CREATE.jpg

The workflow would then be generated as below:

WORKFLOW.jpg

3. Add pagination - to get more results for similar products

If the auto-detect function fails to generate pagination automatically, we need to add pagination manually.

  • Click > at the bottom of the page

  • Then click Loop click single URL to set the pagination

4.jpg

4. Modify XPath for Pagination - to locate the next page button accurately

In order to make sure the pagination goes right, an accurate XPath for the pagination is essential.

  • Click Pagination in the workflow

  • Choose General settings

  • Input //ul[@class="pagination"]/li[3]/a in the blank box

  • Click Apply to apply the setting

PAGINATION.jpg

5. Clean data - to get the correct format of the number

As shown in the data preview, the price extracted from the page missed a ".", we can add a clean data step to make it right.

  • Click on More (...) of the Price column

  • Click Clean Data

clean_data.jpg
  • Click Add Step

  • Click Trim spaces

add_steps.jpg
  • Click Trim Both to trim both the spaces behind and after the number

  • Click Confirm

TRIM.jpg
  • Click Add Step again

  • Click Replace with Regular Expression this time.

2.jpg
  • Enter \n in the Regular Expression column

  • Enter . in the With column

  • Click Confirm to save the settings

3_.jpg
  • Click Apply to apply the formula

SAVE.jpg

NOTE: The RegEx entered here means to replace line break (\n) with "." For more tutorials on RegEx, kindly please check here.


6. Run the task - to get the desired data

  • Click the Save button first to save all the settings you have made

  • Then click Run to run your task either locally or cloudly

mceclip8.png
  • Select Run on your device and click Run Now to run the task on your local device

  • Wait for the task to complete

mceclip9.png

Below is a sample data run from the local run. Excel, CSV, HTML, and JSON formats are available for export.

DATA_OUTPUT.jpg

TIP: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks could be scheduled hourly, daily, or weekly and data delivered regularly.

Did this answer your question?