Scrape product information from bukalapak
FollowIn this tutorial, we will show you how to collect product details on bukalapak.com with Octoparse.
You could visit our easy-to-use "Task Template" on the main screen of the Octoparse. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates
We will scrape each product page URL of Huawei products with Task 1 and extract the product title, price, and seller information from each product page with Task 2. By splitting one task into two, the extraction speed could be improved to a certain degree, especially when we use Octoparse Cloud Extraction.
To follow through, you may want to use this URL in the tutorial:
Here are the main steps in this tutorial [Download demo task file here]
Task 1: Extract all the URLs of detail pages on the search result pages
- "Go to Web Page"- to open the target web page
- Create a pagination loop - to scrape all the results from multiple pages
- Loop extract detail page URL on each list - to select all the product URLs
- Save and start extraction - to run the task and get data
Task 2: Collect the product information from scraping URLs
- Input a batch of the scraped URLs - to loop open the detail pages
- Extract data - to select the data for extraction
- Start extraction - to run the task and get data
Task 1: Extract all the URLs of detail pages on the search result pages
1. "Go to Web Page" - to open the target web page
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Website" box
- Click "Save URL" to move on
2. Create a pagination loop - to scrape all the results from multiple pages
- Scroll down and click the "Next page" button on the web page
- Click "Loop click next page" on "Action Tips"
- Uncheck the "Auto Retry" option
- Set "AJAX Timeout" as 5s
- Click "OK" to have the step saved
Tips! AJAX timeout can often be used as web page timeout for Click Action. For example, when you have a page that takes forever to finish loading, long after the data you need gets loaded, you can conveniently use AJAX timeout to tell Octoparse to move on to the next action when the set time is reached. If you want to learn more about AJAX, you can enjoy the video tutorial here |
3. Loop extract detail page URL on each list - to select all the product URLs
- Click any product title on the first page
The first 10 products are “promoted products” which are not the products we want. Avoid them and click the 11th product.
- Click "Select all" on the "Action Tips"
- Click "Extract URLs from the selected link"
- Rename the fields by selecting from the pre-defined list or inputting on your own
4. Save and start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Export the result after the data extraction has completed. The list of URLs will be used in Task 2.
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - to loop open the detail pages
- Click "+ Task" to start a task using Advanced Mode to build Task 2
- Input batch URLs
There are four ways to input URLs to a task/crawler. In this case, we will input URLs directly by pasting. You should input less than 10K URLs.
If you want to input more URLs, please refer to Batch URLs input to check the other three ways which allow inputting up to one million URLs.
2. Extract data - to select the data for extraction
- Click on the data you need on the page
- Select "Extract text of the selected element" on the "Action Tips"
- Rename the fields
3. Save and start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Related articles:
Scraping product details from Wayfair
Scrape product information from BestBuy
Artículo en español: Scrape la información del producto de bukalapak
También puede leer artículos de web scraping en el website oficial
Writer: Eric
Editor: Yina