The new tutorial about using templates in version 8 is available here.
In this tutorial, we will show you how to collect product information on Tokopedia (an Indonesia e-commerce site) with Octoparse.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Tokopedia Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
We will scrape data such as the product title, price, image URL and more from the product details page with Octoparse.
To follow through, you might want to use the URL in this tutorial:
This tutorial will also cover:
- Modify XPath to locate the desired price data accurately
Here are the main steps in this tutorial [Download demo task file here]
- "Go To Web Page" - open the target web page
- Create a pagination loop - scrape all data from multiple pages
- Build a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Customize the data field by modifying XPath – improve the accuracy of a certain data field (Optional)
- Start extraction - run the task and get the data
1. "Go To Web Page" - open the target web page
- Create the task with "Advanced Mode"
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2. Create a pagination loop - scrape all data from multiple pages
- Scroll down and click the ">" button on the web page
- Click "Loop click next page" on "Action Tips"
Tokopedia applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load in the "Click to paginate" step.
- Uncheck "Auto Retry"
- Check "Load the page with AJAX"
- Set up "AJAX Timeout"(For demonstration, we set "3s")
If you want to learn more about AJAX, here are related tutorials you might need:
3. Build a "Loop Item" - loop click into each item on each list
- Click "Go To Web Page" in the workflow
We are now on the second page. When creating a "Loop Item", we should always start with the 1st item on the 1st page. Thus, we 'd better go back to the 1st page.
- Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
When you create a list of items to scrape a website, sometimes the list may include several "Ads" items. To exclude the promotion products in this case, we can start the Loop Item building from the 3rd row on this page.
- Click the title of the 1st item on the 3rd row
- Click "Select All" on the "Action Tips" panel
- Select "Loop click each element"
In this case, we exclude the "Ads" Items by skipping the first two rows. However, when the "ads" items exist in the product list, there is another way for you to exclude them.
4. Extract data - select the data for extraction
- Click the information you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
5. Customize the data field by modifying XPath - Improve the accuracy of a certain data field (Optional)
In this case, the price element is not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the price element on each page to be precisely detected.
- Click "Customize data field"
- Select "Customize XPath"
- Paste the revised XPath into the "Matching XPath" text box
- Click "OK" to save
To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need:
6. Start extraction - run the task and get the data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
So now we have gone through all the steps to scrape data from Tokopedia. Here is the sample output.
Artículo en español: Scrape los datos del producto de Tokopedia
También puede leer artículos de web scraping en el website oficial
Was this article helpful? Contact us at any time if you need our help!