Scrape product information from Tokopedia(Version 8)
FollowIn this tutorial, we will show you how to collect product information, such as product title, price, and rating on Tokopedia (an Indonesian e-commerce site). There are two ways to get it done with Octoparse.
Option 1) Using Task Template
Octoparse provides pre-built templates for you to quickly extract product URLs and product details from Tokopedia. Simply enter the parameters as instructed and start getting data right away. There is no need to build the task on your own when you use the templates.
To access the Tokopedia templates, click on "More" on the homepage, then select the tab "Products". As you can see, there are two templates available for Tokopedia.
Generally speaking, the "Product URL" template helps you capture the URLs of the product detail pages from search results. And the "Product data" template helps you capture detailed product information based on the list of URLs previously captured. Select the "Product URL" template to get started. Check out this post for instructions on how to use a template.
Tips!
|
Option 2) Build your own Tokopedia crawler
If you've tried the templates and it doesn't quite give you the information you need or if the templates no longer work well, you can always set up your own scraping crawler.
With Octoparse's auto-detect feature, building your own crawler is quite straightforward. Let's see how it is done step-by-step.
For this example, we'll build two scraping tasks, one to capture the product URLs from search results and a second one to fetch product details from each product page.
As a rule of thumb, if you'll need to extract a relatively large amount of data, especially on any eCommerce website, it is usually recommended to split the job into two tasks. Scraping with a URL list is more efficient when running in the Cloud and also if you have the URL list handy, you'll know if any products got left out.
We'll use the search result URL below for the example.
https://www.tokopedia.com/search?st=product&q=usb
Task 1: Build a task to scrape the product URLs from the search result page
- "Go To Web Page" - open the target web page
- Build a "Loop Item" by using auto-detect web page data
- Create a pagination loop - scrape all data from multiple pages
- Drag the "Loop Item" into the "Pagination" if it is not in the right place
- Run the task on your device and wait for the task to finish
1. "Go To Web Page" - open the target web page
- Paste the URL into the box and click "Start" to move on
2. Build a "Loop Item" by using auto-detect web page data
- Click the "Auto-detect web page data" and wait for the detection to finish
- Modify the page scroll settings and click "Create workflow"
3. Create a pagination loop - scrape all data from multiple pages
- Scroll down to the bottom and click the ">" button on the web page
- Click "Loop click single button" on " Tips"
Tokopedia applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX timeout.
- Set up "AJAX Timeout"(For demonstration, we set "10s")
- Set up "scroll down" for the "Click to Paginate" action
Tips! If you want to learn more about AJAX, here are related tutorials you might need: |
4. Drag the "Loop Item" into the "Pagination" if it is not in the right place
5. Run the task on your device and wait for the task to finish
6. Export the data into an excel file
Task 2: Scrape product details from each product page
- "Go To Web Page" - using the advanced mode
- Extract data from the web page
- Rename the data fields and click the OK button to save all the changes
- Run task on your device
1. Go to the webpage- using the advanced mode
- Start a new task with the "New+" button
- Copy the URLs from the data file we just export and paste the URL list into the website box and click save
2. Extract data from the web page
- Click any text from the page and choose "Extract the text of the selected element"
Tips: To scrape other formats of data, you can check: |
3. Rename the data fields and click the OK button to save all the changes
4. Run task on your device
Here is the sample output.
Author: Lesley
Editor: Yina