In this tutorial, we are going to show you how to scrape product data on Flipkart.com.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the product title, rating, reviews, and price from the product details page with Octoparse.
This tutorial will also cover:
- Deal with AJAX for pagination
- Modify XPath
Here are the main steps in this tutorial: [Download task file here]
- Go to Web page - open the targeted web page
- Create a pagination loop - scrape all the results from multiple pages
- Create a "Loop Item" - loop click into each item on each list
- Extract data - select the data for extraction
- Customize the data field by modifying XPath - improve the accuracy of a certain data field (Optional)
- Start extraction - run the task and get data
1. Go to Web page - open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like flipkart.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Click the "Next" button
- Click "Loop click next page" on the "Action Tips" panel
- Set up AJAX Load for the "Click to paginate" action
Flipkart.com applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load in the "Click to paginate" step.
- Undo the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Check the box for "Load the page with AJAX" and set up AJAX Timeout (2-4 seconds will usually work), and in this case, we set "3" seconds
- Click "OK" to save
For more about dealing with AJAX in Octoparse:
3. Create a "Loop Item" - loop click into each item on each list
- Select the 1st item and the 2nd one you see in the built-in browser until the "Action Tips" panel suggests "Loop click each element"
- Click "Loop click each element" to create a "Loop Item"
Octoparse will click through each link captured in the "Loop Item", and open the detail page of the 1st item.
- Undo "Retry when page remains unchanged"
- Set up "Wait before execution" as "3" seconds
- Check the box for "Load the page with AJAX" and set up "AJAX Timeout" according to your network condition(long enough for the page to load), and in this case, we set "7s" for demonstration(optional)
- Click "save"
4. Extract data - select the data for extraction
After you click "Loop click each element", Octoparse will open the detail page of the first item.
- Click on the data you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Rename the items you scraped on the data field by selecting from the predefined list or inputting on your own
5. Customize the data field by modifying XPath - improve the accuracy of a certain data field (Optional)
- Select the data field for "Price"
- Click the small icon of "Customize data field" and select "Customize XPath"
- Paste "//div[@class="_1vC4OE _3qQ9m1"]" into the "Matching XPath" box
- Click "OK" to save
If you want to learn more about XPath and how to generate it, here are the related tutorials you might need:
6. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Here is the sample for your information.
Was this article helpful? Contact us any time if you need our help!