In this tutorial, we are going to show you how to scrape product data on Flipkart.com.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the product title, rating, reviews, and price from the product details page with Octoparse.
This tutorial will also cover:
- Deal with AJAX for pagination
- Modify Xpath
Here are the main steps in this tutorial: [Download task file here]
- Go to Web page - to open the targeted web page
- Create a pagination loop - to scrape all the results from multiple pages
- Create a "Loop Item" - to loop click into each item on each list
- Extract data - to select the data for extraction
- Customize the data field by modifying XPath - to improve the accuracy of a certain data field (Optional)
- Start extraction - to run the task and get data
1. Go to Web page - to open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like flipkart.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2. Create a pagination loop - to scrape all the results from multiple pages
- Click the "Next" button
- Click "Loop click next page" on the "Action Tips"
- Set up AJAX Load for the "Click to paginate" action
Flipkart.com applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load in the "Click to paginate" step.
- Uncheck the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Check the box for "Load the page with AJAX" and set up AJAX Timeout (2-4 seconds will work usually)
- Click "OK" to save
For more about dealing with AJAX in Octoparse：
3. Create a "Loop Item" - to loop click into each item on each list
- Select the first item and the second one you see in the inner browser until the action panel suggests "loop click each element"
- Click "Loop click each element" to create a "Loop Item"
Octoparse will click through each link captured in the "Loop Item", and open the detail page.
- Undo "Retry when page remains unchanged"
- Check the box for "Load the page with AJAX" and set up "AJAX Timeout" according to your network condition
- Click "save"
4. Extract data - to select the data for extraction
After you click "Loop click each element", Octoparse will open the detail page of the first product.
- Click on the data you need on the page
- Select "Extract text of the selected element" from the "Action Tips"
- Rename the items you scraped on the data field by selecting from the pre-defined list or inputting on your own
5. Customize the data field by modifying XPath - to improve the accuracy of a certain data field (Optional)
- Select the data field for "Price"
- Click "Customize XPath" �to set the XPath
- Paste "//div[@class="_1vC4OE _3qQ9m1"]" into "Matching Xpath" box
- Click "OK" to save
6. Start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Here is the sample for your information.