Scrape data on Instagram
FollowThere are two ways to scrape Instagram with Octoparse. You can build a scraping task using Advanced Mode or use our pre-built template for Instagram. The template helps you fetch data in no time while building a fresh task provides the flexibility to extract any data needed from the web page.
You can quickly access the various pre-built templates by going to Task Templates on the main screen of the Octoparse App. This tutorial, however, will focus on how to build a new task to scrape the data needed from Instagram with Advanced Mode.
Here are the main steps in this tutorial: [Download demo task file here ]
Tips
|
For this example, we' are going to scrape post content, date, image URL, number of likes and the location from Instagram.
The main steps are:
- "Go To Web Page" - load the target web page
- Create a pagination loop - scrape data from multiple posts
- Extract data - select the data for extraction
- Reformat data using RegEx (Optional)
- Use XPath to select data (Optional)
- Run the task and get data
1. "Go To Web Page" - load the target web page
- Open the Octoparse App and create a new task with Advanced Mode.
- Paste the URL: https://www.instagram.com/izkiz/ (or any other Instagram post pages) into the Input URL box and click Save URL to proceed.
- Click Setting and change the default built-in browser to Firefox 45.0.
Octoparse offers several versions of the built-in browser to support web pages of all kinds. Whenever a web page is not showing correctly in the built-in browser (which is the case with Instagram), try the other browsers to make it work. If you use Octoparse 7.2.2, please have the task saved before modifying the settings
- Click Save to apply the new setting.
2. Create a pagination loop to scrape data from multiple posts
Because of the special website structure of Instagram, we need to click one of the posts to return the “>” button, which is the “Next page” button to go to the next post.
- Click the first post and click the "A" tag at the bottom of "Action Tips"
When you select an item with URL, the selected tag would be "A". Normally there’s no need to modify, as Octoparse automatically identifies tags of selected items. But for this case, we need to revise the tag on the bottom of "Action Tips".
- Select "Click the link"
- Uncheck "Auto retry when no response"
- Check "Load the page with AJAX", and set up "AJAX Timeout” for "5" seconds
We have the first post opened now. However, as Instagram loads the content with AJAX, we should set up AJAX Load for the "Click Item" action.
Now, we can create “Pagination”.
- Click the ">" button
- Click "Loop click next page" on the "Action Tips" panel
- Click "Load the page with AJAX" and set up "AJAX timeout" for "5" seconds
Instagram uses AJAX on the ">" button, so we need to set up AJAX Load for "Click to Paginate" action as well.
Tips! To learn more about dealing with AJAX in Octoparse, please refer to Deal with AJAX You can also go to this video tutorial Octoparse: AJAX 101 |
3. Extract data - select the data for extraction
We are now in the second post. When creating a "Loop Item", we should always start with the first item on the first page. In this case, we should go back to the first post.
- Click "Go To Web Page" in the workflow
- Click "Click Item"
Octoparse would open the first post.
- Click the "Pagination" loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the "Extract data" step at the appropriate position in the workflow.
Now, let’s start to extract data.
- Select the data you want
- Click "Extract text of the selected element " on the "Action Tips" panel
- For the URL of the image, click "Extract URL of the image"
- For the post date, select "Extract outer HTML of the selected element"
- Rename the fields by selecting from the predefined list or inputting on your own
- Click "OK" to save
Tips: To learn more about how to adjust workflow, please refer to Getting to know Octoparse |
4. Reformat data using RegEx (Optional)
When extracting the post date data, we may find some data shown in different formats, like "3 days ago". To unify the format, we need to go to "Refine extracted data".
- Select the "post_date" field to be modified
- Click the icon of "Customize data field"
- Select "Refine extracted data", click"Add step", and then select "Match with Regular Expression"
- Select "Try RegEx Tool"
- Check the box for "Start With" and enter: title="
- Check the box for "End With" and enter:" datetime
- Click "Generate" and "Match"
- Click "Apply" and "OK"
- Click "OK" to save
5. Use XPath to select data (Optional)
If we check the workflow manually, we may find that the data fields of location and "URL" are blank, which means the data are missing, so we need to go back to the source code of Instagram to find out their XPath expressions.
- Click the row of "URL" to modify the data field
- Click the icon of "Customize data field"
- Click "Customize XPath"
- Paste the revised XPath into the "Matching XPath" text box and click "OK"
- Revised XPath: //img[@class="FFVAD"]
Follow the above steps to revise the XPath of the data field of "Location".
- Paste the revised XPath into the "Matching XPath" text box and click "OK"
- Revised XPath: //a[@class="O4GlU"]
Tips! To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need: |
6. Run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only).
For a premium user, Cloud Extraction is highly recommended.
Now, we have all the extracted data. Here is the sample output.
日本語記事:Instagramから投稿をスクレイピングする
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Scrape datos en Instagram
También puede leer artículos de web scraping en el website oficial
Related Articles:
Scrape video information from YouTube
Can Octoparse scrape Facebook Community or Groups?
Author: Vanny
Editor: Fergus
Was this article helpful? Contact us any time if you need our help!