Scrape posts from Facebook public pages (V8.4)
FollowAs one of the earliest social networking websites, Facebook sets itself apart from its competitors by its overwhelming popularity. Its richness in user-generated data in the form of posts, comments, and replies allows scrapers to identify popular topics and assess public sentiment.
Due to popular demand, this tutorial is the fourth in a series of tutorials that the Octoparse team has prepared for users with a need for Facebook data.
In this article, we are going to scrape from a Facebook public profile page.
You may need to use the URL below to follow through:
Here are the major steps of the task workflow. You can download the task file at the bottom of the page.
- Create a Go to Web Page - to open the target website
- Log into Facebook - to save the cookies for authentication
- Auto-detect web page - to create the basic workflow
- Modify the XPath for the Loop Item
- Click on "See more" - to reveal the complete post content
- Run your task - to get the data you want
1. Create a Go to Web Page - to open the target website
- Enter the URL on the home page and click Start
Octoparse will automatically load the page in the built-in browser and you will find the login page.
2. Log into Facebook - to save the cookies for authentication
- Switch to Browse mode by clicking on
- Fill out the log-in page with your user name and password and click "log-in"
- Turn off Browse mode
- Go to the settings of the Go to Web Page and save cookies
Tip: If you would like to log in to see more information or discover that the login steps should be included in the workflow to help run the task successfully, please follow this tutorial to see how to log in to a website in Octoparse: Scrape data behind a login
3. Auto-detect the web page - create the basic workflow
- Click Auto-detect web page data on the Tips panel
- Click Edit under Add a page scroll
- Set to scroll to the bottom, repeat 20 times, wait time as 5s
- Rename or delete fields in the Data preview if needed
- Click on Create workflow
4. Modify the XPath for the Loop Item
- Click on Loop Item action
- Make sure the loop mode is Variable List
- Enter the Xpath //div[@role="article"][not(contains(@aria-label,"Comment"))]/../..
- Click Apply to save the settings
Tip: XPath plays an important role in locating the correct elements in Octoparse. You can check the tutorial below to learn more about it: What is XPath and how to use it in Octoparse
5. Click on "See more" - to reveal the complete post content
Many posts with long content have a "See more" button in it and you need to click it to see the complete content. You will need to add a branch to the workflow to determine if the post contains "See more" button. If there is such a button, click on it and scrape the data; if there is not, just scrape the data directly.
- Add Branch Conditions inside the Loop Item
- Set up the condition as if the current loop contains specific element
- Input the XPath /DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/SPAN[1]/DIV[1]/DIV[1]//div[contains(text(),'See')]
- Add a Click Item inside the left branch
- Go to the settings of the Click Item and select Relative XPath
- Input the XPath /DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/SPAN[1]/DIV[1]/DIV[1]//div[contains(text(),'See')]
- Go to Optional tab, set up the AJAX Load for the Click Item
6. Run your task - to get the data you want
- Click Save to save the task first
- Click Run on the upper left side
- Select Run task on your device to run the task on your computer
Here is the sample output.
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.
Author: Joy
Editor: Yina