As one of the earliest social networking websites, Facebook sets itself apart from its competitors by its overwhelming popularity. Its richness in user-generated data in the form of posts, comments, and replies allows scrapers to identify popular topics and assess public sentiment.
Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Facebook data.
In this tutorial, we are going to show you how to scrape comments from Facebook posts.
Before you start building a crawler of your own, you may want to check out the pre-built Facebook Comment template first. The task templates provide an easy way to get the data you want. You can just enter the post URLs to get comments extracted within minutes. Note that the template belongs to a premium feature. You can apply for a free 14-day premium trial here!
If the template falls short of your needs and you would like to build the crawler from scratch, you might continue with the tutorial.
Last time, we successfully scraped the post URLs from the official Nintendo Facebook page. Check the article here if you haven't already done so! We will use one of the URLs as our starting point and extract each comment and their first reply as a demo.
Here are the major steps for this task:
- Start a new task with a list of URLs - to import post URLs
- Log into Facebook in the browser mode - to save cookies for authentication
- Add a Pagination action to loop click "View X More Comments" for the main post - to load all the comments for the main post
- Create a Loop Item for the comments - to define the area for data extraction
- Create a branch condition - to improve crawler efficiency by condition-based scraping
- Select the data for extraction - to collect each comment and their first reply
- Modify the XPath for the data fields
- Run the task - to get your desired data
1. Start a new task with a list of URLs - to import post URLs extracted earlier
Last time, we extracted post URLs from the official Nintendo Facebook page, now we can use the list of URLs as a starting point.
- Find the +New button on the sidebar. Click it and then select Advanced Mode.
- Choose Import from file and Select File to locate a local file with the list of URLs
- Choose the right option from the dropdown menu to locate the list of URLs in the file
- Check the URL Preview section to see if we have the right URLs
- Click Save to start
We will find a Loop URLs with Go to Web Page action generated in the workflow.
- Set a longer timeout for the action in case the web page takes some time to load.
2. Log into Facebook in the browser mode - to save cookies for authentication
Facebook hides its data behind authentication, so we need to log in first.
- Toggle on browse mode
- Log into Facebook as you do in a normal browser
- Click the Go to Web Page action to open its settings panel in the bottom right corner
- Click Use cookie from the current page
- Click Apply to save the settings
Now we have saved our login information in the task workflow.
3. Add a Pagination action to loop click "View X More Comments" for the main post - to load all the comments for the main post
Facebook's default view shows only a few comments, so we might need to click a View X More Comments button multiple times to load all the comments.
- Click on the Loop URLs and select one post URL that contains more than 50 comments (here is an example URL: https://www.facebook.com/Nintendo/videos/209774058010837/)
- Click Go to Web Page and the post will be opened
- Click on View more comments on the web page and Select Loop click single element
- Set AJAX timeout to 3s
- Click the Pagination action and change its XPath to //span[contains(text(),"more comments")]
Now Octoparse will click open all the comments if it locates a view more comments button on the page.
4. Create a Loop Item for the comments - to define the area for data extraction
Now we need to create a loop item for each comment and their replies in order to define the area for data extraction. As it is difficult to select comments and their replies in one section with point and click, we need to set up the loop manually and write an XPath to locate them.
- Hover over the down arrow under the Pagination action
- Click Add a Step to add a loop
- Go to the General tab for the Loop Item and set Loop Mode to Variable List
- Enter Matching XPath as (//div[contains(@aria-label,"Comment by")]/../../..)/li
5. Create a branch condition - to improve crawler efficiency by condition-based scraping
Not every comment has a reply. Those who have will show a "1 Reply" or "X Replies" button below their content. We need to create a branch condition here to tell them apart so that our crawler only loads replies when a comment actually has one.
- Hover over the down arrow above Drop an action here
- Click Add a Step to add a branch condition in the loop
- Click on the left branch of the branch condition and choose Execute if the current loop contains specific element
- Input the XPath that covers the "1 Reply" and "X Replies" button: //span[@class="j83agx80 fv0vnmcu hpfvmrgz"]/span[contains(text(),Repl)]
- Click on X Replies under any comment and select Click element
- Choose Relatvie XPath to the Loop Item and input the XPath //span[@class="j83agx80 fv0vnmcu hpfvmrgz"]/span[contains(text(),Repl)]
- Click Apply to save the settings
6. Select the data for extraction - to collect each comment and their first reply
- Click + to add an Extract Data step under the Branch condition box
- Click on a comment and select Extract the text of the element on the Tips
- Click on the commenter's name and select Extract the text of the element
- Click on a reply and select Extract the text of the element
- Click on a replier's name and select Extract the text of the element
- Click on Extract Data action and tick Extract data in the loop
- Go to the Data Preview section, click >Page-level data>Page URL
Now we have roughly located the data fields we need.
7. Modify the XPath for the data fields
- Turn to the Data Preview section and hover over the three dots to the right of the first data field for more options
- Select Customize XPath to open a window to edit the element XPath
- Change Field Name to Commenter
- Tick Relative XPath to the Loop Item under Matching XPath and input //div[contains(@aria-label,"Comment by ")]//span[@class="pq6dq46d"]/span
- Click Apply to save the settings
- Repeat the above process for the other three data fields, change the field names and XPath accordingly
- Comment: //div[contains(@aria-label,"Comment by ")]//div[@class="kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql"]
- First Replier: /div/following-sibling::div/div/ul/li/div/div/div/div/div/div/div/div/div/span
- First Reply: /div/following-sibling::div/div/ul/li/div/div/div/div/div/div/div/div/div//div[@class="kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql"]
Now Octoparse can locate the element more precisely.
8. Run the task - to get your desired data
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
- Wait for the task to complete
Note: Facebook tends to block datacenter (non-residential) IPs. As a result, Facebook crawlers are NOT compatible with the cloud extraction feature.
Here is a sample output from a local run.
Facebook has devoted substantial resources to limiting scraping activities. If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.