Scrape post URLs from a Facebook Group(V8.4)
FollowAs one of the earliest social networking websites, Facebook sets itself apart from its competitors by its overwhelming popularity. Its richness in user-generated data in the form of posts, comments, and replies allows scrapers to identify popular topics and assess public sentiment.
Due to popular demand, this tutorial is the third in a series of tutorials that the Octoparse team has prepared for users with a need for Facebook data. Kindly note that Facebook has hidden the post URLs from all scrapers, you cannot access them directly from the webpage, or even the source code.
In this tutorial, we are going to show you how to scrape post URLs from a Facebook group with Octoparse. Check our first tutorial on Facebook if you want to scrape the post URL from an account page instead.
We will scrape the post URLs from the Elden Ring Facebook community, check out the sample URL below:
https://www.facebook.com/groups/337943427851536
Here are the major steps of the tutorial:
- Create a Go to Web Page - to open the target web page
- Log into Facebook in browse mode - to save cookies for authentication
- Add a Go to Web Page - to open the target page
- Add a Scroll Page loop - to load more posts from the infinite scrolling page
- Create a loop with the timestamps of the first comment of each post - to extract the hidden URLs
- Run the task - to get the post URLs
1. Create a Go to Web Page - to open the target web page
Every workflow in Octoparse starts by telling Octoparse a web page to start with.
- Enter the link of a Facebook home page https://www.facebook.com/ into the search bar at the top of the home screen and click Start.
You can also enter the URL by creating the task in advanced mode.
- Find the +New button on the sidebar. Click on it and then select Advanced Mode.
- Input the URL into the website box and click Save to start.
Either way, check if a Go to Web Page action has been generated in your workflow and the target page has been opened in the built-in browser.
2. Log into Facebook in browse mode - to save cookies for authentication
Facebook hides its data behind authentication, so we need to log in first.
- Toggle on browse mode and log into Facebook as you do in a normal browser
- Click the Go to Web Page action to open its settings panel (located at the bottom right)
- Go to the Options tab and tick Use cookies
- Tick Use cookie from the current page
- Click Apply to save the settings
- Turn off the Browse Mode
We have now successfully saved the login information in the task workflow so that when we run the task, our Facebook account has been logged in.
3. Add a Go to Web Page - to open the target page
After logging in, we need to go to the target web page, which is https://www.facebook.com/groups/337943427851536 in this case. We need a new Go to Web Page action in the workflow.
- Hover over the down arrow under the Go to Web Page action
- Click
to add a Open Page
- Click on the Open Page action and input the URL to the newly added action
- Click Apply to save
4. Add a Scroll Page loop - to load more posts from the infinite scrolling page
Facebook is specially designed to keep you scrolling through its infinite scroll style pagination. We need to add a scroll page loop to the workflow to load more posts from the page.
- Hover over the down arrow under the Go to Web Page action
- Click
to add a loop
- Change its loop mode to Scroll Page in the General tab
- Set the Repeats according to how many posts you want to scrape(100 for example)
- Set the Wait time as 3s
- Click Apply to save the settings
Now Octoparse will scroll down the page the way we tell it to.
5. Create a loop with the timestamps of the first comment of each post - to extract the hidden URLs
After checking the source code of the webpage, we find out that Facebook hides the post IDs inside the timestamps of each post. However, the source code for the main post timestamp is dynamic (The content of the href tag is displayed as "#" until hovered on), so we are forced to collect the post IDs from the first comment of each post. It takes some time to find the correct XPath, but don't worry, we got it here for you.
The XPath for the timestamp is (//ul[li[div[div[contains(@aria-label,"Comment by")]]]])/li[1]//ul/li[4]
- Hover over the down arrow inside the Loop Item we have created in the last step
- Click
to add another loop
- Change Loop Mode to Variable List
- Input the Matching XPath for the timestamp
- Click Apply to save the settings
- Click
inside the loop to add an Extract Data action
- Click
on the data preview section and select Capture data on the page to add a new data field
- Rename the Field to PostURL and change its Matching XPath to /a
- Click Apply to save
- Click the three dots on the Data Preview section and select Customize field
- Select URL (href attribute) to get the raw URL
- Click the three dots again and select Clean data to match out the clean URL
- Click + Add Step and choose Match with Regular Expression
- Use the Try RegEx tool if you don't know how to write the RegEx yourself. As we need to match out the URL before "?", we can generate a RegEx to end with "?"
- Remember to click Apply to save the settings
Now we can get the standard post URLs from the task.
6. Run the task - to get the post URLs
Your workflow should look like this.
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
Here is a sample output from a local run.
With these post URLs, we can move on to building a second task to scrape the comments of each post as well as their replies.
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.
Author: Crix
Editor: Yina