Scrape post URLs from Facebook public pages
FollowAs one of the earliest social networking websites, Facebook sets itself apart from its competitors by its overwhelming popularity. Its richness in user-generated data in the form of posts and comments allows scrapers to identify popular topics and assess public sentiment.
Due to popular demand, this tutorial is the first in a series of tutorials that the Octoparse team has prepared for users with a need for Facebook data.
In this tutorial, we are going to show you how to scrape post URLs from Facebook posts with Octoparse.
We will scrape the post URLs from the official Nintendo Facebook page, check out the sample URL below: https://www.facebook.com/Nintendo/
Here are the major steps for this task:
- Create a Go to Web Page - to open the target web page
- Log into Facebook in the browser mode - to save cookies for authentication
- Add a Go to Web Page - to open the target page
- Add a Scroll Page loop - to load more posts from the infinite scrolling page
- Create a Loop Item - to loop click a sequence of buttons to locate the hidden post URL
- Modify the settings for actions in the loop - to debug our workflow
- Use Clean data to get standard post URLs
- Run the task - to get the post URLs
1. Create a Go to Web Page - to open the target web page
Every workflow in Octoparse starts by telling Octoparse a web page to start with.
- Enter the link of a Facebook home page, https://www.facebook.com/, into the search bar at the top of the home screen and click Start.
You can also enter the URL by creating the task in advanced mode.
- Find the +New button on the sidebar. Click on it and then select Advanced Mode.
- Input the URL into the website box and click Save to start.
Either way, check if a Go to Web Page action has been generated in your workflow and the target page has been opened in the built-in browser.
2. Log into Facebook in browse mode - to save cookies for authentication
Facebook hides its data behind authentication, so we need to log in first.
- Toggle on browse mode and log into Facebook as you do in a normal browser
- Click the Go to Web Page action to open its settings panel (located at the bottom right)
- Go to the Options tab and tick Use cookies
- Click Use cookie from the current page
- Click Apply to save the settings
We have now successfully saved the login information in the task workflow so that when we run the task, our Facebook account has been logged in.
3. Add a Go to Web Page - to open the target page
After logging in, we need to go to the target web page, which is https://www.facebook.com/Nintendo/ in this case. We need a new Go to Web Page action in the workflow.
- Hover over the down arrow under the Go to Web Page action
- Click
to add a Open Page
- Click on the Open Page action and input the URL to the newly added action
- Click Apply to save
4. Add a Scroll Page loop - to load more posts from the infinite scrolling page
Facebook is specially designed to keep you scrolling through its infinite scroll style pagination. We need to add a scroll page loop to the workflow to load more posts from the page.
- Hover over the down arrow under the Go to Web Page action
- Click
to add a loop
- Change its loop mode to Scroll Page in the General tab
- Modify the settings for the page scroll
- Click Apply to save the settings
Now Octoparse will scroll down the page the way we tell it to.
5. Create a Loop Item - to loop click a sequence of buttons to locate the hidden post URL
As Facebook intends to hide its post URL from crawlers, we cannot locate any element with the post URL directly. We need to click on a sequence of buttons to find the hidden URL for a post.
- Click on the three dots (
) in the upper right of any post
- Choose Select All on the Tips panel
- Select Loop click each element and set AJAX timeout to 3s
- Click Embed from the pop-up and select Click element from the Tips panel
- set AJAX timeout to 3s
- Click on the URL text box in this window and select Extract text box value from the Tips panel
- set AJAX timeout to 3s
- Click the cross icon in the top right of the pop-up window and select Click element
- set AJAX timeout to 3s
Now we have set up a basic workflow where Octoparse will click open the Embed Post window of a post, extract the post URL from it, close the window, and move on to the next. But the fact is we are still only halfway through.
6. Modify the settings for actions in the loop - to debug our workflow
To make sure our workflow runs smoothly, we still have to make a few adjustments to the actions in the loop.
- Click on the Loop item and change its loop mode to Variable List in the General tab
- Set the Matching XPath as //div[@aria-label="Actions for this post”]
- Click Apply to save the settings
- Click on the second Click Item in the loop (should be named Click Item1 by default)and change its absolute XPath to //span[contains(text(),"Embed")]
- Click on Extract Data action
- Go to Data Preview
- Click on the three dots for more options for the data field and choose Customize XPath
- Change the field name to posturl (The name change is optional)
- Change its absolute XPath to //iframe[contains(@src,"https://www.facebook.com/plugins")]
- Click Apply to save
- Click Customize field and select src attribute (image URL)
- Click on the last Click Item2 and change its absolute XPath to //div[@aria-label="Close"]
- Click Apply to save
Now we can say we have completed the workflow and Octoparse will find and extract the hidden post URLs successfully.
7. Use Clean data to get standard post URLs
There are altogether three types of URLs (posts, video posts, photo posts) being extracted from the last step. As they are still not standardized, we need to use the Clean data function to clean their format. Observe their patterns closely. We need to replace the %2F with a slash and match out the standard post URLs.
- Click on the three dots for more options for the data field
- Select Clean data
- Click +Add Step and select Replace
- Replace %2F with / and Click Confirm
- Click +Add Step again and select Match with Regular Expression
- Use the RegEx tool if you are not sure about RegEx
- Start with 3A// and end with &show
- Click Generate and check the Results
- Tick Match all and click Apply and then Confirm
- Click +Add Step the 3rd time and select Add a prefix
- Enter https:// as the prefix and click Confirm
Check if you have the settings below.
- Click Apply to save your settings
Now we can get the standard post URLs from the task.
8. Run the task - to get the post URLs
Your workflow should look like this.
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
Here is the sample output from a local run.
With these post URLs, we can move on to build a second task to scrape the comments of each post as well as their replies.
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.
Author: Crix
Editor: Yina