Scrape tweets from a public Twitter account
FollowWith a reported 211 million daily active users, Twitter has proven its worth in social media marketing. Users on Twitter post an average of 6000 tweets every second, which makes it over 500 million tweets posted each day. All of this chatter and noise is a treasure chest full of valuable information for marketers, brands, researchers, and analysts. Marketers and brands often scrape Twitter data of specific accounts (influencers, competitors) to analyze engagement and plan effective strategies.
Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Twitter data.
In this post, we are going to teach you how to scrape tweets from a public account.
If you don't want to bother creating a custom crawler on your own, you can search for a ready-to-use Twitter Task Template from the main screen to save some time.
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check the video below.
You can use the following sample link to follow though:
https://twitter.com/search?q=Latest%20News&src=tyah
Here are the main steps in this tutorial[Download the demo task click here]
- Create a Go to Web Page - to open the target web page
- Create a Loop Item - to loop through each tweet
- Create another Loop Item - to scroll down the web page
- Rewrite some of the XPath - to locate the web elements more accurately
- Create a Extract Data - to scrape the desired data fields
- Run the task - to get your desired data
1. Create a Go to Web Page - to open the target twitter link
Every workflow in Octoparse starts by telling Octoparse a web page to start with.
- Enter the sample URL into the search bar at the top of the home screen and click Start.
2. Create a Loop Item - to loop through each tweet
Next, we need to create a loop for all the tweets.
- Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)
- Continue to select the second tweet
- Choose Extract text of the selected elements from the Tips Panel
3. Create another Loop Item - to scroll down the web page
The infinite scroll pattern of Twitter is designed to load content dynamically, which requires a few necessary tweaks in the task workflow to minimize the loss of data.
- Add a new Loop Item in the workflow
- Drag the original loop inside the new loop (Loop Item inside Loop Item1)
- Click the Loop Item and set its Loop Mode to Scroll Page in the General tab
- Set the scroll pattern to for one screen, wait time 1s, and repeat 100 times (or more)
- Tick Capture data as page scrolls dynamically (possibly duplicates) (Important!)
- Click Apply to confirm
4. Rewrite some of the XPath - to locate the web elements more accurately
The auto-generated XPath may not be accurate enough. So we need to rewrite the XPath for some data fields.
- Click Loop Item (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..
5. Create a Extract Data - to scrape the desired data fields
- Click Extract Data and you will see a tweet being highlighted in red
- Choose Extract the text of selected element from the Tips Panel
- Repeat the action and get the name, time, text, reply, retweet, likes
Tip: Double click each field down below the page, you can easily rename them.
You may notice that Tweet post time is shown as "20m". So we need to clean the data field to show the exact post date/time.
- Click More button of the field
- Choose Customize field
- Select to extract the attribute of datetime
6. Run the task - to get your desired data
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
- Wait for the task to complete
Here is the sample output from a local run.
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.
Author: Crix
Editor: Yina