You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
With a reported 211 million daily active users, Twitter has proven its worth in social media marketing. Users on Twitter post an average of 6000 tweets every second, making it over 500 million tweets posted daily. All of this chatter and noise is a treasure chest full of valuable information for marketers, brands, researchers, and analysts. Marketers and brands often scrape Twitter data of specific accounts (influencers, competitors) to analyze engagement and plan effective strategies.
Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Twitter data.
In this post, we are going to teach you how to scrape tweets from a public account.
If you don't want to bother creating a custom crawler on your own, you can search for a ready-to-use Twitter Task Template from the main screen to save some time.
If you want to know how to build the task from scratch, you may continue reading the following tutorial or watching the video below.
You can use the following sample link to follow through:
Here are the main steps in this tutorial[Download the demo task click here]
1. Create a Go to Web Page - to open the target twitter link
Every workflow in Octoparse starts by telling Octoparse a web page to start with.
Enter the sample URL into the search bar at the top of the home screen and click Start.
Tip: Some accounts cannot be accessed until you've logged into Twitter. To extract data behind a login, Check this article on how to Extract data behind login.
2. Create a Loop Item - to loop through each tweet
Next, we need to create a loop for all the tweets.
Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)
Continue to select the second tweet
Choose Extract text of the selected elements from the Tips Panel
3. Create another Loop Item - to scroll down the web page
The infinite scroll pattern of Twitter is designed to load content dynamically, requiring a few necessary tweaks in the task workflow to minimize data loss.
Add a new Loop Item in the workflow
Drag the original loop inside the new loop (Loop Item inside Loop Item1）
Click the Loop Item and set its Loop Mode to Scroll Page in the General tab
Set the scroll pattern to for one screen, wait time 1s, and repeat 100 times (or more)
Tick Capture data as page scrolls dynamically (possibly duplicates) (Important!)
Click Apply to confirm
4. Rewrite some of the XPath - to locate the web elements more accurately
The auto-generated XPath may not be accurate enough. So we need to rewrite the XPath for some data fields.
Click Loop Item (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..
5. Create an Extract Data - to scrape the desired data fields
Click Extract Data, and you will see a tweet being highlighted in red
Choose Extract the text of selected element from the Tips Panel
Repeat the action and get the name, time, text, reply, retweet, likes
Tip: Double-click each field down below the page. You can easily rename them.
You may notice that Tweet post time is shown as "20m". So we need to clean the data field to show the exact post date/time.
Click More button on the field
Choose Customize field
Select to extract the attribute of DateTime
6. Run the task - to get your desired data
Click Save on the upper right to save your task
Click Run next to it and wait for a Run Task window to pop up
Select Run on your device to run the task on your local device
Wait for the task to complete
Here is the sample output from a local run.
Tip: It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.
Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.