You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

With a reported 211 million daily active users, Twitter has proven its worth in social media marketing. Users on Twitter post an average of 6000 tweets every second, making it over 500 million tweets posted daily. All of this chatter and noise is a treasure chest full of valuable information for marketers, brands, researchers, and analysts. Marketers and brands often scrape Twitter data of specific accounts (influencers, competitors) to analyze engagement and plan effective strategies.

Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Twitter data.

In this post, we are going to teach you how to scrape tweets from a public account.

If you don't want to bother creating a custom crawler on your own, you can search for a ready-to-use Twitter Task Template from the main screen to save some time.

1.png

If you want to know how to build the task from scratch, you may continue reading the following tutorial or watching the video below.

You can use the following sample link to follow through:

https://twitter.com/search?q=Latest%20News&src=tyah

Here are the main steps in this tutorial[Download the demo task click here]

  1. Create a Go to Web Page - to open the target web page

  2. Create a Loop Item - to loop through each tweet

  3. Create another Loop Item - to scroll down the web page

  4. Rewrite some of the XPath - to locate the web elements more accurately

  5. Create an Extract Data - to scrape the desired data fields

  6. Run the task - to get your desired data


1. Create a Go to Web Page - to open the target twitter link

Every workflow in Octoparse starts by telling Octoparse a web page to start with.

  • Enter the sample URL into the search bar at the top of the home screen and click Start.

2.png

Tip: Some accounts cannot be accessed until you've logged into Twitter. To extract data behind a login, Check this article on how to Extract data behind login.


2. Create a Loop Item - to loop through each tweet

Next, we need to create a loop for all the tweets.

  • Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)

  • Continue to select the second tweet

  • Choose Extract text of the selected elements from the Tips Panel

3.gif

3. Create another Loop Item - to scroll down the web page

The infinite scroll pattern of Twitter is designed to load content dynamically, requiring a few necessary tweaks in the task workflow to minimize data loss.

  • Add a new Loop Item in the workflow

  • Drag the original loop inside the new loop (Loop Item inside Loop Item1)

__t.gif
  • Click the Loop Item and set its Loop Mode to Scroll Page in the General tab

77.png
  • Set the scroll pattern to for one screen, wait time 1s, and repeat 100 times (or more)

  • Tick Capture data as page scrolls dynamically (possibly duplicates) (Important!)

  • Click Apply to confirm

1.png

4. Rewrite some of the XPath - to locate the web elements more accurately

The auto-generated XPath may not be accurate enough. So we need to rewrite the XPath for some data fields.

  • Click Loop Item (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..

8.png

5. Create an Extract Data - to scrape the desired data fields

  • Click Extract Data, and you will see a tweet being highlighted in red

  • Choose Extract the text of selected element from the Tips Panel

  • Repeat the action and get the name, time, text, reply, retweet, likes

9.gif

Tip: Double-click each field down below the page. You can easily rename them.

11.png

You may notice that Tweet post time is shown as "20m". So we need to clean the data field to show the exact post date/time.

  • Click More button on the field

  • Choose Customize field

  • Select to extract the attribute of DateTime

__t1.gif

6. Run the task - to get your desired data

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run Task window to pop up

  • Select Run on your device to run the task on your local device

  • Wait for the task to complete


Here is the sample output from a local run.

81.png

Tip: It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.

Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.

Did this answer your question?