With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends, etc. from a variety of social media websites, such as Twitter.
By scraping data from Twitter, you can:
- Keep updated with the latest trends worldwide
- Find out potential customers for your business
- Analyze the marketing value of hot topics
You can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Twitter Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check the video below.
To illustrate, we will take scrape news from Twitter as an example: https://twitter.com/search?q=Latest%20News&src=tyah
Let's get started with the main steps in this tutorial to start your task. [Download demo task file here]
- Go to Web Page - Open the target web page
- Create a "Loop Item" - to loop extract each tweet
- Create a "Pagination" to scroll down the web page
- Modify the Loop Item XPath and select text to scrape
- Start extraction - run the task and get data
1. Go to Web Page - Open the target web page
- Enter the URL on the home page and click Start
Please note that this website is the news page from Twitter without login. If you want to extract data behind a login, please refer to the following tutorial:
2. Create a "Loop Item" and extract data - loop extract each tweet
- Select the first tweet on the web page(note to select the whole tweet block)
- Continue to select the second tweet
- Choose "Extract text of the selected elements"
3. Create a "Pagination" to scroll down the web page
- Choose "Paginate to scrape more pages"
- Select a blank area on the webpage
- Click "Confirm"
- Click the gear icon of Pagination
- Modify the XPath of the Pagination to //main and set up a proper "Repeats" to exit loop
- Click the gear icon of "Click to Paginate" action
- Tick "Scroll down the page after it is loaded"
- Set up scroll way as "Scroll for one screen", "Repeats" as 1 and "Wait" as 5s
The "Pagination" action is not really to click any "Next button" to load the next page but to scroll down the page to load more tweets. Twitter only loads the tweets that are on the current screen, so every time the page scrolls, we need to scrape the tweets from the current screen instead of scraping after finishing scrolling.
4. Modify the Loop Item XPath and select text to scrape
- Click the gear icon of the "Loop Item" and input the XPath //article[@role="article"]/../../..
- Click the "Extract Data" action and you will see a tweet being highlighted in red
- Select text from the red area and choose to "Extract the text"
5. Start data extraction - run your task and get data
- Click "Save"
- Click "Run" on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
You can export the result data in provided formats such as EXCEL, CVS, JSON or in your database.
Here is the sample output.
It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.
Was this article helpful? Contact us at any time if you need our help!