With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends, etc. from a variety of social media websites, such as Twitter.
By scraping data from Twitter, you can:
- Keep updated with the latest trends worldwide
- Find out potential customers for your business
- Analyze the marketing value of hot topics
You can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Twitter Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check the video below.
You may need this link to follow though:
Here are the main steps in this tutorial[Download the demo task click here]
1. Go to Web Page - Open the target web page
- Enter the URL on the home page and click Start
Please note that this website is a news page from Twitter without login. To extract data behind a login, please refer to the following tutorial:
2. Create a "Loop Item" and extract data - loop extract each tweet
- Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)
- Continue to select the second tweet
- Choose "Extract text of the selected elements"
3. Create a "Loop Item" - to scroll down the web page
- Add a new Loop Item in the workflow
- Drag the original loop inside the new loop (Loop Item inside Loop Item1）
- Click the Loop Item1
- Select the Loop Mode as Scroll Page
- Set up the scroll way as scroll for one screen, wait time as 1s and repeat as 100 or more
- Note to tick "Capture data as page scrolls dynamically (possibly duplicates)"
- Click "Apply" to confirm
4. Modify the Loop Item XPath
- Click the "Loop Item" (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..
5. Extract Data - select text to scrape
- Click the "Extract Data" action and you will see a tweet being highlighted in red
- Select text within the red area (name, time, text, reply, retweet, like) and choose to "Extract the text of selected element"
Double click each field down below the page, you can easily rename them.
You may have noticed the Tweet post time is shown as "20m". This is hard for us to tell the exact post date and time. We can modify this field to get the detailed time.
- Click "More" button of the field
- Choose the "Customize field"
- Select to extract the attribute of datetime
6. Start data extraction - run your task and get data
- Click "Save"
- Click "Run" on the upper right side
- Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
You can export the result data in provided formats such as EXCEL, CVS, JSON or in your database.
Here is the sample output.
It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.