With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends, etc. from a variety of social media websites, such as Twitter.
In this tutorial, we will show you how to scrape data from Twitter. Any data seen on the web page can be scraped without coding. If you are interested in the data scraping from these social media websites like Twitter, this tutorial can help you get started.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Twitter Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
To illustrate, we will take scrape news from Twitter as an example: https://twitter.com/search?q=Latest%20News&src=tyah
By scraping data from Twitter, you can:
- Keep updated with the latest trends worldwide
- Find out potential customers for your business
- Analyze the marketing value of hot topics
Let's get started with the main steps in this tutorial to start your task. [Download demo task file here]
- "Go To Web Page" - open the target website
- Set "Scroll down" - load more data from the listed page
- Create a "Loop Item" and extract data - loop extract each tweet
- Set Regular Expression - clean and reformat data if needed (Optional)
- Start extraction - run the task and get data
1. "Go to Web Page" - open the targeted web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode.
- Paste the URL into the "Website" box and click "Save URL" to move on
Please note that this website is the news page from Twitter without login. If you want to extract data behind a login, please refer to the following tutorial:
2. Set "Scroll down" - load more data from the listed page
- Check the box for "Scroll down to bottom of the page when finished loading"
- Set "Scroll times" as "20" and "Interval" as "3" second (This is for demonstration, and you can set the numbers based on your needs)
- Select "Scroll down to the bottom of the page" as the "Scroll way" and click the "OK" button
"Interval" is the time interval between every two scrolls.
For some websites like Twitter, clicking the next page button to paginate is not an option for loading content. To fully load the listings, we need to scroll the page down to the bottom continuously. Theoretically, the higher the number we input for "Scroll times", the more data we can extract.
3. Create a "Loop Item" and extract data - loop extract each tweet
- Select the 1st tweet block in the list, and then select the 2nd block, make sure the blocks are in the same shape
- Click "Extract text of the selected elements" on the "Action Tips" panel
Octoparse will automatically select other similar items and create a "Loop item" list. Remember to select the whole block when you extract data from the list page.
- You may find some items undetected, enter this modified XPath of the "Variable List" to improve accuracy when locating elements: //li[@data-item-type='tweet']
- Delete the auto-generated data field
- Click on the data you want to extract on the 1st block
- Click "Extract text of the selected element" on the "Action Tips" panel
- Rename the field name column by the predefined names or inputting on your own
- Click "OK" to save
If you have a task that Octoparse only extracts the first item and keeps producing duplicates, you may need to revise the “Loop Item” you create in the task. In this case, if you only select the name to create a loop and select data outside the name area to extract, Octoparse may extract the 1st item and duplicate. For more details, you can read the following article:
4. Use Regular Expression - clean and reformat data if needed
Regular Expression aims at reformatting data after data extraction in Octoparse. For example, if you want to delete words like "Reply", "Retweet" and "Like" in this case, you can use Regular Expression to get the specific digit value by trimming the strings.
If the result already satisfies your needs, you can just skip this step.
- Select the "Reply" row, click the "Customize data field" icon, select the "Refine extracted data" option and click "Add step"
- Click "Replace" and paste the "Reply ***" with all space values from extraction data "Reply 856" into the "Replace" box.
- Click "OK" to save
- For the rows of "Retweet" and "Like", you can just repeat the steps above
The value you will enter into the "Replace" box must be copied with all original space value. In this step, *** just means space value.
You can also reformat values in "Retweet" and "Like" rows like this step if needed.
5. Start data extraction - run your task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
You can export the result data in provided formats such as EXCEL, CVS, JSON or in your database.
Here is the sample output.
Was this article helpful? Contact us any time if you need our help!