With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends etc. from a variety of social media websites. In this tutorial, we would show you how to extract data from Twitter. Any data seen on the web page can be scraped without coding. If you are interested in the data scraping from these social media websites like Twitter, this tutorial can help you get started.
After running your task, just export the result data in provided formats such as EXCEL, CVS, JSON or in your database.
To illustrate, we will scrape news information from Twitter as an example: https://twitter.com/search?q=news&src=typd&lang=en
By scraping data from Twitter, you can:
· Know more about the newest trends worldwide
· Find out your potential customers for business
· Analyze the marketing value of hot topics
Let's get started with the main steps in this tutorial to start your task. [Download example task file ]
1) "Go To Web Page" - to open the target website
· Paste the target URL into "Extraction URL" box and save.
2) Use scrolling down - to get more data from listed page
· Select "Scroll Down" options under "Advanced Options".
· Set "Scroll times" and "Interval" you need.
· Select "Scroll down for one screen" as "Scroll way" and click "OK" button.
3) Create a "Loop Item" - to loop extract each tweet
· Click data you want on the web page, then the selected area will be highlighted in green.
· Click "Select all" and select "Extract text from the selected elements" in "Action Tips" panel.
· Rename the "Field name" column if necessary.
4) Use Regular Expression - to clean and reformat data if needed
Regular Expression aims at reformatting data after data extraction in Octoparse. For example, if you want to delete words like "Reply", "Retweet" and "Like" in this case, you can use Regular Expression to get the specific digit value by trimming the strings. If the result already satisfies your needs, you can just skip this step.
· Select the "Reply" row, click "Customize data field" icon, select "Refine extracted data" option and click "Add step" button.
· Click "Replace" and paste the "Reply ***" with all space values from extraction data "Reply 856" into "Replace" box.
· Click "OK" button.
The value you will enter into "Replace" box must be copied with all original space value. In this step, *** just means space value.
You can also reformat values in "Retweet" and "Like" rows like this step if needed.
5) Start data extraction - to run your task and get data
· Select "Start Extraction" and "Local Extraction".
· Select "Export" to get all data you want.