Twitter Data Mining


Categories: Data_Science

Twitter Data Mining

Twitter is Warehouse of data. It is trove of peoples Opinions and emotions. It contains lot of data of real people which is very useful to observe trends and build models. What makes Twitter really Attractive is it has some really nice API collection. This API’s Allows us to harness it data and build systems that are Useful like Real time Sentiment Analysis and Reputation analysis

we Will dive Deep into Twitter’s API’s collections. As Data mining Enthusiast we are interested in data collection and not Application Development so we will mainly concentrate on Certain API’s

But We should be careful as Rest API’s are rate limited. After retrieving data from these API’s we need to store them efficiently we will use a database so that analysis later on will be easy.

There are lot of option while choosing database SQL? NOSQL? SQL is fairly easy and doesn,t require that much effort So we will use MYSQL for this example. Alright lets Dig in

First of all for Using Twitter API you need a Twitter Devloper account So just sign up for a twitter account in Twitter and go to This Link and create a Twitter App . Type any app name and description and get the consumer key. You need to Create an Access Token in order to send requests to Twitter API. Click create access Token to create access token . Totally we need Four Keys for Accessing Twitter API

R implementation

R is a choice of many Statisticians and have really good eco-system of libraries for Data Analysis. For Twitter Data collection We have “twitteR” library and for Data Storage “RMysql”. Here we use twitteR library to search trending topics in particular place and use Search API to store those tweets in a data base

Loading libraries required

#consumer key, consumer secret, access token, access secret.

Now we need to acess geometric data in-order to choose location in which we have to get the twitter trends

download.file("",destfile = "lat.csv")
geo <- read.csv("lat.csv",stringsAsFactors = F)
# Cleaning the Data recieved
Latitude <- as.character(geo$Latitude)
Longitude <- as.character(geo$Longitude)
Latitude <- gsub("?",".",Latitude,fixed = T)
Longitude <- gsub("?",".",Longitude,fixed = T)
Latitude <- gsub(".N","",Latitude,fixed = T)
Longitude <-gsub(".E","",Longitude,fixed = T)

Now the above code can be simplified by “dplyr’s” Piping operator but to keep things Clean and simple we will do it in the old way. Now we need to iterate over the above list and get the tweets and store them in a data base .

Be Careful as Twitter Limits Queries in the Search API we need to send less requests so that we don’t get rate limited so in this Script we limited the script to avoid being rate limited

# Connection and Retrival of tweets

#connecting to Mysql Database
connection <- register_mysql_backend("data_base_name","host_server_adress","User_name","password")
for (i in 1:33) {
  Woeid <- closestTrendLocations(Latitude[i], Longitude[i])
  trends <- getTrends(Woeid$woeid)
  for (current_trend in trends$name[1:10]) {
    search_term <- searchTwitteR(current_trend,n=1000,retryOnRateLimit = 20)
    store_tweets_db(search_term,table_name = "Trends")

Python Implementation

Yet another Tool which is Highly Matured and flexible for data analysis. It has very good library for Twitter. Its “Tweepy”. Tweepy has many functions but we mainly concentrate on Data collection features

First Lets import the needed libraries and Acess Keys

# importing Libaries
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import mysql.connector
import json
# consumer key, consumer secret,
# access token, access secret.

Next we need credentials of the MYSQL database in which we need to store our tweet data

#  Database Credintials

# replace mysql.server with "localhost" 
# if you are running via your own server!
cnx = mysql.connector.connect(user='username', password='password',
                              database='database name')

We should have have a listener class to listen to the incoming stream of twitter data. Below we implement the listner class

class listener(StreamListener):
    def on_data(self, data):
        all_data = json.loads(data)
        # check to ensure there is text in
        # the json data
        if 'text' in all_data:
            tweet = all_data["text"]
            username = all_data["user"]["screen_name"]
            timestamp = all_data["timestamp_ms"]
            tweetid = all_data["id"]
            location_of_tweet = all_data["user"]["location"]
            if location_of_tweet==None:
                "INSERT INTO StreamTweetTable (time, username, tweet, location, tweetid) VALUES (%s,%s,%s,%s,%s)",
                (timestamp, username, tweet, location_of_tweet,tweetid))
            return True
    def on_error(self, status):

In the above listener class we are listening to the data and extracting the tweet, username time-stamp and location of the tweeter to the database table

Unlike in the Previous Example in R We need to create the table before hand in the database server with the column’s we are using above . However We can Do it Through SQL Statements in Python. To do it just add below code after initiating connection

  CREATE TABLE StreamTweetTable
  time int,
  location varchar(255),
  tweet varchar(500),
  location_of_tweet varchar(255),
  tweetid varchar(255));

Then we can use the credentials and collect the Stream data of particular Tracking word of interest lets say “Batman” 😛

auther = OAuthHandler(ckey, csecret)
auther.set_access_token(atoken, asecret)
twitterStream = Stream(auther, listener())
                    languages=["en"], stall_warnings=True)

Now we have a Data base of tweets we can Use of Twitter Sentiment Analysis

Feel Free to Comment Suggestions Below or Email Them to [email protected]