Brendan Tierney - Oralytics Blog: June 2018

Thursday, June 28, 2018

My book on Oracle R Enterprise translated into Chinese

A couple of days ago the post man knocked on my door with a package. I hadn't ordered anything, so it was a puzzling what it might be.

When I opened the package I found 3 copies of a book in Chinese.

It was one of my books !

One of my books was translated into Chinese !

What a surprise, as I wasn't aware this was happening.

At this time I'm not sure where you can purchase the book, but I'll update this blog post when I find out.

Monday, June 18, 2018

Twitter Analytics using Python - Part 3

This is my third (of five) post on using Python to process Twitter data.

Check out my all the posts in the series.

In this post I'll have a quick look at how to save the tweets you have download. By doing this allows you to access them at a later point and to perform more analysis. You have a few instances of saving the tweets. The first of these is to save them to files and the second option is to save them to a table in a database.

Saving Tweets to files

In the previous blog post (in this series) I had converged the tweets to Pandas and then used the panda structure to perform some analysis on the data and create some charts. We have a very simple command to save to CSV.

# save tweets to a file
tweets_pd.to_csv('/Users/brendan.tierney/Dropbox/tweets.csv', sep=',')

We can inspect this file using a spreadsheet or some other app that can read CSV files and get the following.

Twitter app8

When you want to read these tweets back into your Python environment, all you need to do is the following.

# and if we want to reuse these tweets at a later time we can reload them
old_tweets = pd.read_csv('/Users/brendan.tierney/Dropbox/tweets.csv')

old_tweets

Tweet app9

That's all very easy!

Saving Tweets to a Database

There are two ways to add tweets to table in the database. There is the slow way (row-by-row) or the fast way doing a bulk insert.

Before we get started with inserting data, lets get our database connection setup and the table to store the tweets for our date. To do this we need to use the cx_oracle python library. The following codes shows the setting up of the connections details (without my actual login details), establishes the connects and then retrieves some basic connection details to prove we are connected.

# import the Oracle Python library
import cx_Oracle

# define the login details
p_username = "..."
p_password = "..."
p_host = "..."
p_service = "..."
p_port = "1521"

# create the connection
con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port)
cur = con.cursor()

# print some details about the connection and the library
print("Database version:", con.version)
print("Oracle Python version:", cx_Oracle.version)


Database version: 12.1.0.1.0
Oracle Python version: 6.3.1

Now we can create a table based on the current date.

# drop the table if it already exists
#drop_table = "DROP TABLE TWEETS_" + cur_date
#cur.execute(drop_table)

cre_table = "CREATE TABLE TWEETS_" + cur_date + " (tweet_id number, screen_name varchar2(100), place varchar2(2000), lang varchar2(20), date_created varchar2(40), fav_count number, retweet_count number, tweet_text varchar2(200))"

cur.execute(cre_table)

Now lets first start with the slow (row-by-row) approach. To do this we need to take our Panda data frame and convert it to lists that can be indexed individually.

lst_tweet_id = [item[0] for item in rows3]
lst_screen_name = [item[1] for item in rows3]
lst_lang =[item[3] for item in rows3]
lst_date_created = [item[4] for item in rows3]
lst_fav_count = [item[5] for item in rows3]
lst_retweet_count = [item[6] for item in rows3]
lst_tweet_text = [item[7] for item in rows3]

#define a cursor to use for the the inserts
cur = con.cursor()
for i in range(len(rows3)):
    #do the insert using the index. This can be very slow and should not be used on big data
    cur3.execute("insert into TWEETS_2018_06_12 (tweet_id, screen_name, lang, date_created, fav_count, retweet_count, tweet_text) values (:arg_1, :arg_2, :arg_3, :arg_4, :arg_5, :arg_6, :arg_7)",
                 {'arg_1':lst_tweet_id[i], 'arg_2':lst_screen_name[i], 'arg_3':lst_lang[i], 'arg_4':lst_date_created[i],
                  'arg_5':lst_fav_count[i], 'arg_6':lst_retweet_count[i], 'arg_7':lst_tweet_text[i]})

#commit the records to the database and close the cursor
con.commit()
cur.close()

Tweet app10

Now let us look a quicker way of doing this.

WARNING: It depends on the version of the cx_oracle library you are using. You may encounter some errors relating to the use of floats, etc. You might need to play around with the different versions of the library until you get the one that works for you. Or these issues might be fixed in the most recent versions.

The first step is to convert the panda data frame into a list.

rows = [tuple(x) for x in tweets_pd.values]
rows

Tweet app11

Now we can do some cursor setup like setting the array size. This determines how many records are sent to the database in each batch. Better to have a larger number than a single digit number.

cur = con.cursor()

cur.bindarraysize = 100

cur2.executemany("insert into TWEETS_2018_06_12 (tweet_id, screen_name, place, lang, date_created, fav_count, retweet_count, tweet_text) values (:1, :2, :3, :4, :5, :6, :7, :8)", rows)

Check out the other blog posts in this series of Twitter Analytics using Python.

Monday, June 4, 2018

Twitter Analytics using Python - Part 2

This is my second (of five) post on using Python to process Twitter data.

Check out my all the posts in the series.

In this post I was going to look at two particular aspects. The first is the converting of Tweets to Pandas. This will allow you to do additional analysis of tweets. The second part of this post looks at how to setup and process streaming of tweets. The first part was longer than expected so I'm going to hold the second part for a later post.

Step 6 - Convert Tweets to Pandas

In my previous blog post I show you how to connect and download tweets. Sometimes you may want to convert these tweets into a structured format to allow you to do further analysis. A very popular way of analysing data is to us Pandas. Using Pandas to store your data is like having data stored in a spreadsheet, with columns and rows. There are also lots of analytic functions available to use with Pandas.

In my previous blog post I showed how you could extract tweets using the Twitter API and to do selective pulls using the Tweepy Python library. Now that we have these tweet how do I go about converting them into Pandas for additional analysis? But before we do that we need to understand a bit more a bout the structure of the Tweet object that is returned by the Twitter API. We can examine the structure of the User object and the Tweet object using the following commands.

dir(user)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'contributors_enabled',
 'created_at',
 'default_profile',
 'default_profile_image',
 'description',
 'entities',
 'favourites_count',
 'follow',
 'follow_request_sent',
 'followers',
 'followers_count',
 'followers_ids',
 'following',
 'friends',
 'friends_count',
 'geo_enabled',
 'has_extended_profile',
 'id',
 'id_str',
 'is_translation_enabled',
 'is_translator',
 'lang',
 'listed_count',
 'lists',
 'lists_memberships',
 'lists_subscriptions',
 'location',
 'name',
 'needs_phone_verification',
 'notifications',
 'parse',
 'parse_list',
 'profile_background_color',
 'profile_background_image_url',
 'profile_background_image_url_https',
 'profile_background_tile',
 'profile_banner_url',
 'profile_image_url',
 'profile_image_url_https',
 'profile_link_color',
 'profile_location',
 'profile_sidebar_border_color',
 'profile_sidebar_fill_color',
 'profile_text_color',
 'profile_use_background_image',
 'protected',
 'screen_name',
 'status',
 'statuses_count',
 'suspended',
 'time_zone',
 'timeline',
 'translator_type',
 'unfollow',
 'url',
 'utc_offset',
 'verified']

dir(tweets)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'author',
 'contributors',
 'coordinates',
 'created_at',
 'destroy',
 'entities',
 'favorite',
 'favorite_count',
 'favorited',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'parse',
 'parse_list',
 'place',
 'retweet',
 'retweet_count',
 'retweeted',
 'retweets',
 'source',
 'source_url',
 'text',
 'truncated',
 'user']

We can see all this additional information to construct what data we really want to extract.

The following example illustrates the searching for tweets containing a certain word and then extracting a subset of the metadata associated with those tweets.

oracleace_tweets = tweepy.Cursor(api.search,q="oracleace").items()
tweets_data = []
for t in oracleace_tweets:
   tweets_data.append((t.author.screen_name,
                       t.place,
                       t.lang,
                       t.created_at,
                       t.favorite_count,
                       t.retweet_count,
                       t.text.encode('utf8')))

We print the contents of the tweet_data object.

print(tweets_data)

[('jpraulji', None, 'en', datetime.datetime(2018, 5, 28, 13, 41, 59), 0, 5, 'RT @tanwanichandan: Hello Friends,\n\nODevC Yatra is schedule now for all seven location.\nThis time we have four parallel tracks i.e. Databas…'), ('opal_EPM', None, 'en', datetime.datetime(2018, 5, 28, 13, 15, 30), 0, 6, "RT @odtug: Oracle #ACE Director @CaryMillsap is presenting 2 #Kscope18 sessions you don't want to miss! \n- Hands-On Lab: How to Write Bette…"), ('msjsr', None, 'en', datetime.datetime(2018, 5, 28, 12, 32, 8), 0, 5, 'RT @tanwanichandan: Hello Friends,\n\nODevC Yatra is schedule now for all seven location.\nThis time we have four parallel tracks i.e. Databas…'), ('cmvithlani', None, 'en', datetime.datetime(2018, 5, 28, 12, 24, 10), 0, 5, 'RT @tanwanichandan: Hel ......

I've only shown a subset of the tweets_data above.

Now we want to convert the tweets_data object to a panda object. This is a relative trivial task but an important steps is to define the columns names otherwise you will end up with columns with labels 0,1,2,3...

import pandas as pd

tweets_pd = pd.DataFrame(tweets_data,
                         columns=['screen_name', 'place', 'lang', 'created_at', 'fav_count', 'retweet_count', 'text'])

Now we have a panda structure that we can use for additional analysis. This can be easily examined as follows.

tweets_pd

 	screen_name 	place 	lang 	created_at 	fav_count 	retweet_count 	text
0 	jpraulji 	None 	en 	2018-05-28 13:41:59 	0 	5 	RT @tanwanichandan: Hello Friends,\n\nODevC Ya...
1 	opal_EPM 	None 	en 	2018-05-28 13:15:30 	0 	6 	RT @odtug: Oracle #ACE Director @CaryMillsap i...
2 	msjsr 	None 	en 	2018-05-28 12:32:08 	0 	5 	RT @tanwanichandan: Hello Friends,\n\nODevC Ya...

Now we can use all the analytic features of pandas to do some analytics. For example, in the following we do a could of the number of times a language has been used in our tweets data set/panda, and then plot it.

import matplotlib.pyplot as plt

tweets_by_lang = tweets_pd['lang'].value_counts()
print(tweets_by_lang)

lang_plot = tweets_by_lang.plot(kind='bar')
lang_plot.set_xlabel("Languages")
lang_plot.set_ylabel("Num. Tweets")
lang_plot.set_title("Language Frequency")

en    182
fr      7
es      2
ca      2
et      1
in      1

Pandas1

Similarly we can analyse the number of times a twitter screen name has been used, and limited to the 20 most commonly occurring screen names.

tweets_by_screen_name = tweets_pd['screen_name'].value_counts()
#print(tweets_by_screen_name)

top_twitter_screen_name = tweets_by_screen_name[:20]
print(top_twitter_screen_name)

name_plot = top_twitter_screen_name.plot(kind='bar')
name_plot.set_xlabel("Users")
name_plot.set_ylabel("Num. Tweets")
name_plot.set_title("Frequency Twitter users using oracleace")

oraesque           7
DBoriented         5
Addidici           5
odtug              5
RonEkins           5
opal_EPM           5
fritshoogland      4
svilmune           4
FranckPachot       4
hariprasathdba     3
oraclemagazine     3
ritan2000          3
yvrk1973           3
...

Pandas2

There you go, this post has shown you how to take twitter objects, convert them in pandas and then use the analytics features of pandas to aggregate the data and create some plots.