This is my third (of five) post on using Python to process Twitter data.
Check out my all the posts in the series.
In this post I'll have a quick look at how to save the tweets you have download. By doing this allows you to access them at a later point and to perform more analysis. You have a few instances of saving the tweets. The first of these is to save them to files and the second option is to save them to a table in a database.
Saving Tweets to filesIn the previous blog post (in this series) I had converged the tweets to Pandas and then used the panda structure to perform some analysis on the data and create some charts. We have a very simple command to save to CSV.
# save tweets to a file tweets_pd.to_csv('/Users/brendan.tierney/Dropbox/tweets.csv', sep=',')
We can inspect this file using a spreadsheet or some other app that can read CSV files and get the following.
When you want to read these tweets back into your Python environment, all you need to do is the following.
# and if we want to reuse these tweets at a later time we can reload them old_tweets = pd.read_csv('/Users/brendan.tierney/Dropbox/tweets.csv') old_tweets
That's all very easy!
Saving Tweets to a Database
There are two ways to add tweets to table in the database. There is the slow way (row-by-row) or the fast way doing a bulk insert.
Before we get started with inserting data, lets get our database connection setup and the table to store the tweets for our date. To do this we need to use the cx_oracle python library. The following codes shows the setting up of the connections details (without my actual login details), establishes the connects and then retrieves some basic connection details to prove we are connected.
# import the Oracle Python library import cx_Oracle # define the login details p_username = "..." p_password = "..." p_host = "..." p_service = "..." p_port = "1521" # create the connection con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port) cur = con.cursor() # print some details about the connection and the library print("Database version:", con.version) print("Oracle Python version:", cx_Oracle.version) Database version: 12.1.0.1.0 Oracle Python version: 6.3.1
Now we can create a table based on the current date.
# drop the table if it already exists #drop_table = "DROP TABLE TWEETS_" + cur_date #cur.execute(drop_table) cre_table = "CREATE TABLE TWEETS_" + cur_date + " (tweet_id number, screen_name varchar2(100), place varchar2(2000), lang varchar2(20), date_created varchar2(40), fav_count number, retweet_count number, tweet_text varchar2(200))" cur.execute(cre_table)
Now lets first start with the slow (row-by-row) approach. To do this we need to take our Panda data frame and convert it to lists that can be indexed individually.
lst_tweet_id = [item[0] for item in rows3] lst_screen_name = [item[1] for item in rows3] lst_lang =[item[3] for item in rows3] lst_date_created = [item[4] for item in rows3] lst_fav_count = [item[5] for item in rows3] lst_retweet_count = [item[6] for item in rows3] lst_tweet_text = [item[7] for item in rows3] #define a cursor to use for the the inserts cur = con.cursor() for i in range(len(rows3)): #do the insert using the index. This can be very slow and should not be used on big data cur3.execute("insert into TWEETS_2018_06_12 (tweet_id, screen_name, lang, date_created, fav_count, retweet_count, tweet_text) values (:arg_1, :arg_2, :arg_3, :arg_4, :arg_5, :arg_6, :arg_7)", {'arg_1':lst_tweet_id[i], 'arg_2':lst_screen_name[i], 'arg_3':lst_lang[i], 'arg_4':lst_date_created[i], 'arg_5':lst_fav_count[i], 'arg_6':lst_retweet_count[i], 'arg_7':lst_tweet_text[i]}) #commit the records to the database and close the cursor con.commit() cur.close()
Now let us look a quicker way of doing this.
WARNING: It depends on the version of the cx_oracle library you are using. You may encounter some errors relating to the use of floats, etc. You might need to play around with the different versions of the library until you get the one that works for you. Or these issues might be fixed in the most recent versions.
The first step is to convert the panda data frame into a list.
rows = [tuple(x) for x in tweets_pd.values] rows
Now we can do some cursor setup like setting the array size. This determines how many records are sent to the database in each batch. Better to have a larger number than a single digit number.
cur = con.cursor() cur.bindarraysize = 100 cur2.executemany("insert into TWEETS_2018_06_12 (tweet_id, screen_name, place, lang, date_created, fav_count, retweet_count, tweet_text) values (:1, :2, :3, :4, :5, :6, :7, :8)", rows)
Check out the other blog posts in this series of Twitter Analytics using Python.
No comments:
Post a Comment