An Introduction To Text Analysis With Python, Part 3

04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
Neal Caren
An introduction to text analysis with Python, Part 3

Posted on April 10, 2012 by Neal Caren
Two earlier tutorials looked at the basics of using Python to analyze text data. This post explains how to expand
the code written earlier so that you can use it to explore the positive and negative sentiment of any set of texts.
Specifically, well look at looping over more than one tweet, incorporating a more complete dictionary, and
exporting the results. [If you just want the final python script, you can just download it.]
Earlier, we used a pretty nealcaren list of words to measure positive sentiment. While the study in Science used
the commercial LIWC dictionary, an alternate sentiment dictionary is produced by Theresa Wilson, Janyce Wiebe,
and Paul Hoffmann at the University of Pittsburgh and is freely available. In both cases, the sentiment
dictionaries are used in a fairly straightforward way: the more positive words in the text, the higher the text scores
on the positive sentiment scale. While this has some drawbacks, the method is quite popular: the LIWC database
has over 1,000 cites in Google Scholar, and the Wilson et al. database has more than 600.
Downloading
Since the Wislon et al. list combines negative and positive polarity words in one list, and includes both words and
word stems, I will clean it up a little bit. You can download the positive list and the negative list using your
browser, but you dont have to. Python can do that.
First, you need to import one of the modules that Python uses to communicate with the Internet:
>>> import urllib

Like many commands, Python wont return anything unless something went wrong. In this case, it should just
respond with >>>, which means that the module was successfully brought into memory. Next, store the web
address that you want to access in a string. You dont have to do this, but its the type of thing that makes your
code easier to read and allows you to scale up quickly when you want to download thousands of urls.
>>> url='http://www.unc.edu/~ncaren/haphazard/negative.txt'
You can also create a string with the name you want the file to have on you hard drive:
>>> file_name='negative.txt'
To download and save the file:
>>> urllib.urlretrieve(url,file_name)
This will download the file into your current directory. If you want it to go somewhere else, you can put the full
path in the file_name string. You didnt have to enter the url and the file name in the prior lines. Something like the
following would have worked exactly the same:
>>> urllib.urlretrieve('http://www.unc.edu/~ncaren/haphazard/negative.txt','negative.txt')
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 1/6
Note that the location and filename are both surrounded by quotation marks because you want Python to use this
information literally; they arent referring to a string object, like in our previous code. This line of code is actually
quite readable, and in most circumstances this would be the most efficient thing to do. But there are actually three
files that we want to get: the negative list, the positive list, and the list of tweets. And we can download the three
using a pretty simple loop:
>>> files=['negative.txt','positive.txt','obama_tweets.txt']
>>> path='http://www.unc.edu/~ncaren/haphazard/'
>>> for file_name in files:
... urllib.urlretrieve(path+file_name,file_name)
...
The first line creates a new list with three items, the names of the three files to be downloaded. The second line
creates a string object that stores the url path that they all share. The third line starts a loop over each of the items
in the files list using file_name to reference each item in turn. The fourth line is indented, because it happens once
for each item in the list as a result of the loop, and downloads the file. This is the same as the original download
line, except the URL is now the combination of two strings, path and file_name. As noted previously, Python can
combine strings with a plus sign, so the result from the first pass through the loop will be
http://www.unc.edu/~ncaren/haphazard/negative.txt, which is where the file can be found. Note that this takes advantage of
the fact that we dont mind reusing the original file name. If we wanted to change it, or if there were different
paths to each of the files, things would get slightly trickier.
More fun with lists
Lets take a look at the list of Tweets that we just downloaded. First, open the file:
>>> tweets = open("obama_tweets.txt").read()

As you might have guessed, this line is actually doing double duty. It opens the file and reads it into memory
before it is stored in tweets. Since the file has one Tweet on each line, we can turn it into a list of tweets by splitting
it at the end of line character. The file was originally created on a Mac, so the end of line character is an n (think n
for new line). On a Windows computer, the end of line character is an rn (think r for return and n for new line). So
if the file was created on a Windows computer, you might need to strip out the extra character with something like
windows_file=windows_file.replace('r','') before you split the lines, but you dont need to worry about that here, no
matter what operating system you are using. The end of line character comes from the computer that made the
file, not the computer you are currently using. To split the tweets into a list:
>>> tweets_list = tweets.split('\n')

As always, you can check how many items are in the list:
>>> len(tweets_list)
1365
You can print the entire list by typing print tweets_list, but it will scroll by very fast. A more useful way to look at it
is to print just some of the items. Since its a list, we can loop through the first few item so they each print on the
same line.
>>> for tweet in tweets_list[0:5]:

... print tweet
...
Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.
In his teen years, Obama has been known to use marijuana and cocaine.
IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BU
RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, tha
RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT:
Note the new [0:5] after the tweets_list but before the : that begins the loop. The first number tells Python where to
make the first cut in the list. The potentially counterintuitive part is that this number doesnt reference an actual
item in the list, but rather a position between each item in the listthink about where the comma goes when lists
are created or printed. Adding to the confusion, the position at the start of the list is 0. So, in this case, we are
telling Python we want to slice our list starting at the beginning and continuing until the fifth comma, which is
after the fifth item in the list.
So, if you wanted to just print the second item in the list, you could type:
>>> print tweets_list[1:2]

['Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.']
This slices the list from the first comma to the second comma, so the result is the second item in the list. Unless
you have a computer science background, this may be confusing as its not the common way to think of items in
lists.
As a shorthand, you can leave out the first number in the pair if you want to start at the very beginning or leave
out the last number if you want to go until the end. So, if you want to print out the first five tweets, you could just
type print tweet_list[:5]. There are several other shortcuts along these lines that are available. We will cover some of
them in other tutorials.
Now that we have our Tweet list expanded, lets load up the positive sentiment list and print out the first few
entries:
>>> pos_sent = open("positive.txt").read()

>>> positive_words=pos_sent.split('\n')
>>> print positive_words[:10]
['abidance', 'abidance', 'abilities', 'ability', 'able', 'above', 'above-average', 'abundant', 'abundance', 'acceptance']
Like the tweet list, this file contained each entry on its own line, so it loads exactly the same way. If you typed
len(positive_words) you would find out that this list has 2,230 entries.
Preprocessing
In the earlier post, we explored how to preprocess the tweets: remove the punctuation, convert to lower case, and
examine whether or not each word was in the positive sentiment list. We can use this exact same code here with
our long list. The one alteration is that instead of having just one tweet, we now have a list of 1,365 tweets, so we
have to loop over that list.
>>> for tweet in tweets_list:

... positive_counter=0
... tweet_processed=tweet.lower()
... for p in list(punctuation):
... tweet_processed=tweet_processed.replace(p,'')
... words=tweet_processed.split(' ')
... for word in words:
... if word in positive_words:
... print word
... positive_counter=positive_counter+1
... print positive_counter/len(words)
...
If you saw a string of numbers roll past you, it worked! To review, we start by looping over each item of the list.
We set up a counter to hold the running total of the number of positive words found in the tweet. Then we make
everything lower case and store it in tweet_processed. To strip out the punctuation, we loop over every item of
punctuation, swapping out the punctuation mark with nothing. If you havent already typed from string import
punctuation in your current python session, you might get an error with this line, so make sure to include that
import. The cleaned tweet is then converted to a list of words, split at the white spaces. Finally, we loop through
each word in the tweet, and if the word is in our new and expanded list of positive words, we increase the counter
by one. After cycling through each of the tweet words, the proportion of positive words is computed and printed. If
you just get zeros, you might need to type from __future__ import division again, so that the result isnt rounded down.
The major problem with this script is that it is currently useless. It prints the positive sentiment results, but then
doesnt do anything with it. A more practical solution would be to store the results somehow. In a standard
statistical package, we would generate a new variable that held our results. We can do something similar here by
storing the results in a new list. Before we start the tweet loop, we add the line:
>>> positive_counts=[]
Then, instead of printing the proportion, we can append it to the list:
>>> positive_counts.append(positive_counter/word_count)
The next time we run through the loop, it wont produce any output, but it will create a list of the proportions. This
still isnt that useful, although you can use Python to do most of your statistical analysis and plotting, but at this
point you are probably ready to get your data out of Python and back into your statistical package.
The most convenient way to store data for use in multiple packages is as a plain text file where each case is its own
row and variables are separated by commas. This file type commonly has a csv extension, and Python can read
and write these files quite easily.
First, import the csv module:
>>> import csv

To write to csv file, you first open the file with the csv writer:
>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))
In the 'open' part of the command, the first item is the name of the file you want to create, and the 'wb' tells python
that this is a file you want to write. Be careful with your file name, because if there is already a file with this name,
Python will write over it. If you wanted to read a csv file, you would just swap reader for writer and 'rb' for 'wb',
which creates a nice symmetry.
Sending your list of positive sentiment values to the file requires just one more line:
>>> writer.writerows(positive_counts)
You can now import this file into your statistical software package or just take a peak at it in excel. Of course,
having just one variable is not the most useful thing. Usually, you will have more than one that you want to export,
but for now we just have the one. At a minimum, you might also want to export the text of the original tweets. To
combine more than one list together, you can zip them into one list. This is different from appending one list to the
other, which would just make the one list twice as long.
>>> output=zip(tweets_list,positive_counts)
In this case, zip creates a new list output that is the same length as our tweets_list, but each entry has two items: the
tweet and the positive count. You can use zip to combine as many lists as your like, although they all need to be the
same length. Technically, each item in the list is a tuple, or an ordered element list, which is a data format quite
similar to a list but generally less useful for textual analysis.
To write our final version of the output, we need to repeat the line that created our writer and then write the output
list:
>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))

>>> writer.writerows(output)
Thats it. If you searched everyday for tweets mentioning President Obama and ran this script, my guess is that
your data would tell a pretty interesting story about trends over time. Or, if you had your own text data arranged
such that each text was its own line, you could just update the file name, and compute the sentiment scores.
In case you were wondering, the top two most negative tweets, were Hatch Makes Startling Accusation Against
Obama http://t.co/HVQfUzgr ..shocking headlineNOT and We need to tag Obama & define him for Nov
battle. #Obama #failedleader #incompetent #wasteful #divisive #desperate #flexible #arrogant #lazy, which
gives our little study some face validity.
You have probably noticed that our code for this project has swelled to about 40 lines. Not horrible, but not that
easy to copy and paste. And if you mess up in a loop, you have to start all over again. While typing in commands
this way in Python is useful for playing around with new codes and commands, most of the time its not the most
efficient way to do things. Just like Stata has .do files, you can similarly save a series of Python commands as a text
file and then run them all together. These sort of Python files use a .py extension.
Ive compiled all the code for our sentiment analysis into one file, and you can download it sentiment.py using
your browser. At this point, you might want to make a directory for yourself where you can store all your python
files.
You can quit Python by typing the exit(), which should bring you back to your operating systems prompt. Now,
assuming you are in the directory where you download sentiment.py, you can run the entire program by typing:
$ python sentiment.py
Remember the $ sign means that we are out of Python. This command tells your computer that you want Python
to run the program sentiment.py. If all works according to plan, your computer should think for a couple of seconds,
and then display the operating system prompt. Python displays fewer things when run this way: only things with a
print statement in front of them are displayed, so dont expect your output to be as verbose as when you typed in
each command. Actually, you probably would want to add some print statements along the way so that you knew
everything was working.
Assuming you didnt get an error message, there should be new file called tweet_sentiment.csv in your current
directory. You can confirm this by typing ls -l on a Mac or dir in Windows. This should display the contents of the
current directory and you should see tweet_sentiment.csv listed along with the current timewhich means that the file
was just created. Perfect.
There are easier ways to run your .py files which Ill discuss at a later point, and ways to improve the script, such as
adding comments as notes to ourselves, speeding it up, and allowing different types of input files. But if you made
it this far, you can proudly call yourself a beginning Python programmer. Congratulations.
About Neal Caren

Sociology
View all posts by Neal Caren
This entry was posted in Uncategorized and tagged python, sentiment, tutorial. Bookmark the permalink.
Neal Caren
Proudly powered by WordPress.

An Introduction To Text Analysis With Python, Part 3 - Neal Caren

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An Introduction To Text Analysis With Python, Part 3 - Neal Caren

Încărcat de

Drepturi de autor:

Formate disponibile

04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren

>>> import urllib

More fun with lists

>>> tweets = open("obama_tweets.txt").read()

>>> tweets_list = tweets.split('\n')

>>> for tweet in tweets_list[0:5]:

>>> print tweets_list[1:2]

>>> pos_sent = open("positive.txt").read()

>>> for tweet in tweets_list:

First, import the csv module:

>>> import csv

>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))

>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))

About Neal Caren

S-ar putea să vă placă și