Documente Academic
Documente Profesional
Documente Cultură
Neal Caren
Two earlier tutorials looked at the basics of using Python to analyze text data. This post explains how to expand
the code written earlier so that you can use it to explore the positive and negative sentiment of any set of texts.
Specifically, well look at looping over more than one tweet, incorporating a more complete dictionary, and
exporting the results. [If you just want the final python script, you can just download it.]
Earlier, we used a pretty nealcaren list of words to measure positive sentiment. While the study in Science used
the commercial LIWC dictionary, an alternate sentiment dictionary is produced by Theresa Wilson, Janyce Wiebe,
and Paul Hoffmann at the University of Pittsburgh and is freely available. In both cases, the sentiment
dictionaries are used in a fairly straightforward way: the more positive words in the text, the higher the text scores
on the positive sentiment scale. While this has some drawbacks, the method is quite popular: the LIWC database
has over 1,000 cites in Google Scholar, and the Wilson et al. database has more than 600.
Downloading
Since the Wislon et al. list combines negative and positive polarity words in one list, and includes both words and
word stems, I will clean it up a little bit. You can download the positive list and the negative list using your
browser, but you dont have to. Python can do that.
First, you need to import one of the modules that Python uses to communicate with the Internet:
>>> url='http://www.unc.edu/~ncaren/haphazard/negative.txt'
You can also create a string with the name you want the file to have on you hard drive:
>>> file_name='negative.txt'
To download and save the file:
>>> urllib.urlretrieve(url,file_name)
This will download the file into your current directory. If you want it to go somewhere else, you can put the full
path in the file_name string. You didnt have to enter the url and the file name in the prior lines. Something like the
following would have worked exactly the same:
>>> urllib.urlretrieve('http://www.unc.edu/~ncaren/haphazard/negative.txt','negative.txt')
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 1/6
04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
Note that the location and filename are both surrounded by quotation marks because you want Python to use this
information literally; they arent referring to a string object, like in our previous code. This line of code is actually
quite readable, and in most circumstances this would be the most efficient thing to do. But there are actually three
files that we want to get: the negative list, the positive list, and the list of tweets. And we can download the three
using a pretty simple loop:
>>> files=['negative.txt','positive.txt','obama_tweets.txt']
>>> path='http://www.unc.edu/~ncaren/haphazard/'
>>> for file_name in files:
... urllib.urlretrieve(path+file_name,file_name)
...
The first line creates a new list with three items, the names of the three files to be downloaded. The second line
creates a string object that stores the url path that they all share. The third line starts a loop over each of the items
in the files list using file_name to reference each item in turn. The fourth line is indented, because it happens once
for each item in the list as a result of the loop, and downloads the file. This is the same as the original download
line, except the URL is now the combination of two strings, path and file_name. As noted previously, Python can
combine strings with a plus sign, so the result from the first pass through the loop will be
http://www.unc.edu/~ncaren/haphazard/negative.txt, which is where the file can be found. Note that this takes advantage of
the fact that we dont mind reusing the original file name. If we wanted to change it, or if there were different
paths to each of the files, things would get slightly trickier.
Lets take a look at the list of Tweets that we just downloaded. First, open the file:
>>> len(tweets_list)
1365
You can print the entire list by typing print tweets_list, but it will scroll by very fast. A more useful way to look at it
is to print just some of the items. Since its a list, we can loop through the first few item so they each print on the
same line.
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 2/6
04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
Note the new [0:5] after the tweets_list but before the : that begins the loop. The first number tells Python where to
make the first cut in the list. The potentially counterintuitive part is that this number doesnt reference an actual
item in the list, but rather a position between each item in the listthink about where the comma goes when lists
are created or printed. Adding to the confusion, the position at the start of the list is 0. So, in this case, we are
telling Python we want to slice our list starting at the beginning and continuing until the fifth comma, which is
after the fifth item in the list.
So, if you wanted to just print the second item in the list, you could type:
As a shorthand, you can leave out the first number in the pair if you want to start at the very beginning or leave
out the last number if you want to go until the end. So, if you want to print out the first five tweets, you could just
type print tweet_list[:5]. There are several other shortcuts along these lines that are available. We will cover some of
them in other tutorials.
Now that we have our Tweet list expanded, lets load up the positive sentiment list and print out the first few
entries:
Preprocessing
In the earlier post, we explored how to preprocess the tweets: remove the punctuation, convert to lower case, and
examine whether or not each word was in the positive sentiment list. We can use this exact same code here with
our long list. The one alteration is that instead of having just one tweet, we now have a list of 1,365 tweets, so we
have to loop over that list.
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 3/6
04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
punctuation in your current python session, you might get an error with this line, so make sure to include that
import. The cleaned tweet is then converted to a list of words, split at the white spaces. Finally, we loop through
each word in the tweet, and if the word is in our new and expanded list of positive words, we increase the counter
by one. After cycling through each of the tweet words, the proportion of positive words is computed and printed. If
you just get zeros, you might need to type from __future__ import division again, so that the result isnt rounded down.
The major problem with this script is that it is currently useless. It prints the positive sentiment results, but then
doesnt do anything with it. A more practical solution would be to store the results somehow. In a standard
statistical package, we would generate a new variable that held our results. We can do something similar here by
storing the results in a new list. Before we start the tweet loop, we add the line:
>>> positive_counts=[]
Then, instead of printing the proportion, we can append it to the list:
>>> positive_counts.append(positive_counter/word_count)
The next time we run through the loop, it wont produce any output, but it will create a list of the proportions. This
still isnt that useful, although you can use Python to do most of your statistical analysis and plotting, but at this
point you are probably ready to get your data out of Python and back into your statistical package.
The most convenient way to store data for use in multiple packages is as a plain text file where each case is its own
row and variables are separated by commas. This file type commonly has a csv extension, and Python can read
and write these files quite easily.
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 4/6
04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
In the 'open' part of the command, the first item is the name of the file you want to create, and the 'wb' tells python
that this is a file you want to write. Be careful with your file name, because if there is already a file with this name,
Python will write over it. If you wanted to read a csv file, you would just swap reader for writer and 'rb' for 'wb',
which creates a nice symmetry.
Sending your list of positive sentiment values to the file requires just one more line:
>>> writer.writerows(positive_counts)
You can now import this file into your statistical software package or just take a peak at it in excel. Of course,
having just one variable is not the most useful thing. Usually, you will have more than one that you want to export,
but for now we just have the one. At a minimum, you might also want to export the text of the original tweets. To
combine more than one list together, you can zip them into one list. This is different from appending one list to the
other, which would just make the one list twice as long.
>>> output=zip(tweets_list,positive_counts)
In this case, zip creates a new list output that is the same length as our tweets_list, but each entry has two items: the
tweet and the positive count. You can use zip to combine as many lists as your like, although they all need to be the
same length. Technically, each item in the list is a tuple, or an ordered element list, which is a data format quite
similar to a list but generally less useful for textual analysis.
To write our final version of the output, we need to repeat the line that created our writer and then write the output
list:
In case you were wondering, the top two most negative tweets, were Hatch Makes Startling Accusation Against
Obama http://t.co/HVQfUzgr ..shocking headlineNOT and We need to tag Obama & define him for Nov
battle. #Obama #failedleader #incompetent #wasteful #divisive #desperate #flexible #arrogant #lazy, which
gives our little study some face validity.
You have probably noticed that our code for this project has swelled to about 40 lines. Not horrible, but not that
easy to copy and paste. And if you mess up in a loop, you have to start all over again. While typing in commands
this way in Python is useful for playing around with new codes and commands, most of the time its not the most
efficient way to do things. Just like Stata has .do files, you can similarly save a series of Python commands as a text
file and then run them all together. These sort of Python files use a .py extension.
Ive compiled all the code for our sentiment analysis into one file, and you can download it sentiment.py using
your browser. At this point, you might want to make a directory for yourself where you can store all your python
files.
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 5/6
04/11/2017 An introduction to text analysis with Python, Part 3 | Neal Caren
You can quit Python by typing the exit(), which should bring you back to your operating systems prompt. Now,
assuming you are in the directory where you download sentiment.py, you can run the entire program by typing:
$ python sentiment.py
Remember the $ sign means that we are out of Python. This command tells your computer that you want Python
to run the program sentiment.py. If all works according to plan, your computer should think for a couple of seconds,
and then display the operating system prompt. Python displays fewer things when run this way: only things with a
print statement in front of them are displayed, so dont expect your output to be as verbose as when you typed in
each command. Actually, you probably would want to add some print statements along the way so that you knew
everything was working.
Assuming you didnt get an error message, there should be new file called tweet_sentiment.csv in your current
directory. You can confirm this by typing ls -l on a Mac or dir in Windows. This should display the contents of the
current directory and you should see tweet_sentiment.csv listed along with the current timewhich means that the file
was just created. Perfect.
There are easier ways to run your .py files which Ill discuss at a later point, and ways to improve the script, such as
adding comments as notes to ourselves, speeding it up, and allowing different types of input files. But if you made
it this far, you can proudly call yourself a beginning Python programmer. Congratulations.
This entry was posted in Uncategorized and tagged python, sentiment, tutorial. Bookmark the permalink.
Neal Caren
Proudly powered by WordPress.
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/ 6/6