Sentiments

tl;dr

  1. Implement a program that categorizes a word as positive or negative.

    $ ./smile love
    :)
    $ ./smile hate
    :(
    $ ./smile Stanford
    :|
  2. Implement a program that categorizes a user’s tweets as positive or negative.

    $ ./tweets @cs50
     0 hello, @world
     1 I love you, @world
    -1 I hate you, @world
    ...
  3. Implement a website that generates a pie chart categorizing a user’s tweets.

Background

"Sentiment analysis," otherwise known as "opinion mining," involves inference of sentiment (i.e., opinion) from text. For instance, movie reviews on Rotten Tomatoes are often positive or negative. So are product reviews on Amazon. Similarly do opinions underlie many tweets on Twitter.

Some words tend to have positive connotations (e.g., "love"), while some words tend to have negative connotations (e.g., "hate"). And so, if someone were to tweet "I love you", you might infer positive sentiment. And if someone were to tweet "I hate you", you might infer negative sentiment. Of course, individual words alone aren’t always reliable, as "I do not love you" probably isn’t a positive sentiment, but let’s not worry about those cases. Some words, meanwhile, have neither positive nor negative connotations (e.g., "the").

A few years back, Dr. Minqing Hu and Prof. Bing Liu of the University of Illinois at Chicago kindly put together lists of 2006 positive words and 4783 negative words. We’ll use those to classify tweets! But first a tour.

Distribution

Downloading

$ wget http://cdn.cs50.net/2016/fall/problems/sentiments/sentiments.zip
$ unzip sentiments.zip
$ rm sentiments.zip
$ mv problems-sentiments sentiments
$ cd sentiments
$ chmod a+x smile tweets
$ ls
analyzer.py     helpers.py          positive-words.txt  smile*     tweets*
application.py  negative-words.txt  requirements.txt    templates/

Understanding

smile

Open up smile in sentiments/. Suffice it to say that file’s name doesn’t end in .py, even though the file contains a program written in Python. But that’s okay! Notice the "shebang" atop the file:

#!/usr/bin/env python3

That line tells a computer to interpret (i.e., run) the program using python3 (aka python on CS50 IDE), an interpreter that understands Python 3.

Notice next that the program imports a class called Analyzer from a module called analyzer as well as a function called colored from a module called termcolor. The former you’ll actually soon implement in a file called analyzer.py. (Recall that a class in Python is like a struct in C except that a class can also contain functions, otherwise known as "methods" when they’re inside a class.) The latter colorizes output in terminal windows, as we’ll soon see.

This program defines only one function, main, which gets called per the file’s last line. Within main, we first make sure that sys.argv contains the expected number of command-line arguments. We then "instantiate" (i.e., allocate) an Analyzer object. We then pass to that object’s analyze method the word that a user has provided in sys.argv[1]. As we’ll soon see, that method will return a positive int if its input is positive, a negative int if its input is negative, and 0 if its input is neither positive nor negative. The program ultimately prints a colored smiley accordingly.

analyzer.py

Open up anlyzer.py in sentiments/. Not much going on in there (yet)! Notice, though, that it imports the Natural Language Toolkit, among whose features is a tokenizer that you can use to split a tweet (which is maximally a 140-character str object) into a list of words (i.e., shorter str objects).

In there is our definition of that Analyzer class, which has two methods: __init__, which is called whenever Analyzer is instantiated; and analyze, which can be called to analyze some text. That first method takes two arguments in addition to self: positives, whose value is the path to a text file containing positive words; and negatives, whose value is the path to a text file containing negative words. Meanwhile, analyze takes one argument in addition to self: a str to be analyzed for sentiment. Though that function is (temporarily) hardcoded to return 0 no matter what.

Recall that methods are automatically passed that first reference to self so that they have a way of referring to objects' "instance variables."

positive-words.txt, negative-words.txt

Open up positive-words.txt and negative-words.txt (without changing them). Notice that atop each file is a bunch of comments, each of which starts with a ;. (Those are just text files, though, so the authors' choice of ; is arbitrary.) The lists of positive and negative words, respectively, begin below those comments, after a blank line.

tweets

Open up tweets. Ah, another shebang. But nothing else besides a TODO! More on that soon.

helpers.py

Open up helpers.py. You should see two functions: chart and get_user_timeline. Given three values (positive, negative, and neutral, each an int or a float), chart generates HTML (as a str) for a pie chart depicting those values. Given a screen name, meanwhile, get_user_timeline returns a list of tweets (each as a str). That function uses Twython (har har), a library for Python, to retrieve those tweets via Twitter’s API (application programming interface), a free service that can be queried programmatically for tweets. Notice how the function expects two "environment variables" to exist. Environment variables exist within your terminal window, key/value pairs that programs (like tweets) can access programmatically. We’ll soon use two, API_KEY and API_SECRET, to store credentials for Twitter.

application.py

Open up application.py. In this file is a "controller" for a Flask-based web app with two endpoints: / and /search. The first displays the simplest of forms via which you can search for a user on Twitter by screen name. The second displays one of those pie charts categorizing that user’s tweets. Notice, though, how 100% of those tweets are (temporarily) assumed to be neutral.

templates/index.html

Open up templates/index.html. In there is that simplest of forms. Notice how it figures out via url_for, a function that comes with Flask, to what URL the form should be submitted.

templates/search.html

Open up templates/search.html. Notice how this template renders a user’s screen name as well as that pie chart.

templates/layout.html

Open up templates/layout.html. In here is a layout on which index.html and search.html depend. It leverages Bootstrap to override browsers' default aesthetics.

requirements.txt

Open up requirements.txt (without changing it, though you can later if you’d like). This file specifies the libraries, one per line, on which all of this functionality depends.

Getting Started

  1. In a terminal window execute

    cd ~/workspace/pset6/sentiments/
    pip3 install --user -r requirements.txt

    to install these programs' dependencies.

  2. Sign up for Twitter at twitter.com/signup if you don’t already have an account.

  3. Visit apps.twitter.com, logging in if prompted, and click Create New App.

    • Any (available) Name suffices.

    • Any (sufficiently long) Description suffices.

    • For Website, input https://cs50.harvard.edu/ (or any other URL).

    • Leave Callback URL blank.

  4. Click Create your Twitter application. You should see "Your application has been created."

  5. Click Keys and Access Tokens.

  6. Click modify app permissions.

  7. Select Read only, then click Update Settings.

  8. Click Keys and Access Tokens again.

  9. Highlight and copy the value to the right of Consumer Key (API Key).

  10. In a terminal window, execute

    export API_KEY=value

    where value is that (pasted) value, without any space immediately before or after the =.

  11. Highlight and copy the value to the right of Consumer Secret (API Secret).

  12. In a terminal window, execute

    export API_SECRET=value

    where value is that (pasted) value, without any space immediately before or after the =.

If you close that terminal window and/or open another, you’ll need to repeat those last five steps.

Next, try running

./smile

to see how it works. Keep in mind that all words will be classified (for now!) as neutral because of that hardcoded 0 in analyze.py.

Next, try running

flask run

and then select CS50 IDE > Web Server in CS50 IDE’s top-left corner. Search for some user’s screen name, and you should see a chart! Of course, it’s all yellow for now because of that 100.0 in application.py. Quit Flask with control-c.

Specification

analyzer.py

Complete the implementation of analyzer.py in such a way that

  • __init__ loads positive and negative words into memory in such a way that analyze can access them, and

  • analyze analyzes the sentiment of text, returning a positive score if text is more positive than negative, a negative score if text is more negative than positive, and 0 otherwise, whereby that score is computed as follows:

    • assign each word in text a value: 1 if the word is in positives, -1 if the word is in negatives, and 0 otherwise

    • consider the sum of those values to be the entire text’s score

For instance, if text were "I love you" (and Analyzer were instantiated with default values for its named parameters), then its score would be 0 + 1 + 0 = 1, since

  • "I" is in neither positive-words.txt nor negative-words.txt,

  • "love" is in positive-words.txt, and

  • "you" is in neither positive-words.txt nor negative-words.txt.

Suffice it to say, more sophisticated algorithms exist, but we’ll keep things simple!

tweets

Complete the implementation of main in tweets in such a way that program

  • accepts one and only one command-line argument, the screen name for a user on Twitter,

  • queries Twitter’s API for a user’s most recent 50 tweets,

  • analyzes the sentiment of each of those tweets, and

  • outputs each tweet’s score and text, colored in green if positive, red if negative, and yellow otherwise.

application.py

Complete the implementation of search in application.py in such a way that the function

  • queries Twitter’s API for a user’s most recent 100 tweets,

  • classifies each tweet as positive, negative, or neutral,

  • generates a chart that accurately depicts those sentiments as percentages.

If a user has tweeted fewer than 100 times, classify as many tweets as exist.

Walkthroughs

Usage

Your programs should behave per the examples below. Assumed that the underlined text is what some user has typed.

$ ./smile
Usage: ./smile word
$ ./smile foo bar
Usage: ./smile word
$ ./smile love
:)
$ ./smile hate
:(
$ ./smile Stanford
:|
$ ./tweets
Usage: ./tweets @screen_name
$ ./tweets @foo @bar
Usage: ./tweets @screen_name
$ ./tweets @cs50
 0 hello, @world
 1 I love you, @world
-1 I hate you, @world
...

Testing

No check50 for these! But here are some actual screen names on Twitter that might have some positive or negative sentiments!

Staff’s Solution

smile

~cs50/pset6/smile

tweets

~cs50/pset6/tweets

Hints

analyzer.py

  • Odds are you’ll find nltk.tokenize.casual.TweetTokenizer of interest, which can be used to tokenize a tweet (i.e., split it up into a list of words) with code like:

    tokenizer = nltk.tokenize.TweetTokenizer()
    tokens = tokenizer.tokenize(tweet)

    For instance, if tweet is I love you, then tokens will be ["I", "love", "you"]. The tokenizer treats some punctuation as separate tokens, so not to worry if it splits words like a+ (which is in positive-words.txt) into two tokens.

  • Be sure to ignore any comments or blank lines inside of positives and negatives.

  • If you would like a variable to be accessible from both __init__ and analyze, be sure to define it as an "instance variable" inside of self. For instance, if you were to define

    self.n = 42

    inside of __init__, then self.n would also be accessible inside of analyze.

  • Odds are you’ll find str.lower of interest.

  • Note that get_user_timeline returns None in cases of error, as might happen if a screen name doesn’t exist or a screen name’s tweets are private.

  • And here’s the time-complexity (aka "Big O" or "Big Oh") of various operations in current CPython, the implementation of Python we’re using (which is an interpreter called python, or really python3, which itself is actually written in C).

tweets

  • Look at smile for inspiration!

  • Because tweets doesn’t end in .py, CS50 IDE won’t know it’s Python code, so syntax highlighting won’t be enabled by default. With the file open in a tab, change Text to Python in the tab’s bottom-right corner to enable.

application.py

  • Look (back) at tweets for inspiration!

FAQs

Could not build url for endpoint '/'

If you find that when you try to search in your Flask app without typing anything into the text field, you get a Could not build url for endpoint '/'., change the line in application.py that reads redirect(url_for("/")) to return redirect(url_for("index")).

ImportError: No module named 'sqlalchemy'

If seeing this error, execute

pip install --user sqlalchemy

to resolve!

twython.exceptions.TwythonAuthError: Twitter API returned a 401 (Unauthorized), An error occurred processing your request

If seeing this error, odds are you’re trying to get tweets for a screen name that’s protected (i.e., private)! Not to worry, though. You can assume we’ll only test your code with screen names that aren’t protected.

twython.exceptions.TwythonError: Twitter API returned a 404 (Not Found), Sorry, that page does not exist

If seeing this error, odds are you’re trying to get tweets for a screen name that doesn’t exist! Not to worry, though. You can assume we’ll only test your code with screen names that exist.

TypeError: 'NoneType' object is not iterable

If seeing this error in a for loop, be sure you’re indeed iterating over a list and not, e.g., None. In particular, be sure you’re checking the return value of get_user_timeline, which, per its implementation, can return None in cases of error.

CHANGELOG

  • 2018-11-14

    • Updated distribution code location

  • 2016-10-27

    • Clarified that search should classify ⇐ 100 tweets.

  • 2016-10-26

    • Added hint and FAQ about how get_user_timeline can return None.

    • Clarified that analyze takes a (potentially multi-word) str as an argument, not just a word.

  • 2016-10-21

    • Initial release.

Acknowledgements

Special thanks to Aditi Muralidharan and John DeNero of UC Berkeley and to Minqing Hu and Bing Liu of the University of Illinois at Chicago!