Problem: Sentiments

Questions? Feel free to head to CS50 on Reddit, CS50 on StackExchange, the #cs50ap channel on CS50x Slack (after signing up), or the CS50 Facebook group.

tl;dr

Implement a program that categorizes a word as positive or negative.
```
$ ./smile love
:)
$ ./smile hate
:(
$ ./smile Stanford
:|
```

Implement a program that categorizes a user’s tweets as positive or negative.

$ ./tweets @cs50
 0 hello, @world
 1 I love you, @world
-1 I hate you, @world
...

Implement a website that generates a pie chart categorizing a user’s tweets.

This course’s philosophy on academic honesty is best stated as "be reasonable." The course recognizes that interactions with classmates and others can facilitate mastery of the course’s material. However, there remains a line between enlisting the help of another and submitting the work of another. This policy characterizes both sides of that line.

The essence of all work that you submit to this course must be your own. Collaboration on problems is not permitted (unless explicitly stated otherwise) except to the extent that you may ask classmates and others for help so long as that help does not reduce to another doing your work for you. Generally speaking, when asking for help, you may show your code or writing to others, but you may not view theirs, so long as you and they respect this policy’s other constraints. Collaboration on quizzes and tests is not permitted at all. Collaboration on the final project is permitted to the extent prescribed by its specification.

Below are rules of thumb that (inexhaustively) characterize acts that the course considers reasonable and not reasonable. If in doubt as to whether some act is reasonable, do not commit it until you solicit and receive approval in writing from your instructor. If a violation of this policy is suspected and confirmed, your instructor reserves the right to impose local sanctions on top of any disciplinary outcome that may include an unsatisfactory or failing grade for work submitted or for the course itself.

Reasonable

Communicating with classmates about problems in English (or some other spoken language).
Discussing the course’s material with others in order to understand it better.
Helping a classmate identify a bug in his or her code, such as by viewing, compiling, or running his or her code, even on your own computer.
Incorporating snippets of code that you find online or elsewhere into your own code, provided that those snippets are not themselves solutions to assigned problems and that you cite the snippets' origins.
Reviewing past years' quizzes, tests, and solutions thereto.
Sending or showing code that you’ve written to someone, possibly a classmate, so that he or she might help you identify and fix a bug.
Sharing snippets of your own solutions to problems online so that others might help you identify and fix a bug or other issue.
Turning to the web or elsewhere for instruction beyond the course’s own, for references, and for solutions to technical difficulties, but not for outright solutions to problems or your own final project.
Whiteboarding solutions to problems with others using diagrams or pseudocode but not actual code.
Working with (and even paying) a tutor to help you with the course, provided the tutor does not do your work for you.

Not Reasonable

Accessing a solution to some problem prior to (re-)submitting your own.
Asking a classmate to see his or her solution to a problem before (re-)submitting your own.
Decompiling, deobfuscating, or disassembling the staff’s solutions to problems.
Failing to cite (as with comments) the origins of code, writing, or techniques that you discover outside of the course’s own lessons and integrate into your own work, even while respecting this policy’s other constraints.
Giving or showing to a classmate a solution to a problem when it is he or she, and not you, who is struggling to solve it.
Looking at another individual’s work during a quiz or test.
Paying or offering to pay an individual for work that you may submit as (part of) your own.
Providing or making available solutions to problems to individuals who might take this course in the future.
Searching for, soliciting, or viewing a quiz’s questions or answers prior to taking the quiz.
Searching for or soliciting outright solutions to problems online or elsewhere.
Splitting a problem’s workload with another individual and combining your work (unless explicitly authorized by the problem itself).
Submitting (after possibly modifying) the work of another individual beyond allowed snippets.
Submitting the same or similar work to this course that you have submitted or will submit to another.
Using resources during a quiz beyond those explicitly allowed in the quiz’s instructions.
Viewing another’s solution to a problem and basing your own solution on it.

Assessment

Your work on this problem set will be evaluated along four axes primarily.

Scope: To what extent does your code implement the features required by our specification?
Correctness: To what extent is your code consistent with our specifications and free of bugs?
Design: To what extent is your code written well (i.e., clearly, efficiently, elegantly, and/or logically)?
Style: To what extent is your code readable (i.e., commented and indented with variables aptly named)?

To obtain a passing grade in this course, all students must ordinarily submit all assigned problems unless granted an exception in writing by the instructor.

Getting Started

As always, we’ll begin the setup process for this problem by logging into cs50.io and executing:

update50

From there, move into your chapter6 directory. Then, download the distro for this assignment by executing:

wget http://cdn.cs50.net/2016/fall/problems/sentiments/sentiments.zip

Then run the following commands to set up some names and permissions:

$ unzip sentiments.zip
$ rm sentiments.zip
$ mv problems-sentiments sentiments
$ cd sentiments
$ chmod a+x smile tweets
$ ls
analyzer.py     helpers.py          positive-words.txt  smile*     tweets*
application.py  negative-words.txt  requirements.txt    templates/

Back to the code in a bit!

Background

"Sentiment analysis," otherwise known as "opinion mining," involves inference of sentiment (i.e., opinion) from text. For instance, movie reviews on Rotten Tomatoes are often positive or negative. So are product reviews on Amazon. Similarly do opinions underlie many tweets on Twitter.

Some words tend to have positive connotations (e.g., "love"), while some words tend to have negative connotations (e.g., "hate"). And so, if someone were to tweet "I love you", you might infer positive sentiment. And if someone were to tweet "I hate you", you might infer negative sentiment. Of course, individual words alone aren’t always reliable, as "I do not love you" probably isn’t a positive sentiment, but let’s not worry about those cases. Some words, meanwhile, have neither positive nor negative connotations (e.g., "the").

A few years back, Dr. Minqing Hu and Prof. Bing Liu of the University of Illinois at Chicago kindly put together lists of 2006 positive words and 4783 negative words. We’ll use those to classify tweets! But first a tour.

Distribution

Understanding

`smile`

Open up smile in sentiments/. Suffice it to say that file’s name doesn’t end in .py, even though the file contains a program written in Python. But that’s okay! Notice the "shebang" atop the file:

#!/usr/bin/env python3

That line tells a computer to interpret (i.e., run) the program using python3 (aka python on CS50 IDE), an interpreter that understands Python 3.

Notice next that the program imports a class called Analyzer from a module called analyzer as well as a function called colored from a module called termcolor. The former you’ll actually soon implement in a file called analyzer.py. (Recall that a class in Python is like a struct in C except that a class can also contain functions, otherwise known as "methods" when they’re inside a class.) The latter colorizes output in terminal windows, as we’ll soon see.

This program defines only one function, main, which gets called per the file’s last line. Within main, we first make sure that sys.argv contains the expected number of command-line arguments. We then "instantiate" (i.e., allocate) an Analyzer object. We then pass to that object’s analyze method the word that a user has provided in sys.argv[1]. As we’ll soon see, that method will return a positive int if its input is positive, a negative int if its input is negative, and 0 if its input is neither positive nor negative. The program ultimately prints a colored smiley accordingly.

`analyzer.py`

Open up anlyzer.py in sentiments/. Not much going on in there (yet)! Notice, though, that it imports the Natural Language Toolkit, among whose features is a tokenizer that you can use to split a tweet (which is maximally a 140-character str object) into a list of words (i.e., shorter str objects).

In there is our definition of that Analyzer class, which has two methods: __init__, which is called whenever Analyzer is instantiated; and analyze, which can be called to analyze some text. That first method takes two arguments in addition to self: positives, whose value is the path to a text file containing positive words; and negatives, whose value is the path to a text file containing negative words. Meanwhile, analyze takes one argument in addition to self: a str to be analyzed for sentiment. Though that function is (temporarily) hardcoded to return 0 no matter what.

Recall that methods are automatically passed that first reference to self so that they have a way of referring to objects' "instance variables."

`positive-words.txt`, `negative-words.txt`

Open up positive-words.txt and negative-words.txt (without changing them). Notice that atop each file is a bunch of comments, each of which starts with a ;. (Those are just text files, though, so the authors' choice of ; is arbitrary.) The lists of positive and negative words, respectively, begin below those comments, after a blank line.

`tweets`

Open up tweets. Ah, another shebang. But nothing else besides a TODO! More on that soon.

`helpers.py`

Open up helpers.py. You should see two functions: chart and get_user_timeline. Given three values (positive, negative, and neutral, each an int or a float), chart generates HTML (as a str) for a pie chart depicting those values. Given a screen name, meanwhile, get_user_timeline returns a list of tweets (each as a str). That function uses Twython (har har), a library for Python, to retrieve those tweets via Twitter’s API (application programming interface), a free service that can be queried programmatically for tweets. Notice how the function expects two "environment variables" to exist. Environment variables exist within your terminal window, key/value pairs that programs (like tweets) can access programmatically. We’ll soon use two, API_KEY and API_SECRET, to store credentials for Twitter.

`application.py`

Open up application.py. In this file is a "controller" for a Flask-based web app with two endpoints: / and /search. The first displays the simplest of forms via which you can search for a user on Twitter by screen name. The second displays one of those pie charts categorizing that user’s tweets. Notice, though, how 100% of those tweets are (temporarily) assumed to be neutral.

`templates/index.html`

Open up templates/index.html. In there is that simplest of forms. Notice how it figures out via url_for, a function that comes with Flask, to what URL the form should be submitted.

`templates/search.html`

Open up templates/search.html. Notice how this template renders a user’s screen name as well as that pie chart.

`templates/layout.html`

Open up templates/layout.html. In here is a layout on which index.html and search.html depend. It leverages Bootstrap to override browsers' default aesthetics.

`requirements.txt`

Open up requirements.txt (without changing it, though you can later if you’d like). This file specifies the libraries, one per line, on which all of this functionality depends.

Getting Started

In a terminal window execute

cd ~/workspace/chapter6/sentiments/
pip3 install --user -r requirements.txt

to install these programs' dependencies.

Sign up for Twitter at twitter.com/signup if you don’t already have an account.
Visit apps.twitter.com, logging in if prompted, and click Create New App.
- Any (available) Name suffices.
- Any (sufficiently long) Description suffices.
- For Website, input https://cs50.harvard.edu/ (or any other URL).
- Leave Callback URL blank.
Click Create your Twitter application. You should see "Your application has been created."
Click Keys and Access Tokens.
Click modify app permissions.
Select Read only, then click Update Settings.
Click Keys and Access Tokens again.
Highlight and copy the value to the right of Consumer Key (API Key).
In a terminal window, execute
```
export API_KEY=value
```
where value is that (pasted) value, without any space immediately before or after the =.
Highlight and copy the value to the right of Consumer Secret (API Secret).
In a terminal window, execute
```
export API_SECRET=value
```
where value is that (pasted) value, without any space immediately before or after the =.

If you close that terminal window and/or open another, you’ll need to repeat those last five steps.

Next, try running

./smile

to see how it works. Keep in mind that all words will be classified (for now!) as neutral because of that hardcoded 0 in analyze.py.

Next, try running

flask run

and then select CS50 IDE > Web Server in CS50 IDE’s top-left corner. Search for some user’s screen name, and you should see a chart! Of course, it’s all yellow for now because of that 100.0 in application.py. Quit Flask with control-c.

Specification

`analyzer.py`

Complete the implementation of analyzer.py in such a way that

__init__ loads positive and negative words into memory in such a way that analyze can access them, and
analyze analyzes the sentiment of text, returning a positive score if text is more positive than negative, a negative score if text is more negative than positive, and 0 otherwise, whereby that score is computed as follows:
- assign each word in text a value: 1 if the word is in positives, -1 if the word is in negatives, and 0 otherwise
- consider the sum of those values to be the entire text’s score

For instance, if text were "I love you" (and Analyzer were instantiated with default values for its named parameters), then its score would be 0 + 1 + 0 = 1, since

"I" is in neither positive-words.txt nor negative-words.txt,
"love" is in positive-words.txt, and
"you" is in neither positive-words.txt nor negative-words.txt.

Suffice it to say, more sophisticated algorithms exist, but we’ll keep things simple!

`tweets`

Complete the implementation of main in tweets in such a way that program

accepts one and only one command-line argument, the screen name for a user on Twitter,
queries Twitter’s API for a user’s most recent 50 tweets,
analyzes the sentiment of each of those tweets, and
outputs each tweet’s score and text, colored in green if positive, red if negative, and yellow otherwise.

`application.py`

Complete the implementation of search in application.py in such a way that the function

queries Twitter’s API for a user’s most recent 100 tweets,
classifies each tweet as positive, negative, or neutral,
generates a chart that accurately depicts those sentiments as percentages.

If a user has tweeted fewer than 100 times, classify as many tweets as exist.

Walkthroughs

Usage

Your programs should behave per the examples below. Assumed that the underlined text is what some user has typed.

$ ./smile
Usage: ./smile word
$ ./smile foo bar
Usage: ./smile word
$ ./smile love
:)
$ ./smile hate
:(
$ ./smile Stanford
:|

$ ./tweets
Usage: ./tweets @screen_name
$ ./tweets @foo @bar
Usage: ./tweets @screen_name
$ ./tweets @cs50
 0 hello, @world
 1 I love you, @world
-1 I hate you, @world
...

Testing

No check50 for these! But here are some actual screen names on Twitter that might have some positive or negative sentiments!

Staff Solution

`smile`

~cs50/chapter6/smile

`tweets`

~cs50/chapter6/tweets

Hints

`analyzer.py`

Odds are you’ll find nltk.tokenize.casual.TweetTokenizer of interest, which can be used to tokenize a tweet (i.e., split it up into a list of words) with code like:
```
tokenizer = nltk.tokenize.TweetTokenizer()
tokens = tokenizer.tokenize(tweet)
```
For instance, if tweet is I love you, then tokens will be ["I", "love", "you"]. The tokenizer treats some punctuation as separate tokens, so not to worry if it splits words like a+ (which is in positive-words.txt) into two tokens.
Be sure to ignore any comments or blank lines inside of positives and negatives.
If you would like a variable to be accessible from both __init__ and analyze, be sure to define it as an "instance variable" inside of self. For instance, if you were to define
```
self.n = 42
```
inside of __init__, then self.n would also be accessible inside of analyze.
Odds are you’ll find str.lower of interest.
Note that get_user_timeline returns None in cases of error, as might happen if a screen name doesn’t exist or a screen name’s tweets are private.
And here’s the time-complexity (aka "Big O" or "Big Oh") of various operations in current CPython, the implementation of Python we’re using (which is an interpreter called python, or really python3, which itself is actually written in C).

`tweets`

Look at smile for inspiration!
Because tweets doesn’t end in .py, CS50 IDE won’t know it’s Python code, so syntax highlighting won’t be enabled by default. With the file open in a tab, change Text to Python in the tab’s bottom-right corner to enable.

`application.py`

Look (back) at tweets for inspiration!

FAQs

Could not build url for endpoint '/'

If you find that when you try to search in your Flask app without typing anything into the text field, you get a Could not build url for endpoint '/'., change the line in application.py that reads redirect(url_for("/")) to return redirect(url_for("index")).

ImportError: No module named 'sqlalchemy'

If seeing this error, execute

pip install --user sqlalchemy

to resolve!

twython.exceptions.TwythonAuthError: Twitter API returned a 401 (Unauthorized), An error occurred processing your request

If seeing this error, odds are you’re trying to get tweets for a screen name that’s protected (i.e., private)! Not to worry, though. You can assume we’ll only test your code with screen names that aren’t protected.

twython.exceptions.TwythonError: Twitter API returned a 404 (Not Found), Sorry, that page does not exist

If seeing this error, odds are you’re trying to get tweets for a screen name that doesn’t exist! Not to worry, though. You can assume we’ll only test your code with screen names that exist.

TypeError: 'NoneType' object is not iterable

If seeing this error in a for loop, be sure you’re indeed iterating over a list and not, e.g., None. In particular, be sure you’re checking the return value of get_user_timeline, which, per its implementation, can return None in cases of error.

How to Submit

Step 1 of 2

Recall that you were asked to modify the files below:

analyzer.py
application.py
tweets

Be sure that each of your files is in ~/workspace/chapter6/sentiments/, as with:

cd ~/workspace/chapter6/sentiments/
ls

If any file is not in ~/workspace/chapter6/sentiments/, move it into that directory, as via mv (or via CS50 IDE’s lefthand file browser).

Step 2 of 2

To submit sentiments, execute

cd ~/workspace/chapter6/sentiments/
submit50 cs50/2017/ap/sentiments/

If you run into any trouble, email sysadmins@cs50.harvard.edu!

You may resubmit any problem as many times as you’d like before the deadline.

Your submission should be graded for correctness within 2 minutes, at which point your score will appear at cs50.me!

Acknowledgements

Special thanks to Aditi Muralidharan and John DeNero of UC Berkeley and to Minqing Hu and Bing Liu of the University of Illinois at Chicago!

This was Sentiments.