Problem: Sentiments
Questions? Feel free to head to CS50 on Reddit, CS50 on StackExchange, the #cs50ap
channel on CS50x Slack (after signing up), or the CS50 Facebook group.
tl;dr
-
Implement a program that categorizes a word as positive or negative.
$ ./smile love :) $ ./smile hate :( $ ./smile Stanford :|
-
Implement a program that categorizes a user’s tweets as positive or negative.
$ ./tweets @cs50 0 hello, @world 1 I love you, @world -1 I hate you, @world ...
-
Implement a website that generates a pie chart categorizing a user’s tweets.
Academic Honesty
This course’s philosophy on academic honesty is best stated as "be reasonable." The course recognizes that interactions with classmates and others can facilitate mastery of the course’s material. However, there remains a line between enlisting the help of another and submitting the work of another. This policy characterizes both sides of that line.
The essence of all work that you submit to this course must be your own. Collaboration on problems is not permitted (unless explicitly stated otherwise) except to the extent that you may ask classmates and others for help so long as that help does not reduce to another doing your work for you. Generally speaking, when asking for help, you may show your code or writing to others, but you may not view theirs, so long as you and they respect this policy’s other constraints. Collaboration on quizzes and tests is not permitted at all. Collaboration on the final project is permitted to the extent prescribed by its specification.
Below are rules of thumb that (inexhaustively) characterize acts that the course considers reasonable and not reasonable. If in doubt as to whether some act is reasonable, do not commit it until you solicit and receive approval in writing from your instructor. If a violation of this policy is suspected and confirmed, your instructor reserves the right to impose local sanctions on top of any disciplinary outcome that may include an unsatisfactory or failing grade for work submitted or for the course itself.
Reasonable
-
Communicating with classmates about problems in English (or some other spoken language).
-
Discussing the course’s material with others in order to understand it better.
-
Helping a classmate identify a bug in his or her code, such as by viewing, compiling, or running his or her code, even on your own computer.
-
Incorporating snippets of code that you find online or elsewhere into your own code, provided that those snippets are not themselves solutions to assigned problems and that you cite the snippets' origins.
-
Reviewing past years' quizzes, tests, and solutions thereto.
-
Sending or showing code that you’ve written to someone, possibly a classmate, so that he or she might help you identify and fix a bug.
-
Sharing snippets of your own solutions to problems online so that others might help you identify and fix a bug or other issue.
-
Turning to the web or elsewhere for instruction beyond the course’s own, for references, and for solutions to technical difficulties, but not for outright solutions to problems or your own final project.
-
Whiteboarding solutions to problems with others using diagrams or pseudocode but not actual code.
-
Working with (and even paying) a tutor to help you with the course, provided the tutor does not do your work for you.
Not Reasonable
-
Accessing a solution to some problem prior to (re-)submitting your own.
-
Asking a classmate to see his or her solution to a problem before (re-)submitting your own.
-
Decompiling, deobfuscating, or disassembling the staff’s solutions to problems.
-
Failing to cite (as with comments) the origins of code, writing, or techniques that you discover outside of the course’s own lessons and integrate into your own work, even while respecting this policy’s other constraints.
-
Giving or showing to a classmate a solution to a problem when it is he or she, and not you, who is struggling to solve it.
-
Looking at another individual’s work during a quiz or test.
-
Paying or offering to pay an individual for work that you may submit as (part of) your own.
-
Providing or making available solutions to problems to individuals who might take this course in the future.
-
Searching for, soliciting, or viewing a quiz’s questions or answers prior to taking the quiz.
-
Searching for or soliciting outright solutions to problems online or elsewhere.
-
Splitting a problem’s workload with another individual and combining your work (unless explicitly authorized by the problem itself).
-
Submitting (after possibly modifying) the work of another individual beyond allowed snippets.
-
Submitting the same or similar work to this course that you have submitted or will submit to another.
-
Using resources during a quiz beyond those explicitly allowed in the quiz’s instructions.
-
Viewing another’s solution to a problem and basing your own solution on it.
Assessment
Your work on this problem set will be evaluated along four axes primarily.
- Scope
-
To what extent does your code implement the features required by our specification?
- Correctness
-
To what extent is your code consistent with our specifications and free of bugs?
- Design
-
To what extent is your code written well (i.e., clearly, efficiently, elegantly, and/or logically)?
- Style
-
To what extent is your code readable (i.e., commented and indented with variables aptly named)?
To obtain a passing grade in this course, all students must ordinarily submit all assigned problems unless granted an exception in writing by the instructor.
Getting Started
As always, we’ll begin the setup process for this problem by logging into cs50.io and executing:
update50
From there, move into your chapter6
directory. Then, download the distro for this assignment by executing:
wget http://cdn.cs50.net/2016/fall/problems/sentiments/sentiments.zip
Then run the following commands to set up some names and permissions:
$ unzip sentiments.zip
$ rm sentiments.zip
$ mv problems-sentiments sentiments
$ cd sentiments
$ chmod a+x smile tweets
$ ls
analyzer.py helpers.py positive-words.txt smile* tweets*
application.py negative-words.txt requirements.txt templates/
Back to the code in a bit!
Background
"Sentiment analysis," otherwise known as "opinion mining," involves inference of sentiment (i.e., opinion) from text. For instance, movie reviews on Rotten Tomatoes are often positive or negative. So are product reviews on Amazon. Similarly do opinions underlie many tweets on Twitter.
Some words tend to have positive connotations (e.g., "love"), while some words tend to have negative connotations (e.g., "hate"). And so, if someone were to tweet "I love you", you might infer positive sentiment. And if someone were to tweet "I hate you", you might infer negative sentiment. Of course, individual words alone aren’t always reliable, as "I do not love you" probably isn’t a positive sentiment, but let’s not worry about those cases. Some words, meanwhile, have neither positive nor negative connotations (e.g., "the").
A few years back, Dr. Minqing Hu and Prof. Bing Liu of the University of Illinois at Chicago kindly put together lists of 2006 positive words and 4783 negative words. We’ll use those to classify tweets! But first a tour.
Distribution
Understanding
smile
Open up smile
in sentiments/
. Suffice it to say that file’s name doesn’t end in .py
, even though the file contains a program written in Python. But that’s okay! Notice the "shebang" atop the file:
#!/usr/bin/env python3
That line tells a computer to interpret (i.e., run) the program using python3
(aka python
on CS50 IDE), an interpreter that understands Python 3.
Notice next that the program imports a class called Analyzer
from a module called analyzer
as well as a function called colored
from a module called termcolor
. The former you’ll actually soon implement in a file called analyzer.py
. (Recall that a class in Python is like a struct
in C except that a class can also contain functions, otherwise known as "methods" when they’re inside a class.) The latter colorizes output in terminal windows, as we’ll soon see.
This program defines only one function, main
, which gets called per the file’s last line. Within main
, we first make sure that sys.argv
contains the expected number of command-line arguments. We then "instantiate" (i.e., allocate) an Analyzer
object. We then pass to that object’s analyze
method the word that a user has provided in sys.argv[1]
. As we’ll soon see, that method will return a positive int
if its input is positive, a negative int
if its input is negative, and 0
if its input is neither positive nor negative. The program ultimately prints a colored smiley accordingly.
analyzer.py
Open up anlyzer.py
in sentiments/
. Not much going on in there (yet)! Notice, though, that it imports the Natural Language Toolkit, among whose features is a tokenizer that you can use to split a tweet (which is maximally a 140-character str
object) into a list
of words (i.e., shorter str
objects).
In there is our definition of that Analyzer
class, which has two methods: __init__
, which is called whenever Analyzer
is instantiated; and analyze
, which can be called to analyze some text
. That first method takes two arguments in addition to self
: positives
, whose value is the path to a text file containing positive words; and negatives
, whose value is the path to a text file containing negative words. Meanwhile, analyze
takes one argument in addition to self
: a str
to be analyzed for sentiment. Though that function is (temporarily) hardcoded to return 0
no matter what.
Recall that methods are automatically passed that first reference to self
so that they have a way of referring to objects' "instance variables."
positive-words.txt
, negative-words.txt
Open up positive-words.txt
and negative-words.txt
(without changing them). Notice that atop each file is a bunch of comments, each of which starts with a ;
. (Those are just text files, though, so the authors' choice of ;
is arbitrary.) The lists of positive and negative words, respectively, begin below those comments, after a blank line.
tweets
Open up tweets
. Ah, another shebang. But nothing else besides a TODO
! More on that soon.
helpers.py
Open up helpers.py
. You should see two functions: chart
and get_user_timeline
. Given three values (positive
, negative
, and neutral
, each an int
or a float
), chart
generates HTML (as a str
) for a pie chart depicting those values. Given a screen name, meanwhile, get_user_timeline
returns a list
of tweets (each as a str
). That function uses Twython (har har), a library for Python, to retrieve those tweets via Twitter’s API (application programming interface), a free service that can be queried programmatically for tweets. Notice how the function expects two "environment variables" to exist. Environment variables exist within your terminal window, key/value pairs that programs (like tweets
) can access programmatically. We’ll soon use two, API_KEY
and API_SECRET
, to store credentials for Twitter.
application.py
Open up application.py
. In this file is a "controller" for a Flask-based web app with two endpoints: /
and /search
. The first displays the simplest of forms via which you can search for a user on Twitter by screen name. The second displays one of those pie charts categorizing that user’s tweets. Notice, though, how 100% of those tweets are (temporarily) assumed to be neutral.
templates/index.html
Open up templates/index.html
. In there is that simplest of forms. Notice how it figures out via url_for
, a function that comes with Flask, to what URL the form should be submitted.
templates/search.html
Open up templates/search.html
. Notice how this template renders a user’s screen name as well as that pie chart.
templates/layout.html
Open up templates/layout.html
. In here is a layout on which index.html
and search.html
depend. It leverages Bootstrap to override browsers' default aesthetics.
requirements.txt
Open up requirements.txt
(without changing it, though you can later if you’d like). This file specifies the libraries, one per line, on which all of this functionality depends.
Getting Started
-
In a terminal window execute
cd ~/workspace/chapter6/sentiments/ pip3 install --user -r requirements.txt
to install these programs' dependencies.
-
Sign up for Twitter at twitter.com/signup if you don’t already have an account.
-
Visit apps.twitter.com, logging in if prompted, and click Create New App.
-
Any (available) Name suffices.
-
Any (sufficiently long) Description suffices.
-
For Website, input https://cs50.harvard.edu/ (or any other URL).
-
Leave Callback URL blank.
-
-
Click Create your Twitter application. You should see "Your application has been created."
-
Click Keys and Access Tokens.
-
Click modify app permissions.
-
Select Read only, then click Update Settings.
-
Click Keys and Access Tokens again.
-
Highlight and copy the value to the right of Consumer Key (API Key).
-
In a terminal window, execute
export API_KEY=value
where
value
is that (pasted) value, without any space immediately before or after the=
. -
Highlight and copy the value to the right of Consumer Secret (API Secret).
-
In a terminal window, execute
export API_SECRET=value
where
value
is that (pasted) value, without any space immediately before or after the=
.
If you close that terminal window and/or open another, you’ll need to repeat those last five steps.
Next, try running
./smile
to see how it works. Keep in mind that all words will be classified (for now!) as neutral because of that hardcoded 0
in analyze.py
.
Next, try running
flask run
and then select CS50 IDE > Web Server in CS50 IDE’s top-left corner. Search for some user’s screen name, and you should see a chart! Of course, it’s all yellow for now because of that 100.0
in application.py
. Quit Flask with control-c.
Specification
analyzer.py
Complete the implementation of analyzer.py
in such a way that
-
__init__
loads positive and negative words into memory in such a way thatanalyze
can access them, and -
analyze
analyzes the sentiment oftext
, returning a positive score iftext
is more positive than negative, a negative score iftext
is more negative than positive, and0
otherwise, whereby that score is computed as follows:-
assign each word in
text
a value:1
if the word is inpositives
,-1
if the word is innegatives
, and0
otherwise -
consider the sum of those values to be the entire text’s score
-
For instance, if text
were "I love you" (and Analyzer
were instantiated with default values for its named parameters), then its score would be 0 + 1 + 0 = 1, since
-
"I" is in neither
positive-words.txt
nornegative-words.txt
, -
"love" is in
positive-words.txt
, and -
"you" is in neither
positive-words.txt
nornegative-words.txt
.
Suffice it to say, more sophisticated algorithms exist, but we’ll keep things simple!
tweets
Complete the implementation of main
in tweets
in such a way that program
-
accepts one and only one command-line argument, the screen name for a user on Twitter,
-
queries Twitter’s API for a user’s most recent 50 tweets,
-
analyzes the sentiment of each of those tweets, and
-
outputs each tweet’s score and text, colored in green if positive, red if negative, and yellow otherwise.
application.py
Complete the implementation of search
in application.py
in such a way that the function
-
queries Twitter’s API for a user’s most recent 100 tweets,
-
classifies each tweet as positive, negative, or neutral,
-
generates a chart that accurately depicts those sentiments as percentages.
If a user has tweeted fewer than 100 times, classify as many tweets as exist.
Usage
Your programs should behave per the examples below. Assumed that the underlined text is what some user has typed.
$ ./smile
Usage: ./smile word
$ ./smile foo bar
Usage: ./smile word
$ ./smile love
:)
$ ./smile hate
:(
$ ./smile Stanford
:|
$ ./tweets
Usage: ./tweets @screen_name
$ ./tweets @foo @bar
Usage: ./tweets @screen_name
$ ./tweets @cs50
0 hello, @world
1 I love you, @world
-1 I hate you, @world
...
Testing
No check50
for these! But here are some actual screen names on Twitter that might have some positive or negative sentiments!
Hints
analyzer.py
-
Odds are you’ll find
nltk.tokenize.casual.TweetTokenizer
of interest, which can be used to tokenize a tweet (i.e., split it up into alist
of words) with code like:tokenizer = nltk.tokenize.TweetTokenizer() tokens = tokenizer.tokenize(tweet)
For instance, if
tweet
isI love you
, thentokens
will be["I", "love", "you"]
. The tokenizer treats some punctuation as separate tokens, so not to worry if it splits words likea+
(which is inpositive-words.txt
) into two tokens. -
Be sure to ignore any comments or blank lines inside of
positives
andnegatives
. -
If you would like a variable to be accessible from both
__init__
andanalyze
, be sure to define it as an "instance variable" inside ofself
. For instance, if you were to defineself.n = 42
inside of
__init__
, thenself.n
would also be accessible inside ofanalyze
. -
Odds are you’ll find
str.lower
of interest. -
Note that
get_user_timeline
returnsNone
in cases of error, as might happen if a screen name doesn’t exist or a screen name’s tweets are private. -
And here’s the time-complexity (aka "Big O" or "Big Oh") of various operations in current CPython, the implementation of Python we’re using (which is an interpreter called
python
, or reallypython3
, which itself is actually written in C).
tweets
-
Look at
smile
for inspiration! -
Because
tweets
doesn’t end in.py
, CS50 IDE won’t know it’s Python code, so syntax highlighting won’t be enabled by default. With the file open in a tab, change Text to Python in the tab’s bottom-right corner to enable.
application.py
-
Look (back) at
tweets
for inspiration!
FAQs
Could not build url for endpoint '/'
If you find that when you try to search in your Flask app without typing anything into the text field, you get a Could not build url for endpoint '/'.
, change the line in application.py
that reads redirect(url_for("/"))
to return redirect(url_for("index"))
.
ImportError: No module named 'sqlalchemy'
If seeing this error, execute
pip install --user sqlalchemy
to resolve!
twython.exceptions.TwythonAuthError: Twitter API returned a 401 (Unauthorized), An error occurred processing your request
If seeing this error, odds are you’re trying to get tweets for a screen name that’s protected (i.e., private)! Not to worry, though. You can assume we’ll only test your code with screen names that aren’t protected.
twython.exceptions.TwythonError: Twitter API returned a 404 (Not Found), Sorry, that page does not exist
If seeing this error, odds are you’re trying to get tweets for a screen name that doesn’t exist! Not to worry, though. You can assume we’ll only test your code with screen names that exist.
TypeError: 'NoneType' object is not iterable
If seeing this error in a for
loop, be sure you’re indeed iterating over a list
and not, e.g., None
. In particular, be sure you’re checking the return value of get_user_timeline
, which, per its implementation, can return None
in cases of error.
How to Submit
Step 1 of 2
Recall that you were asked to modify the files below:
-
analyzer.py
-
application.py
-
tweets
Be sure that each of your files is in ~/workspace/chapter6/sentiments/
, as with:
cd ~/workspace/chapter6/sentiments/
ls
If any file is not in ~/workspace/chapter6/sentiments/
, move it into that directory, as via mv
(or via CS50 IDE’s lefthand file browser).
Step 2 of 2
-
To submit
sentiments
, executecd ~/workspace/chapter6/sentiments/ submit50 cs50/2017/ap/sentiments/
If you run into any trouble, email sysadmins@cs50.harvard.edu!
You may resubmit any problem as many times as you’d like before the deadline.
Your submission should be graded for correctness within 2 minutes, at which point your score will appear at cs50.me!
Acknowledgements
Special thanks to Aditi Muralidharan and John DeNero of UC Berkeley and to Minqing Hu and Bing Liu of the University of Illinois at Chicago!
This was Sentiments.