Problem: Similarities

tl;dr

  1. Implement a program that measures the edit distance between two strings.

  2. Implement a web app that depicts the costs of transforming one string into another, a la the below.


published: false ---

Academic Honesty

This course’s philosophy on academic honesty is best stated as "be reasonable." The course recognizes that interactions with classmates and others can facilitate mastery of the course’s material. However, there remains a line between enlisting the help of another and submitting the work of another. This policy characterizes both sides of that line.

The essence of all work that you submit to this course must be your own. Collaboration on problems is not permitted (unless explicitly stated otherwise) except to the extent that you may ask classmates and others for help so long as that help does not reduce to another doing your work for you. Generally speaking, when asking for help, you may show your code or writing to others, but you may not view theirs, so long as you and they respect this policy’s other constraints. Collaboration on quizzes and tests is not permitted at all. Collaboration on the final project is permitted to the extent prescribed by its specification.

Below are rules of thumb that (inexhaustively) characterize acts that the course considers reasonable and not reasonable. If in doubt as to whether some act is reasonable, do not commit it until you solicit and receive approval in writing from your instructor. If a violation of this policy is suspected and confirmed, your instructor reserves the right to impose local sanctions on top of any disciplinary outcome that may include an unsatisfactory or failing grade for work submitted or for the course itself.

Reasonable

  • Communicating with classmates about problems in English (or some other spoken language).

  • Discussing the course’s material with others in order to understand it better.

  • Helping a classmate identify a bug in his or her code, such as by viewing, compiling, or running his or her code, even on your own computer.

  • Incorporating snippets of code that you find online or elsewhere into your own code, provided that those snippets are not themselves solutions to assigned problems and that you cite the snippets' origins.

  • Reviewing past years' quizzes, tests, and solutions thereto.

  • Sending or showing code that you’ve written to someone, possibly a classmate, so that he or she might help you identify and fix a bug.

  • Sharing snippets of your own solutions to problems online so that others might help you identify and fix a bug or other issue.

  • Turning to the web or elsewhere for instruction beyond the course’s own, for references, and for solutions to technical difficulties, but not for outright solutions to problems or your own final project.

  • Whiteboarding solutions to problems with others using diagrams or pseudocode but not actual code.

  • Working with (and even paying) a tutor to help you with the course, provided the tutor does not do your work for you.

Not Reasonable

  • Accessing a solution to some problem prior to (re-)submitting your own.

  • Asking a classmate to see his or her solution to a problem before (re-)submitting your own.

  • Decompiling, deobfuscating, or disassembling the staff’s solutions to problems.

  • Failing to cite (as with comments) the origins of code, writing, or techniques that you discover outside of the course’s own lessons and integrate into your own work, even while respecting this policy’s other constraints.

  • Giving or showing to a classmate a solution to a problem when it is he or she, and not you, who is struggling to solve it.

  • Looking at another individual’s work during a quiz or test.

  • Paying or offering to pay an individual for work that you may submit as (part of) your own.

  • Providing or making available solutions to problems to individuals who might take this course in the future.

  • Searching for, soliciting, or viewing a quiz’s questions or answers prior to taking the quiz.

  • Searching for or soliciting outright solutions to problems online or elsewhere.

  • Splitting a problem’s workload with another individual and combining your work (unless explicitly authorized by the problem itself).

  • Submitting (after possibly modifying) the work of another individual beyond allowed snippets.

  • Submitting the same or similar work to this course that you have submitted or will submit to another.

  • Using resources during a quiz beyond those explicitly allowed in the quiz’s instructions.

  • Viewing another’s solution to a problem and basing your own solution on it.

Assessment

Your work on this problem set will be evaluated along three axes primarily.

Correctness

To what extent is your code consistent with our specifications and free of bugs?

Design

To what extent is your code written well (i.e., clearly, efficiently, elegantly, and/or logically)?

Style

To what extent is your code readable (i.e., commented and indented with variables aptly named)?

To obtain a passing grade in this course, all students must ordinarily submit all assigned problems unless granted an exception in writing by the instructor.

Background

Determining whether two strings are identical is (relatively!) trivial: iterate over the characters in each, checking whether each and every one is identical. But it’s non-trivial to quantify just how dissimilar two (non-identical) strings are. And it can be time-consuming, as there are multiple (and often many!) ways to transform one string into the other.

The challenge ahead is to measure the "edit distance" between two strings, the minimal number of additions, deletions, and/or edits necessary to transform one string into the other.

Distribution

Downloading

$ wget https://cdn.cs50.net/ap/2018/problems/similarities/more/similarities.zip
$ unzip similarities.zip
$ rm similarities.zip
$ cd similarities
$ chmod a+x score
$ ls
application.py  helpers.py  requirements.txt  score*  static/  templates/

Understanding

score

Open up score. Suffice it to say that file’s name doesn’t end in .py, even though the file contains a program written in Python. But that’s okay! Notice the "shebang" atop the file:

#!/usr/bin/env python3

That line tells a computer to interpret (i.e., run) the program using python3 (aka python on CS50 IDE), an interpreter that understands Python 3.

Notice how the file defines a function called main and calls that function toward the bottom of the file. Defining main isn’t strictly necessary in Python, but it’s not uncommon.

Notice how score uses Python’s argparse module in order to parse two command-line arguments, FILE1 and FILE2, the files to compare. The program then tries to read the contents of those files into strings, file1 and file2. If something goes wrong, as indicated by an IOError, the "exception" is caught. See https://docs.python.org/3/tutorial/errors.html for more on exceptions.

Finally, the program passes those strings to distances, a function we’ll soon see, and ultimately prints the edit distance between the two files!

helpers.py

Open up helpers.py. Ah, the familiar TODO. Declared in this file is a function called distances that takes two strings as arguments, a and b, and is supposed to return (via a matrix of costs) the edit distance between one and the other. At the moment, though, it simply returns an empty two-dimensional list!

This file also defines an "enumeration" (i.e., Enum) that essentially defines three constants, each of which represents an operation via which a string might be transformed into another: Operation.DELETED, Operation.INSERTED, and Operation.SUBSTITUTED.

application.py

Open up application.py. This file implements a web application that, ultimately, will allow you to visualize the edit distance between two strings as well as the operations necessary to transform one into the other at minimal cost. No need to understand the entirety of this file, but notice how score infers from the matrix returned by distances the sequence of operations that yield that minimal cost.

templates/layout.html

Open up templates/layout.html. In this file is a template for the web application’s overall layout. Odds are you’ll recognize a few of the HTML tags therein and notice a few new ones. Notice, in particular, how the template uses Bootstrap, a popular library. In fact, we based this template on their own starter template.

templates/index.html

Open up templates/index.html. Ah, another TODO. Notice how this template "extends" layout.html, which is to say that layout.html is the "mold" from which index.html itself will be made. The block defined in index.html will effectively get plugged into the placeholder for block in layout.html.

Ultimately, this file will contain the form via which users will be able to submit two strings to your web application for comparison.

templates/score.html

Open up templates/score.html. We took the liberty of implementing this file for you. Thanks to its use of some CSS (particularly a class called row), it ensures that matrix.html will fill the top half of a browser’s viewport and that log.html will fill the bottom half of the same.

templates/matrix.html

Open up templates/matrix.html. Ah, another TODO. It’s via this file that you’ll need to generate an HTML table that depicts the costs via which one string can be transformed into another.

templates/log.html

Open up templates/log.html. Phew, looks like we implemented this file for you. Indeed, it’s via this file that your web app will generate an HTML table that summarizes the operations via which one string can be transformed into another.

templates/error.html

Open up templates/error.html. In this file is a template with which any HTTP errors will be displayed. It happens to use Bootstrap’s Jumbotron feature.

static/styles.css

Open up static/styles.css. In this file are some CSS properties that collectively implement your web application’s user interface. Essentially, they modify some of Bootstrap’s own defaults.

requirements.txt

Open up requirements.txt (without changing it, though you can later if you’d like). This file specifies the libraries, one per line, on which all of this functionality depends.

Specification

helpers.py

distances

Implement distances in such a way that, given two strings, a and b, it calculates the edit distance from a to b, returning (as a list of lists) the matrix of operational costs incurred along the way. Treat the matrix’s top-left corner as [0][0] and the matrix’s bottom-right corner as [len(a)][len(b)]. Stored in each element of the matrix should be a tuple, (cost, operation), where cost is an int and operation is an Operation.

templates/index.html

Implement templates/index.html in such a way that it contains an HTML form via which a user can submit:

  • a string called string1

  • a string called string2

You’re welcome to look at the HTML of the staff’s solution as needed, but do try to figure out the right syntax on your own first, as via https://www.google.com/search?q=html+forms!

templates/matrix.html

Implement templates/matrix.html in such a way that it generates, using Jinja2, a visualization of a matrix returned by distances (given some a and b) via an HTML table. In each cell of the table should be only a cost, not an operation. Along the lefmost column should be the characters from a, each in its own cell (and row); along the topmost row should be the characters from b, each in its cell (and column).

Walkthroughs

Testing

To test your implementation of distances via the command line, execute score as follows, where FILE1 and FILE2 are any two text files:

./score FILE1 FILE2

To test your implementations via a web app, execute

flask run

and then visit the outputted URL.

See http://cdn.cs50.net/2017/fall/psets/6/similarities/inputs/ for sample inputs, though be sure to test with some of your own!

Correctness

check50 cs50/problems/2018/ap/similarities/more

Style

style50 helpers.py

Staff’s Solution

CLI

~cs50/pset6/more/score

How to Submit

Step 1 of 3

Execute update50 again to ensure that your IDE is up-to-date.

Step 2 of 3

  • Recall that you were asked to implement the similarities.

    • Be sure that helpers.py is in ~/workspace/unit6/similarities/, as with:

      cd ~/workspace/unit6/similarities/
      ls

Step 3 of 3

  • To submit similarities, execute:

    cd ~/workspace/unit6/similarities/
    submit50 cs50/2018/ap/similarities/more

Your submission should be graded for correctness within 2 minutes, at which point your score will appear at cs50.me!

This was Similarities.