Problem: Server (Part 1)

Questions? Feel free to head to CS50 on Reddit, CS50 on StackExchange, the #cs50ap channel on CS50x Slack (after signing up), or the CS50 Facebook group.

Objectives

Become familiar with HTTP.
Apply familiar techniques in unfamiliar contexts.
Transition from C to web programming

This course’s philosophy on academic honesty is best stated as "be reasonable." The course recognizes that interactions with classmates and others can facilitate mastery of the course’s material. However, there remains a line between enlisting the help of another and submitting the work of another. This policy characterizes both sides of that line.

The essence of all work that you submit to this course must be your own. Collaboration on problems is not permitted (unless explicitly stated otherwise) except to the extent that you may ask classmates and others for help so long as that help does not reduce to another doing your work for you. Generally speaking, when asking for help, you may show your code or writing to others, but you may not view theirs, so long as you and they respect this policy’s other constraints. Collaboration on quizzes and tests is not permitted at all. Collaboration on the final project is permitted to the extent prescribed by its specification.

Below are rules of thumb that (inexhaustively) characterize acts that the course considers reasonable and not reasonable. If in doubt as to whether some act is reasonable, do not commit it until you solicit and receive approval in writing from your instructor. If a violation of this policy is suspected and confirmed, your instructor reserves the right to impose local sanctions on top of any disciplinary outcome that may include an unsatisfactory or failing grade for work submitted or for the course itself.

Reasonable

Communicating with classmates about problems in English (or some other spoken language).
Discussing the course’s material with others in order to understand it better.
Helping a classmate identify a bug in his or her code, such as by viewing, compiling, or running his or her code, even on your own computer.
Incorporating snippets of code that you find online or elsewhere into your own code, provided that those snippets are not themselves solutions to assigned problems and that you cite the snippets' origins.
Reviewing past years' quizzes, tests, and solutions thereto.
Sending or showing code that you’ve written to someone, possibly a classmate, so that he or she might help you identify and fix a bug.
Sharing snippets of your own solutions to problems online so that others might help you identify and fix a bug or other issue.
Turning to the web or elsewhere for instruction beyond the course’s own, for references, and for solutions to technical difficulties, but not for outright solutions to problems or your own final project.
Whiteboarding solutions to problems with others using diagrams or pseudocode but not actual code.
Working with (and even paying) a tutor to help you with the course, provided the tutor does not do your work for you.

Not Reasonable

Accessing a solution to some problem prior to (re-)submitting your own.
Asking a classmate to see his or her solution to a problem before (re-)submitting your own.
Decompiling, deobfuscating, or disassembling the staff’s solutions to problems.
Failing to cite (as with comments) the origins of code, writing, or techniques that you discover outside of the course’s own lessons and integrate into your own work, even while respecting this policy’s other constraints.
Giving or showing to a classmate a solution to a problem when it is he or she, and not you, who is struggling to solve it.
Looking at another individual’s work during a quiz or test.
Paying or offering to pay an individual for work that you may submit as (part of) your own.
Providing or making available solutions to problems to individuals who might take this course in the future.
Searching for, soliciting, or viewing a quiz’s questions or answers prior to taking the quiz.
Searching for or soliciting outright solutions to problems online or elsewhere.
Splitting a problem’s workload with another individual and combining your work (unless explicitly authorized by the problem itself).
Submitting (after possibly modifying) the work of another individual beyond allowed snippets.
Submitting the same or similar work to this course that you have submitted or will submit to another.
Using resources during a quiz beyond those explicitly allowed in the quiz’s instructions.
Viewing another’s solution to a problem and basing your own solution on it.

Assessment

Your work on this problem set will be evaluated along four axes primarily.

Scope: To what extent does your code implement the features required by our specification?
Correctness: To what extent is your code consistent with our specifications and free of bugs?
Design: To what extent is your code written well (i.e., clearly, efficiently, elegantly, and/or logically)?
Style: To what extent is your code readable (i.e., commented and indented with variables aptly named)?

To obtain a passing grade in this course, all students must ordinarily submit all assigned problems unless granted an exception in writing by the instructor.

Getting Ready

First, join David for a tour of HTTP, the “protocol” via which web browsers and web servers communicate.

Next, consider reviewing some of these examples from Week 7, via which we introduced HTML, the language in which web pages are written.

And also some of these examples, via which we introduced CSS, the language with which web pages can be stylized.

Next, consider reviewing some of these examples, via which we introduced HTML forms, which we used to submit GET queries to Google.

For another perspective altogether, join Daven for a tour of HTML too. Don’t miss the bloopers at the end!

Finally, join Joseph (and Rob) for a closer look at CSS.

Getting Started

First, log into cs50.io and execute

update50

within a terminal window to make sure your workspace is up-to-date.

Then execute

cd ~/workspace/chapterB

at your prompt to ensure that you’re inside of unit6 (which is inside of workspace which is inside of your home directory). Then execute

wget http://docs.cs50.net/2016/ap/problems/server/server.zip

to download this problem’s distro. Unzip the ZIP file (remember how?) and then delete the ZIP file from your unit6 directory. Navigate into your newly-created server directory (remember how?) and type:

ls

You should see that your directory contains five files and a folder.

Makefile	parser.c	parser.h	pointers.c	public/		server.o

Now execute

tree

(which is a hierarchical, recursive variant of ls), and you should see that the directory contains the below.

.
├── Makefile
├── public
│   ├── cat.html
│   ├── cat.jpg
│   ├── favicon.ico
│   ├── hello.html
│   ├── hello.php
│   └── test
│       └── index.html
└── server.o

Dang it, still C. But some other stuff too!

Go ahead and take a look at cat.html. Pretty simple, right? Looks like it has an img tag, the value of whose src attribute is cat.jpg.

Next, take a look at hello.html. Notice how it has a form that’s configured to submit via GET a text field called name to hello.php. Make sense? IIf not, try taking another look at the walkthrough for search-0.html.

Now take a look at hello.php. Notice how it’s mostly HTML but inside its body is a bit of PHP code:

<?= htmlspecialchars($_GET[“name”]) ?>

The <?= notation just means “echo the following value here”. htmlspecialchars, meanwhile, is just an atrociously named function whose purpose in life is to ensure that special (even dangerous!) characters like < are properly “escaped” as HTML “entities”. See http://php.net/manual/en/function.htmlspecialchars.php for more details if curious. Anyhow $_GET is a “superglobal” variable inside of which are any HTTP parameters that were passed via GET to hello.php. More specifically, it’s an “associative array” (i.e., hash table) with keys and values. Per that HTML form in hello.html, one such key should be name! But more on all that in a bit.

Now the fun part. Open up parser.h and parser.c.

The challenge ahead is to implement the parsing part of a web server that knows how to serve static content (i.e., files ending in .html, .jpg, et al.) and dynamic content (i.e., files ending in .php).

Want to try out the staff’s solution before we dive into distribution code? Execute the below to download the latest version of the staff’s solution, as the version in CS50 IDE by default is outdated. Note that the O in -O is a capitalized letter O, not a zero.

sudo wget –O ~cs50/chapterB/server http://docs.cs50.net/2016/ap/problems/server/server
sudo chmod a+x ~cs50/chapterB/server

Then execute the below to run the staff’s implementation of server.

~cs50/chapterB/server

You should see these instructions:

Usage: server [-p port] /path/to/root

Looks a bit complex, but that’s just a conventional way of saying:

This program’s name is server.
To specify a (TCP) port number on which server should listen for HTTP requests, include -p as a command-line argument, followed by (presumably) a number. The brackets imply that specifying a port is optional. (If you don’t specify, the program will default to port 8080, which is required by CS50 IDE.)
The last command-line argument to server should be the path to your server’s “root” (the directory from which files will be served).

Let’s try it out. Execute the below within your own ~/workspace/chapterB directory so that the staff’s solution uses your own copy of public as its root.

~cs50/chapterB/server public

You should see output like the below.

Using /home/Ubuntu/workspace/chapterB/public for server’s root
Listening on port 8080

Toward the top-right corner of CS50 IDE, meanwhile, you should see your workspace’s “fully qualified domain name,” an address of the form ide50-username.cs50.io, where username is your own username. Visit https://ide50-username.cs50.io/ (where username is your own username) in another tab. You should see a “directory listing” (i.e. an unordered list) of everything that’s in public, yes? And if you click cat.jpg, you should see a happy cat. If not, do just reach out to classmates or staff for a hand!

Incidentally, even though server is running on port 8080, CS50-IDE’s is “port-forwarding” port 80 (which, recall, is browsers’ default) to 8080 for you. That’s why you don’t need to specify 8080 in the URL you just visited.

Anyhow, assuming you indeed saw a happy cat in that tab, you should also see

GET /cat.jpg HTTP/1.1

in your terminal window, which is the “request line” that your browser sent to the server (which is being outputted by server via printf for diagnostics’ sake). Below that you should see all of the headers that your browser sent to server followed by

HTTP/1.1 200 OK

which is the server’s response to the browser (which is also being outputted by server via printf for diagnostics’ sake).

Next, just like I did in that short on HTTP, open up Chrome’s developer tools, per the instructions at https://developer.chrome.com/devtools. Then, once open, click the tools’ Network tab, and then, while holding down Shift, reload the page.

Not only should you see Happy Cat again, you should also see the below in your terminal window.

GET /cat.jpg HTTP/1.1
HTTP/1.1 200 OK

You might also see the below.

GET /favicon.ico HTTP/1.1
HTTP/1.1 200 OK

What’s going on if so? Well, by convention, a lot of websites have in their root directory a favicon.ico file, which is just a tiny icon that’s meant to be displayed in a browser’s address bar or tab. If you do see those lines in your terminal window, that just means Chrome is guessing that your server, too , might have favicon.ico file, which it does!

Here’s a quick walkthrough if a demo may help.

Alright, now try visiting https://ide50-username.cs50.io/cat.html. (Note the .html instead of .jpg this time.) You should see Happy Cat again, possibly with a bit of margin around him (simply because of Chrome’s default CSS properties). If you look at the developer tools’ Network tab (possibly after reloading, if they weren’t still open), you should see that Chrome first requested cat.html followed by cat.jpg, since the latter, really, was specified as the value of the img element’s src attribute that we saw earlier in cat.html. To confirm as much, take a look at the developer tools’ Elements tab, wherein you’ll see a pretty-printed version of the HTML in cat.html. You can even change the HTML, but only Chrome’s in-memory copy thereof. To change the actual file, you’d need to do so in the usual way within CS50 IDE. Incidentally, you might find it interesting to tinker with the developer tools’ Styles tab, too. Even though this page doesn’t’ have any CSS of its own, you can see and change (temporarily) Chrome’s default CSS properties via that tab.

Okay, one last test. Try visiting https://ide50-username.cs50.io/hello.html. Go ahead and input your name into the form and then submit it, as by clicking the button or hitting Enter. You should find yourself at a URL like https://ide50-username.cs50.io/hello.php?name=Alice (albeit with your name, not Alice’s, unless your name is also Alice), where a personalized hello awaits! That’s what we mean by “dynamic” content. By submitting that form, you provided input (i.e., your name) to the server, which then generated output just for you. (That input was in the form of an “HTTP parameter” called name, the value of which was your name.) Indeed, if you look at the page’s source code (as via the developer tools’ Elements tab), you’ll see your name embedded within the HTML! By contrast, files like cat.jpg and cat.html (and even hello.html) are “static” content, since they’re not dynamically generated.

Neat, eh?? Though odds are you’ll find it easier to test your own code via a command line than with a browser. So let’s show you one other technique.

Open up a second terminal window and position it alongside your first. In the first terminal window, execute

~cs50/chapterB/server public

from within your ~/workspace/unit6 directory, if the server isn’t already running. Then, in the second terminal window, execute the below. (Note the http:// this time instead of https://.)

curl –i http://localhost:8080/

If you haven’t used curl before, it’s a command-line program with which you can send HTTP requests (and more) to a server in order to see its responses. The -i flag tells curl to include responses’ HTTP headers in the output. Odds are, whilst debugging your server, you’ll find it more convenient (and revealing!) to see all of that via curl than by poking around Chrome’s developer tools.

Incidentally, take care not to request cat.jpg (or any binary file) via curl, else you’ll see quite a mess! (You’re about to try, aren’t you.)

Unfortunately, your server right now doesn’t have functionality… yet! If you open up parser.h, you’ll see declarations and descriptions for six functions, error, extract_request, extract_headers, parse, extract_query, and respond. While implementing your server, do take care to note the follow.

You may alter Makefile
You may alter parser.h, but may not alter the declarations of any of the functions therein. Odds are, you won’t need to change parser.h.
You may alter parser.c, and in fact, must in order to complete the implementation of your server.
Do not delete the server.o file as that’s where most of the server’s implementation is held!

Alright, ready to go?

OmgLikeWhut

Recall that HTTP messages adhere to a “grammar,” which is to say they’re formatted according to a set of rules, patterns that web servers and web browsers can parse. Consider a (simplified) grammar below, wherein any bold symbol is further defined by some other rule. Know that CRLF represents \r\n, that SP represents a single white space, that * means “zero or more” (of whatever’s in parentheses), that / means “or” and that square brackets mean something’s optional.

HTTP-message	=	*start-line*
			*( *header-field* CRLF )
			CRLF
			[ message-body ]
start-line	=	*request-line* /  *status-line*
request-line	=	method SP *request-target* SP HTTP-version CRLF
status-line	=	HTTP-version SP status-code SP reason-phrase CRLF
header-field	=	field-name “:” field-value

One important task of web servers is to parse the HTTP message to ensure the messages are “grammatically correct”.

Before we jump into coding, let’s review a old friend from our past: pointers (you’ll thank us later)! Recall that a string is just a char*, which is a pointer to single character. The program will then continue reading 1 byte (the size of a char) at a time until a \0 is reached. So if we have the following string:

char* word = “hello, world”;

Then the char* pointer, word, would point to the first letter in the string, which in this case, is h. Let’s arbitrarily say that h is located in memory at location 0x05 (remember hex?). Then the e would be at 0x06, the l at 0x07 and so on and so forth until the NULL terminator is read at 0x11. So if we had a pointer that pointed to the NULL terminator, and the pointer, word, we can get the length of the string by subtracting the pointer, word, from the pointer that points to the NULL terminator (0x11 – 0x05 = 0x0C which is 12, the length of the string word). Remember, all strings end with a NULL terminator, even an empty string such as ””!

Still confused? Open up pointers.c and take a look at the code inside. Here, we parse a sentence, ”hello, world” to find the first and second words of the sentence. We isolate the variables, once in a char array and once using dynamically allocated memory. Note that strcpy copies in the ’\0’ whereas strncpy does not. You’re more than welcome, and in fact encouraged, to use pointers.c as a reference throughout this problem.

Divide and Conquer

Feel free to divide the work in whichever manner you believe to be the most efficient, but our recommendation is for one person to take on the extract_request, extract_headers, and extract_query, while the other person tackles parse. Be sure to communicate throughout the whole coding and problem-solving process to make the problem much easier! After all, two heads are better than one.

Such Parsing, Much Wow

If you try to run your version of server, you’ll see that it doesn’t do much at all. Now that we had that little review of pointers, your job, as a team, is to implement the parsing functionality. Let’s dive in.

extract_request

Complete the implementation of extract_request in such a way that the function takes the parameter message, which is the HTTP message, and returns the request-line.

The request-line of the message is the part of HTTP message up to and including the first CRLF (or \r\n). If no CRLF is found, respond to the browser with 500 Internal Server Error by calling

error(500);

and returning NULL. Else, if a CRLF is found, “extract” the request-line by dynamically allocating (remember how?) a char* and taking care to copy only the request-line into the new char*. Do take care to ensure all your strings end in a NULL terminator and not to worry about freeing your dynamically allocated memory! We take care of that for you in server.o.

Odds are you’ll find functions like strchr, strstr, strcpy, strncpy, strncmp, strcmp, and/or strcasecmp of help!

extract_headers

Complete the implementation of extract_headers in such a way that the function takes the parameter content and extracts the headers, responding with the interpreter’s content.

Similarly to extract_request, the headers are the part up to and including the first CRLF CRLF in content. If no CRLF CRLF is found, free content, respond to the browser with 500 Internal Server Error, and stop the function by simply calling return.

Else, copy in the header information to a new variable, though this time the variable need not be in dynamically allocated memory since it is not being returned by the function.

Finally, call respond with a 200. If you look at parser.h, you’ll see that respond takes four arguments, code, headers, body, and length, where code is your response code to the browser, headers are the headers you extracted in the function, body is the part of the HTTP message that comes after the headers (after the CRLF CRLF), and length is the length of the body. Odds are your call to respond will look something like

respond(200, headers, needle + 4, length – size of headers);

where headers is the extracted headers and needle (we use the variable name needle to represent finding a needle in a haystack, wherein content is the haystack here) is a pointer to CRLF CRLF. To calculate the length of the body, calculate the size of the headers and subtract that from length, which was passed to extract_headers as an argument.

Odds are you’ll find functions like strchr, strstr, strcpy, strncpy, strncmp, strcmp, and/or strcasecmp of help!

extract_query

Complete the implementation of extract_query in such a way that the function takes in target, the request-target of the HTTP message, and stores the absolute path and query in abs_path and query respectively.

Per 5.3 of http://tools.ietf.org/html/rfc7230, a request-target can take several forms, the only one of which your server needs to support is

absolute-path [ “?” query ]

whereby absolute-path (which will not contain ?) might optionally be followed by a ? followed by a query. For example, a request-target may look like

/hello.php
/hello.php?
/hello.php?q=Alice

where /hello.php is the absolute-path for all three above, but the only query present in any of the three request-target’s above is `q=Alice. Do take note that the ? is part of neither the absolute-path nor the query. The presence of a ? simply indicates that there may possibly be a query that follows afterwards.

Per target, the request-target passed into extract_query as an argument, so that the absolute-path is stored at the address in abs_path and the query is stored at the address in query. If that substring is absent (even if a ? is present), then query should be ””, thereby consuming one byte, whereby query[0] is ’\0’.

For instance, if request-target is /hello.php or /hello.php? then query should have a value of ””. And if request-target is /hello.php?q=Alice, then query should have a value of q=Alice.

Odds are you’ll find functions like strchr, strstr, strcpy, strncpy, strncmp, strcmp, and/or strcasecmp of help!

parse

Complete the implementation of parse in such a way that the function parses (i.e. iterates over) line, ensuring that the request-line is “grammatically correct”.

Per 3.1.1 of http://tools.ietf.org/html/rfc7230, a request-line is defined as

method SP request-target SP HTTP-version CRLF

wherein SP represents a single space ( ) and CRLF represents \r\n. None of method, request-target, and HTTP-version, meanwhile, may contain SP.

Parse line by isolating the method, request-target, HTTP-version and CRLF and storing them into separate char array variables, so that a variable named target would only hold the request-target.

Ensure that request-line (which is passed into parse as line) is consistent with the following rules. If is not, respond to the browser with the appropriate response code, by calling the error with the response code as shown below.

error(400);

If method is not GET, respond to the browser with 405 Method Not Allowed and return false;
If request-target does not begin with /, respond to the browser with 501 Not Implemented and return false;
If request-target contains a ”, respond to the browser with 400 Bad Request and return false;
If HTTP-version is not HTTP/1.1, respond to the browser with 505 HTTP Version Not Supported and return false;
If there is no \r\n, respond to the browser with 414 Request-URI Too Long and return false;
If there are more or fewer than two SP, respond to the browser with 400 Bad Request and return false;
If anything else in the request-line is incorrectly formatted, respond to the browser with 400 Bad Request and return false;
Else if everything is properly formatted, call extract_query and return true as per below

extract_query(target, abs_path, query);
return true;

wherein target is the request-target you extracted earlier, and abs_path and query are the arguments passed into parse.

Odds are you’ll find functions like strchr, strstr, strcpy, strncpy, strncmp, strcmp, and/or strcasecmp of help!

This was Server (Part 1).