Week 6

Last Time

  • We learned about more complex data structures, such as:

    • linked lists, nodes connected by pointers

    • stacks and queues, lists which maintain properties of first in first out or last in first out

    • binary search trees, with up to two nodes linked to each parent node

    • hash tables, essentially arrays of linked lists, where we can quickly find elements

    • tries, where we can look up elements one character at a time

The Internet

  • Now we leave behind the world of C to learn about the internet and the web.

  • Let’s consider how we might connect to the internet at home. We have an internet service provider (ISP) such as Comcast or Verizon, who build some wires into your home that connects you to their network of wires.

  • And the internet is just a connection of all these networks. Applications that we use every day run on top of this physical connection.

  • These days we typically connect to a router (a box that the wire from the outside world plugs into) wirelessly. Once we choose the wireless network that our router is broadcasting and connect to it, a technology called DHCP (Dynamic Host Configuration Protocol) assigns some IP (Internet Protocol) address to our computer, that uniquely identifies it. And this address is how computers across the internet talk to each other.

  • IPv4 (IP version 4) is the most common today, with four numbers of the format ....

  • Just like how buildings in the real world have an address to identify them, so do computers on the internet.

  • And there is a system for allocating these addresses, by provider or organization. For example, Harvard’s IPs include the ones in the range of 140.247.. or 128.103...

  • Each of the # symbols can be in the range 0 to 255, and that’s the range of values 8 bits can hold. So an IP address with 4 of these numbers are exactly 32 bit values.

  • There are also reserved IPs, known as private addresses, with the ranges 10... and 172.16..# - 172.31.. and 192.168.. that are used within a particular network, but not with the outside world.

  • But we rarely, if ever, type in some numbers into our browser to visit websites. There is another technology called DNS (Domain Name System) that maps IP addresses to domain names, and vice versa. So a domain name like www.google.com is translated to an IP address behind the scenes.

  • And now that we have IP addresses to send to and receive from, we can create and send packets information with those addresses in them.

  • We send those packets to routers, computer servers, that are in datacenters around the world, that only route information based on the destination IP. By passing our packets from router to router, we can get them to our destination.

  • We can open the CS50 IDE, and run a command like:

    $ nslookup www.google.com
    Server:         140.247.233.195
    Address:        140.247.233.195#53
    
    Non-authoritative answer:
    Name:   www.google.com
    Address: 172.217.4.36
    • The first line is the DNS server we asked to look up the domain name for us, and it returned a Non-authoritative answer of the address since it doesn’t own that domain name.

  • So we can imagine packets as envelopes with information inside, and To and From addresses on the outside.

  • We can even run a command like this:

    Traceroute
    • We see the routers that our packets would go through if we wanted to reach www.google.com.

    • The first two, with the letters sc in their name and ending in .harvard.edu are Harvard’s routers in the Science Center.

    • The next one, bdrgw2, is a "border gateway", that then connects to nox1, "northern crossroads," a place where a lot of internet providers connect their cabling and technology.

    • Then we have lots of anonymous routers with no domain names attached, until we finally reach the last one, which must be one of Google’s servers.

  • Now let’s try a website far away:

    Traceroute 2
    • So it looks like the Japanese version of CNN’s website takes a lot longer to reach.

    • It seems that routers 8 and 9 has the biggest gap, so there might be a (literal) ocean of distance between them.

  • We watch a video on underwater cables.

  • So once someone, say Google, receives the packet we sent them, they might want to reply. But if they want to send more data than can fit in a single packet, there exists a technology called TCP (Transmission Control Protocol) that splits data into pieces, and sends multiple packets. And those packets are labeled with something like 1 of 4 or 2 of 4, so we can order them and know we got them all.

  • There are also other services, so for a computer to differentiate what application a packet is meant for, packets can also be labeled with an additional number called a port.

  • For example, standard ports and protocols include:

    • 21 FTP, for file transfers

    • 22 SSH, secure shell, to run commands on another computer

    • 25 SMTP, for sending email

    • 53 DNS

    • 80 HTTP, for visiting websites

    • 443 HTTPS, for visiting secure websites

  • Firewalls keep out packets, so they might be used to block certain websites, or keep in packets, to prevent sensitive information from leaving. And this is implemented with a local router looking at all the packets, and simply not sending ones with certain addresses. And it could also block all traffic on a certain port.

  • There are services called VPNs (Virtual Private Networks) that you can use to connect to your company or school’s network. An encrypted tunnel is created to route all your traffic through the VPN first, before being sent out to the internet. But the cost of this is that it now takes more time to send our packets there first.

  • Other pieces of hardware include switches, with lots of ports to plug ethernet cables into, to connect many machines, and access points, which create wireless networks for computers to connect to.

  • We watch another video summarizing how the internet works.

HTTP

  • Now that we have an idea of how data is transmitted between computers on the internet, we can talk about what is being sent.

  • HTTP (HyperText Transfer Protocol) is one of the most common ways that messages are formatted for communication.

  • For example, in the real world we might introduce ourselves by saying "Hi, I’m David" and extending our hand, and the other person says their name and shakes our hand back.

  • With HTTP, we have similar conventions for how we start communicating and respond to communications.

  • The simplest request in HTTP is a method called GET, where we send a message that literally reads:

    GET / HTTP/1.1
    Host: www.harvard.edu
    ...
    • The / refers everything in the default directory, HTTP/1.1 indicates the the version of HTTP we want to use, and Host: www.harvard.edu indicates the website we want the server to return to us.

  • And a response would start with this:

    HTTP/1.1 200 OK
    Content-Type: text/html
    ...
    • And after those first lines, will be the actual webpage or information we requested.

    • HTML is the language that webpages are written in, which is what the content would likely be using.

  • Common status codes include:

    • 200 OK

    • 301 Moved Permanently

    • 302 Found

    • 304 Not Modified

    • 401 Unauthorized

    • 403 Forbidden

    • 404 Not Found

    • 500 Internal Server Error

  • We can see this with commands in our terminal too. We can run:

    $ telnet www.harvard.edu 80
    Trying 104.16.151.6...
    Connected to www.harvard.edu.cdn.cloudflare.net
    Escape character is '^]'.
    • We use port 80 since that’s used for HTTP, and we see that Harvard uses a service called CloudFlare, which is a content delivery network (that helps serve websites more quickly).

  • Then we can type:

    GET / HTTP/1.1
    Host: www.harvard.edu
  • And if we send that, and then scroll up (or redirect the output to a file), we’ll see first the HTTP response:

    HTTP response
    • We see HTTP/1.1 200 OK and a lot of other headers, that indicate when this page expires or what type of content it is.

  • We can use an alternative command called curl to see just the headers:

    $ curl -I http://www.harvard.edu/
  • We can do:

    $ curl -I http://reference.cs50.net/
    HTTP/1.1 301 Moved Permanently
    Cache-control: no-cache="set-cookie"
    Content-Length: 178
    Content-Type: text/html
    Date: Mon, 03 Oct 2016 17:17:39 GMT
    Location: https://reference.cs50.net/
    Server: nginx/1.8.1
    Set-Cookie: AWSELB=7D03E3C11C9564D4EBA91026CCAAA8EEDCD5DC34657AEDEBBAB0856E24F9ACB5BE65C5B4443B7EF06C9BBEAC5F36BF556A51333C0377A6BC471E810D021D4033A06AC36B27;PATH=/
    Connection: keep-alive
    • We see a Location: header to redirect us to a new URL.

    • If we go to that into our browser we’ll see that the location changes for us to start with https:// automatically.

  • With HTTPS, our traffic between the server and ourselves will be encrypted, so anyone else on the network won’t be able to read it.

  • If we now visit Google and search for something like "cats", we might end up at some long URL. But we can change it to what we understand: http://www.google.com/search?q=cats. And if we visit just that URL, we still see our results.

  • So it seems like our browser is sending out input (what we typed into the search page) to the server with the URL.

  • If we right-click a website in Chrome, we can click Inspect and see formatted HTML with a nested structure and perhaps patterns of words: