Problem: Recover

Questions? Feel free to head to CS50 on Reddit, CS50 on StackExchange, the #cs50ap channel on CS50x Slack (after signing up), or the CS50 Facebook group.

Objectives

  • Acquaint you with file I/O.

  • Get you more comfortable with data structures and hexadecimal.

  • Gently introduce pointers.

  • Save the day.

* The Wikipedia articles are a bit dense; feel free to skim or skip!

This course’s philosophy on academic honesty is best stated as "be reasonable." The course recognizes that interactions with classmates and others can facilitate mastery of the course’s material. However, there remains a line between enlisting the help of another and submitting the work of another. This policy characterizes both sides of that line.

The essence of all work that you submit to this course must be your own. Collaboration on problems is not permitted (unless explicitly stated otherwise) except to the extent that you may ask classmates and others for help so long as that help does not reduce to another doing your work for you. Generally speaking, when asking for help, you may show your code or writing to others, but you may not view theirs, so long as you and they respect this policy’s other constraints. Collaboration on quizzes and tests is not permitted at all. Collaboration on the final project is permitted to the extent prescribed by its specification.

Below are rules of thumb that (inexhaustively) characterize acts that the course considers reasonable and not reasonable. If in doubt as to whether some act is reasonable, do not commit it until you solicit and receive approval in writing from your instructor. If a violation of this policy is suspected and confirmed, your instructor reserves the right to impose local sanctions on top of any disciplinary outcome that may include an unsatisfactory or failing grade for work submitted or for the course itself.

Reasonable

  • Communicating with classmates about problems in English (or some other spoken language).

  • Discussing the course’s material with others in order to understand it better.

  • Helping a classmate identify a bug in his or her code, such as by viewing, compiling, or running his or her code, even on your own computer.

  • Incorporating snippets of code that you find online or elsewhere into your own code, provided that those snippets are not themselves solutions to assigned problems and that you cite the snippets' origins.

  • Reviewing past years' quizzes, tests, and solutions thereto.

  • Sending or showing code that you’ve written to someone, possibly a classmate, so that he or she might help you identify and fix a bug.

  • Sharing snippets of your own solutions to problems online so that others might help you identify and fix a bug or other issue.

  • Turning to the web or elsewhere for instruction beyond the course’s own, for references, and for solutions to technical difficulties, but not for outright solutions to problems or your own final project.

  • Whiteboarding solutions to problems with others using diagrams or pseudocode but not actual code.

  • Working with (and even paying) a tutor to help you with the course, provided the tutor does not do your work for you.

Not Reasonable

  • Accessing a solution to some problem prior to (re-)submitting your own.

  • Asking a classmate to see his or her solution to a problem before (re-)submitting your own.

  • Decompiling, deobfuscating, or disassembling the staff’s solutions to problems.

  • Failing to cite (as with comments) the origins of code, writing, or techniques that you discover outside of the course’s own lessons and integrate into your own work, even while respecting this policy’s other constraints.

  • Giving or showing to a classmate a solution to a problem when it is he or she, and not you, who is struggling to solve it.

  • Looking at another individual’s work during a quiz or test.

  • Paying or offering to pay an individual for work that you may submit as (part of) your own.

  • Providing or making available solutions to problems to individuals who might take this course in the future.

  • Searching for, soliciting, or viewing a quiz’s questions or answers prior to taking the quiz.

  • Searching for or soliciting outright solutions to problems online or elsewhere.

  • Splitting a problem’s workload with another individual and combining your work (unless explicitly authorized by the problem itself).

  • Submitting (after possibly modifying) the work of another individual beyond allowed snippets.

  • Submitting the same or similar work to this course that you have submitted or will submit to another.

  • Using resources during a quiz beyond those explicitly allowed in the quiz’s instructions.

  • Viewing another’s solution to a problem and basing your own solution on it.

Assessment

Your work on this problem set will be evaluated along four axes primarily.

Scope

To what extent does your code implement the features required by our specification?

Correctness

To what extent is your code consistent with our specifications and free of bugs?

Design

To what extent is your code written well (i.e., clearly, efficiently, elegantly, and/or logically)?

Style

To what extent is your code readable (i.e., commented and indented with variables aptly named)?

To obtain a passing grade in this course, all students must ordinarily submit all assigned problems unless granted an exception in writing by the instructor.

Getting Started

To start, open a terminal window and execute

update50

to make sure your workspace is up-to-date.

Next, navigate to your ~/workspace/chapter4 directory. For the first time in a while, no big code distro this time, but there is one important file to download. Execute

wget http://docs.cs50.net/2016/ap/problems/recover/recover.zip

This file’s somewhat large (about 16 MB), so it may require some time to download depending on the speed of your internet connection! Then unzip that ZIP file, delete it, and the navigate into your newly-created recover directory. If you list the contents of your directory, you should find two files therein:

camera.raw  recover.c

You’ll be doing all your work for this problem inside of this recover.c file. Open it up and…​ darn, it’s totally empty!

Genius Bar

Alright, now let’s put all your new skills to the test.

In anticipation of this problem, David spent several days snapping photos of people he knows, all of which were saved by his digital camera as JPEGs on a 1GB CompactFlash (CF) card. Unfortunately, he’s not very good with computers, and somehow deleted them all! Thankfully, in the computer world, "deleted" tends not to mean "deleted" so much as "forgotten." His computer insists that the CF card is now blank, but he’s pretty sure it’s lying to him.

Write in ~/workspace/chapter4/recover/recover.c a program that recovers these photos.

Ummm.

Adele

Okay, here’s the thing. Even though JPEGs are more complicated than BMPs, JPEGs have "signatures," patterns of bytes that can distinguish them from other file formats. Specifically, the first three bytes of JPEGs are

0xff 0xd8 0xff

from first byte to third byte, left to right. The fourth byte, meanwhile, is either 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, 0xe8, 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, of 0xef. Put another way, the fourth byte’s first four bits are 1110.

Odds are, if you find this pattern of four bytes on a disk known to store photos (e.g., David’s CF card), they demark the start of a JPEG. (To be sure, you might encounter these patterns on some disk purely by chance, so data recovery isn’t an exact science.)

Fortunately, digital cameras tend to store photographs contiguously on CF cards, whereby each photo is stored immediately after the previously taken photo. Accordingly, the start of a JPEG usually demarks the end of another. However, digital cameras generally initialize CF cards with a FAT file system whose "block size" is 512 bytes (B). The implication is that these cameras only write to those cards in units of 512 B. A photo that’s 1 MB (i.e., 1,048,576 B) thus takes up 1048576 รท 512 = 2048 "blocks" on a CF card. But so does a photo that’s, say, one byte smaller (i.e., 1,048,575 B)! The wasted space on disk is called "slack space." Forensic investigators often look at slack space for remnants of suspicious data.

The implication of all these details is that you, the investigator, can probably write a program that iterates over a copy of David’s CF card, looking for JPEGs' signatures. Each time you find a signature, you can open a new file for writing and start filling that file with bytes from his CF card, closing that file only once you encounter another signature. Moreover, rather than read his CF card’s bytes one at a time, you can read 512 of them at a time into a buffer for efficiency’s sake. Thanks to FAT, you can trust that JPEGs' signatures will be "block-aligned." That is, you need only look for those signatures in a block’s first four bytes.

Realize, of course, that JPEGs can span contiguous blocks. Otherwise, no JPEG could be larger than 512 B. But the last byte of a JPEG might not fall at the very end of a block. Recall the possibility of slack space. But not to worry. Because this CF card was brand-new when he started snapping photos, odds are it’d been "zeroed" (i.e., filled with 0s) by the manufacturer, in which case any slack space will be filled with 0s. It’s okay if those trailing 0s end up in the JPEGs you recover; they should still be viewable.

David, of course, only has one CF card, so he’s gone ahead and created a "forensic image" of the card, storing its contents, byte after byte, in the file camera.raw. which lives inside of ~cs50/chapter4. So that you don’t waste time iterating over millions of 0s unnecessarily, he’s only imaged the first few megabytes of the CF card. But you should ultimately find that the image contains 50 JPEGs (but don’t hardcode that number in, lest we also need to use your program to recover a different set of images off of a different CF card). You can, then, open the file programmatically with fopen, as in the below.

FILE* file = fopen("camera.raw", "r");

But let’s make the program a little bit more flexible. recover should accept either zero command line arguments (in which case, you should open the hard-coded camera.raw file, as indicated above), or a single command line argument (in which case, you should open the user-provided file, which you may assume is formatted in exactly the same manner as camera.raw, albeit with different JPEGs).

When executed, though, your program should recover every one of the JPEGs from the file being examined, storing each as a separate file in your current working directory. Your program should number the files it outputs by naming each ###.jpg, where ### is three-digit decimal number from 000 on up. (Befriend sprintf.) You need not try to recover the JPEGs' original names. To check whether the JPEGs your program spit out are correct, simply double-click and take a look! If each photo appears intact, your operation was likely a success!

Odds are, though, the JPEGs that the first draft of your code spits out won’t be correct. (If you open them up and don’t see anything, they’re probably not correct!) Execute the command below to delete all JPEGs in your current working directory.

rm *.jpg

If you’d rather not be prompted to confirm each deletion, execute the command below instead.

rm -f *.jpg

Just remember to be careful with that -f switch, as it "forces" deletion without prompting you.

If you’d like to check the correctness of your program with check50, you may execute the below.

check50 1617.chapter4.recover recover.c

Lest it spoil your (forensic) fun, the staff’s solution to recover is not available.

As before, if you happen to use malloc, be sure to use free so as not to leak memory. Try using valgrind to check for any leaks!

Here’s Zamyla!

Sanity Checks

Before you consider this problem set done, best to ask yourself these questions and then go back and improve your code as needed! Do not consider the below an exhaustive list of expectations, though, just some helpful reminders. To be clear, consider the questions below rhetorical. No need to answer them in writing, since all of your answers should be "yes!"

  • Does recover accept either 0 or 1 command-line arguments? If 0, is recover opening camera.raw? If 1, is recover opening whatever file was passed in as the argument?

  • Does recover output 50 JPEGs when not passed in any command-line arguments? Are all 50 viewable?

  • Does recover name the JPEGs ###.jpg, where ### is a three-digit number from 000 through 049?

  • Are you sure that recover doesn’t have any memory leaks?

This was Recover.