Database Design
Last Time
-
We talked about cloud computing, or just servers and computing resources we rent from someone else over the internet. There are benefits to scale, where many customers can share the high cost of a data center and staff to maintain some amount of hardware.
-
There are also various levels of service:
-
IAAS, or infrastructure as a service, provides the virtual hardware for engineers to use
-
PAAS, or platform as a service, provides more abstraction over the layer of hardware, such that an engineer can run an application or server without knowing about networking, load balancing, or other details that IAAS exposes
-
SAAS, or software as a service, is any application that’s built on top of that, such as Gmail or Facebook.
-
-
We’ve also taken a look at mid-quarter feedback, and we’ll try to adjust pacing and topics as we can (though we can’t make everyone happy!)
Databases
Concepts
-
We’ve almost certainly all used Microsoft Excel before, a program used to store data in rows and columns.
-
A sheet of rows and columns is a table, and a collection of tables is a database.
-
With many rows, Excel might become very slow or even crash. It’s trying to load everything in the file into memory, and a particularly large file would use up too much and cause problems.
-
A relational database program allows for the storage, updating, and accessing of much more data efficiently.
-
SQL, Structured Query Language, is a limited programming language that is used to interact with databases.
-
CRUD is an acronym for the primary operations with a database:
-
create
-
read
-
update
-
delete
-
-
In SQL, those commands are
INSERT
,SELECT
,UPDATE
, andDELETE
.
Tables
-
We might have a spreadsheet with a table that looks like the following:
-
We have various tables that might represent someone’s music collection, and this table has various rows for albums.
-
-
We can import this spreadsheet into a program we already have, called SQLite (as well as a program that provides a GUI for it), and run commands like
SELECT Title FROM Album;
, which will print out all the titles from ourAlbum
table. -
We can even run something like
SELECT Artist.Name, Album.Title FROM Album JOIN Artist ON Album.ArtistID = Artist.ArtistID;
. Now, even though the artist’s name for each album is not a column in ourAlbum
table, we somehow were able to link together the tables and print out the correct artist’s name: -
We can also run a query that counts how many rows are selected, that match some
WHERE
condition:
Types
-
So to get to this point where we have a database that we can query, we need to design the tables. Each column has a fixed type, for storing data efficiently and in the correct format:
-
CHAR
,VARCHAR
-
Stores characters, or strings.
CHAR
is a fixed number of characters, whereasVARCHAR
sets some upper bound for a variable number of characters.
-
-
SMALLINT
,INTEGER
,BIGINT
-
These types store numbers with different numbers of bits, so the largest number in each increase, but the amount of space used is also higher.
-
-
FLOAT
,DOUBLE PRECISION
,DECIMAL
-
These types store real numbers with various numbers of bits.
DECIMAL
in particular prevents floating point imprecision, by specifying the number of digits before and after the decimal point.
-
-
DATE
,TIME
,TIMESTAMP
-
…
-
-
To design a table, we’ll need to name each of the columns, and specify their types.
Customers
-
If we wanted to build a database of customers, we might want to design something like this:
-
Id
, used to identify customers, might be aBIGINT
, in case we need more than 4 billion rows for customers. -
Name
might be aVARCHAR
, and we might limit the upper bound to 64 characters, but some people might have even longer names that don’t fit. If we usedCHAR
, we would be using a fixed number of bytes for each name, when we don’t need to, but we might get higher performance for searching the column, since the database can use a data structure like an array. -
Email
can beVARCHAR
too. -
And
telephone
might be aVARCHAR
with a smaller length, but it shouldn’t be a number type since a leading 0 would automatically be lost. -
What they purchased
could be aCHAR
if our products have a fixed identifier. -
Last date of purchase
can be aTIMESTAMP
. -
Age
is better stored as aDATE
representing the birthday of the customer, so our data won’t be out of date. -
For
Gender
, we could use a type not mentioned before,ENUM
, which is an enumerated, fixed list of choices that we can choose from for each value. -
Geographic location
could be latitude or longitude, or an enumerated list of states or countries, or aCHAR
with 5 digits for a US zip code. -
Purchase price
would beDECIMAL
, with 4 digits to the left and 2 digits to the right of the decimal.
-
-
Now, if the same customer made another purchase, we would have duplicate data:
-
So we might want a
customers
table, and apurchases
table, with the rows there linking back to some row in thecustomers
table. -
For
Geographic location
, too, we might want acities
table with additional information about a zip code. -
Returning to our music table, we might need to add multiple artists to an album. We could have an additional column for
ArtistId1
orArtistId2
, or we could have an additional table mapping an album’s ID to any number of artist IDs.
Indexing
-
We can also tell our database ahead of time to index our columns, which means that it will build a data structure, like a tree, as we add rows, so we can quickly search for values in the future. There are various types of attributes we can apply to our columns:
-
PRIMARY
indicates that this column will be unique and used to identify each row. Conventionally, this is the ID of the row in a table. -
UNIQUE
is a constraint that this column will have only unique values among rows. A phone number or email might be in this category. -
INDEX
allows for any column to be indexed, for quicker searches. -
FULLTEXT
indexes the strings in the column, so we can search for keywords within those strings.
-
-
The tradeoff to keep these indices are that inserts and updates will require these indices to be updated as well. Storage space will also be a consideration if our database becomes very large.
More Concepts
-
Another concept that a database offers is atomicity, where two or more operations can occur together or not at all.
-
One example is having no milk in the fridge, and two roommates each coming home, seeing the state of the fridge, and going out separately to purchase additional milk.
-
With a database lock, for example, your program can temporarily prevent another program from updating some table or rows while you’re reading and changing that data.
-
Another example is with ATM transactions, two people might use two cards linked to the same account, at two different ATMs at the exact same time. Then, each machine might check the balance at the same time, and allow money to be withdrawn before the balance is updated by the other machine.
-
On a login page, too, if the web server’s code was not written properly, someone could maliciously access or change data:
-
This is called a SQL injection attack, whereby the quotes and equal signs could change a command being sent to a database.
-
The original command might look like:
db.execute(f"SELECT * FROM users WHERE username = '{username}' AND password = '{password}'")
. But if the password isn’t substituted correctly, the command might look likedb.execute("SELECT * FROM users WHERE username = 'me@examplemailprovider.com' AND password = '' OR '1' = '1'")
. TheOR
will select everything from theusers
table! Fortunately, we can sanitize, or escape, characters like these single quotes that change the meaning of commands. -
Our code might escape characters such that the final command looks like:
db.execute("SELECT * FROM users WHERE username = 'me@examplemailprovider.com' AND password = '\' OR \'1\' = \'1'")
. Then, our database will interpret the quote as a character, not as ending the phrase in the command.
-
-
Remember that last week, we talked about scaling our web servers. And now we might need to scale our databases, to have more capacity. Sharding is a way to divide our data between servers. For example, customers with names between A-M might be on the first server, with customers between N-Z on the second. But now we need multiple copies of each server, so we don’t lose any pieces of data. We might have multiple primary databases which sync to each other, and also databases that are read-only, which will make syncing faster as well as provide additional capacity for reads.
-
http://sqlitebrowser.org/ is what we use as a GUI to work with a local database file with SQL.
-
We’ll try out a few more SQL commands with our assignment for this week!