If you've ever wanted to know what a database is, how a good one works, and how it hooks up with Rails, you are a nerd. And you've come to the right place. I gave a beginner talk about the subject at our recent hackfest and thought a blog post might be helpful for any beginners out there struggling with these concepts. Let's dive in.
If you've ever used a spreadsheet, you've used a database. Consider this table that tracks a bookstore's inventory:
A database is simply an organized way to store data. If this were a database for a web app, each row would get added by a user entering some information, maybe through a web form. So when a new installment of Harry Potter arrives, we'd enter the information for that book.
Let's see what happens when we start tracking more detailed information for each book:
Because we're storing publisher details on the same row as the book itself, the user can enter slightly different information each time. We now have a data consistency problem. Is Scholastic's correct city "NYC" or "New York?" Is the author of Harry Potter spelled "Rowling" or "Rolling?" There's no way to know for sure by looking at this data. From a practical standpoint, it's now impossible to see all books by Scholastic, since that company name could be stored in dozens of different ways.
What we need is database normalization, a process that eliminates redundancy and organizes data efficiently. In our non-normalized spreadsheet above, we're storing J.K. Rowling's birth place in every single book record. This leads to inconsistent data because if the information is updated in only one of two locations, we can't be sure where the correct version is stored.
If we have our books table laid out like this, how do we normalize it so that data is only stored in one place?
The key to avoiding duplicate information across tables is to make sure each table only describes one thing. If the books table includes detailed information about an author, we're storing that author's information redundantly, in every row where his or her books appear. We can fix this by breaking out the table into two:
We now have the tables cleanly separated, with no information duplicated in any cell. Now we need to somehow relate each Harry Potter row in the books table to J.K. Rowling in the authors table.
Before we do this, we need a way identify each row in the books table. We can't do this with the book's title, for example, because many different books may have the same title. Same goes for the year a book was published.
The solution is to add an id number to each row. This is called a primary key, an attribute that is guaranteed to be unique among other records in the table. We use this new id column as a primary key since other attributes (title, year published) might be the same in two rows.
Now, we can add an author id column to the books table. Author id points to a row in the authors table, so now we can make any updates to the author J.K. Rowling in once place.
The author information is stored in only one place, which is a requirement of a normalized database.
This structure works for all types of relationships:
Sometimes, though, it's not enough to add an author_id to a table. Consider this example, where we want to store what language a book is written in, as well as which languages an author uses:
Normalization means there can be no duplicate rows and every cell must contain exactly one value. The authors table is invalid because we're storing more than one value in a cell (English AND French, for example). Furthermore, we're tying the spelling of a language to each row in the books and authors tables. If someone misspelled a language , he'd need to update tons of rows to make the correction. We should only be storing information like that in one place.
A good first step is to break out languages into its own table:
The question now is how to relate them all. We can't add language id to the authors table since an author may speak more than one language (and just like we can't have two first name fields for a record, we can't have two language id fields). Same goes for the books table.
Since there's no good way to store a language_id on a book or author record directly, we use an intermediary table, called a join table or a junction table.
Whereas most tables describe an object (like a book or an author), join tables describe a relationship. Each record contains id fields from two tables, thereby acting as a junction between the two. This way, we don't have to pollute our books or authors table with another foreign key, and we can store the relationships non-redundantly.
And that's it! You now have everything you need to know to design a database. Next week we'll hook these tables up to Ruby classes so we can start using them with Rails.
UPDATE: Part II is available here.