MongoDB: a love story
I’ll just say it right up front: NoSQL is a bit of a fad. As much as developers don’t like to consider themselves victims of fashion trends, there is this unwritten desire to be working in the vogue languages and technologies, and it’s not at all uncommon for a developer working in these technologies to snub their noses at what were probably perfectly serviceable choices just a few years ago (and still are today). Oh, you’re writing your software in Java? You’re using SVN still? Omigosh, why don’t you just use Git? Jeez, you’re using SQL? How are you ever going to scale with that? Rails developers had a reputation around here for being totally insufferable when it comes to those using a non-Rails stack, and even they have fallen victim to obsolescence as node.js becomes the new kid on the block.
That’s the sort of attitude that caused me to approach Mongo with skepticism, and you should too. It’s not right for everyone or every data model, and if you have a pretty good SQL based skill set and you know how to tweak an SQL database for maximum performance and/or you can easily model data in a relational way to solve a problem, you may not stand to gain much of anything at all with Mongo.
I’m not an SQL whiz, though. The first DBMS I learned how to use is an old school hierarchy based database used for electronic medical records. Knowing the limitations and annoying issues that this type of DBMS has, I knew Mongo doesn’t make life a cake walk, but having now spent a few months with Mongo-backed applications, I really love it. I don’t have to be a DBA to be able to work with a highly powerful and fast database, and knowing a few simple concepts I can do just about anything I need to in Mongo.
First off, let me explain the anatomy of a Mongo database. At your most general level, you have a database which contains zero or more named collections. Each collection contains zero or more documents, each of which is structured in a binary form of JSON called BSON. Documents in this database can take have all the data types you expect in JSON and can be nested arbitrarily.
Although most of the documents in a collection tend to be pretty homogenous, there are no hard and fast rules about the keys each document in a collection must contain. Also, Mongo has no rules for what data type a particular key should have. If you’re not familiar this would be referred to as schemaless in database parlance.
I think of schemaless databases as being analogous to dynamically typed languages, sort of (like how Ruby is). The flexibility that offers is incredible, especially if you want to quickly get a prototype out, or if you want to quickly start storing your data in a different way. Of course, those schemas were also doing things like preventing you from doing stupid things like deleting a record that was still referenced by another record. Mongo will happily let you shoot yourself in the foot, leaving greater potential for errors at runtime (the worst time for errors). Of course, using an ODM like Mongoid will go a long way to helping save you from yourself. You could implement callbacks that, on destroying a record, will go and delete the records that are pointing to it.
If you learned databases in college, you probably were learning primarily about relational databases. These learnings you’ve acquired have inevitably caused you to reshape the data you think of to fit into this relational sort of mold. You might think to use some collections to join other collections as you might a join table. It’s a trap!
If you’re designing your schemas like this, are you doing it out of habit? If so, stop! If you’re designing the schema that way because you know you’re going to need to report on the data in a lot of different ways later, there’s a decent chance you might be better off with an SQL database.
So, always embed everything, right? Well, not so fast. Although you want to ultimately keep yourself from extra queries just for handling data with relationships, always embedding everything can have bad side effects, too.
Say, for instance, you’re making a database schema for the proverbial blog that has posts and comments on the posts. If you embedded everything, you would have a collection of posts, and each post would contain zero or more comments. It makes opening a post a snap. Just one query and you’ve got all the data you need! However, there’s a catch. Mongo documents are limited to 16 megabytes. If your blog gets on Reddit, you’re going to hit that limit. Or, if you ever wanted users to see a list of all their comments they’ve written, you’d have to cycle through each post and query the post for just the comments you want. That’s expensive and it’s more work for you as a developer. You could instead embed an array of IDs of comment records kept in a comments collection. Or keep them in the same collection; after all, Mongo doesn’t care! Or you could also just use a relation and let each post have many comments and each comment belongs to a single post and will contain the object ID of the post. The more I use MongoDB for production stuff, the more I’m learning to have many classes in my Rails apps (which map to collections) and I give these classes a single role that they can do well and I define their relationships. Yes, that sounds an awful lot like a relational database, but it’s not quite that way. Instead it is closely coupling my model classes with the way they’re actually modeled in the database, and I think that’s great. But it is important to use moderation in how you do that, because you don’t want to have to send the database two dozen queries just to get information for a single page load.
If you’re using Mongo as the primary DB for your web app, you’re inevitably going to end up needing relations, so don’t shy away from them just because you’re not using a relational DB. Just be smart about them!
Another concept that’s going to feel like a sin when you start doing it in a Mongo database is denormalization. Let’s say you’ve got a collection of posts by different users. What would you store to identify the user? In a relational DB you’d store a foreign key to a Users table. Then, when you got the blog post info, you’d look up that record in the users table and get the user’s name and such. In the Mongo world that’s a huge waste of time, and you just store the name and any other user info you need right in the post there. You should also include a reference to the user record for discreteness (because you never know when you might be running some data conversion that would benefit from just having an ID), but the info you need to display the post is stored right in the post.
“Omg,” you say, “but what if that user changes their name someday? What will we do then?”
The people who designed Mongo presumably thought long and hard about this, and decided that although it’s nice and elegant to only have to change a name in one place when a name is changed, just looking at a post happens many orders of magnitude more often than name changes. Therefore, it makes no sense whatsoever to add an unnecessary query every time just to accommodate this. What if a user changes their name? No big deal, just run a batch job that will find documents referencing the document with the name change, and update the fields. It’s expensive, but it won’t happen often. If you’re dealing with the kind of data where that does happen often, maybe SQL is for you. It’s better optimized to jump through hoops to do that sort of crap.
Mongo is also, as they say, web scale. Because of the lack of things like joins which are difficult to scale across many shards, it’s not at all complex to scale your Mongo database across many nodes.
But it’s not the web-scale-ness or the general performance that makes me so enamored with Mongo. The beautiful thing about Mongo is that it’s such a simple database. Relational databases are a very unnatural way to think about structuring your data, and their advantages only show up in some scenarios. With Mongo, you can store the data the way you think about it. Got a record that’s going to contain an array of values as one of its properties? In Mongo that’s trivial. In SQL, not so much (yeah, yeah, Postgres can do it). Want to include some key-value pairs in your document? Easy. Oh, you want to embed a document in your document? Well, yo dawg, you’re in luck, because Mongo lets you do that.
(I couldn’t resist)
Mongo lacks a lot of the complexities inherent in the RDBMSes that have been the standard for the past couple decades. Of course, sometimes the complexities are desirable. For instance, transaction support might be a complexity, but if, for instance, you’re Evernote, and saving a note involves a complex series of all-or-nothing changes to your account, you want transaction support (after all, it would suck if your upload quota increased after you failed to upload a note once).
This lack of all the bells and whistles leaves us right now with a very simple database, a sort of iPhone-like re-imagining of what a database can be and how simple it can be. Some features are going to get added back in, in a re-imagined way (I’m pretty excited about the aggregation framework myself) but many won’t.
Mongo has some caveats you should be aware of:
- Mongo is a memory-mapped database and it stores its data with the wishful thinking that it’s always completely in RAM. You can’t tell Mongo what to keep in RAM (like, say, just your indexes) if your data set gets bigger than how much RAM you have; you just have to rely on the OS to be smart about caching. SQL databases are a little wiser about this.
- There’s a global write lock that will probably get you bummed out if you do a lot of writes.
- The database doesn’t have built-in transaction support. If you want to do a set of actions on a database that should be performed in an all-or-nothing way, you’ll need to implement your own logic to undo the previous changes if some later step should fail.
- If you use long, descriptive keys in your database, Mongo doesn’t do anything to save space. If you have a million records with the key “somethingreallylongthatIshouldhaveshortened” you’re going to have a million instances of that string in your DB. There’s talk that Mongo might improve upon this, and Mongoid lets you name fields so that keys are stored as something smaller in the DB.