The longer I develop software professionally, the more I start to look sideways at what I can only call irrational or semi-irrational behavior by other developers that I work with directly or indirectly. For example, people sticking to a dead, broken, or even just sub-optimal language/platform simply because of the IDE or tooling that they are comfortable with. “What? You mean I can program games for WebOS in COBOL now? SIGN ME UP!!!”
Today, I’d like to talk about a different behavior that seems irrational to me. There is a view that when you write software it should be designed around your database, simply because that is where you are storing your data. It sounds harmless, but the effects are quite negative over time.
The database is the only place you can store data right?
The first offense that always happens is your code starts to be structured the same as your database. Databases use a relational model, object oriented code uses an object-oriented model. These are not the same thing at all. Objects aren’t stored in tables, they don’t have foreign keys to other objects. They don’t have many to many relationships. In fact, we invent Object Relational Mappers specifically to try and make database programming feel more object oriented. There is a reason this is called the Vietnam of Computer Science.
After our code starts to look like the database – models named after database tables, fields named after database columns, objects having “relationships” instead of containing actual data, another thing happens. Your software design decisions will be structured around the database. You will start to look at problems through the lens of “how do I store this in the database?”, “how will we query this in the database?”, etc.
This leads to truly bizarre behavior like storing files in the database, instead of the filesystem. Now, when have you ever in your normal computing experience thought to yourself, “I just finished writing this document, I better dump it in MySQL so I can query it later.”? That would be insane, but for some reason, once you start putting your system data in a database, it starts to seem like a good idea.
Also, this kind of behavior rarely leads to the question of “should I put this data in the database?”. In many cases it is totally reasonable to ask that question, but in my experience that question is never asked simply because the norm is to put EVERYTHING in the database, no matter if it makes sense to or not.
The third problem I think becomes a recurring theme amongst highly database oriented systems is that system logic creeps its way into the database itself. For example, maybe your system uses stored procedures or triggers to run queries when you save an entity to update other things. In some cases, this is harmless certainly, but in many cases you are updating data summaries and other things throughout the database, which is to say you are doing business logic in the database.
So, I ask you, how do you unit test a database trigger or stored procedure? Also, why would you want part of your business logic in a database? Now your DBA if you have one owns part of the codebase, but without the accountability of things like source control and code review. Those stored procedures and triggers could change any time and you wouldn’t know until something broke. If you rely on them too much you might end up doing ridiculous things like Etsy’s insane stored procedure router.
Note that Etsy has ended up rewriting their system to rely less on stored procedures and is using an ORM. At least their logic should stay in code hopefully.
The fourth problem, and I would argue the biggest problem with databases is that people get locked into them with their code. Whole codebases are written to spit out SQL code, and if you want to switch to say MongoDB or CouchDB, you will end up rewriting your system to do it. Actually, you won’t, because management will make the reasonable decision that rewriting your system probably isn’t worth it just to switch databases. So, I hope you like your current database, because you’ll be stuck with it for the life of your system probably.
What is funny is when tech people talk about various cloud computing services, they really get flustered when you talk about vendor lock-in, but when it comes to their favorite database, they’ll sing its praises and gladly tie their code directly to it. They don’t worry about switching cost because their favorite database is perfect!!!!
So, lest you think I hate databases because of SQL or something, I would just like to say that all of these problems are equally true for both SQL and noSQL databases. A lot of people who use Mongo use a ORM like Mongoid (in Ruby) and are basically out of luck if they decide to switch to Postgres, Cassandra, MySQL, CouchDB, Neo4j, Riak, SQLServer, Oracle, HBase, DynamoDB, etc. There is no way around it, a huge chunk of the codebase would need to be rewritten. Thus, the likelyhood of using a different data source is pretty low simply because of how people think about databases.
The Filesystem Test
It really doesn’t have to be this way. One simple question can change your perception of what is a reasonable use of a database. I call it the filesystem test. Every time you are about to do something with your database, ask yourself this:
Would I do this on the filesystem?
That’s it. Too simple right?
Well, let’s walk through the problems outlined above.
1. Your code becomes more relational and less object oriented
2. You make database oriented design decisions
3. Your program logic creeps into the database
4. Your code is highly coupled to the database and the switching cost is too high
If you were saving your program data to the filesystem, would you use an ORM? Well, I don’t think so because the filesystem is not relational at all. It’s just files, so you would probably serialize your object data and save it to a file or a set of files. So far so good.
If you were saving your program data to the filesystem, would you design your objects and code around the filesystem? Again, I don’t think so because it’s just files. You might store each object as a file, or maybe collections as a single file. In either case, you are much more likely to keep your entities as just normal objects and just serialize them to files than you are to worry too much about the filesystem deeply influencing the design. Instead, you might just load all of your data into memory and worry about efficient ways to find things in lists and graphs and trees of data, using normal algorithms and data structures.
If you were saving your program data to the filesystem, would your program logic creep into the filesystem? Well, most filesystems aren’t terribly smart and don’t try and do too much beyond store and retrieve files, my guess is no. The only wacky scenario I can think of is maybe if a filesytem were to index the contents of files for something more like OS X’s Spotlight Search, you might be inclined to use that for faster data lookup. But, that is not a direction 80% of programmers would think to head.
If you were saving your program data to the filesystem, would it be too expensive to switch data systems? Well, by default you are going to be able to move between filesystems most of the time. You could easily throw your data on Amazon S3. Beyond that, since nobody in their right mind would write mind would write code that is as highly coupled to the filesystem as we do for the database, I think it would be much cheaper to swap data systems.
If you look at your database through the lens of the filesystem, I think it becomes clearer what the responsibilities of the database should be. Databases are great for storing and retrieving data, just like the filesystem. So, next time you are tempted to get too fancy with your database or are coupling your code to the db, ask yourself “Would I do this on the filesystem?”. If the answer is no, then you need to come up with a better solution.
P.S. If people wrote more code against the filesystem, the same behavior would likely emerge where everything is filesystem centric.