I’m not a data architect or DBA, but in my current and past positions I, like most software developers, have been responsible for designing schemas for both simple and complex databases. One thing I always waffle about is whether to compose keys from meaningful data or generate surrogate keys for my tables.
I have always been a proponent of using surrogate, or blind, keys. The reason is that I have faced scenarios where what is said to be an iron-clad rule about the business meaning associated with a primary key suddenly changes requiring the need to change that primary key – most DBs don’t handle this well. It’s also cumbersome constructing joins on natural keys when they are composed of several columns. With that said, I do see advantages to having real meaning attached to the fields used as a composite key.
What do you think? Is this a matter of preference of the database designer or is there some rule or advantage that I don’t know of that would be a definite answer one way or another?
Here are a couple of articles to refer to for arguments.
Always use blind keys as primary keys. Always. Never break this rule. Never use natural, meaningful or compound keys. This is an iron clad rule. Whatever advantages you see in using natural keys are outweighed massively by the disadvantages. These disadvantages only magnify as the application gets larger.
The advantages of using blind keys are as you mention. That they are immutable is the most important one. If your keys can change, you suddenly have to start worrying about every foreign key that points to them. If you ever archive your data, point to them from a third party system, cache them, etc. when you do change them, a great unhappiness will extend over your life.
No discussion of natural vs. surrogates is complete without Joe Celko, make sure to look him up to see his rants. Short answer: he, ah, *disagrees* with fletch’s comment.
My thought is that “data model” is actually an abbreviation for “data model of reality,” and that means that if your reality changes, your data model changes. Further, even if you have put in some non-reality concession to performance like having a single meaningless number that some vendor’s DB can use quick instead of a compound natural key that some vendor’s DB isn’t optimized for, your *application* is surely using the natural keys? So if some iron-clad business rule suddenly becomes soft, you have development anyway?
If you step back you’ll probably find that a surrogate key, while perhaps serving as the PK in your DB of choice, is pretty tightly coupled with a (perhaps compound) (perhaps enforced via unique constraint) natural key. Even in a data warehouse, the home of surrogate keys by design!
Having been a DBA for over 20 years and a developer for even longer, I have always been a proponent of data driven design. With that view in mind, a sound data model is paramount to effecient use of a relational database.
One of the first things one should ask when modeling is, what are my entities, what are the characteristics or attributes of those entities, and what attributes uniquely define a particular entity. If an attribute (data column) or group of attributes can uniquely define a single row, then you have a good “candidate” for a primary key. If no attibutes are good candidates, then a surrogate key is required.
That said, one must also look at how data is stored in a database. In particular consider indexing. Although databases can handle composite keys and can handle keys that are based on wide columns (for example varchar(255)), you will not get many index entries per page and hence may suffer worse I/O. I tend to add a surrogate key when the candidate key gets too wide (what is too wide? It depend on whether the data is variable).
So why not always use a surrogate key? Anyone who has ever tried to research a data problem with a database designed by a surrogate-happy developer, would know how flat-out annoying it is when every primary and foreign key is an auto-incremented number. More importantly though, consider how a relational database uses indexes.
A smart DBA and developer will design indexes and write queries that can retreive business data directly from the index and thus avoid having to access the data pages directly. Obviously, if every key is a meaningless number, then your queries will have to retrieve the data page containing the base row so you can display the real data a human wants to see.
So, in summary, should you use intelligent or surrogate keys? It depends. Sometimes you have to load a bunch of data, and optimize queries before you’ll settle on one or the other. A general rule of thumb – which follows good data modeling practices – is to look for candidate keys based on pieces of business data (i.e. intelligent keys)
I’m with Fletch – avoid the headaches, especially if you plan to stay and support what you’ve created. If you always move on to another new job, then most people care less. To which I say, shame on you – leave the next person a nice starting point, NOT a mess to clean up.