Tuesday 13 December 2016

Notes on database versioning

This post contains links to external resources that discuss database versioning or related concepts. I've also added a description for some of the links (which may not be a summary, but contains relevant points discussed in the resource).

By database versioning, I am referring to versioning of data stored in the database, and not schema versioning or migration. An application of database versioning is audit logging.

Basics


A good starting point is this Wikipedia article: Slowly Changing Dimensions (SCD)

Various techniques (Type 0, Type 2, Type 6..) of versioning (referred to as SCD management methodologies) are specified. They are discussed against a particular example, but may be applied to other use cases (such as audit logging) as well.

… For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is now assigned to a different regional office.

Log Trigger is a technique that deals with SCD. The MS-SQL server example is most concise, describing an implementation using database triggers. (One can use functions (in Postgres) or equivalent routines for better organization and structure within triggers).

If the use case doesn't require a start_date and an end_date (audit logging). One can do away with the end_date, and rename start_date to created_at. To get the latest version, one would look at the row in the history table with the largest value of created_at.

When it comes to foreign keys, one may or may not want the referential integrity in the history table as in the base table.

Foreign keys may be kept if the referenced table rows are not deleted, or use ON DELETE SET NULL instead of  ON DELETE CASCADE (for example) to not lose the entire row. The choice would depend on your use case, and when the use case is audit logging, neither method provides an accurate audit log.

If the referenced table also makes use of a history table to store versions, one can use Foreign Keys to the history table (instead of Foreign Keys to the base reference table) for better comprehensiveness.

Better History Tables


This post from database-programmer suggests copying all values from the base table into the history table is a naive approach. It recommends storing only the fields that you care about, and storing "deltas" in cases where the value is numeric. The advantage of storing deltas is highlighted by a specific query (computing balance at a point in time) for a specific use-case (ordering system).

Security of history tables can be dealt with by restricting operations (disallow UPDATES, limit INSERTS to only database triggers, and allow DELETES only by privileged users for purging the tables to reduce size on disk).

While it makes sense to only save columns you need, storing deltas will probably depend on your use case. Good points on security and access control being implemented at the lowest level (the database) rather than application code.

Branching Model


Another technique I came across was self-referencing history or version tables that can be used to maintain multiple versions (branches) of a record. Useful when you need to keep track of the source or hierarchy of changes, or need to store incremental changes (or work on such changes individually).

Vendor-specific solutions


Change Data Capture (CDC) in SQL Server and Oracle.



There are several implementations, and several more combinations of various implementations that one can go with. Whatever the choice, one needs to maintain the versioning implementation, be it in code or in the database (as triggers or stored procedures).

Schema changes may require changes to the history tables. If schema changes frequently or involves major structural updates, making relevant changes to the history tables incurs a proportionately higher overhead (which might not be worth it).