Tuesday 13 December 2016

Notes on database versioning

This post contains links to external resources that discuss database versioning or related concepts. I've also added a description for some of the links (which may not be a summary, but contains relevant points discussed in the resource).

By database versioning, I am referring to versioning of data stored in the database, and not schema versioning or migration. An application of database versioning is audit logging.

Basics


A good starting point is this Wikipedia article: Slowly Changing Dimensions (SCD)

Various techniques (Type 0, Type 2, Type 6..) of versioning (referred to as SCD management methodologies) are specified. They are discussed against a particular example, but may be applied to other use cases (such as audit logging) as well.

… For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is now assigned to a different regional office.

Log Trigger is a technique that deals with SCD. The MS-SQL server example is most concise, describing an implementation using database triggers. (One can use functions (in Postgres) or equivalent routines for better organization and structure within triggers).

If the use case doesn't require a start_date and an end_date (audit logging). One can do away with the end_date, and rename start_date to created_at. To get the latest version, one would look at the row in the history table with the largest value of created_at.

When it comes to foreign keys, one may or may not want the referential integrity in the history table as in the base table.

Foreign keys may be kept if the referenced table rows are not deleted, or use ON DELETE SET NULL instead of  ON DELETE CASCADE (for example) to not lose the entire row. The choice would depend on your use case, and when the use case is audit logging, neither method provides an accurate audit log.

If the referenced table also makes use of a history table to store versions, one can use Foreign Keys to the history table (instead of Foreign Keys to the base reference table) for better comprehensiveness.

Better History Tables


This post from database-programmer suggests copying all values from the base table into the history table is a naive approach. It recommends storing only the fields that you care about, and storing "deltas" in cases where the value is numeric. The advantage of storing deltas is highlighted by a specific query (computing balance at a point in time) for a specific use-case (ordering system).

Security of history tables can be dealt with by restricting operations (disallow UPDATES, limit INSERTS to only database triggers, and allow DELETES only by privileged users for purging the tables to reduce size on disk).

While it makes sense to only save columns you need, storing deltas will probably depend on your use case. Good points on security and access control being implemented at the lowest level (the database) rather than application code.

Branching Model


Another technique I came across was self-referencing history or version tables that can be used to maintain multiple versions (branches) of a record. Useful when you need to keep track of the source or hierarchy of changes, or need to store incremental changes (or work on such changes individually).

Vendor-specific solutions


Change Data Capture (CDC) in SQL Server and Oracle.



There are several implementations, and several more combinations of various implementations that one can go with. Whatever the choice, one needs to maintain the versioning implementation, be it in code or in the database (as triggers or stored procedures).

Schema changes may require changes to the history tables. If schema changes frequently or involves major structural updates, making relevant changes to the history tables incurs a proportionately higher overhead (which might not be worth it).

Tuesday 5 July 2016

unittest quirks - Python 3.5 vs 3.4

This is something I discovered when a CI server wouldn't find any of the tests in a Django project.

Consider the following directory structure. (app here is a Django app).
.
├── app
│   ├── __init__.py
│   ├── admin.py
│   ├── migrations
│   │   └── __init__.py
│   ├── models.py
│   ├── tests
│   │   ├── __init__.py
│   │   └── stuff.py
│   └── views.py
├── manage.py
stuff.py looks like this.
from django.test import TestCase


class SimpleTest(TestCase):
    def test_thing(self):
        self.assertEqual('foo', 'foo')
And here's __init__.py
from .stuff import SimpleTest  # noqa
By default, a Django app has an empty tests.py.
Replacing it with a module, and importing all tests in tests/__init__.py should work, right?
Creating test database for alias 'default'...

----------------------------------------------------------------------
Ran 1 tests in 0.000s

OK
Destroying test database for alias 'default'...
That's using Python 3.5.

What happens when you're using 3.4?
Creating test database for alias 'default'...

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK
Destroying test database for alias 'default'...
Wait, what?

The same version of Django is used in both cases, on the same codebase. What happened here?

Let's take a look at the Python 3.5 changelog
unittest
The TestLoader.loadTestsFromModule() method now accepts a keyword-only argument pattern which is passed toload_tests as the third argument. Found packages are now checked for load_tests regardless of whether their path matches pattern, because it is impossible for a package name to match the default pattern. (Contributed by Robert Collins and Barry A. Warsaw in issue 16662.)

Since the pattern Django uses is “test*.py”, a module wouldn't match, and the tests won't be found.

This can be made to work in 3.4 by changing stuff.py inside tests/ to test_stuff.py, (and removing the import in __init__.py, which isn't needed).

Lessons learnt:
  • Use the same version of Python in dev and production. Even minor versions matter.
  • Test against multiple versions of Python (using something like tox).

Friday 1 April 2016

Better Django choices

A typical Django model looks something like this:

from django.db import models


class MyModel(models.Model):
    STATUS_CHOICES = (
        (1, "Active"),
        (2, "In-Progress"),
        (-1, "Inactive"),
    )
    status = models.SmallIntegerField(choices=STATUS_CHOICES, default=1)

    def __str__(self):
        return '{}'.format(self.pk)

To filter on "status", one would do this

MyModel.objects.filter(status=1)

Not only is hardcoding non-informative in this case, anyone reading or writing a filter has to refer to the Model every time. In the two fairly large Django codebases that I have worked on, there was a mapping system that would let you do this.

MyModel.objects.filter(status=STATUS_MAP["Active"])

This gives better context to someone reading the code. Authors, on the other hand, still need to hardcode a string.

"There must be a better way"

Sure there is, a quick search will lead you to this.

One can pip install django-choices and call it a day. To me, this problem seems too trivial to require a dependency.

More searching will lead you to this. What I suggest is a mix of both, with some restructuring.

Create a directory named "choices" in the app's folder. For each model in your app, add a file in this directory (for "MyModel", you can add "mymodel.py").

For each field with choices, you define a class (with the same name capitalised).

class Status:
    ACTIVE = 1
    IN_PROGRESS = 2
    INACTIVE = -1

    choices = (
        (ACTIVE, "Active"),
        (IN_PROGRESS, "In-Progress"),
        (INACTIVE, "Inactive"),
    )

MyModel will be updated likewise. An extra property of the form <model_name><capitalised_field_name> is added. You can, and probably should, experiment with the naming conventions here.


from django.db import models

from .choices import mymodel


class MyModel(models.Model):
    MyModelStatus = mymodel.Status
    status = models.SmallIntegerField(choices=MyModelStatus.choices,
                                      default=MyModelStatus.ACTIVE)

    def __str__(self):
        return '{}'.format(self.pk)

Filtering will be thus:

MyModel.objects.filter(status=MyModel.MyModelStatus.ACTIVE)

There's several enhancements you can make to the "Status" class. Such as creating a base that auto-defines "choices", gives default values, etc. This does the job without adding much overhead. It is slightly verbose, but I'd take that any day for improved readability. Authors will also benefit from autocomplete in supporting editors and IDEs.

For those who might consider refactoring an existing codebase, you can run "makemigrations" to test it out. If you did everything correctly, no changes will be detected when you run "python manage.py makemigrations".

Making changes to production systems however, based on the latest blog posts that you read, is a risk that is yours to take :)

Wednesday 30 March 2016

Stateful Modules in Python

TIL that random.seed persists across imports, aka modules with state. (aka unintentional MonkeyPatching?)

Let's say we have "moduleA.py" (WARNING:  Do not rely on "random" for cryptographic purposes, use "os.urandom" instead)

import random


def get_random_numbers():
    s = random.sample(range(1000), 10)
    return s

And another file named "main.py"

from moduleA import get_random_numbers

import random


if __name__ == '__main__':
    for i in range(3):
        random.seed(1234)
        # do things with random
        
        # code from another module that uses random
        print(get_random_numbers())

The side effect may be obvious here (prints the same random numbers on each iteration).

Perhaps .seed should be treated as a runtime context? (instead of associating state with the module)

with random.seed(1337):
    _ = random.random()

To avoid this, one can seed random with os.urandom before each use.

import random
import os


def get_random_numbers():
    random.seed(os.urandom(256))
    s = random.sample(range(1000), 10)
    return s