Saturday, 12 August 2017

Django tips #3 - Subquery expressions

This post describes a common quirk observed when performing joins against multiple tables in an SQL query, and a solution to it using a new feature (Subquery expressions) of the Django ORM.

A common problem

Consider the following models in Django.

from django.db import models

class Account(models.Model):
    name = models.CharField(max_length=32, unique=True)

class Credit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)

class Debit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)

Let's add some accounts, and also a few credit and debit entries against them. This is what the tables look like now:

# Accounts
|   id | name   |
|    1 | FOX    |
|    2 | SNAKE  |
|    3 | DOG    |

# Credits
| account_name   |   credit_amount |
| FOX            |           100.0 |
| SNAKE          |            50.0 |
| SNAKE          |            20.0 |
| DOG            |           300.0 |

# Debits
| account_name   |   dedit_amount |
| FOX            |           40.0 |
| SNAKE          |           30.0 |
| DOG            |           12.0 |
| DOG            |           23.0 |

Now, we want to answer a simple question: "What is the current balance of each account?". The balance of an account is defined as the sum of all credits minus the sum of all debits.

Seems pretty straightforward, right?

This is how one would do it using the Django ORM (A simple group-by and aggregate, which I described in one of the previous posts:

    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('10.00')),
 ('DOG', Decimal('565.00'))]

The balance for FOX looks okay (100 - 40). But what's happening with SNAKE, and DOG? Their balance amount isn't correct.
Printing the individual sums of credits and debits might help us figure out what's happening.

| name   |   credit_sum |   debit_sum |   balance |
| FOX    |        100.0 |        40.0 |      60.0 |
| SNAKE  |         70.0 |        60.0 |      10.0 |
| DOG    |        600.0 |        35.0 |     565.0 |

The sum of debits is double what is should be for SNAKE, and the sum of credits is twice of what it should be for DOG.
The "twice" comes from the fact that there are two entries of credit for SNAKE, and two entries of debit for DOG.

An intermediate representation might help explain this better. This is the data on which the aggregates (SUM) are applied.

| name   |   credit_id |   credit_amount |   debit_id |   debit_amount |
| FOX    |           1 |           100.0 |          1 |           40.0 |
| SNAKE  |           4 |            20.0 |          2 |           30.0 |
| SNAKE  |           3 |            50.0 |          2 |           30.0 |
| DOG    |           5 |           300.0 |          3 |           12.0 |
| DOG    |           5 |           300.0 |          4 |           23.0 |

The INNER JOIN is applied before the aggregate (SUM in this case). We need it the other way round.

The solution

In Postgres, this can be done using a WITH query (Common Table Expression):

WITH credits AS (
    SELECT account_id, sum(amount) AS credit_sum
        FROM core_credit
    GROUP BY account_id
), debits AS (
    SELECT account_id, sum(amount) AS debit_sum
        FROM core_debit
    GROUP BY account_id

SELECT, credit_sum - debit_sum AS balance
    FROM core_account a
        INNER JOIN credits c
            ON c.account_id =
        INNER JOIN debits d
            ON d.account_id =;

Which gives us the correct balance of each account:

| name   |   balance |
| FOX    |      60.0 |
| SNAKE  |      40.0 |
| DOG    |     265.0 |

Common Table Expressions don't seem to be supported in Django, but subqueries, which can be used to achieve the same output, were recently added in Django 1.11.

This is how we can use Subquery expressions to get the same result:

from django.db.models.expressions import Subquery, OuterRef
from django.db.models import F, Sum

from core.models import Account, Credit, Debit

credits = Credit.objects.filter(
debits = Debit.objects.filter(

    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('40.00')),
 ('DOG', Decimal('265.00'))]

It's not the most intuitive of code, but to someone who spends a lot of time using the ORM, it's just a matter of getting used to.

You can read more about Subquery expressions here:

The Django ORM has come a long way in terms of things it makes possible, reducing the need to resort to raw SQL queries even for some not-so-straightforward use cases.

Tuesday, 13 December 2016

Notes on database versioning

This post contains links to external resources that discuss database versioning or related concepts. I've also added a description for some of the links (which may not be a summary, but contains relevant points discussed in the resource).

By database versioning, I am referring to versioning of data stored in the database, and not schema versioning or migration. An application of database versioning is audit logging.


A good starting point is this Wikipedia article: Slowly Changing Dimensions (SCD)

Various techniques (Type 0, Type 2, Type 6..) of versioning (referred to as SCD management methodologies) are specified. They are discussed against a particular example, but may be applied to other use cases (such as audit logging) as well.

… For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is now assigned to a different regional office.

Log Trigger is a technique that deals with SCD. The MS-SQL server example is most concise, describing an implementation using database triggers. (One can use functions (in Postgres) or equivalent routines for better organization and structure within triggers).

If the use case doesn't require a start_date and an end_date (audit logging). One can do away with the end_date, and rename start_date to created_at. To get the latest version, one would look at the row in the history table with the largest value of created_at.

When it comes to foreign keys, one may or may not want the referential integrity in the history table as in the base table.

Foreign keys may be kept if the referenced table rows are not deleted, or use ON DELETE SET NULL instead of  ON DELETE CASCADE (for example) to not lose the entire row. The choice would depend on your use case, and when the use case is audit logging, neither method provides an accurate audit log.

If the referenced table also makes use of a history table to store versions, one can use Foreign Keys to the history table (instead of Foreign Keys to the base reference table) for better comprehensiveness.

Better History Tables

This post from database-programmer suggests copying all values from the base table into the history table is a naive approach. It recommends storing only the fields that you care about, and storing "deltas" in cases where the value is numeric. The advantage of storing deltas is highlighted by a specific query (computing balance at a point in time) for a specific use-case (ordering system).

Security of history tables can be dealt with by restricting operations (disallow UPDATES, limit INSERTS to only database triggers, and allow DELETES only by privileged users for purging the tables to reduce size on disk).

While it makes sense to only save columns you need, storing deltas will probably depend on your use case. Good points on security and access control being implemented at the lowest level (the database) rather than application code.

Branching Model

Another technique I came across was self-referencing history or version tables that can be used to maintain multiple versions (branches) of a record. Useful when you need to keep track of the source or hierarchy of changes, or need to store incremental changes (or work on such changes individually).

Vendor-specific solutions

Change Data Capture (CDC) in SQL Server and Oracle.

There are several implementations, and several more combinations of various implementations that one can go with. Whatever the choice, one needs to maintain the versioning implementation, be it in code or in the database (as triggers or stored procedures).

Schema changes may require changes to the history tables. If schema changes frequently or involves major structural updates, making relevant changes to the history tables incurs a proportionately higher overhead (which might not be worth it).

Tuesday, 5 July 2016

unittest quirks - Python 3.5 vs 3.4

This is something I discovered when a CI server wouldn't find any of the tests in a Django project.

Consider the following directory structure. (app here is a Django app).
├── app
│   ├──
│   ├──
│   ├── migrations
│   │   └──
│   ├──
│   ├── tests
│   │   ├──
│   │   └──
│   └──
├── looks like this.
from django.test import TestCase

class SimpleTest(TestCase):
    def test_thing(self):
        self.assertEqual('foo', 'foo')
And here's
from .stuff import SimpleTest  # noqa
By default, a Django app has an empty
Replacing it with a module, and importing all tests in tests/ should work, right?
Creating test database for alias 'default'...

Ran 1 tests in 0.000s

Destroying test database for alias 'default'...
That's using Python 3.5.

What happens when you're using 3.4?
Creating test database for alias 'default'...

Ran 0 tests in 0.000s

Destroying test database for alias 'default'...
Wait, what?

The same version of Django is used in both cases, on the same codebase. What happened here?

Let's take a look at the Python 3.5 changelog
The TestLoader.loadTestsFromModule() method now accepts a keyword-only argument pattern which is passed toload_tests as the third argument. Found packages are now checked for load_tests regardless of whether their path matches pattern, because it is impossible for a package name to match the default pattern. (Contributed by Robert Collins and Barry A. Warsaw in issue 16662.)

Since the pattern Django uses is “test*.py”, a module wouldn't match, and the tests won't be found.

This can be made to work in 3.4 by changing inside tests/ to, (and removing the import in, which isn't needed).

Lessons learnt:
  • Use the same version of Python in dev and production. Even minor versions matter.
  • Test against multiple versions of Python (using something like tox).

Thursday, 31 March 2016

Better Django choices

A typical Django model looks something like this:

from django.db import models

class MyModel(models.Model):
        (1, "Active"),
        (2, "In-Progress"),
        (-1, "Inactive"),
    status = models.SmallIntegerField(choices=STATUS_CHOICES, default=1)

    def __str__(self):
        return '{}'.format(

To filter on "status", one would do this


Not only is hardcoding non-informative in this case, anyone reading or writing a filter has to refer to the Model every time. In the two fairly large Django codebases that I have worked on, there was a mapping system that would let you do this.


This gives better context to someone reading the code. Authors, on the other hand, still need to hardcode a string.

"There must be a better way"

Sure there is, a quick search will lead you to this.

One can pip install django-choices and call it a day. To me, this problem seems too trivial to require a dependency.

More searching will lead you to this. What I suggest is a mix of both, with some restructuring.

Create a directory named "choices" in the app's folder. For each model in your app, add a file in this directory (for "MyModel", you can add "").

For each field with choices, you define a class (with the same name capitalised).

class Status:
    ACTIVE = 1
    INACTIVE = -1

    choices = (
        (ACTIVE, "Active"),
        (IN_PROGRESS, "In-Progress"),
        (INACTIVE, "Inactive"),

MyModel will be updated likewise. An extra property of the form <model_name><capitalised_field_name> is added. You can, and probably should, experiment with the naming conventions here.

from django.db import models

from .choices import mymodel

class MyModel(models.Model):
    MyModelStatus = mymodel.Status
    status = models.SmallIntegerField(choices=MyModelStatus.choices,

    def __str__(self):
        return '{}'.format(

Filtering will be thus:


There's several enhancements you can make to the "Status" class. Such as creating a base that auto-defines "choices", gives default values, etc. This does the job without adding much overhead. It is slightly verbose, but I'd take that any day for improved readability. Authors will also benefit from autocomplete in supporting editors and IDEs.

For those who might consider refactoring an existing codebase, you can run "makemigrations" to test it out. If you did everything correctly, no changes will be detected when you run "python makemigrations".

Making changes to production systems however, based on the latest blog posts that you read, is a risk that is yours to take :)

Tuesday, 29 March 2016

Stateful Modules in Python

TIL that random.seed persists across imports, aka modules with state. (aka unintentional MonkeyPatching?)

Let's say we have "" (WARNING:  Do not rely on "random" for cryptographic purposes, use "os.urandom" instead)

import random

def get_random_numbers():
    s = random.sample(range(1000), 10)
    return s

And another file named ""

from moduleA import get_random_numbers

import random

if __name__ == '__main__':
    for i in range(3):
        # do things with random
        # code from another module that uses random

The side effect may be obvious here (prints the same random numbers on each iteration).

Perhaps .seed should be treated as a runtime context? (instead of associating state with the module)

with random.seed(1337):
    _ = random.random()

To avoid this, one can seed random with os.urandom before each use.

import random
import os

def get_random_numbers():
    s = random.sample(range(1000), 10)
    return s

Friday, 11 December 2015

MediaWiki, VisualEditor & Parsoid

If you're trying to get rid of the (horrid) default editor that ships with the latest version of MediaWiki (in 2015) and use shiny stuff like VisualEditor, here's a tip that might save two days of your time.

VisualEditor needs to communicate with a backend that does actual parsing, which in this case, is Parsoid (nodejs package), but for reasons unknown the latest version of Parsoid (v0.4.x) does not work with the latest version of VisualEditor (for MediaWiki 1.25.3, which was released in October 2015 and is already deemed legacy… just PHP things).

One must check out (using git, for example) v0.3.0 of Parsoid, for VisualEditor to be able to communicate with it (when it comes to actual parsing of pages).

Tuesday, 31 March 2015

Nginx forbidden

Today I learned that to serve a file with nginx, you need to satisfy (at least) two conditions.

Nginx must have read access to the file you want to serve.

Even if nginx workers are running as root, if the file is marked 000, then nginx cannot serve the file!

In most cases this can be as easy as doing

chmod o+r filename

Or finer group-level permissions depending on access control restrictions for that file.

And the second condition which had me searching the web for hours is

Every directory in the path of the file must be set as executable.

If you want to serve files in /var/www/static/css

location /css/ {
    root /var/www/static;

Then var, www, static and css directories must be executable by the nginx process.

Most web server's master process runs as root spawning worker processes as www-data or whichever user you specify. www-data or the user must have appropriate access to all the files and directories you want to serve.