Saturday 12 August 2017

Django tips #3 - Subquery expressions

This post describes a common quirk observed when performing joins against multiple tables in an SQL query, and a solution to it using a new feature (Subquery expressions) of the Django ORM.

A common problem


Consider the following models in Django.

from django.db import models


class Account(models.Model):
    name = models.CharField(max_length=32, unique=True)


class Credit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)


class Debit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)


Let's add some accounts, and also a few credit and debit entries against them. This is what the tables look like now:

# Accounts
+------+--------+
|   id | name   |
|------+--------|
|    1 | FOX    |
|    2 | SNAKE  |
|    3 | DOG    |
+------+--------+


# Credits
+----------------+-----------------+
| account_name   |   credit_amount |
|----------------+-----------------|
| FOX            |           100.0 |
| SNAKE          |            50.0 |
| SNAKE          |            20.0 |
| DOG            |           300.0 |
+----------------+-----------------+

# Debits
+----------------+----------------+
| account_name   |   dedit_amount |
|----------------+----------------|
| FOX            |           40.0 |
| SNAKE          |           30.0 |
| DOG            |           12.0 |
| DOG            |           23.0 |
+----------------+----------------+


Now, we want to answer a simple question: "What is the current balance of each account?". The balance of an account is defined as the sum of all credits minus the sum of all debits.

Seems pretty straightforward, right?

This is how one would do it using the Django ORM (A simple group-by and aggregate, which I described in one of the previous posts: https://mixedquantum.blogspot.in/2015/01/python-django-tips.html)

Account.objects.values('id').annotate(
    credit_sum=Sum('credit__amount'),
    debit_sum=Sum('debit__amount'),
    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

"""
[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('10.00')),
 ('DOG', Decimal('565.00'))]
"""


The balance for FOX looks okay (100 - 40). But what's happening with SNAKE, and DOG? Their balance amount isn't correct.
Printing the individual sums of credits and debits might help us figure out what's happening.

+--------+--------------+-------------+-----------+
| name   |   credit_sum |   debit_sum |   balance |
|--------+--------------+-------------+-----------|
| FOX    |        100.0 |        40.0 |      60.0 |
| SNAKE  |         70.0 |        60.0 |      10.0 |
| DOG    |        600.0 |        35.0 |     565.0 |
+--------+--------------+-------------+-----------+

The sum of debits is double what is should be for SNAKE, and the sum of credits is twice of what it should be for DOG.
The "twice" comes from the fact that there are two entries of credit for SNAKE, and two entries of debit for DOG.

An intermediate representation might help explain this better. This is the data on which the aggregates (SUM) are applied.

+--------+-------------+-----------------+------------+----------------+
| name   |   credit_id |   credit_amount |   debit_id |   debit_amount |
|--------+-------------+-----------------+------------+----------------|
| FOX    |           1 |           100.0 |          1 |           40.0 |
| SNAKE  |           4 |            20.0 |          2 |           30.0 |
| SNAKE  |           3 |            50.0 |          2 |           30.0 |
| DOG    |           5 |           300.0 |          3 |           12.0 |
| DOG    |           5 |           300.0 |          4 |           23.0 |
+--------+-------------+-----------------+------------+----------------+

The INNER JOIN is applied before the aggregate (SUM in this case). We need it the other way round.

The solution


In Postgres, this can be done using a WITH query (Common Table Expression):

WITH credits AS (
    SELECT account_id, sum(amount) AS credit_sum
        FROM core_credit
    GROUP BY account_id
), debits AS (
    SELECT account_id, sum(amount) AS debit_sum
        FROM core_debit
    GROUP BY account_id
)

SELECT a.name, credit_sum - debit_sum AS balance
    FROM core_account a
        INNER JOIN credits c
            ON c.account_id = a.id
        INNER JOIN debits d
            ON d.account_id = a.id;

Which gives us the correct balance of each account:

+--------+-----------+
| name   |   balance |
|--------+-----------|
| FOX    |      60.0 |
| SNAKE  |      40.0 |
| DOG    |     265.0 |
+--------+-----------+


Common Table Expressions don't seem to be supported in Django, but subqueries, which can be used to achieve the same output, were recently added in Django 1.11.

This is how we can use Subquery expressions to get the same result:

from django.db.models.expressions import Subquery, OuterRef
from django.db.models import F, Sum

from core.models import Account, Credit, Debit

credits = Credit.objects.filter(
    account=OuterRef('pk')).values('account_id').annotate(sum_credits=Sum('amount'))
debits = Debit.objects.filter(
    account=OuterRef('pk')).values('account_id').annotate(sum_debits=Sum('amount'))

Account.objects.annotate(
    credit_sum=Subquery(credits.values('sum_credits')),
    debit_sum=Subquery(debits.values('sum_debits')),
    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

"""
[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('40.00')),
 ('DOG', Decimal('265.00'))]
"""


It's not the most intuitive of code, but to someone who spends a lot of time using the ORM, it's just a matter of getting used to.

You can read more about Subquery expressions here: https://docs.djangoproject.com/en/1.11/ref/models/expressions/#subquery-expressions

The Django ORM has come a long way in terms of things it makes possible, reducing the need to resort to raw SQL queries even for some not-so-straightforward use cases.

1 comment: