Mixed Quantum: 2017

I'm Kedar, and I work as an Engineer at a startup. This post is inspired by patio11's tweet and his posts on the same topic. It's a summary of thing's I've learnt over the past year working on PaySense's platform.

Shipping on time and quick iterations matter. Time now is more valuable then money later. Below are a few (opinionated) engineering practises I've found to be useful.

Keep complexity low

Whether this be the tooling or deployment, the simpler it is to understand, the better it is to manage. Unless you've got a large team, utilise resources where they matter (shipping product).

Don't solve all the problems

The best problems are the ones you don't have to solve, those that you can avoid altogether. Buy or outsource your way out of ones you think are not important (but may be urgent). If it can be put off until later, do so.
For those that you're not so sure about, spend some time figuring out. If you need an hour to figure out if something (that will take +10 man-hours) really needs solving, do it! Even if you're wrong 8 times out of 10, you will be better off. At times you may be also be able to find other solutions, from your investigation, that don't take as long.
The better you know the ins and outs of critical systems, where data is stored and how it flows through your systems, the more effective you can get at problem-solving.

Make it work, make it right, make it fast

Make it work - Handle 90% of the use cases, show value, serve the customer first. This is your "MVP".
Make it right - Keep refactoring code, adding tests, ensure and improve the base level of quality.
Make it fast (if you really have to) - This one's highly opinionated, but I believe you should solve performance issues when they become a problem. This is difficult to get right, since you must keep an eye on run-times (does it increase day by day) and side-effects (does it subtly impact other systems), but will save you time and resources that you can deploy elsewhere.

Bugs and prioritisation

Not all bugs are created equal. Define them by how many customers they impact, and at what stage in the funnel. Prioritise those that appear later in your on-boarding funnel first. Customers at the end of the funnel are more likely to put off (and may also end up complaining, making noise on social media) than those at the beginning.
If you're spending too much time fixing bugs and/or they keep cropping up, it's probably a sign of bigger issues.

Monoliths and Microservices

Start with a single system and database. This keeps operational complexity low.
Microservices arise for several reasons, identify poor ones: .e.g hype-driven (BIGCO is using it), for sake of using another language, engineer(s) wanting free reign, "solutions seeking problems", over optimising (primary intention is of "make it fast") right from the start.
Develop in "modules" instead that you can pull out (these won't exist from day one, but you can factor them in when "making it right").

Team and Hiring

Use languages, tools, and frameworks that you and your team are comfortable and experienced with.
Hire people who are comfortable with your team. Make use of references, trust but verify.

Ship early, ship often.

Ship as early as you can. Validate ideas before it gets to the customer. If you can experiment without building it out completely, do it!
Better to "make it work" in a month, than ship with a "right & fast" product in 3. Because when you launch after 3 months, it's likely you'll scrap or end up redoing most of what you did. More time it takes to ship, the more factors you need to consider: e.g. market change, competition.

All of these might seem obvious at first, but I've come to learn and appreciate as they've allowed me to become better at what I do.

This post describes a common quirk observed when performing joins against multiple tables in an SQL query, and a solution to it using a new feature (Subquery expressions) of the Django ORM.

A common problem

Consider the following models in Django.

from django.db import models


class Account(models.Model):
    name = models.CharField(max_length=32, unique=True)


class Credit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)


class Debit(models.Model):
    account = models.ForeignKey(Account)
    amount = models.DecimalField(max_digits=12, decimal_places=2)

Let's add some accounts, and also a few credit and debit entries against them. This is what the tables look like now:

# Accounts
+------+--------+
|   id | name   |
|------+--------|
|    1 | FOX    |
|    2 | SNAKE  |
|    3 | DOG    |
+------+--------+


# Credits
+----------------+-----------------+
| account_name   |   credit_amount |
|----------------+-----------------|
| FOX            |           100.0 |
| SNAKE          |            50.0 |
| SNAKE          |            20.0 |
| DOG            |           300.0 |
+----------------+-----------------+

# Debits
+----------------+----------------+
| account_name   |   dedit_amount |
|----------------+----------------|
| FOX            |           40.0 |
| SNAKE          |           30.0 |
| DOG            |           12.0 |
| DOG            |           23.0 |
+----------------+----------------+

Now, we want to answer a simple question: "What is the current balance of each account?". The balance of an account is defined as the sum of all credits minus the sum of all debits.

Seems pretty straightforward, right?

This is how one would do it using the Django ORM (A simple group-by and aggregate, which I described in one of the previous posts: https://mixedquantum.blogspot.in/2015/01/python-django-tips.html)

Account.objects.values('id').annotate(
    credit_sum=Sum('credit__amount'),
    debit_sum=Sum('debit__amount'),
    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

"""
[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('10.00')),
 ('DOG', Decimal('565.00'))]
"""

The balance for FOX looks okay (100 - 40). But what's happening with SNAKE, and DOG? Their balance amount isn't correct.
Printing the individual sums of credits and debits might help us figure out what's happening.

+--------+--------------+-------------+-----------+
| name   |   credit_sum |   debit_sum |   balance |
|--------+--------------+-------------+-----------|
| FOX    |        100.0 |        40.0 |      60.0 |
| SNAKE  |         70.0 |        60.0 |      10.0 |
| DOG    |        600.0 |        35.0 |     565.0 |
+--------+--------------+-------------+-----------+

The sum of debits is double what is should be for SNAKE, and the sum of credits is twice of what it should be for DOG.
The "twice" comes from the fact that there are two entries of credit for SNAKE, and two entries of debit for DOG.

An intermediate representation might help explain this better. This is the data on which the aggregates (SUM) are applied.

+--------+-------------+-----------------+------------+----------------+
| name   |   credit_id |   credit_amount |   debit_id |   debit_amount |
|--------+-------------+-----------------+------------+----------------|
| FOX    |           1 |           100.0 |          1 |           40.0 |
| SNAKE  |           4 |            20.0 |          2 |           30.0 |
| SNAKE  |           3 |            50.0 |          2 |           30.0 |
| DOG    |           5 |           300.0 |          3 |           12.0 |
| DOG    |           5 |           300.0 |          4 |           23.0 |
+--------+-------------+-----------------+------------+----------------+

The INNER JOIN is applied before the aggregate (SUM in this case). We need it the other way round.

The solution

In Postgres, this can be done using a WITH query (Common Table Expression):

WITH credits AS (
    SELECT account_id, sum(amount) AS credit_sum
        FROM core_credit
    GROUP BY account_id
), debits AS (
    SELECT account_id, sum(amount) AS debit_sum
        FROM core_debit
    GROUP BY account_id
)

SELECT a.name, credit_sum - debit_sum AS balance
    FROM core_account a
        INNER JOIN credits c
            ON c.account_id = a.id
        INNER JOIN debits d
            ON d.account_id = a.id;

Which gives us the correct balance of each account:

+--------+-----------+
| name   |   balance |
|--------+-----------|
| FOX    |      60.0 |
| SNAKE  |      40.0 |
| DOG    |     265.0 |
+--------+-----------+

Common Table Expressions don't seem to be supported in Django, but subqueries, which can be used to achieve the same output, were recently added in Django 1.11.

This is how we can use Subquery expressions to get the same result:

from django.db.models.expressions import Subquery, OuterRef
from django.db.models import F, Sum

from core.models import Account, Credit, Debit

credits = Credit.objects.filter(
    account=OuterRef('pk')).values('account_id').annotate(sum_credits=Sum('amount'))
debits = Debit.objects.filter(
    account=OuterRef('pk')).values('account_id').annotate(sum_debits=Sum('amount'))

Account.objects.annotate(
    credit_sum=Subquery(credits.values('sum_credits')),
    debit_sum=Subquery(debits.values('sum_debits')),
    balance=F('credit_sum') - F('debit_sum')
).values_list('name', 'balance')

"""
[('FOX', Decimal('60.00')),
 ('SNAKE', Decimal('40.00')),
 ('DOG', Decimal('265.00'))]
"""

It's not the most intuitive of code, but to someone who spends a lot of time using the ORM, it's just a matter of getting used to.

You can read more about Subquery expressions here: https://docs.djangoproject.com/en/1.11/ref/models/expressions/#subquery-expressions

The Django ORM has come a long way in terms of things it makes possible, reducing the need to resort to raw SQL queries even for some not-so-straightforward use cases.

Mixed Quantum

Sunday, 31 December 2017

Year in Review 2017 - Startup Engineering