The 10th circle of ORM join hell

Disclaimer: I’m not a big fan of ORM’s but I do appreciate the simplicity and ease of use that they bring to projects.

For 99% of workloads, the Django ORM will work as required. Unfortunately the ORM, being a simpleton, does not know what Common Table Expressions (CTE’s) nor is the ORM smart enough to generate the SQL you think it is.

Let me give you an example. Recently, we had a reporting endpoint of our site that was unusable to large customers. It worked perfectly for small customers but when it rose above a few hundred thousand rows it become practically unusable and would timeout when the request passed our Nginx 40 second time-out rule.

After searching for ~~a few minutes~~ half a morning, I came across the SQL query from hell in the AWS RDS logs. In the meantime, I managed to fix a few missing indexes that gave it a further performance boost but did not solve the underlying issue.

Side-note: I’ve grown fond of these logs since it helps me track down queries that fail or take far too long to complete.

For the example below, I will give you a brief description of the schema that we use.

Basically, we have Companies.

Every Company has a Warehouse.

We then have Users authorized to act on a Warehouse.

Therefore we need to ensure that the user is authorized to see or edit the resultant Shipments.

Psuedo-code below of the Shipment query that was not evaluating in an appropriate time:

Shipments.objects.all().for_company(company_id).for_user(user_id)

The problem here is that it was generating SQL that looked like this (I’ve renamed the tables and fields so forgive me if it’s not valid SQL):

SELECT * FROM shipment_table 
WHERE ("shipment_table"."_company_id" = COMPANY_ID_HIDDEN AND
"shipment_table"."origin_id" IN 
  (SELECT U0."id" FROM "warehouse_table" U0
     INNER JOIN "user_warehouse_table" U1 ON
     ( U0."id" = U1."warehouse" ) WHERE U1."user_id" = USER_ID )
  AND "shipment_table"."status_action" IN STATUS_ACTIONS) LIMIT 100

For a customer of ours, this was taking over 90 seconds to complete. The query would never actually complete since anything over 40 seconds gets dropped by Nginx. This is obviously not performant enough for what we require.

Further investigation found the problem was occuring in the for_user function. This function looked as follows:

def for_user(self, user):
        user_warehouses = user.appuser.warehouses.values_list(
            'pk', flat=True
        ).order_by()
        if user_warehouses:
            return self.filter(origin_id__in=user_warehouses)
        return self

Since user_warehouses is a Django ORM query object, it evaluates it lazily. This causes that join from hell.

To resolve this nightmare join, I had to pre-evaluate the filter. Using list(query), the function then evaluates immediately.

    def for_user(self, user):
        # Resolve the user warehouses immediately rather than
        # let the ORM mangle it into a super complicated join
        user_warehouses = list(user.appuser.warehouses.values_list(
            'pk', flat=True
        ).order_by())
        if len(user_warehouses) > 0:
            return self.filter(origin_id__in=user_warehouses)
        return self

Even though I am doing another whole request, this creates the much more sane and predicatable query:

SELECT * FROM shipment_table
WHERE ("shipment_table"."_company_id" = COMPANY_ID_HIDDEN
AND "shipment_table"."origin_id" IN WAREHOUSE_IDS
AND "shipment_table"."status_action" IN STATUS_ACTIONS) LIMIT 100

This simple pre-evaluation brought the time required for the query to complete down from 90 seconds to 0.3ms, a significant improvement.

Basically instead of evaluating every row dynamically, we want to first generate a the static list of available warehouses and only return those that match.

So when using the ORM, I recommend generating and evaluating the SQL generated to ensure that you are not creating the 10th circle of hell.