Django long running report on Django

I recently came across an interesting problem at work with an equally interesting solution / hack.

First off, I have a love - hate relationship with Heroku. Heroku has some great features, but was not built to handle ~~large~~ small datasets in memory.

Heroku firewall and my query

Just look at the cost for RAM progression:

For $50, you get a GB, which only cuts you off when you hit 2 GB;
At $200, you get 2GB, which I assume cuts you off at 4GB!

That is absolutely crazy considering I can send almost a gig a second between modern machines. Furthermore, if your Heroku container hits 300MB, it will fail to deploy (apparently, I haven’t hit that limit but using Pandas in a small project has pushed my dyno size to 200MB).

Back to the problem at hand. We were running a SQL query for a dataset of around 16 million rows. Not a lot but we were computing on the fly and joining a few tables. Our client is perfectly happy with a minute execution time considering that before we started on the project the execution time was around the 5 minute mark.

What really hurt us though was the arbitrary request limit that Heroku imposed on our paid container, the dreaded code=H12 desc="Request timeout".

The rule is basically as follows, if you haven’t sent any data between client and server within a 30 second rolling window, it terminates the connection. This is not just on the free container, it’s on all your containers.

We needed a solution and we needed it fast since we was already behind on the project.

We knew that a streaming HTTP response would work after searching the internets. Unfortunately, how do we yield and complete the query simultaneously?

I’m sad to say it took me a few minutes to realise that’s exactly what a thread does. We had found our solution!

Let me show you the solution and then take you through the code a bit.

The thread part

data = {}
t = Thread(target=_report_builder, args=(start, end, selected_products))
t.start()

while (t.is_alive()):
    time.sleep(1)
    t.join(.1)
    yield(" ")

The CSV yielding part

psuedo_buffer = Echo()
writer = csv.writer(psuedo_buffer)

response = chain(writer.writerow(_get_columns(selected_products)),
                 self.return_rows(writer))

return StreamingHttpResponse(response,
                             content_type="application/csv")

Basically _report_builder is a function that is invoked in a seperate thread with the args passed in.

Then every second, we check if the thread has completed, and yield a space. If the thread has completed it moves on and yields a chain. This chain is used to join the header to the row iterator.

The Echo class used for the pseudo_buffer is derived from the Django documentation, Streaming Large CSV Files in Django.

In this way, you can run a long query while keeping Heroku happy and keeping your client happy by delivering those quick iterations which Heroku enables for small teams.