Django long running report on Django
I recently came across an interesting problem at work with an equally interesting solution / hack.
First off, I have a love - hate relationship with Heroku. Heroku has some
great features, but was not built to handle large
small datasets in memory.
Heroku firewall and my query
Just look at the cost for RAM progression:
- For $50, you get a GB, which only cuts you off when you hit 2 GB;
- At $200, you get 2GB, which I assume cuts you off at 4GB!
That is absolutely crazy considering I can send almost a gig a second between modern machines. Furthermore, if your Heroku container hits 300MB, it will fail to deploy (apparently, I haven’t hit that limit but using Pandas in a small project has pushed my dyno size to 200MB).
Back to the problem at hand. We were running a SQL query for a dataset of around 16 million rows. Not a lot but we were computing on the fly and joining a few tables. Our client is perfectly happy with a minute execution time considering that before we started on the project the execution time was around the 5 minute mark.
What really hurt us though was the arbitrary request limit that Heroku imposed
on our paid container, the dreaded code=H12 desc="Request timeout"
.
The rule is basically as follows, if you haven’t sent any data between client and server within a 30 second rolling window, it terminates the connection. This is not just on the free container, it’s on all your containers.
We needed a solution and we needed it fast since we was already behind on the project.
We knew that a streaming HTTP response would work after searching the internets. Unfortunately, how do we yield and complete the query simultaneously?
I’m sad to say it took me a few minutes to realise that’s exactly what a thread does. We had found our solution!
Let me show you the solution and then take you through the code a bit.
The thread part
data = {}
t = Thread(target=_report_builder, args=(start, end, selected_products))
t.start()
while (t.is_alive()):
time.sleep(1)
t.join(.1)
yield(" ")
The CSV yielding part
psuedo_buffer = Echo()
writer = csv.writer(psuedo_buffer)
response = chain(writer.writerow(_get_columns(selected_products)),
self.return_rows(writer))
return StreamingHttpResponse(response,
content_type="application/csv")
Basically _report_builder
is a function that is invoked in a seperate thread
with the args passed in.
Then every second, we check if the thread has completed, and yield a space. If the thread has completed it moves on and yields a chain. This chain is used to join the header to the row iterator.
The Echo
class used for the pseudo_buffer
is derived from the
Django documentation,
Streaming Large CSV Files in Django.
In this way, you can run a long query while keeping Heroku happy and keeping your client happy by delivering those quick iterations which Heroku enables for small teams.