I recently came across an interesting problem at work with an equally interesting solution / hack.
First off, I have a love - hate relationship with Heroku. Heroku has some
great features, but was not built to handle
small datasets in memory.
Heroku firewall and my query
Just look at the cost for RAM progression:
- For $50, you get a GB, which only cuts you off when you hit 2 GB;
- At $200, you get 2GB, which I assume cuts you off at 4GB!
That is absolutely crazy considering I can send almost a gig a second between modern machines. Furthermore, if your Heroku container hits 300MB, it will fail to deploy (apparently, I haven’t hit that limit but using Pandas in a small project has pushed my dyno size to 200MB).
Back to the problem at hand. We were running a SQL query for a dataset of around 16 million rows. Not a lot but we were computing on the fly and joining a few tables. Our client is perfectly happy with a minute execution time considering that before we started on the project the execution time was around the 5 minute mark.
What really hurt us though was the arbitrary request limit that Heroku imposed
on our paid container, the dreaded
code=H12 desc="Request timeout".
The rule is basically as follows, if you haven’t sent any data between client and server within a 30 second rolling window, it terminates the connection. This is not just on the free container, it’s on all your containers.
We needed a solution and we needed it fast since we was already behind on the project.
We knew that a streaming HTTP response would work after searching the internets. Unfortunately, how do we yield and complete the query simultaneously?
I’m sad to say it took me a few minutes to realise that’s exactly what a thread does. We had found our solution!
Let me show you the solution and then take you through the code a bit.
The thread part
The CSV yielding part
_report_builder is a function that is invoked in a seperate thread
with the args passed in.
Then every second, we check if the thread has completed, and yield a space. If the thread has completed it moves on and yields a chain. This chain is used to join the header to the row iterator.
Echo class used for the
pseudo_buffer is derived from the
Streaming Large CSV Files in Django.
In this way, you can run a long query while keeping Heroku happy and keeping your client happy by delivering those quick iterations which Heroku enables for small teams.