Thursday, August 11, 2011

Pleased as Punch - Streaming CSV

I have to admit, there's a little extra spring in my step this morning, and it isn't just because it finally stopped raining (Boston's been wet for a week and a half) - I just published my first gem to : csv_streamer.

I'm working on a project that analyzes a very large data set - historical trading data for US markets. The analysis spits out a similarly large amount of data, and I let users download the final analysis as a csv file (but not the original trading data mr. exchange operator).

I have been using the very nifty csv_builder project, which provides template support for csv. To generate a csv file, you just pop arrays onto a csv object provided to your template, like this:

csv << ["this","is","an","example","line"]

Like all templates in Rails, you get a nice clean separation between your view and your controller.

The problem is, web servers limit the resources available to a single request. You can't spend 30 minutes and gobs of RAM generating a csv file, and then send it back to the browser in one shot. Most production web servers will timeout the request after 30 or 60 seconds. The implementation of the timeout varies, but increasingly web servers base the timeout on the time between data writes, rather than the close of the stream. In other words, you have 30 seconds to send your first byte, and then 30 seconds to make each additional write. The technique of starting the stream quickly, and then dribbling out data over longer periods is called streaming. But Rails 3.0 doesn't have native support for streaming templates.

Rails 3.1 includes template streaming as a marquee feature, however, in that case the feature is more generalized. They want to facilitate parallel requests from the browser (most browsers can handle 4 per domain), so that pages render more quickly. Rails 3.1 will help with the standard templates, but for an extension like csv_builder, the template handler itself needs to be modified.

Luckily, Rails 3.0 does have support for streaming data. The key is to set your controller's response_body to an object that implements "each", as described in this stackoverflow discussion, and in numerous screencasts and howtos.

From what I found googling my face off for three days, most people who need to stream data just do so directly from their controller methods. It works, if you aren't a wonky architect with a penchant for strict separation between your views and controllers who also invest a lot of effort in creating csv_builder templates. So, I wanted csv_builder's templates, but I also wanted streaming support. Ergot, csv_streamer!

csv_streamer is a fork of csv_builder (hopefully my pull request will be accepted and csv_builder will just have streaming). The project was pretty fun, because it involved reading a lot of code and learning all about streaming, ruby blocks/procs and yield. As it turns out, csv is just ideal for streaming, because files are generated a line at a time. My implementation takes advantage, and streams each line as it is generated for maximum smoothness (I meant that in terms of data chunks being small and frequent, but it could be taken brogrammaticly). The problem of streaming html is more complicated, because of the dependencies between document parts. In csv, the header is the only dependency, and it is always served first, so streaming (and stream oriented processing on the client) is simple.

Another interesting aspect of streaming is the dependency on your Rack server. Even if you code a streaming response in your controller or template handler, it will only stream to your client browser if the underlying web server supports streaming. Rails uses Rack, which allows you to swap out your web server quite easily. The default in development mode is the very antiquated WEBrick, which, among other deficiencies, does not support streaming. Both mongrel and the absolutely hilariously named Unicorn do support streaming. I was able to find more examples of configuring Unicorn - github uses it, for example.  Initially, I went with unicorn for development. I use Heroku for production, and it turns out  the default configuration does not provide streaming. Luckily, Heroku cedar allows you to use Unicorn, and there is a fantastic howto from Michael van Rooijen. In addition to streaming, you can pack multiple unicorn processes onto a single Heroku dyno, to optimize your utilization. Michael's post provides some nice benchmarking and analysis to find the optimal number of dynos.

If you need streaming csv support in your Rails app, add csv_streamer to your gemfile and have at it. You can get all the details from the readme on github.

To get you started even quicker, I created a test application to deploy onto heroku and verify everything worked as expected. Again, there's more detail in the readme on github.

I am, however, still stuck on automated testing. csv_builder uses rspec, and while I can invoke the streaming code in the template handler, the implementation of TestResponse doesn't have a timeout and it buffers all writes until the stream is closed. So, it is a good test for functionality - I can prove the data streamed is correct. However, I'd love to have two tests - one that requests very large data in a non-streaming way and verifies that a timeout exception is raised. A second test would stream the same template, and verify succes. Any hints are very welcome - I posted this quandry to stackoverflow as well.

I'll let you know if I figure out a test, in the meantime: Happy Streaming!


  1. Rails is Web Scale!

  2. Indeed. And I am even using mongodb to shard my way to Web Scale.