Friday, August 26, 2011

private alpha update

Due to a bigger than expected pool, I need to split the alpha invites into two waves. So, some of you have been invited this week (welcome!), and some of you will be getting your invite next week (hang tight!).

Friday, August 19, 2011

Private Alpha Launch

My new thing is ready for private alpha - let me know if you want to be invited!

Thursday, August 18, 2011

browser innards

I'm hoping to find a nice long stretch to read this article on browser architecture. Looks as fascinating as it is useful.
Thanks to Rich for the pointer.

Tuesday, August 16, 2011

a big deal: stanford ai course

The NYTimes covered the viral spread of the stanford ai course, and I am over the moon about it. In addition to thinking Sebastian Thrun is one of the coolest people of all time (self-driving cars would easily be the biggest quality of life improvement possible for first world countries), I think the idea of free instruction is the biggest opportunity for human advancement available today. I'm not using hyperbole when I say that the stanford ai course is a watershed comparable to the Gutenberg bible. The ai course is the proverbial butterfly's wing.

For the last several centuries, the cost to access and distribute information has fallen due to major disruptions: printing presses, cheaper physical delivery, the internet.

But the ai course isn't about content distribution or access. Stanford is offering instruction, based on the gold standard of peer comparison and competition, for free. As of this writing, they have 74,000 registered students. That instant community will collaborate to choose which questions to pose to the professors, and all students will all be ranked against one another. It is an unprecedented increase in the number of people trained in a specific field, and if you believe in humanity, you have to be excited.

I remember learning about the diffusion of solids in a solvent. There are two kinds of dynamics that drive the absorption of a solid into a fluid: chemical and thermodynamic. In the chemically driven process, molecules of the solid have a tendency to separate into the fluid. The lower the saturation of the solid in the fluid, the more likely a molecule is to drift off without being replaced by another molecule bumping into the solid. What's fascinating to me is that this effect is entirely local - if the saturation is higher around the solid mass, molecules are more likely to be replaced and to net zero change. If you had to rely on the chemical process alone to dissolve sugar in your coffee, it would take days to sweeten your favorite beverage.

The thermodynamic effect is convection. Differences in temperature drive fluids to circulate, because the changes in temperature cause changes in density, which in turn cause cooler fluids to fall through warmer fluids. The result is pretty dramatic mixing. The convection driven mixing of your coffee guarantees that the local saturation is always pretty low around your sugar cubes. Fresh, unsweetened, coffee is always swirling by your cubical simple carbohydrate. The mixing drives the time to dissolve down to a manageable minute or so, which is good if you want hot coffee.

Up until now, the world has relied on the diffusion of information in almost all fields. Advanced topics like ai need to be explored, then understood, then standardized, and then instructed before it can be truly common human knowledge. The stanford ai course is information convection. The incredibly broad distribution guarantees that many people who have never been exposed to AI will be taught. By teaching, rather than passively informing, the ai course could enable those new students to teach others. It is hard to imagine a more effective means of advancing ai.

It is exactly this information convection that I want to harness in my next venture - I want to give away tools, processes, and instruction to anyone interested in my kind of problem. I want to teach people how to explore a specific field, and I want them to apply their findings directly and immediately. I hope it has a fraction of the impact that the ai course will have.

Thursday, August 11, 2011

Pleased as Punch - Streaming CSV

I have to admit, there's a little extra spring in my step this morning, and it isn't just because it finally stopped raining (Boston's been wet for a week and a half) - I just published my first gem to rubygems.com : csv_streamer.

I'm working on a project that analyzes a very large data set - historical trading data for US markets. The analysis spits out a similarly large amount of data, and I let users download the final analysis as a csv file (but not the original trading data mr. exchange operator).

I have been using the very nifty csv_builder project, which provides template support for csv. To generate a csv file, you just pop arrays onto a csv object provided to your template, like this:

csv << ["this","is","an","example","line"]

Like all templates in Rails, you get a nice clean separation between your view and your controller.

The problem is, web servers limit the resources available to a single request. You can't spend 30 minutes and gobs of RAM generating a csv file, and then send it back to the browser in one shot. Most production web servers will timeout the request after 30 or 60 seconds. The implementation of the timeout varies, but increasingly web servers base the timeout on the time between data writes, rather than the close of the stream. In other words, you have 30 seconds to send your first byte, and then 30 seconds to make each additional write. The technique of starting the stream quickly, and then dribbling out data over longer periods is called streaming. But Rails 3.0 doesn't have native support for streaming templates.

Rails 3.1 includes template streaming as a marquee feature, however, in that case the feature is more generalized. They want to facilitate parallel requests from the browser (most browsers can handle 4 per domain), so that pages render more quickly. Rails 3.1 will help with the standard templates, but for an extension like csv_builder, the template handler itself needs to be modified.

Luckily, Rails 3.0 does have support for streaming data. The key is to set your controller's response_body to an object that implements "each", as described in this stackoverflow discussion, and in numerous screencasts and howtos.

From what I found googling my face off for three days, most people who need to stream data just do so directly from their controller methods. It works, if you aren't a wonky architect with a penchant for strict separation between your views and controllers who also invest a lot of effort in creating csv_builder templates. So, I wanted csv_builder's templates, but I also wanted streaming support. Ergot, csv_streamer!

csv_streamer is a fork of csv_builder (hopefully my pull request will be accepted and csv_builder will just have streaming). The project was pretty fun, because it involved reading a lot of code and learning all about streaming, ruby blocks/procs and yield. As it turns out, csv is just ideal for streaming, because files are generated a line at a time. My implementation takes advantage, and streams each line as it is generated for maximum smoothness (I meant that in terms of data chunks being small and frequent, but it could be taken brogrammaticly). The problem of streaming html is more complicated, because of the dependencies between document parts. In csv, the header is the only dependency, and it is always served first, so streaming (and stream oriented processing on the client) is simple.

Another interesting aspect of streaming is the dependency on your Rack server. Even if you code a streaming response in your controller or template handler, it will only stream to your client browser if the underlying web server supports streaming. Rails uses Rack, which allows you to swap out your web server quite easily. The default in development mode is the very antiquated WEBrick, which, among other deficiencies, does not support streaming. Both mongrel and the absolutely hilariously named Unicorn do support streaming. I was able to find more examples of configuring Unicorn - github uses it, for example.  Initially, I went with unicorn for development. I use Heroku for production, and it turns out  the default configuration does not provide streaming. Luckily, Heroku cedar allows you to use Unicorn, and there is a fantastic howto from Michael van Rooijen. In addition to streaming, you can pack multiple unicorn processes onto a single Heroku dyno, to optimize your utilization. Michael's post provides some nice benchmarking and analysis to find the optimal number of dynos.

If you need streaming csv support in your Rails app, add csv_streamer to your gemfile and have at it. You can get all the details from the readme on github.

To get you started even quicker, I created a test application to deploy onto heroku and verify everything worked as expected. Again, there's more detail in the readme on github.

I am, however, still stuck on automated testing. csv_builder uses rspec, and while I can invoke the streaming code in the template handler, the implementation of TestResponse doesn't have a timeout and it buffers all writes until the stream is closed. So, it is a good test for functionality - I can prove the data streamed is correct. However, I'd love to have two tests - one that requests very large data in a non-streaming way and verifies that a timeout exception is raised. A second test would stream the same template, and verify succes. Any hints are very welcome - I posted this quandry to stackoverflow as well.

I'll let you know if I figure out a test, in the meantime: Happy Streaming!

Thursday, August 04, 2011

designing with rails routes

I have been starting any new work in Rails by creating or modifying the models in the application, and then adding tests to the models. Up until today, I had found the transition in coding models to coding controllers to be very confusing. Models and models' tests seem very intuitive to me, but I was having trouble putting my finger on what made controllers so tough. I think I finally learned the missing piece - routes!

One of the cooler elements of Rails is the way it inherently supports ReST. The routes.rb file's "resources" keyword allows you to quickly express the structure of your ReSTful API. The buckblog has a nice brief on one of the more useful bits: nested resources, and links to the more canonical tutorials from the Rails community. Here are the most useful things I've learned:

  1. When you are ready to modify your controllers, especially if you want to add a new resource or change relationships between them, start with the routes.rb file, and think of 'rake routes' like a compiler. Make changes and 'rake routes' to make sure they reflect your intent before you start writing tests or views.
  2. Rails does quite a lot to automate and abstract you from the details of routing. You do need to understand a few key concepts though:
    • A controller has many actions. 
    • Routes and Actions are one to one (but...)
    • URLs and Routes are not one to one. ReST uses the http verb to distinguish between reading and deleting at the same URL. In other words, a Route is just a URL plus a verb. 
    • Resources are a collection of routes, usually pointing to one controller. Resources provide nice ReSTful semantics, making the intent of your routing more clear in routes.rb.
  3. Nesting resources is a strong spice, best used sparingly. In my mind, nesting is ideal for certain actions for a 1:N / one-many relationship. I found it very important to understand that you can pick and choose routes you want to nest. A single resource can have both nested and unested routes. Imagine you have the classic "post has many comments" relationship. 
    • I'd suggest using nested resources routes for 
      • :index = since you will almost always want to filter comments down to comments on a single post, build it into your route
      • :create = you will always need to specify a post for your comment, so build it into the route
    • But I'd avoid nesting the resource routes for
      • :destroy, because you'll be most commonly deleting a single comment
      • :show, because :index is for listing, :show is for a single comment. Why require the caller to specify both the post id and the comment id?
    • I think :new is very debatable, and ultimately depends on the structure of your pages. To set the action of the form properly, you need to have the parent's ID. If you are rendering the form in a context that has the parent set already, you may not need to pass it along to the controller via the request. But if you find yourself putting the parent's ID into a parameter, you should nest the :new action.
    • Rails creates convenience functions that will generate the URL path for a particular route. There are decent conventions for the naming, but to be honest, I've found the patterns difficult to remember or apply by hand. Mostly, I have been running into trouble when dealing with controllers/models that have multi-word names, but often with the "natural" way Rails deals with singular/plural names. Now I don't even try to remember, because you can always get the function name from the leftmost column of rake routes.