Friday, July 29, 2011

System Administration Nirvana

I don't consider myself a true sysadmin, but inevitably you need to dabble a bit as an admin if you want to build anything fun. Part of my current project is a custom python daemon. My web application posts jobs to a database (if you will allow me to call mongo a db...), which the python daemon monitors. When a job is posted to the db, the daemon picks it up and does the processing.

Of course, I have somewhere between a few and several bugs in my monitoring process. So, from time to time, I need to restart the process. I just invited my first alpha user to test out the site, so while my audience is minuscule, I'm still very worried about the site being dysfunctional.

Enter RightScale. Today, in about 4 hours, I learned all about monitoring at rightscale (they use collectd) and I enabled it for my job monitoring servers. It was easy to add a plugin to monitor my custom application -- I just configured the standard processes plugin to track my daemon. Immediately, I was able to see count, cpu usage, mem usage, and disk io for my process. Very useful. I added an escalation to email me when the process crashed*. That was neat... but then I had this vision of myself fishing with my son, getting an urgent email, making him quit fishing early (tears), and then speeding home, all just to type "kill -9 ". So, I made a custom alert escalation on rightscale to restart the deamon if it crashes. Pretty simple, but something that would have taken days in the past. I would have spent a week just comparing all the options for monitoring systems, and figuring out how to install one on all my servers. 

Another nifty trick - when I invited my testers to the site, I wanted to have separate staging and production environments. So, I clicked the "clone" button, and presto, my whole environment was replicated. heroku_san made it even easier for the web application.

Anyway, wish me luck as the first user tries out my new project!

* Yeah, some of my bugs are still crashing bugs. Sorry Joel Spolsky, I don't have a QA team for this either. I do have 200+ unit tests though!

No comments:

Post a Comment