2011/05/25

Lessons learned during a DR implementation


During the last few weeks I've been engaged in the implementation of a disaster recovery solution for a payment system. The task is nearing its completion; usually that's the period where you get the earth move away under your feet due to some unforeseen factor or a detail somehow missed. Thankfully that doesn't appear to be the case in this particular project.

It's always a nice feeling to get such an important and complicated project under your belt. During the course of this project I haven't really picked up any particular technology skills worth mentioning (except being able to break my personal record of the time needed to install the payment system from scratch). However, I gained some experience on practical DR planning and learned a few important lessons. Here are the most significant ones.
  • Closely examine dependencies. In the case of a payment system, it's very difficult to have it operate solely on its own. The payment system usually interfaces with a lot of other entities. These can be the bank's host, external card management systems, Visa and MasterCard facilities, help desk system, middleware servers and others. Some of these are more important than others but it should be defined early on what is an acceptable disaster recovery setup. For example, it doesn't make much sense to have disaster recovery planning for the payment system but not the bank's host. These external systems must be a part of the disaster recovery planning as well.
  • Make sure that there are well defined, written procedures for switching to the DR site. This appears deceptively trivial, right? Well, it's not as easy as it sounds. It really depends upon the organizational skills of each institution, but it can be a tricky exercise.
  • Don't be afraid to break the norm and use the infrastructure available. In my case, keeping the disaster recovery in sync with the production site is officially achieved using replication software. However, the bank already had a SAN in place and that was already replicated to a disaster recovery site. By not following the standard, we managed to leverage what appears to be a vastly superior solution in the DR implementation.
  • Test every possible scenario you can think off. By far the easiest DR scenario is one when everything goes smoothly and you can perform an orderly shutdown of all servers, then switch to the DR site. But what about a power outage in mid-transaction? What about a software failure severe enough to render the production site unworkable? Would the plan to switch to DR work under such circumstances? Test it and find out.
  • Test exceptional scenarios under load. Everything runs smoothly in the test environment where you have a load of exactly 0.01 transactions per second. How about loading up the test system with 40 TPS and see how a switch to the DR site goes? That's what's going to happen in a real-world scenario.
  • Bundle as few infrastructure upgrades as possible in a DR project. It's tempting to say that since we'll test everything, how about also upgrading our database server and also install those 25 patches to the payment system as well? Well, that's a thought and it can save you some time...if everything goes smoothly. But the last thing you want to do is end up chasing down problems due to upgrades while you should be testing the DR plan.

2011/05/08

Documentation

I frequent the CodePlex site quite often. As a developer I try to get a glimpse of the efforts of other developers for various reasons. Some of the projects found in CodePlex are a one-year scheme, possibly a junior developer's attempt at fame but without serious motivation. Most of the projects tend to revolve around topics that appeal to the younger developers, especially projects revolving around ASP.Net and MVC, HTML, content management and other related popular technologies.

Occasionally, I find an open source project that stands out from the others. Some time ago I got interested in CommonLibrary.Net. This little project is a library of reusable code but it is different in several ways from other similar endeavors. A very encouraging point is that the project started out in 2009 and the main author is very active, committing regular check-ins and continuously evolving the library and posting releases with new features and fixes. Another feature that grasps the attention of the visitor is the large namespace and the multitude of developer helper classes that were fit into this project. Finally visitors browsing the code can quickly see that it's of unusually high quality, with the main author clearly knowing the tricks of the trade.

There is one sore point, and CommonLibrary.Net is by no means the only project at CodePlex that is problematic in this area. The quality of documentation is poor at best. There are examples of using the library and one project deliverable is a help file generated by the code XML comments. But it's not enough. The examples focus on showing how to use specific namespaces of the library but do not explain the general idea behind them. Browsing the code one can quickly see that there are several problems with the XML comments from which the help file is generated: parameters are not documented, comments are copy-pasted between overloads and are incorrect, class and method comments are frequently vague and a lot of protected or public members remain undocumented.

I think that there are several cases where an open source project can get away with little or no documentation at all. But a developer's library project simply isn't one of them. I would go as far as saying that the documentation may very well be the most important deliverable of such a project. Users of code libraries are themselves developers but that doesn't mean that they want, by default, to see the source code to understand what's being done under the hood. And the notion that a well laid out namespace and use of naming conventions are their own documentation is correct but way overblown if applied to the max at the expense of code comments. True, one can easily gather that ComLib.Scheduling.Scheduler has something to do with scheduling tasks but how are the tasks scheduled exactly? What is a task? What triggers a task? Is there something a task should not do? How can callers know the status of their tasks? Can they stop them, reschedule them, pause them? Can they add a task at dynamically at runtime? At start-up time? How can they gracefully stop all tasks when the program is shutting down?

Fortunately work is being done at CommonLibrary.Net to slowly fix that. It appears it might take a while but it's being done. Documentation for code libraries is indispensable.