During the last few weeks I've been engaged in the implementation of a disaster recovery solution for a payment system. The task is nearing its completion; usually that's the period where you get the earth move away under your feet due to some unforeseen factor or a detail somehow missed. Thankfully that doesn't appear to be the case in this particular project.
It's always a nice feeling to get such an important and complicated project under your belt. During the course of this project I haven't really picked up any particular technology skills worth mentioning (except being able to break my personal record of the time needed to install the payment system from scratch). However, I gained some experience on practical DR planning and learned a few important lessons. Here are the most significant ones.
- Closely examine dependencies. In the case of a payment system, it's very difficult to have it operate solely on its own. The payment system usually interfaces with a lot of other entities. These can be the bank's host, external card management systems, Visa and MasterCard facilities, help desk system, middleware servers and others. Some of these are more important than others but it should be defined early on what is an acceptable disaster recovery setup. For example, it doesn't make much sense to have disaster recovery planning for the payment system but not the bank's host. These external systems must be a part of the disaster recovery planning as well.
- Make sure that there are well defined, written procedures for switching to the DR site. This appears deceptively trivial, right? Well, it's not as easy as it sounds. It really depends upon the organizational skills of each institution, but it can be a tricky exercise.
- Don't be afraid to break the norm and use the infrastructure available. In my case, keeping the disaster recovery in sync with the production site is officially achieved using replication software. However, the bank already had a SAN in place and that was already replicated to a disaster recovery site. By not following the standard, we managed to leverage what appears to be a vastly superior solution in the DR implementation.
- Test every possible scenario you can think off. By far the easiest DR scenario is one when everything goes smoothly and you can perform an orderly shutdown of all servers, then switch to the DR site. But what about a power outage in mid-transaction? What about a software failure severe enough to render the production site unworkable? Would the plan to switch to DR work under such circumstances? Test it and find out.
- Test exceptional scenarios under load. Everything runs smoothly in the test environment where you have a load of exactly 0.01 transactions per second. How about loading up the test system with 40 TPS and see how a switch to the DR site goes? That's what's going to happen in a real-world scenario.
- Bundle as few infrastructure upgrades as possible in a DR project. It's tempting to say that since we'll test everything, how about also upgrading our database server and also install those 25 patches to the payment system as well? Well, that's a thought and it can save you some time...if everything goes smoothly. But the last thing you want to do is end up chasing down problems due to upgrades while you should be testing the DR plan.