DAT401: SQL 2008 HA/DR case studies

This session talked about several proven methods of high availability and disaster recovery for SQL 2008. It focused on several case studies of real-word companies, their HA/DR approach, and metrics from their environments. It didn’t cover any new wizbang technology or third party products. With the proper design, processes, procedures, and highly skilled people, it’s really mind boggling what companies have done.

There are a few common HA/DR architectures:

– Failover clustering for HA and database mirroring for DR
– Synchronous database mirroring for HA/DR and log shipping for additional DR
– Geo-clustering for HA/DR and log shipping for additional DR
– Failover clustering for HA and SAN-based replication for DR
– Peer-to-Peer replication for HA and DR

Each architecture has its own pros and cons. Your business requirements will determine which solutions you will want to employ. The remainder of the session was discussing various case studies.

One case study, bWin, is an online gambling company in Europe. They process 1 million bets a day on over 90 sports. Their SLA is zero data loss, 99.99% availability 24×7, and they have an unlimited IT budget (no kidding). Their design had to take into account a full datacenter failure and complete data loss within that datacenter. Total data is in excess of 100TB, 100 SQL instances, and the environment processes over 450K SQL statements per second.

Their solution, which is highly complex with extreme levels of redundancy, has enabled them ZERO downtime in three years, zero data loss, and near 100% verified data availability. Their backup and storage architectures are really mind blowing. It is well worth reading the case study, here. If you want to read about their backup architecture, you can find the case study here. They can backup a 2TB database in 36 minutes. People were starting to laugh in the session at the extreme lengths this company went to ensure verified zero data loss.

The key take way from this case study is that you need to document everything, have processes and procedures in place for every scenario, and have extremely highly skilled people. The technology is just one small piece of the entire design. It’s really the processes and people that enable these extreme levels of up time and data availability. You can have all the technology in the world but if your document is poor and you don’t have extremely highly skilled people, you will end up in a world of hurt and miss your SLA targets.

Another case study was ServiceU. In summary, they were able to upgrade from SQL 2005 to SQL 2008, Windows Server 2003 to 2008, a new SAN, and new server hardware, with less than 16 minutes of total downtime. This was accomplished without any virtualization product and through careful planning and orchestrating of the upgrades.

Other case studies include QR Limited, Progressive Insurance, and an Asian travel company. Bottom line is that SQL can provide highly robust HA/DR if you have the right architecture, documentation, processes, and highly skilled people.