Sunday, November 23, 2008

Design considerations for Oracle RAC High Availability Design

I have recently been working with clients to implement and review Oracle 10g RAC environments. One item that has
been of importance is how to design a truly redundant and highly available (HA) infrastructure.

The goal is to avoid any single point of failure (SPOF) while maximizing performance and scalable factors

Here are my notes:

Hardware Considerations
Implement 3-4 nodes for the RAC design in the initial phase. By using at least 3-4 nodes for a new RAC cluster, you
help to protect yourself against the dreaded "Split Brain" condition. If you lose a single node, you still will have 2-3 nodes
to use for failover and servicing current mission critical applications.

Network Considerations for RAC
1. Implement multiple switches for both the interconnect to avoid loss in communication between nodes in the RAC cluster.
The danger of using only a single switch is that if this switch has a failure, the entire RAC cluster will crash and result will be
downtime. I see a lot of clients skimp on this item. By using 2 switches at both the interconnect (private network) and storage level (ie: the SAN fabric or iSCSI layer) you protect yourself against a network failure.

2. Use a fat pipe for the interconnect. Go with at least 4Gb+ Ethernet or even better, fiber channel for best throughput.
Even better, the Infiniband has robust performance for heavy duty applications.

3. Implement multiple dual homed NIC cards to avoid loss of a network adapter in the server. Other good network interfaces
have this built into the network card such as Sun's IPMP (IP Multipathing - Sun Traffic manager).

Storage Considerations for Oracle 10g/11g RAC
Invest in the best SAN possible with Fiber Channel (FC/FC-AL) for best performance and support.

Fiber is the best overall performance and HA solution for enterprise storage. Another suitable option is iSCSI which provides many similar benefits.

Other tips for Oracle 10g/11g RAC Design

1. Mirror and protect multiple copies of the Oracle 10g/11g Clusterware: have several copies of the OCR (Oracle Cluster Registry) and Voting Disks. These are small footprints in size and if you only have a single copy, guess what will happen to your entire RAC cluster if you lose your only copy of these critical files? The entire RAC cluster will fail because it will not be able to communicate. I have seen clients that have 1 copy of the OCR and vote disk and they put their RAC clustered environments at great risk. Even if the Unix or storage administrator says they mirror copies of them on storage, one cannot be too cautious to have multiple copies.

2. Use multiple ASM Disk Groups.

At least 4-5 ASM disk groups are recommended to split up the various Oracle 10g/11g database files for performance and availabilty reasons. For example, we can have the following sample ASM configuration:

+FLASHDG for flash recovery area within ASM to store backups and archivelogs
+DATADG for Oracle 10g/11g data files
+INDEXDG for Oracle 10g/11g indexes
+DATADG2 for additional Oracle 10g/11g application database files

Implement Oracle Data Guard for RAC
While Oracle RAC provides performance, scalability, and data protection against a single node failure for a RAC instance in the cluster, it does not protect against data loss in the event that the RAC database has a media failure and data loss. This is because the RAC cluster nodes all share the same database. Many folks incorrectly assume that RAC is a total HA solution. It is not. I recommend that a standby physical database be implemented with RAC environments to provide for protection against data loss and the single point of failure (SPOF) which is the Achilles Heel with RAC. By using an Oracle Data Guard with RAC, you gain failover and switchover features to protect against data loss. Downtime is bad enough for an already stressed DBA to worry about, but data loss will get a DBA fired and potentially cause a company to go out of business. Thus, Data Guard is the perfect solution to complement RAC for a comprehensive HA solution as part of the Oracle Maximum Availability (MAA) architecture.


Implement and Test RMAN Backup and Recovery with Oracle 10g/11g RAC
Oracle provides the ultimate backup and recovery tool called the Recovery Manager (RMAN) for free out of the box to provide essential backup and recovery for complex RAC environments. User managed hot backups were fine years ago before the RMAN age but sorry folks, they really do not cut the mustard for modern times. RMAN provides a ton of features such as block level media recovery and point in time recovery that are not available in the old user backups. Plus RMAN can be used to clone RAC databases and implement standby Data Guard environments as well as backup and recovery ASM disk groups with Oracle 10g and 11g for RAC.





Hope these tips and tricks help you with building a reliable and stable RAC environment!

Cheers,
Ben

1 comment:

Lochan said...

cool man its so informative and precise. Thanks for sharing the info.