A few weeks back I was trying to configure a Windows 2000 cluster for a client. Nothing too unusual about that, but we still came across a couple of issues:
Firstly, the shared disk was on a SAN, with Veritas Volume Manager providing dynamic multi-path support. Veritas’ document describing how Volume Manager works with Microsoft Cluster Server suggests that once the Microsoft Cluster service is installed, it is possible to modify an existing Volume Manager installation to install the Volume Manager DLLs. We found that it is not enough for the Cluster service to be installed (i.e. present but without the Cluster Service Installation Wizard having been run) – the cluster has to be fully configured before the Volume Manager MSCS Support will install correctly;
The second issue was far more difficult to diagnose. We had a problem whereby the first cluster node would install without issue; however when we attempted to join the second node to the cluster it would fail and report the following message in %systemroot%\cluster.log:
[JOIN] Unable to get join version data from sponsor xxx.xxx.xxx.xxx using NTLM package, status 5.
The problem turned out to be related to the password length for the Cluster service account. When you set or change a password, Windows generates both a LAN Manager Hash (LM hash) and a Microsoft Windows NT hash (NT hash) of the password. These hashes are stored in the local Security Accounts Manager (SAM) database or in Active Directory; however LM hashes are relatively weak and are often disabled for security reasons, as described in Microsoft Knowledge Base article 299656.
Our Active Directory was running under Windows Server 2003 (albeit in Windows 2000 mixed mode), and because Windows Server 2003 is more secure by default, it does not store LM hash values for passwords.
According to Microsoft Knowledge Base article 272129, all cluster authentication is handled internally to the Cluster service. The only time the Cluster service contacts a domain controller for authentication is to validate the Cluster service account when the cluster is first formed. Every node that requests to join a cluster is validated using RPC communication over the private network by the node that owns the quorum resource. Only LM or NTLM authentication is used for this (i.e. not NTLM v2 or Kerberos).
According to Microsoft Knowledge Base article 828861, if a password of less than 15 characters is used for the Cluster service account, the setup process generates an LM hash to build a session key to authenticate whilst attempting to join the second node to the cluster. Because no LM hash is stored in Active Directory, the domain controller cannot build a matching session key and access is denied; however, when a password that has 15 or more characters is used for the cluster service account, the setup process cannot generate an LM hash and a Windows NT password hash is used to derive the session key instead. The domain controller is able to generate a matching session key and authentication succeeds.
The result of all this is that the Cluster service account must use a password that contains 15 characters or greater.
Incidentally, there is a really useful best practice guide for installing the Microsoft Cluster service on the Microsoft website. Microsoft Knowledge Base article 278007 describes some of the new features in Windows Server 2003 clusters in comparison with Microsoft Windows 2000 Advanced Server and Microsoft Windows 2000 Datacenter Server.
I just wanted to add a few points I neglected to mention in the original post.
A common mistake is to view clustering as the answer to all downtime problems. This is simply not the case – clustering is not a panacea and all it really does is eliminate many of the single points of failure (SPoFs) in a server.
Remember that Microsoft clusters rely on a shared disk and this may represent a SPoF in itself (although usually implemented using a resilient fibre network).
Where clustering can help is with planned downtime – a node can be taken offline for maintenance whilst others remain online and servicing client.
One thing that clustering is not, is a replacement for sound disaster recovery practices.
It is also important to provide resilience for supporting network services (e.g. WINS/DNS/Active Directory etc).