I recently found myself working with a customer to troubleshoot issues with their Exchange Servers. The servers were losing contact with each other periodically and then doing what Exchange is designed to do in this circumstance, failing databases over to other servers. Unfortunately, whilst I’m not going to go into the exact details here for reasons of confidentiality, it did completely underline my opinion (backed up by Microsoft’s Preferred Architecture for Exchange) that virtualised Exchange solutions using SAN storage are overly-complex and will cause issues at some point…
After going through the basics (are all the servers patched up to date, including consistent BIOS and firmware for all disks, controllers, NICs, etc.; has anti-virus software been disabled on the servers, especially host intrusion protection features; is the virtual infrastructure correctly configured, especially virtual networks; are there any tell-tale issues in the event logs?), I contacted my former colleague, Fujitsu Distinguished Engineer and Exchange Master, Nick Parlow (who fixes people’s broken Exchange solutions for a living). Nick gave me some very useful advice for troubleshooting Exchange that I’m going to share here…
- Experfwiz (the Exchange Performance Data Collection tool) can be run to collect information on how the servers are performing. There should be a lot of information in the Daily Performance Logs too (as described by Tony Redmond as long as no-one has deleted the scheduled task to save the logs from being created and taking up disk space!)
- Take the output from Experfwiz (or the Daily Performance Logs) and run it through the Performance Analysis of Logs (PAL) Tool, which will produce an HTML report that can be used to identify issues based on recognised thresholds (and version 2.7.3 includes Exchange 2013 thresholds).
A few more points that might be useful:
- The Exchange Best Practices Analyzer (ExBPA) is back. Sort of. It’s now positioned as a tool to check Exchange before integrating with Office 365 and needs an Office 365 logon (as MVP Damian Scoles explains).
- Exchange DAG database failover is logged in Microsoft/Exchange/HighAvailability/Debug (look for Event IDs 326/7/8). These could be used as a trigger to stop the Microsoft Network Monitor (NetMon) from running… at which point you should have a network trace showing what was happening at the point when the databases failed over. (You could try something similar with Wireshark too.)
- MVP Nuno Mota has lots of advice on monitoring DAG failover on MSExchange.org.
- There’s a useful Exchange 2013 Performance Health Check Script in the TechNet Gallery. This helped to flag some minor issues that were worth fixing – like that Hyperthreading was enabled on the servers, that an out of date version of the .NET Framework was installed and that the power plan was balanced, rather than high performance (this will kill performance on a virtual machine – and I can’t see why any server would use anything other than high performance).
Test-ReplicationHealth
is a useful cmdlet to know (MVP Paul Cunningham has a post about it too).Test-ServiceHealth
is another one (and here’s Paul’s post on that cmdlet).http://servername/virtualdirectory/healthcheck.htm
can be used to test the response from a given server for a given protocol (and test any load balancing).