Today’s Issue – Extremely Slow performance of a SQL Server running on Hyper-V Windows 2008 R2 SP1.
This is a very common customer issue that I run into and the first step is understanding why this is happening?
Could it be Anti-Virus running file level scan’s on the Guest?
Could it be a runaway process consuming all of the IO on the SAN
Let’s take a quick look at the customer’s configuration of Hyper-V as a starting point:
3 x Node HP DL380 G7 Hyper-V Cluster (2 Proc, 128GB RAM, 8 GbE NIC’s)
4 x GbE nic ports teamed for Production
2 x GbE nic Ports teamed for Heartbeat / LiveMigration
2 x GbE nic Ports running HP DSM for MPIO ISCSI
2 x HP 2910 Switches / VLAN ‘s Configured for ISCSI, Heartbeat / LiveMigration, and Production Servers
2 x HP P4300 SAN Shelves running approximately 2500ish IOP max.
As you can see above – We have a dual-core set of HP 290x Switches. Front facing adapters are teamed and ISCSI Connections are using HP DSM’s for Multi-Path IO. The SAN’s that this production infrastructure is connecting too is a HP P4300 SAN.
They Currently have 4 x CSV Volumes and one Quorum Drive connected in this 3 server Hyper-V Failover Cluster.
So where do we start with all of this – Users are complaining for slow performance on their Databases and Applications that connect to them.
Here is where I always like to start – Have a look at your SAN. (You must know some important facts about your hardware in order to troubleshoot effectively. For example I know that this P4300 SAN will allow me to sustain approximately 2500 IOP’s.)
Where are we now?
I log into the P4300 Central Management Console and click on the P4300-Cluster and click on the Performance Node:
I can see right away that we are averaging approximately 1054 IOPS. This means that there is some normal activity happening on the SAN. Problem is this is Jan 3, 2013 and there are only 20 x users in the office. I have run performance baselines that tell me in a relatively idle state – the Guest Workloads only consume around 400 IOP. So I know that something is up.
Now to dig a bit deeper – As seen in the screenshot above you can see that I have 4 CSV Volumes created. We need to get more detailed information about their individual volume performance. This will allow me to start isolating which guests are running on which Volume. So we have to click on the Performance monitor and add some additional counter objects on the SAN.
As you can see in the screenshot below the P4300 SAN allows quick analysis of active ISCSI Volumes that are presented to the servers. I simply have to Click on Connected Volumes and Snapshots, Select one of the Hyper-V Nodes and find the volumes that are in question. We really need to determine which volume on the SAN is consuming all of our IOPS.
Note: The counters that I am interested in are IO Latency Read, IO Latency Total, IO Latency Total, IOPS Reads, IOPS Total, IOPS Writes, and Queue Depth. If the Queue Depth is too high it tells me that we are starting to run out of overall IOPS on the SAN. The Total IOPS can be correlated to a performance monitor counter that we will look at a bit later.
Now we can re-evaluate the results from the Performance Monitor node in the P4300 Central Management Console. Make sure you scroll down and try to see if we get lucky and find one of the volumes and is consuming a majority of the IOPS on the SAN. Right away upon doing this I can see that I have one volume CSV1 that is consuming almost all of the IOPS that the SAN is serving up. It is coming from CSV1. Also of Interest is that the type of IO is a Read and not a write. This normally indicates that there is some type of active file level scan occurring at this time.
Now seeing as the customer has indicated that it is an issue with performance on the Production SQL Server let’s start there:
As seen in the above screen shot this workload is running on CSV Volume 1. This is the volume that is consuming about 50 % of the available IO from the Server right now.
Let’s take a bit deeper look – What I do next is perform a very simple test on the affected server. Open Performance Monitor and add all of the physical disk counters.
What I am most interested in is Disk Transfers/SEC or otherwise known as IOP/SEC. This will tell me if this server is the one that is consuming all of the IO in that CSV Volume. I checked this value before taking a screen shot and this guest alone was consuming over 1000 IOP. So we know that it was something happening on this host. Unfortunately by the time I got to this point again the customer advised that the problem has ceased and that the issues were gone.
How could this happen? What might have been running and then all of a sudden stopped. Well now it is time to do the dirty work. Checking services and indications in logfiles for running file level scans. So best to note the time right now (3:40PM MST). We will need that as we start to gather some information for root cause analysis.
So where to start? How about the beginning. Let’s start again with the P4300 Central Management Console to confirm that the IOPS have been reduced on the SAN. As seen in the screen shot below this clearly indicates that the issues is no longer occurring. (You can note that the IOP Saturation on the SAN is now at 145 IOP. This normally averages about 300-400 on a busy day and then spikes up to as high as 1800 during peak load. This load is distributed amongst 15 Guests.)
So now that business is back to normal for now we still need to try and figure out root cause. I my professional experience this type of issue is almost always caused by some type of file level scan. Now you need to ask yourself a series of questions to move forward:
What type of things do file level scans on a scheduled basis on a server?
Is there an SCCM Agent Deployed?
What type of workload is this? SQL, Exchange, AD, Oracle, Citrix, etc.
Has this issue occurred before or is it new? If it is new what has changed?
So let’s start off with some basic troubleshooting and analysis:
Here are my notes from the analysis that was performed on this production SQL Server:
1. Logon to the Server with Admin Rights
2. Check Services (See if anything pop’s out that is out of the ordinary)
3. Check the event viewer (See if anything was triggered or running – backup / scan / etc.)
a. We know that it was read IOP and not Write IOP. So was there a large job running on the SQL Server that the SAN could not keep up with?
4. I did notice two things from the Eventlog. I cannot take screenshots due to privileged information. But what I can tell you is the following:
a. There are errors relational to the Windows Search Service
b. There was a Windows Installer Install for some HP Hardware that shouldn’t be running.
c. Office SCAN did not appear to be running but we may want to consider disabling it. In a lot of my calls with Microsoft Support this seems to be a common cause with Filter Drivers and such conflicting.
5. As it is very hard to find out what is happening I figured I would check the P4300 Central Management Console again to see what is happening. Luckily we have CSV1 pinning the SAN again.
6. Quickly let’s check Perfmon on the CGYSQL02 Server.
b. Sure enough there it is again.
7. Now we need to quickly download Process Monitor from Microsoft Sysinternals
a. Downloaded from http://technet.microsoft.com/en-ca/sysinternals/bb896645.aspx
b. This is really weird. The Server is in going nuts right now – Process monitor has been running for 2 minutes and I already have over 400K in operations. What is going on.
c. Time to dig in!
d. Well as expected Trend Office Scan is running – Querying for files with the real-time scanner (pccntmon.exe):
e. I am seeing a TON of read’s from LV_Engine.exe – This is known as Live Vault from Iron Mountain. It is scanning the file system for changed files and is uploading them to Iron Mountain for offsite backups. We can also notice below that these are indeed Read operations. This would explain the high volume of READ IOP on the SAN. It also looks like the service is using Volume Shadow Copies to snag SQL Backups.
f. There is a cool feature in process monitor where you can view Process Activity Summaries. As you can see below we are definitely looking at two potential causes for our massive READ IOP issue on this server. LiveVault or Trend OfficeScan – What is interesting to me is that the combination of LiveVault and OfficeScan have initiated of 500K in reads in less than three minutes on this server.
g. At this point I am calling the client back to see if there is a possibility of us stopping either OfficeScan or LiveVault to see if it makes a difference.
h. I stopped the LiveVault Backup Service and as such the IOP Consumption has stopped on the server.
i. Next steps will have the client modify the LiveVault Backup Schedule to not run during peak business hours.
This case today was an interesting one for me. It takes a lot of combined knowledge and understand to figure out what root cause of an issue really is. As it turns out the issue with the slowdown in the Hyper-V Cluster was cause by this one machine that was running backups during peak business hours. When you move to a converged fabric “CLOUD” type environment for your virtualization you must pay particular attention to what type of workloads are running. Also you must have deep insight into the applications running on these workloads.
At the beginning of this case I felt that it would likely be a file-level scan “Anti-Virus” or agent running. As such it definitely was an agent running backups.
What I have shown you today will hopefully empower you with more knowledge to help actively troubleshooting issues at your site.