The current project I’m on has a requirement for being able to determine a server’s overall performance before and after a migration, mostly to make sure that it still functions the same or better once its on the new platform. Whilst it’s easy enough to get raw statistics from perfmon getting an at-a-glance view of how a server is performing before and after a migration is a far more nuanced concept, one that’s not easily accomplished with some Excel wizardry. With that in mind I thought I’d share with you my idea for creating such a view as well as outlining the challenges I’ve hit when attempting to collate the data.
At a high level I’ve focused on the 4 core resources that all operating systems consume: CPU, RAM, disk and network. For the most part these metrics are easily captured by the counters that perfmon has however I wanted to go a bit further to make sure that the final comparisons represented a more “true” picture of before and after performance. To do this I included some additional qualifying metrics which would show if increased resource usage was negatively impacting on performance or if it was just the server consuming more resources because it could since the new platform had much more capacity. With that in mind these are the metrics I settled on using:
- Average of CPU usage (24 hours), Percentage, Quantitative
- CPU idle time on virtual host of VM (24 hours), Percentage, Qualifying
- Top 5 services by CPU usage, List, Qualitative
- Average of Memory usage (24 hours), Percentage, Quantitative
- Average balloon driver memory usage (24 hours), MB consumed, Qualifying
- Top 5 services by Memory usage, List, Qualitative
- Average of Network usage (24 hours), Percentage, Quantitative
- Average TCP retransmissions (24 hours), Total, Qualifying
- Top 5 services by Network bandwidth utilized, List, Qualitative
- Average of Disk usage (24 hours), Percentage, Quantitative
- Average queue depth (24 hours), Total, Qualifying
- Top 5 services by Storage IOPS/Bandwidth utilized, List, Qualitative
Essentially these metrics can be broken down into 3 categories: quantitative, qualitative and qualifying. Quantitative metrics are the base metrics which will form the main part of the before and after analysis. Qualitative metrics are mostly just informational (being the Top 5 consumers of X resource) however they’ll provide some useful insight into what might be causing an issue. For example if an SQL box isn’t showing the SQL process as a top consumer then it’s likely something is causing that process to take a dive before it can actually use any resources. Finally the qualifying metrics are used to indicate whether or not increased usage of a certain metric signals an impact to the server’s performance like say if the memory usage is high and the memory balloon size is high it’s quite likely the system isn’t performing very well.
The vast majority of these metrics are provided in perfmon however there were a couple that I couldn’t seem to get through the counters, even though I could see them in Resource Monitor. As it turns out Resource Monitor makes use of the Event Tracing for Windows (ETW) framework which gives you an incredibly granular view of all events that are happening on your machine. What I was looking for was a breakdown of disk and network usage per process (in order to generate the Top 5 users list) which is unfortunately bundled up in the IO counters available in perfmon. In order to split these out you have to run a Kernel Trace through ETW and then parse the resulting file to get the metrics you want. It’s a little messy but unfortunately there’s no good way to get those metrics separated. The resulting perfmon profile I created can be downloaded here.
The next issue I’ve run into is getting the data into a readily digestible format. You see not all servers are built the same and not all of them run the same amount of software. This means that when you open up the resulting CSV file from different servers the column headers won’t line up so you’ve got to either do some tricky Excel work (which is often prone to failure) or get freaky with some PowerShell (which is messy and complicated). I decided to go for the latter as at least I could maintain and extend the script somewhat easily whereas an Excel spreadsheet has a tendency to get out of control faster than anyone expects. That part is still a work in progress however but I’ll endeavour to update this post with the completed script once I’ve got it working.
After that point it’s a relatively simple task of displaying everything in a nicely formatted Excel spreadsheet and doing comparisons based on the metrics you’ve generated. If I had more time on my hands I probably would’ve tried to integrate it into something like a SharePoint BI site so we could do some groovy tracking and intelligence on it but due to tight time constraints I probably won’t get that far. Still a well laid out spreadsheet isn’t a bad format for presenting such information, especially when you can colour everything green when things are going right.
I’d be keen to hear other people’s thoughts on how you’d approach a problem like this as trying to quantify the nebulous idea of “server performance” has proven to be far more challenging than I first thought it would be. Part of this is due to the data manipulation required but it was also ensuring that all aspects of a server’s performance were covered and converted down to readily digestible metrics. I think I’ve gotten close to a workable solution with this but I’m always looking for ways to improve it or if there’s a magical tool out there that will do this all for me 😉