Friday, November 12, 2010

Performance monitors

Imagine you just arrived at the office on a Monday morning and you’re greeted by an eager user who is complaining that his server is running too slow. How do you even begin to help him? Performance
Monitor, a handy tool built into Windows®, can assist you in diagnosing the problem.
You can access Performance Monitor by typing perfmon at the command prompt or by selecting the Performance or Reliability and Performance Monitor (in Windows Vista® and Windows Server® 2008) from the Administrative Tools menu. To add performance counters and objects to be monitored, you simply click the plus sign and select from a host of possible choices.
So how do you measure the pulse of a server? There are more than 60 basic performance objects, and each object contains multiple counters. In this article, I will discuss the counters that reveal the vital signs of a server, and I will describe the typical sampling intervals that Microsoft® Service Support engineers use most often to troubleshoot performance-related issues.
Of course, a baseline provides a critical reference point when troubleshooting. Since the server load depends on the business requirements and also varies from time to time depending on the business cycle, it is important to establish a baseline determined by the normal workload over a specified period of time. That allows you to observe changes and identify trends.

Making the Results More Readable
Before I dive into an analysis of the counters that represent the vital signs of servers, I'll tell you about two tricks that will make it easier for you to measure the vital signs of servers using Performance Monitor. Note that these tricks are not needed in Windows Vista and Windows Server 2008, but if you are running Performance Monitor on earlier versions of Windows, these two tweaks can come in very handy.
First, you can remove all the distracting sample noise that obscures the graphical view of trend lines. In Windows Vista and Windows Server 2008, Performance Monitor can display up to 1000 data points in graphical view. In previous versions of Windows, the limit is only 100 data points. When there are more than 100 points, Performance Monitor "buckets" the data points. A bucket is represented by a vertical line, indicating the minimum, average, and maximum of the sample points included in the bucket.
As you can tell by looking at the graph in Figure 1, it is difficult to spot the trend line when so much data is displayed at the same time. The Figure 2 graph shows how much easier it is to grasp the data quickly when all the extraneous visual information has been turned off. For details on how you can turn off these vertical lines, see the Knowledge Base article that is available at support.microsoft.com/kb/283110.

Figure 1 Performance data shown with distracting buckets and no commas (Click the image for a larger view)

Figure 2 A cleaner view of data with comma separators (Click the image for a larger view)
The second trick is to add comma separators in the numbers, making it much easier to read the values shown in the counters. Windows Vista and Windows Server 2008 have comma separators enabled by default. In previous versions of Windows, however, Performance Monitor does not enable commas by default.
This may not sound like it would make a huge difference, but take a look at Figure 1, which shows the performance counters without commas, and then look at Figure 2, which shows the counters with commas. I find the latter much more readable. For some simple instructions on adding comma separators to your performance counters in Windows XP, take a look at the Knowledge Base article at support.microsoft.com/kb/300884.

What and When to Measure
Bottlenecks occur when a resource reaches its capacity, causing the performance of the entire system to slow down. Bottlenecks are typically caused by insufficient or misconfigured resources, malfunctioning components, and incorrect requests for resources by a program.
There are five major resource areas that can cause bottlenecks and affect server performance: physical disk, memory, process, CPU, and network. If any of these resources are overutilized, your server or application can become noticeably slow or can even crash. I will go through each of these five areas, giving guidance on the counters you should be using and offering suggested thresholds to measure the pulse of your servers.
Since the sampling interval has a significant impact on the size of the log file and the server load, you should set the sample interval based on the average elapsed time for the issue to occur so you can establish a baseline before the issue occurs again. This will allow you to spot any trend leading to the issue.
Fifteen minutes will provide a good window for establishing a baseline during normal operations. Set the sample interval to 15 seconds if the average elapsed time for the issue to occur is about four hours. If the time for the issue to occur is eight hours or more, set the sampling interval to no less than five minutes; otherwise, you will end up with a very large log file, making it more difficult to analyze the data.

Hard Disk Bottleneck
Since the disk system stores and handles programs and data on the server, a bottleneck affecting disk usage and speed will have a big impact on the server's overall performance.
Please note that if the disk objects have not been enabled on your server, you need to use the command-line tool Diskperf to enable them. Also, note that % Disk Time can exceed 100 percent and, therefore, I prefer to use % Idle Time, Avg. Disk sec/Read, and Avg. Disk sec/write to give me a more accurate picture of how busy the hard disk is. You can find more on % Disk Time in the Knowledge Base article available at support.microsoft.com/kb/310067.
Following are the counters the Microsoft Service Support engineers rely on for disk monitoring.
LogicalDisk\% Free Space This measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
PhysicalDisk\% Idle Time This measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
PhysicalDisk\Avg. Disk Sec/Read This measures the average time, in seconds, to read data from the disk. If the number is larger than 25 milliseconds (ms), that means the disk system is experiencing latency when reading from the disk. For mission-critical servers hosting SQL Server® and Exchange Server, the acceptable threshold is much lower, approximately 10 ms. The most logical solution here is to replace the current disk system with a faster disk system.
PhysicalDisk\Avg. Disk Sec/Write This measures the average time, in seconds, it takes to write data to the disk. If the number is larger than 25 ms, the disk system experiences latency when writing to the disk. For mission-critical servers hosting SQL Server and Exchange Server, the acceptable threshold is much lower, approximately 10 ms. The likely solution here is to replace the disk system with a faster disk system.
PhysicalDisk\Avg. Disk Queue Length This indicates how many I/O operations are waiting for the hard drive to become available. If the value here is larger than the two times the number of spindles, that means the disk itself may be the bottleneck.
Memory\Cache Bytes This indicates the amount of memory being used for the file system cache. There may be a disk bottleneck if this value is greater than 300MB.

Memory Bottleneck
A memory shortage is typically due to insufficient RAM, a memory leak, or a memory switch placed inside the boot.ini. Before I get into memory counters, I should discuss the /3GB switch.
More memory reduces disk I/O activity and, in turn, improves application performance. The /3GB switch was introduced in Windows NT® as a way to provide more memory for the user-mode programs.
Windows uses a virtual address space of 4GB (independent of how much physical RAM the system has). By default, the lower 2GB are reserved for user-mode programs and the upper 2GB are reserved for kernel-mode programs. With the /3GB switch, 3GB are given to user-mode processes. This, of course, comes at the expense of the kernel memory, which will have only 1GB of virtual address space. This can cause problems because Pool Non-Paged Bytes, Pool Paged Bytes, Free System Page Tables Entries, and desktop heap are all squeezed together within this 1GB space. Therefore, the /3GB switch should only be used after thorough testing has been done in your environment.
This is a consideration if you suspect you are experiencing a memory-related bottleneck. If the /3GB switch is not the cause of the problems, you can use these counters for diagnosing a potential memory bottleneck.
Memory\% Committed Bytes in Use This measures the ratio of Committed Bytes to the Commit Limit—in other words, the amount of virtual memory in use. This indicates insufficient memory if the number is greater than 80 percent. The obvious solution for this is to add more memory.
Memory\Available Mbytes This measures the amount of physical memory, in megabytes, available for running processes. If this value is less than 5 percent of the total physical RAM, that means there is insufficient memory, and that can increase paging activity. To resolve this problem, you should simply add more memory.
Memory\Free System Page Table Entries This indicates the number of page table entries not currently in use by the system. If the number is less than 5,000, there may well be a memory leak.
Memory\Pool Non-Paged Bytes This measures the size, in bytes, of the non-paged pool. This is an area of system memory for objects that cannot be written to disk but instead must remain in physical memory as long as they are allocated. There is a possible memory leak if the value is greater than 175MB (or 100MB with the /3GB switch). A typical Event ID 2019 is recorded in the system event log.
Memory\Pool Paged Bytes This measures the size, in bytes, of the paged pool. This is an area of system memory used for objects that can be written to disk when they are not being used. There may be a memory leak if this value is greater than 250MB (or 170MB with the /3GB switch). A typical Event ID 2020 is recorded in the system event log.
Memory\Pages per Second This measures the rate at which pages are read from or written to disk to resolve hard page faults. If the value is greater than 1,000, as a result of excessive paging, there may be a memory leak.

Processor Bottleneck
An overwhelmed processor can be due to the processor itself not offering enough power or it can be due to an inefficient application. You must double-check whether the processor spends a lot of time in paging as a result of insufficient physical memory. When investigating a potential processor bottleneck, the Microsoft Service Support engineers use the following counters.
Processor\% Processor Time This measures the percentage of elapsed time the processor spends executing a non-idle thread. If the percentage is greater than 85 percent, the processor is overwhelmed and the server may require a faster processor.
Processor\% User Time This measures the percentage of elapsed time the processor spends in user mode. If this value is high, the server is busy with the application. One possible solution here is to optimize the application that is using up the processor resources.
Processor\% Interrupt Time This measures the time the processor spends receiving and servicing hardware interruptions during specific sample intervals. This counter indicates a possible hardware issue if the value is greater than 15 percent.
System\Processor Queue Length This indicates the number of threads in the processor queue. The server doesn't have enough processor power if the value is more than two times the number of CPUs for an extended period of time.

Network Bottleneck
A network bottleneck, of course, affects the server's ability to send and receive data across the network. It can be an issue with the network card on the server, or perhaps the network is saturated and needs to be segmented. You can use the following counters to diagnosis potential network bottlenecks.
Network Interface\Bytes Total/Sec This measures the rate at which bytes are sent and received over each network adapter, including framing characters. The network is saturated if you discover that more than 70 percent of the interface is consumed. For a 100-Mbps NIC, the interface consumed is 8.7MB/sec (100Mbps = 100000kbps = 12.5MB/sec* 70 percent). In a situation like this, you may want to add a faster network card or segment the network.
Network Interface\Output Queue Length This measures the length of the output packet queue, in packets. There is network saturation if the value is more than 2. You can address this problem by adding a faster network card or segmenting the network.

Process Bottleneck
Server performance will be significantly affected if you have a misbehaving process or non-optimized processes. Thread and handle leaks will eventually bring down a server, and excessive processor usage will bring a server to a crawl. The following counters are indispensable when diagnosing process-related bottlenecks.
Process\Handle Count This measures the total number of handles that are currently open by a process. This counter indicates a possible handle leak if the number is greater than 10,000.
Process\Thread Count This measures the number of threads currently active in a process. There may be a thread leak if this number is more than 500 between the minimum and maximum number of threads.
Process\Private Bytes This indicates the amount of memory that this process has allocated that cannot be shared with other processes. If the value is greater than 250 between the minimum and maximum number of threads, there may be a memory leak.

Wrapping Up
Now you know what counters the Service Support engineers at Microsoft use to diagnose various bottlenecks. Of course, you will most likely come up with your own set of favorite counters tailored to suit your specific needs. You may want to save time by not having to add all your favorite counters manually each time you need to monitor your servers. Fortunately, there is an option in the Performance Monitor that allows you to save all your counters in a template for later use.
You may still be wondering whether you should run Performance Monitor locally or remotely. And exactly what will the performance hit be when running Performance Monitor locally? This all depends on your specific environment. The performance hit on the server is almost negligible if you set intervals to at least five minutes.
You may want to run Performance Monitor locally if you know there is a performance issue on the server, since Performance Monitor may not be able to capture data from a remote machine when it is running out of resources on the server. Running it remotely from a central machine is really best suited to situations when you want to monitor or baseline multiple servers.
Interpreting CPU Utilization for Performance Analysis
Published 06 August 09 09:02 PM | winsrvperf
CPU hardware and features are rapidly evolving, and your performance testing and analysis methodologies may need to evolve as well. If you rely on CPU utilization as a crucial performance metric, you could be making some big mistakes interpreting the data. Read this post to get the full scoop; experts can scroll down to the end of the article for a summary of the key points.

If you’re the type of person who frequents our server performance blog, you’ve probably seen (or watched) this screen more than a few times:


This is, of course, the Performance tab in Windows Task Manager. While confusion over the meaning of the Physical Memory counters is a regular question we field on the perf team, today I’m going to explain how CPU utilization (referred to here as CPU Usage) may not mean what you would expect!

[Note: In the screenshot above, CPU utilization is shown as a percentage in the top left. The two graphs on the top right show a short history of CPU usage for two cores. Each core gets its own graph in Task Manager.]

CPU utilization is a key performance metric. It can be used to track CPU performance regressions or improvements, and is a useful datapoint for performance problem investigations. It is also fairly ubiquitous; it is reported in numerous places in the Windows family of operating systems, including Task Manager (taskmgr.exe), Resource Monitor (resmon.exe), and Performance Monitor (perfmon.exe).

The concept of CPU utilization used to be simple. Assume you have a single core processor fixed at a frequency of 2.0 GHz. CPU utilization in this scenario is the percentage of time the processor spends doing work (as opposed to being idle). If this 2.0 GHz processor does 1 billion cycles worth of work in a second, it is 50% utilized for that second. Fairly straightforward.

Current processor technology is much more complex. A single processor package may contain multiple cores with dynamically changing frequencies, hardware multithreading, and shared caches. These technological advances can change the behavior of CPU utilization reporting mechanisms and increase the difficulty of performance analysis for developers, testers, and administrators. The goal of this post is to explain the subtleties of CPU utilization on modern hardware, and to give readers an understanding of which CPU utilization measurements can and cannot be compared during performance analysis.

CPU Utilization’s Uses
For those who are unaware, CPU utilization is typically used to track CPU performance regressions or improvements when running a specific piece of code. Say a company is working on a beta of their product called “Foo.” In the first test run of Foo a few weeks ago, they recorded an average CPU utilization of 25% while Foo was executing. However, in the latest build the average CPU utilization during the test run is measured at 75%. Sounds like something’s gone awry.

CPU utilization can also be used to investigate performance problems. We expect this type of scenario to become common as more developers use the Windows Performance Toolkit to assist in debugging applications. Say that Foo gets released for beta. One customer says that when Foo is running, their system becomes noticeably less responsive. That may be a tough bug to root cause. However, if the customer submits an XPerf trace, CPU utilization (and many other nifty metrics) can be viewed per process. If Foo.exe typically uses 25% CPU on their lab test machines, but the customer trace shows Foo.exe is using 99% of the CPU on their system, this could be indicative of a performance bug.

Finally, CPU utilization has important implications on other system performance characteristics, namely power consumption. Some may think the magnitude of CPU utilization is only important if you’re bottlenecked on CPU at 100%, but that’s not at all the case. Each additional % of CPU Utilization consumes a bit more juice from the outlet, which costs money. If you’re paying the electricity bill for the datacenter, you certainly care about that!

Before I go further, I want to call out a specific caveat for the more architecturally-aware folks. Earlier, I used the phrase “cycles worth of work”. I will avoid defining the exact meaning of “work” for a non-idle processor. That discussion can quickly become contentious. Metrics like Instructions Retired and Cycles per Instruction can be very architecture and instruction dependent and are not the focus of this discussion. Also, “work” may or may not include a plethora of activity, including floating point and integer computation, register moves, loads, stores, delays waiting for memory accesses and IO’s, etc. It is virtually impossible for every piece of functionality on a processor to be utilized during any given cycle, which leads to arguments about how much functionality must participate during “work” cycles.

Now, a few definitions:
Processor Package: The physical unit that gets attached to the system motherboard, containing one or more processor cores. In this blog post “processor” and “processor package” are synonymous.
Processor Core: An individual processing unit that is capable of executing instructions and doing computational work. In this blog post, the terms “CPU” and “core” are intended to mean the same thing. A “Quad-Core” processor implies four cores, or CPU’s, per processor package.
Physical Core: Another name for an instance of a processor core.
Logical Core: A special subdivision of a physical core in systems supporting Symmetric Multi-Threading (SMT). A logical core shares some of its execution path with one or more other logical cores . For example, a processor that supports Intel’s Hyper-Threading technology will have two logical cores per physical core. A “quad-core, Hyper-Threaded” processor will have 8 logical cores and 4 physical cores.
Non Uniform Memory Access (NUMA) – a type of system topology with multiple memory controllers, each responsible for a discrete bank of physical memory. Requests to each memory bank on the system may take different amounts of time, depending on where the request originated and which memory controller services the request.
NUMA node: A topological grouping of a memory controller, associated CPU’s, and associated bank of physical memory on a NUMA system.
Hardware thread: A thread of code executing on a logical core.
Affinitization: The process of manually restricting a process or individual threads in a process to run on a specific core, package, or NUMA node.
Virtual Processor: An abstract representation of a physical CPU presented to a guest virtual machine.

Comparisons & Pitfalls
CPU utilization data is almost always useful. It is a piece of information that tells you something about system performance. The real problem comes when you try to put one piece of data in context by comparing it to another piece of data from a different system or a different test run. Not all CPU utilization measurements are comparable - even two measurements taken on the same make and model of processor. There are a few sources of potential error for folks using utilization for performance analysis; hardware features and configuration, OS features and configuration, and measurement tools can all affect validity of the comparison.

1. Be wary of comparing CPU utilization from different processor makes, families, or models.
This seems obvious, but I mentioned a case study above where the Foo performance team got a performance trace back from a customer, and CPU utilization was very different from what was measured in the lab. The conclusion that 99% CPU utilization = a bug is not valid if processors are at all different, because you’re not comparing apples to apples. It can be a useful gut-check, but treat it as such.

Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization

2. Resource sharing between physical cores may affect CPU utilization (for better or worse)
Single-core processors, especially on servers, are uncommon; multi-core chips are now the norm. This complicates a utilization metric for a few reasons. Most significantly, resource sharing between processor cores (logical and physical) in a package makes “utilization” a very hard-to-define concept. L3 caches are almost always shared amongst cores; L2 and L1 might also be shared. When resource sharing occurs, the net effect on performance is workload dependent. Applications that benefit from larger caches could suffer if cache space is shared between cores, but if your workload requires synchronization, it may be beneficial for all threads to be executing with shared cache. Cache misses and other cache effects on performance are not explicitly called out in the performance counter set. So the reported utilization includes time spent waiting for cache or memory accesses, and this time can grow or shrink based on the amount and kind of resource sharing.

Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)


3. Resource sharing between logical cores may affect CPU utilization (for better or worse)
Resource sharing also occurs in execution pipelines when SMT technologies like Intel’s Hyper-threading are present. Logical cores are not the same as physical cores - execution units may be shared between multiple logical cores. Windows considers each logical core a CPU, but seeing the term “Processor 1” in Windows does not imply that the corresponding silicon is a fully functioning, individual CPU.

Consider 2 logical cores sharing some silicon on their execution path. If one of the logical cores is idle, and the other is running at full bore, we have 100% CPU utilization for one logical core. Now consider when both logical cores are active and running full bore. Can we really achieve double the “work” of the previous example? The answer is heavily dependent on the workload characteristics and the interaction of the workload with the resources being shared. SMT is a feature that improves performance in many scenarios, but it makes evaluating performance metrics more…interesting.

Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)


4. NUMA latencies may affect CPU utilization (for better or worse)
An increasing percentage of systems have a NUMA topology. NUMA and resource sharing together imply that system topology can have dramatic effects on overall application performance. Similar to the previous two pitfalls, NUMA effects on performance are workload dependent.

If you want to see which cores belong to which NUMA nodes, right click on a process in the “Processes” tab of Task Manager and click “set affinity…”. You should get a window similar to the one below, which shows the CPU-to-node mapping if a server is NUMA-based. Another way to get this information is to execute the “!NUMA” command in the Windows Debugger (windbg.exe).



Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)


5. Processor power management (PPM) may cause CPU utilization to appear artificially high
Power management features introduce more complexity to CPU utilization percentages. Processor power management (PPM) matches the CPU performance to demand by scaling the frequency and voltage of CPU’s. During low-intensity computational tasks like word processing, a core that nominally runs at 2.4 GHz rarely requires all 2.4 billion potential cycles per second. When fewer cycles are needed, the frequency can be scaled back, sometimes significantly (as low as 28% of maximum). This is very prevalent in the market - PPM is present on nearly every commodity processor shipped today (with the exception of some “low-power” processor SKUs), and Windows ships with PPM enabled by default in Vista, Windows 7, and Server 2008 / R2.

In environments where CPU frequency is dynamically changing (reminder: this is more likely than not), be very careful when interpreting the CPU utilization counter reported by Performance Monitor or any other current Windows monitoring tool. Utilization values are calculated based on the instantaneous (or possibly mean) operating frequency, not the maximum rated frequency.

Example: In a situation where your CPU is lightly utilized, Windows might reduce the operating frequency down to 50% or 28% of its maximum. When CPU utilization is calculated, Windows is using this reference point as the “maximum” utilization. If a CPU nominally rated at 2.0 GHz is running at 500 MHz, and all 500 million cycles available are used, the CPU utilization would be shown as 100%. Extending the example, a CPU that is 50% utilized at 28% of its maximum frequency is using approximately 14% of the maximum possible cycles during the time interval measured, but CPU utilization would appear in the performance counter as 50%, not 14%.

You can see instantaneous frequencies of CPUs in the “Performance Monitor” tool. See the “Processor Performance” object and select the “% of Maximum Frequency” counter.

[Side note related to Perfmon and power management: the “Processor Frequency” and “% of Maximum Frequency” counters are instantaneous samples, not averaged samples. Over a sample interval of one second, the frequency can change dozens of times. But the only frequency you’ll see is the instantaneous sample taken each second. Again, ETW or other more granular measurement tools should be used to obtain statistically better data for calculating utilization.]

Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

6. Special Perfmon counters should be used to obtain CPU utilization in virtualized environments
Virtualization introduces more complexity, because allocation of work to cores is done by the hypervisor rather than the guest OS. If you want to view CPU utilization information via Performance Monitor, specific hypervisor-aware performance counters should be used. In the root partition of a Windows Server running Hyper-V, the “Hypervisor Root Virtual Processor % Total Runtime” counter can be used to track CPU utilization for the Virtual Processors to which a VM is assigned. For deeper analysis of Hyper-V Performance Counters and Processor Utilization in virtualized scenarios, see blog posts here and here.

Key takeaway #7: In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.


7. “% Processor Time” Perfmon counter data may not be statistically significant for short test runs
For someone performing performance testing and analysis, the ability to log CPU utilization data over time is critical. A data collector set can be configured via logman.exe to log the “% Processor Time”counter in the “Processor Information” object for this purpose. Unfortunately, counters logged in this fashion have a relatively coarse granularity in terms of time intervals; the minimum interval is one second. Relatively long sample sizes need to be taken to ensure statistical significance in the utilization data. If you need higher precision, then out-of-band Windows tools like XPerf in the Windows Performance Toolkit can measure and track CPU utilization with a finer time granularity using the Event Tracing for Windows (ETW) infrastructure.

Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.



Best Practices for Performance Testing and Analysis Involving CPU Utilization
If you want to minimize the chances that hardware and OS features or measurement tools skew your utilization measurements, consider the following few steps:
1. If you’re beginning to hunt down a performance problem or are attempting to optimize code for performance, start with the simplest configuration possible and add complexity back into the mix later.
a. Use the “High Performance” power policy in Windows or disable power management features in the BIOS to avoid processor frequency changes that can interfere with performance analysis.
b. Turn off SMT, overclocking, and other processor technologies that can affect the interpretation of CPU utilization metrics.
c. Affinitize application threads to a core. This will enhance repeatability and reduce run-to-run variations. Affinitization masks can be specified programmatically, from the command line, or can be set using the GUI in Task Manager..
d. Do NOT continue to test or run in production using this configuration indefinitely. You should strive to test in out-of-box or planned production configuration, with all appropriate performance and power features enabled, whenever possible.

Key Takeaway #9: When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

2. Understand the system topology and where your application is running on the system in terms of cores, packages, and nodes when your application is not explicitly affinitized. Performance issues can suddenly appear in complex hardware topologies; ETW and XPerf in the Windows Performance Toolkit can help you to monitor this information.
a. Rebooting will generally change where unaffinitized work is allocated to CPUs on a machine. This can make topology-related performance issues reproduce intermittently, increasing the difficulty to root cause and debug problems. Reboot and rerun tests several times, or explicitly affinitize to specific cores and nodes to help flush out any issues related to system topology. This does not mean that the final implementation is required to use thread affinity, or that affinity should be used to work around potential issues; it just improves repeatability and clarity when testing and debugging.
3. Use the right performance sampling tools for the job. If your sample sets will cover a long period of time, Perfmon counters may be acceptable. ETW generally samples system state more frequently and is correspondingly more precise than Perfmon, making it effective even with shorter duration samples. Of course, there is a tradeoff - depending on the number of ETW “hooks” enabled, you may end up gathering significantly more data and your trace files may be large.



Finally, keep in mind that these problems are not isolated to the Windows operating system family. The increase in processor features and complexity over the past decade has made performance analysis, testing, and optimization a challenge on all platforms, regardless of OS or processor manufacturer.
And If you are comparing CPU utilization between two different test runs or systems, use the guidance in this post to double check that the comparison makes sense. Making valid comparisons means you’ll spend more of your time chasing valid performance issues.


Summary of Key Takeaways
Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization
Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)
Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)
Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)
Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency
Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency
Key takeaway #7: In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.
Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.
Key Takeaway #9: When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

Feel free to reply with questions or additional (or alternative) perspectives, and good luck!

Matthew Robben
Program Manager
Windows Server Performance Team
Windows Performance Monitor (PerfMon) has been around for several generations of Windows and allows you to monitor, either over time or in real-time, the performance statistics of a Windows server.
Performance Monitor can capture a plethora of information on a Windows Server and is useful in diagnosing performance problems. However, to meaningfully analyze the PerfMon data captured when troubleshooting performance issues, it is critical that you have a baseline of normal system performance for comparison. This article focuses on using PerfMon to create a performance baseline on a Windows Terminal Server, but the following information also applies to baselining any Windows-based server.
Using Performance Monitor, performance data can be captured in a variety of granularity, from total processor utilization on a server down to the processor time used by an individual Windows process. However, to understand how to obtain the information you want, it is important to understand the three fundamental levels of monitoring criteria. These three levels are detailed below:
Objects: Objects are the top-most criteria for monitoring a set of attributes on the server. Typical objects include Memory, Network, Paging File, Processor, etc.
Counters: Counters are a subset of an object. For any given object, you will have multiple counters. For example, the Processor object has various counters to choose from: % processor time, % privileged time, % user time, interrupts/second, etc.
Instances: Each counter can have one or more instances. Using the example above of the processor object, % processor time would have two instances in a dual-processor system – one for each processor (0 and 1). You have the ability to monitor only one instance of a given counter if you wish.
Another way to look at this relationship is as follows (figure 1):

Figure 1
You can select the object itself, which includes all counters and all instances of each counter, a specific counter for an object, which includes all instances for that counter, or you can select only to view/track a specific instance of a given counter (for example, instance 0 of the % Processor Time counter of the Processor object).
Using Performance Monitor
The default screen shows current activity on the system, measuring pages/sec, average disk queue length and processor utilization.
1. To baseline a system, select Counter Logs under Performance Logs and Alerts. By default, there is a basic counter log that measures the same three counters as listed above. Although you can’t delete the sample, you can create your own custom counter log.
2. Right click on Counter Logs and select New Log Settings… The New Log Settings screen comes up and prompts you to name the job. As a good rule of thumb, it is best to make the job name as descriptive as possible to make future references easier. Include things like the server name and the date that the baseline is being taken.

Enter the job name and click OK.
3. Now it’s time to set up the counters. You will notice that there are two buttons available – Add Objects and Add Counters. Most of the time, you will find that adding entire objects will result in too much data being collected. For a proper baseline, you only need to capture the basic information about the performance of a server. Granular items (such as Processor\Pool Paged Bytes) will have no bearing on the baseline, so it’s overkill. Also, with each additional counter added, the server has to use resources to track that performance data. Adding too many counters by selecting entire objects can easily put undue strain on a server and skew your baseline results. Therefore, it’s best to only add the counters you wish to track.

Clicking Add Counters… button will bring up the following screen (figure 2)

Figure 2
By selecting a performance object from the drop-down list, you can drill down to specific counters and instances of that object.
Below is a list of object counters that make up a good, well-rounded baseline. You should include all instances of each counter except for the Network counters; they should only monitor the instances for the NICs that will be included in the baseline (if appropriate). The details on what each counter gathers will be discussed in part 2 of this article.
Memory
• Pages/Sec
• Available Mbytes
• Committed Bytes
• Page Faults/Sec
Network Interface
• Bytes Total/Sec
• Packets/Sec
Paging File
• % Usage
Physical Disk
• % Disk Time
• Avg Disk Bytes/Transfer
• Avg Disk Queue Length
• Avg Disk Sec/Transfer
• Disk Transfers/Sec
Processor
• % Processor Time
• % Privilege Time
• % User Time
• Interrupts/Sec
System
• Context Switches/Sec
• Processes
• Processor Queue Length
The following counters are for Terminal Servers specifically, and will aid in translating the output into meaningful information:
Terminal Services
• Active Sessions
• Total Sessions
Terminal Services Session
• % Processor Time
• Page Faults/Sec
4. Once you have added any appropriate counters, you can select the sample interval. The default setting of 15 seconds is usually sufficient, but if the server is utilized rather heavily, then set the sampling interval to 30 seconds or more to cut down on the impact that performance monitor may have on normal running conditions.

To set the interval, on the General Tab (figure 3), set the Sample data every: parameter to the desired setting, and the corresponding Units (in seconds, by default).

Figure 3
5. On the Log Files tab (figure 4), you can change the type of log file and where they are stored.

Figure 4
Typically, a binary log file is sufficient if you will be reviewing the data in PerfMon (typical). However you have the option of using a delimited text file (however, delimited files cannot be read by PerfMon) or even streaming the data to an SQL database. For the purposes of this article, we will stick to a binary file.
Clicking on the Configure… button will allow you to set both the file name prefix (which defaults to the job name) and the location of the files. You can also set a maximum size for your log files to prevent them from growing too large. The default of “Maximum limit” will allow the log file to continue to grow until it consumes all space on the drive, so it might be a good idea to set a maximum file size to prevent this if drive space is short or you will not be setting an end time/date for the job on the Schedule tab. Once the log file reaches the specified size, PerfMon will stop logging information.
Another option, however, is to use a Binary Circular File for the log file. Once the log file grows to the size specified, PerfMon will begin flushing the oldest information in the log file to make room for the new data. This will ensure you always have the latest performance statistics when you stop the log, and the log file will never grow beyond the specified size.
6. Finally, the Schedule tab (figure 5) allows you to decide whether the PerfMon job will start and stop at specified times or will require manual intervention. For baselining, you would typically set a start and stop time/date. It is always good to set a stop time if you don’t set a maximum log file size. This will prevent the logs from accidentally filling the drive if you forget to turn off PerfMon. In this example below, Perfmon is set to log data for seven days.

Figure 5
7. All that is left is to start collecting data. To manually start the job, right click the job name in the Counter Logs screen and select Start. Otherwise, the job will start automatically when the scheduled time arrives.

Note: No one needs to be logged on to the server for data collection. PerfMon will automatically start and stop jobs without a user being logged on.

Once the job is started, its icon will turn green in the Counter Logs screen. You can also view the log file location in Windows Explorer to see the actual log files as they grow in size.
With the job now running, the PerfMon is collecting data. The best way to get the most reliable data is to run the server in production as usual. The idea is to allow Performance Monitor to capture performance statistics as the server is under normal use. This will provide a good baseline for future comparison.
Part 2 of this article will go into depth on how to interpret the data gathered and how to effectively use a baseline to troubleshoot future issue.

No comments:

Post a Comment