vRealize Operations 6.6 – Stress Calculation and Rightsizing

Professional football coach, Paul Brown once said, “The key to winning is poise under stress”. This is also the key for optimal performance of your vSphere data center resources. We want our resources stressed just enough to take full advantage of our capacity investment, without delivering substandard performance.

One of the vROps questions I am often asked is:, “How does vROps come up with this ‘right-size’ recommendation?”. It’s often followed with :, “Why should I trust this?”.

The answer is found with an understanding of two factors. The first being how vROps is set to analyze CPU and memory (This is configured by policy or the Monitoring Goals wizard). The default is CPU Demand | Memory Consumed.

There are three options, ranging from what is considered conservative to aggressive. ‘Allocation’, ‘Consumed’, and ‘Demand’ are terms with specific meanings for vROps capacity analysis in this regard. Within this post, I will use the term ‘demand’ in a general sense and will not delve into the differences between them.

The second is the stress policy and calculation. I’ll be focusing on this factor as it performs the heavy lifting here.

Stress is an evaluation of how much something has, in relation to how much is demanded of it. It can be derived for anything that provides resources; such as, virtual machines, hosts, and clusters.

The first thing we need to understand is total capacity of the object we are evaluating. For a cluster, this would be the total memory of all hosts and the total GHz of all CPUs in the cluster.

The next concept is usable capacity. Usable capacity is total capacity minus general and high-availability buffers. So if we say our cluster needs to withstand the loss of one host, we would see the equal CPU and memory of a single host deducted from total capacity to represent usable capacity.

Now, it’s important to keep in mind that vROps provides us both historical and forecasted performance analysis. When we view historical data (e.g. The Stress badge), vROps calculates stress based on total capacity. When we look at forecasted analysis, (e.g. Capacity Remaining or Recommended Size), vROps calculates stress based on usable capacity.

The reason for this is that we use buffers to deal with potential host failures, but looking into the past, we already know what was available at that time. For the future, we want to be sure to account for the additional capacity we will need to satisfy our HA and general buffers.

So we now know that vROps looks at one of the two capacity sums and compares actual requirements (aka resource Workload) being asked of that sum. A Stress score is the result of this analysis.

The Stress score is a percentage represented as a whole number and can exceed 100 (e.g. A group of VMs in a cluster can collectively demand more GHz than exists within the cluster capacity.). vROps only takes note once demand exceeds the specified stress threshold. The stress threshold is set by policy and is 70% by default. vROps evaluates any result less than the threshold as a stress score of zero, and begins at 1 as soon as we cross the threshold.

We obviously don’t want resources running at low stress as this means we’re spending too much on capacity. We don’t want to red-line our capacity with extremely high stress values either, resulting in poor performance. We want our resources in a band of stress that delivers optimal performance for the right investment in capacity (“Poise under stress”).

One caveat I should add on performance is that oversized objects can actually result in degraded performance as well. Additional memory and CPU bring additional overhead with them. This unnecessary overhead consumes compute resources..

As with all badges, we have the concept of color grades. These colors typically represent the criticality of the condition. For stress, we say a resource is Green from 0% stress until it is 25% of the area above the defined stress threshold This green <25% above threshold is the “poise under stress” band. The colors go yellow, orange, and red from there.

The data set analyzed to calculate a stress score is derived by policy as well. By default, vROps begins with the worst one hour of stress over the past 30 day period. The worst hour is the one in which the demand area was greater than any other one hour window, sliding across that 30 day period.

Knowing that vROps collects data points every 5 minutes by default, we can determine that it will look at 12 measurements in this sliding window, beginning with 1-12, then 2-13, then 3-14, and so on until it pans through the entire 30 days (If my math is correct this equals 8,629 comparisons per calculation.). This avoids results based on random spikes that will likely exist across the data, and ensures we are deriving the result from noticeable increase in demand.

The stress score of that worst hour is derived from the percentage of area  that the demand exists above the stress threshold, compared to the total area above the threshold, for the hour.

For example (and to keep it simple), imagine we have 20 GHz of CPU (usable capacity) and 12 consecutive data points were recorded at 18 GHz in the worst hour.

Because stress threshold begins at 70%, we cross into stress at 14 GHz. We observed a straight line of demand at 18 GHz across the 60 minute window. So the window is 6 GHz (Usable Capacity [20] – Stress Threshold [14]) x 12 [12 observation at 5 minute intervals for 60 minutes]). The result is a total area of 6 x 12, result: 72. The demand was 4 GHz above stress threshold for each of the 12 measurements. Resulting in 4 x 12, result: 48. To find stress score, divide demand area above stress threshold by total stress area, and arrive at 48/72, stress calculation, 66.66%, stress score 67.

(In actuality, the calculation vROps performs is a bit more scientific, as it calculates area of the curves formed by the twelve data points)

In the above scenario, we can see that a stress score of 67 for this resource indicates 90% overall demand. It’s important to understand that stress score does not equal workload (i.e. demand) percentage. So if we work off of the default stress threshold of 70%, we look for <25% into the remaining 30% as our target score for optimal utilization. This puts us between a 70% and 78%  average demand for the measured period, with a stress score of 1 – 25.

At midnight, vROps will scan the past 30 days again (dropping off a day and adding the day just passed, find the worst hour, and calculate again. This may be the same data points, or may be different if the last calculation was derived from the day we dropped or there is a worse 1 hour in the day just added.

So now that we know how stress is calculated, we can begin to understand how it is used to represent if a resource is oversized, undersized, or rightsized. Going back to the optimal stress target (70%-77.5% demand), we will see how it is laid down as the template for this analysis.

As I mentioned earlier, when looking at the historical data for a resource, vROps calculates based on total capacity. But when looking forward for recommendations, it calculates based on usable capacity. So, if a resource has a stress badge score of 1 – 25, it doesn’t necessarily mean it will be reported as rightsized.

For rightsizing, vROps looks at the past 30 day graph in much the same way it does when calculating the stress badge score. the only difference is that it considers usable capacity.

vROps uses this score to determine if a resource is oversized, undersized, or rightsized.. If the resource’s stress score is 1-25, it’s rightsized. Well below that band would indicated it is oversized. Above that band, undersized.

To calculate the recommended size, vROps considers the stress window, calculates the additional 22.5% (In the case of the default stress threshold) capacity required on top of that, to give it enough resource to run in the optimal stress band, and voila, that is your recommended size.

In cases where the stress score is greater than 100, vROps has to make an assessment of ‘pent up demand’. This is not an exact number, it will recommend more capacity than you might expect. Trust it, then verify.

It is good practice to apply policies that are appropriate for your various environments. Stress can be tuned more aggressively for Dev/Test environments via a separate policy. You may want to evaluate different type of workloads and create specific policy settings for each. It is critical that you have a strong understanding of the policy elements described above, and how they interrelate.

So, that’s the answer to how and why vROps makes rightsizing recommendations. The in-depth analytics performed by vROps on CPU and memory are the starting point. Leveraging policy, we are provided with a recommended configuration that optimizes capacity to performance. The result is compute SLA at optimum cost.

Special thanks to Jack White, Hicham Mourad, and Craig Risinger for their direct and indirect mentoring on vROps over the years.