Header background

Process more with less using smarter cluster overload prevention for Dynatrace Managed

The world’s most scalable, automatic distributed tracing pushes the boundary once again with enhanced Adaptive Load Management.

Bernd Greifeneder, Dynatrace CTO

Dynatrace has been building automated distributed application instrumentation—without the need to modify source code—for over 15 years already. Dynatrace PurePath technology is the foundation of distributed tracing and enables best-in-class robust observability in an automatic and frictionless way. Dynatrace just makes this easy—it comes out-of-the-box, no silos of data, no DIY stitching together tools, no wasted time, and no wasted resources. 

Dynatrace PurePath technology captures and analyzes transactions end to end across every tier of your application technology stack, from the browser all the way down to the code and database level. In contrast to other solutions, all Dynatrace PurePaths are automatically captured by OneAgent. You can deploy OneAgent with ease and instrument new applications, hosts, or even large, additional environments.

Turnkey cluster overload protection with adaptive traffic management and control

By vastly increasing the number of PurePaths that are processed by a Dynatrace Managed cluster, your initial sizing considerations for Dynatrace Managed nodes and clusters may however end up being inadequate for supporting such volume. A Dynatrace Managed cluster may lack the necessary hardware to process all the additional incoming data. This can occur especially when:

  • There are temporary load spikes due to peak loads from monitored applications that are being load tested, or from cluster nodes that are taking over load from others that are under maintenance or being upgraded.
  • The cluster sizing doesn’t match requirements because a large number of hosts were added to monitoring after the initial sizing, and the cluster needs to be expanded.

To protect the health and integrity of your monitoring environment in such situations, Dynatrace Managed leverages Adaptive Load Reduction (ALR) on incoming traces. The ALR mechanism also ensures maximum stability when the actual load exceeds the capacity of the cluster (though a statistically valid set of requests is still captured for analysis by the Dynatrace Davis® AI causation engine). Unlike our competition, Dynatrace takes a holistic look at cluster health and automatically prevents performance issues. This is one of our self-healing solutions that enables Dynatrace to monitor your applications continuously.

So far there have been two triggers for ALR: cluster node capacity and cluster node health. Starting with Dynatrace Managed version 1.192, we’ve extended the existing cluster node health trigger to examine Dynatrace node responsiveness as it’s more predictable and makes for better utilization of the existing hardware. This means that you’ll receive better answers from Dynatrace Davis and capture even more high-fidelity data as your hardware will be used optimally based on the newly improved ALR algorithm.

The new ALR algorithm gives you more precise AI answers and optimized hardware utilization

Starting with Dynatrace Managed version 1.192, cluster nodes will no longer enable ALR unless the health status of the cluster node indicates that it has reached its limits. Beyond the CPU consumption of your cluster nodes, Dynatrace also considers other factors, such as responsiveness, based on the suspension times caused by garbage collection.

What are the benefits of ALR? First, ALR enables you to access even more details about service calls so that you can understand the root cause of performance issues and optimize your applications more easily. Moreover, the new ALR mechanism is more precise, predictable, and explainable—just like our Davis AI causation engine. Dynatrace will now alert you and enable ALR only when your Managed cluster is close to being overloaded, giving you enough time to react and adjust.

With the new ALR, I finally have insight into all high-fidelity PurePaths, and we can resolve production issues more easily. All this automatically and with the same hardware.

– A Dynatrace Managed customer

To keep clusters resilient and performing well, the ALR mechanism intelligently analyzes only parts of the traffic and makes an estimate of the full load. Note that this is not a mechanism for controlling the current monitoring volume (for that, see adaptive traffic management) but to ensure that the cluster is healthy on recommended hardware. With this latest addition, the major trigger for ALR is cluster node health, and the key performance metric is garbage collection suspension time.  

Impact on disk space

If your Dynatrace Managed cluster frequently experiences ALR events, after the upgrade to Dynatrace version 1.192, the cluster may process more data and traffic to fully utilize existing hardware. This means that disk space requirements for Dynatrace transaction storage may increase. Please watch disk space usage and extend it if needed. Without this, the retention time for transaction storage might be reduced when disk capacity is reached.

Your feedback

Your input matters. Please share your feedback with us by posting your questions and clarifications in the Dynatrace Open Q&A forum.

Also, a special thank you goes out to the co-author of this blog post, Markus Pfleger, Dynatrace Lead Product Architect. Marcus was instrumental in bringing this functionality to market and all sizing/scaling aspects of PurePath/service processing.