Getting the Balance Right: Adaptive Observability Optimizes IT Monitoring

In this era of accelerating digital transformation, as the speed at which business is done increases, so does the pressure to do more in less time. IT operations (ITOps) are constantly evolving and being redefined as new models for technology deployment, scaling, and change acceleration emerge and are implemented. Data is the engine driving today's digitized business world and a mission-critical enterprise asset that supports the optimization of operations, reduced costs, informed decision-making. In some cases, data actually differentiates a business from its peers. Of course, any value in data is constituent on its veracity and quality.

The Hard Costs of Poor Data

Poor data quality is a key reason that enterprises fail to maximize the value of business initiatives. In 2021, Gartner reported that "every year, poor data quality costs organizations an average $12.9 million. Apart from the immediate impact on revenue, over the long-term, poor-quality data increases the complexity of data ecosystems and leads to poor decision making." Gartner analyst Melody Chien added that "good quality data provides better leads, better understanding of customers and better customer relationships. Data quality is a competitive advantage that data and analytics leaders need to improve upon continuously."

Gaining insight from data is essential to business. Understanding the state and health of that data is crucial, and supports ITOps focused on key performance indicators (KPIs) such as zero network downtime and identifying and resolving problems rapidly. The costs of any downtime are significant. A 2021 ITIC report found that for 91% of mid-sized and large enterprises, the cost of one hour of server downtime was a minimum of $300,000. Furthermore, 44% reported that that figure exceeds $1 million.

Augmenting Observability with AIOps

Data observability is the ability to deduce the internal state of the system by analyzing the outputs, such as logs, metrics, and traces that it produces. It's a key element of ITOps, helping enterprises to monitor, detect, and manage data and system problems before they become critical. Prior to observability, analysis effectively siloed, being done in isolated pockets. This ad-hoc approach had little worth, resulting in somewhat of a piecemeal and loosely coordinated approach to data intelligence.

The integration of Intelligence Operations (AIOps), now established as a vital function within enterprise IT strategies in its own right, adds deeper value and context to observability data. Leveraging artificial intelligence (AI) and machine learning (ML) enables enterprises to implement intelligent automation by collecting and analyzing massive amounts of data, applying reasoning and problem-solving, removing noise, and prescribing the best actions for ITOps.

For an organization to fully harness the power of AIOps to derive tangible and actionable business value from its data assets, there must be complete access to, and control of, all operational data, which starts with end-to-end visibility across its entire technology stack. Combining full-stack observability with AIOps expands and unifies the cross-layer, cross-function monitoring capabilities and granularity across any number of tools and sources for different infrastructure components. This enables anomalies that have the potential to impact a business to be detected, automatically flagged and in many instances, automatically triaged and resolved, escalated for human resolution only in cases where parameters dictate that, or an automated solution cannot be readily accessed.

Determining Monitoring Levels Is Key

A quintessential requirement for managing today's enterprise IT is the availability of good quality of monitoring data. Given the amount and complexity of data flowing through today's digitized enterprise system, it's imperative that administrators have a solid understanding of monitoring requirements and systematically deploy a monitoring framework to capture behavioral characteristics of all compute, communication, and storage components.

A structured, clearly defined, and strategic approach is a non-negotiable, as the adoption of an ad-hoc, manual, and intuition-based approach can lead to inconsistent and inadequate data collection and retention policies, which defeats the entire point of the exercise.

Monitoring metrics need to be collected at various layers ranging from the hardware metrics such as CPU and memory utilization, to operating system and virtual machine layers, to database and application layers, among others. While plenty of monitoring tools exist that can monitor a wide variety of system components, most capable of monitoring a large range of metrics and are configurable per need, the bad news is that the quality of the monitored data can often be suspect.

Bearing in mind that IT resources are limited, when conducting observability, a significant challenge can be determining the appropriate monitoring level for observing the IT estate and collecting data. An approach to monitoring that is either too aggressive or too conservative can lead to either very large or very small volumes of monitoring data, both of which present potential drawbacks that can make them impractical for truly effective use.

Very large amounts of data generated by very fine monitoring parameters, for example, can be difficult to store, maintain, and analyze. For instance, logging 10 metrics at a rate of one sample per second will consume about 720,000 KB per hour, one sample every five seconds about 144,000 KB per hour, and one sample every 10 minutes 1,200 KB per hour. The collection of large volumes of data can become unmanageable and carries the risk of missing genuine anomalies and valuable insights in the noise. On the other hand, a very small amount of monitoring data, generated by coarse operational parameters, carries the risk of missing events of interest, incomplete diagnosis, and insufficient insights.

Introducing Adaptive Observability

The emergence of "adaptive observability" provides an optimal middle path, making it possible, based on intelligent deep data analytics, to increase or decrease monitoring levels in response to the system health of specific IT operations.

Seeking to replace the ad-hoc, manual, intuition-based approach with a more systematic, automated, and analytics-based approach for system monitoring, the central function of adaptive observability is to intelligently assess system performance and infer the health and criticality of various system components. This inferred health and criticality is then used to generate dynamic monitoring guidelines for these components.

Adaptive observability integrates two relatively new and active areas of research: adaptive monitoring and adaptive probing, two important approaches for the measurement, monitoring, and management of complex systems. Previously, these two approaches were used in isolation. Passive monitoring techniques compute "at-a-point" metrics and can provide fine-grained metrics, but are agnostic to the end-to-end system performance, while whole probing-based techniques compute "end-to-end" metrics but lack the in-depth view of a component. Adaptive observability combines these two techniques in such a way that they actively complement each other to produce a highly effective monitoring solution that actively assesses ITOps, intelligently routing and rerouting monitoring levels and data gathering depth and frequency to areas where there are issues.

When an issue has been identified, the level of observability is increased, not just of the entity that associated with the issue but also of other entities that are influencing it or are being influenced by it. For example, if a database is experiencing performance problems, then the observability of the servers/disks used by the database, and also of the applications served by the database, will be increased accordingly. This ensures fine-grained analysis not just of the reported event but also its possible causes and effects.

Put simply, more resources are assigned to gather more data more frequently to generate intelligence to resolve the issue quickly and efficiently. Once the issue is resolved, when things are healthy, then it is no longer necessary to collect as much data, and at such a high frequency, and resources are deployed elsewhere in ITOps.

Dr. Maitreya Natu is the Chief Scientist at Digitate, a subsidiary of Tata Consultancy Services. He has received his Ph.D. degree in Computer and Information Sciences and specializes in designing and developing cognitive solutions for managing complex systems. His research interests include network management, data science, applied AI/ML, and cognitive automation. A former adjunct faculty member at both the Indian Institute of Technology, Kanpur, and the Indian Institute of Technology, Indore, Maitreya has authored more than 50 papers for international conferences and journals and holds more than 20 patents.

Comments

Plain text