Beyond Model Accuracy: Designing AI Monitoring for Real-World Healthcare

Last week, I joined leading researchers in clinical AI monitoring from UC-San Francisco, Vanderbilt, and Oregon Health & Science at the AcademyHealth Annual Research meeting in Seattle. During our panel, I presented Vega Health’s approach to monitoring AI performance in our customer set: community and regional health systems. I made the case that the field needs to do more to translate its best research into practical guidance for the settings where most care is actually delivered. For those who couldn’t attend, my thoughts are below.

As AI adoption grows across healthcare, robust and scalable AI monitoring is an urgent need. The largest health systems today are building their own in-house monitoring capabilities. At major academic medical centers, dedicated research and analytics teams are producing sophisticated approaches to evaluating whether AI tools are “working,” understanding if they are having the intended impact on patients, clinical workflows, and financial bottom lines. Community and safety-net health systems don’t always have the same resources or expertise. As a result, many are reliant on AI vendors to monitor their own tools, and vendors often limit their monitoring to a narrow scope: whether the model is performing “accurately.” Accuracy matters and is an important foundation but is not enough to tell a health system whether an AI solution is truly working.

The more important question is whether the solution is delivering value in practice. Are end-users adopting it and using it as intended? Does it fit into clinical workflows in the health system where it is deployed? Is it creating unintended operational strain elsewhere? Is it helping health systems understand and address inequities? Most importantly, is it improving patient outcomes? The answers to those questions determine whether AI succeeds in the environments where most communities get their care.

The limits of model-centric monitoring

Academic medical centers have the infrastructure to build monitoring that goes well beyond the model itself. They can track clinical workflows, staff behavior, downstream outcomes, and equity implications. That kind of comprehensive monitoring is extremely valuable for those systems. But it is also built for environments with large analytics teams, mature data infrastructure, and models that institutions developed themselves. Vendors serving community health systems typically lack both the context and the incentive to build anything comparable. Their monitoring tends to focus on what they can most directly observe and influence, like data quality, model performance, and other technical indicators. In some cases, vendors don’t even offer monitoring, or offer it only for an additional cost. Technical performance is a necessary piece of monitoring in the healthcare environment, but it is only part of the picture.

A model can be technically sound while the surrounding workflow fails to produce the intended impact. If clinicians do not trust it, if adoption is low, if it creates friction in care delivery, or if it has unintended downstream consequences, measuring technical performance alone is ultimately meaningless. And if the interventions triggered by the model’s alerts do not have the intended impact on patient health, a strong technical model becomes worthless in practice.

This is why monitoring cannot stop at the model boundary. A health system’s goal when it implements an AI solution is not a high model sensitivity and specificity. The goal is better patient outcomes and satisfaction, better clinical workflows, and more effective use of resources. As Vega Health’s advisor Suresh Balu once told us, “People are not buying the model. They are buying outcomes.” If the introduction of AI does not produce – or can’t measure – these outcomes value of the solution and end user trust plummets.

Why context has to shape monitoring

The monitoring challenge is especially acute outside large academic medical centers, which only account for 7% of healthcare delivery organizations in the United States. The frameworks that researchers are developing at those institutions are often so tailored to their specific infrastructure, staffing models, and internal expertise that they require too much effort to adapt to be usablefor a community health system.

Vega Health is building for the 93% of organizations that often lack the resources to effectively implement and monitor AI on their own. Community health systems rely more heavily on external vendors and have less internal capacity to interpret technical monitoring outputs. Their workflows, staffing models, and patient populations also differ substantially from the settings where many AI tools are first developed and studied.

Monitoring in community health systems must account for this different reality and be flexible enough to account for the differences among community health systems themselves. The same model implemented in three different health systems may require three different monitoring approaches because the path from output to action to outcome is shaped by local context.

From outputs to outcomes

A strong monitoring approach starts by following the full pathway from model output to real-world impact.

Not just: Is the model accurate?

But also:

Is the information reaching the right people?
Are they acting on it as intended?
Is it improving workflows or creating friction?
Is it contributing to better patient outcomes?
Is it helping surface inequities that need to be addressed?

A broader lens allows monitoring to become operationally meaningful. Take a deterioration alerting use case. Success does not end with whether an alert is statistically valid. It depends on whether that alert is communicated effectively, acted on appropriately, and translates into better outcomes without introducing new pressures elsewhere in the system. Some of the most interesting work coming out of academic medical centers is focused on grounding monitoring in the questions that clinicians and administrators really want to know rather than the technical metrics that are easiest to compute.

Monitoring must be interpretable (and simple enough to use)

Monitoring is not only about what gets measured. It is also about how results are presented, understood, and acted on. One example that stuck with me from the panel: rather than displaying a metric like “positive predictive value (PPV),” leaving it to a clinician to try to interpret that statistical measure, a well-designed monitoring dashboard might ask, How many actual patient deteriorations do I expect to see for every 100 alerts? Displaying PPV is only relevant in the context of that question. The question comes first; the measurement follows. That kind of design makes monitoring legible to the people who need to use it, not just the analysts who built it.

Different stakeholders need different kinds of visibility. Clinicians may need to know whether they can trust a tool’s outputs. Department leaders may need to understand adoption, workflow impact, and unintended consequences. Executive leaders may need visibility into broader operational and financial implications. Each should have their own view that focuses on the questions and related metrics that are relevant to their specific perspectives. But they should connect back to the same underlying truth, allow anyone to trace the full story from model accuracy to utilization to outcomes, and, if the desired outcomes are not being achieved, identify where the system is failing in the chain from model to impact. If technical monitoring, operational KPIs, and leadership reporting all live separately with different definitions, organizations are left trying to reconcile fragmented pieces of performance picture, and it becomes very difficult to agree on where problems lie, let alone how to fix them.

For community health systems, the stakes around interpretability are even higher. These systems usually don’t have the large analytics and informatics teams that play a key role in interpreting and communicating statistical insights at large academic systems. It falls to the clinical staff and administrative leadership to be able to interpret from a monitoring dashboard how often staff responded to an alert and how many patients are better off because of it. If monitoring can’t answer these kinds of questions clearly and directly — and in the language that clinicians and leaders understand — then it isn’t doing its job.

A call for collaboration

The research and development being done at academic medical centers on AI monitoring is impressive, but they are building the methodologies and tools that they need for themselves. The gap between where that research lives and where most healthcare is delivered remains wide — and it will not close on its own. As the monitoring field has expanded, particularly with the rise of generative AI, the landscape has gotten broader and less navigable for buyers. There is a real opportunity for researchers and practitioners to work together to take the best of what individual institutions have learned and turn it into guidance that a health system without a dedicated analytics team can apply.

Vega Health’s approach is to embed end-to-end monitoring within every solution that we deploy. We go deep with individual systems to learn their specific workflows, constraints, and patient populations, then configure our monitoring system to reflect their reality. In doing that work, we are learning a great deal. But we shouldn’t do it alone, and community health systems shouldn’thave to rely solely on vendors to fill that gap.

The research community has a real opportunity here not just to publish findings, but to engage directly with the settings where those findings need to land and to make a much broader impact. That means working with community health systems and with organizations like Vega Health that are already operating in those environments. We are here to collaborate, and we believe that kind of partnership is how good research becomes good practice. It is also how the progress being made at the leading edge of healthcare AI reaches the 93% of organizations that need it most.

Beyond Model Accuracy: Designing AI Monitoring for Real-World Healthcare

The limits of model-centric monitoring

Why context has to shape monitoring

From outputs to outcomes

Monitoring must be interpretable (and simple enough to use)

A call for collaboration

Related resources

Vega Conversations: Dr. Chris DeRienzo on blockers to knowledge sharing in healthcare

The Signal: Where is the vendor accountability in healthcare AI?

Intern Reflection: Building a System, and a Life Worth Leading

Ready to learn more?