Introduction
Ensuring the reliability and effectiveness of Large Language Models (LLMs) in documentation generation requires a structured validation approach. Many organizations rely on pilot groups to assess LLM performance, often uncovering significant variations in results. This article presents a systematic framework for evaluating LLM-generated documentation, focusing on key methodological elements such as completeness, conciseness, and groundedness.
By adopting a standardized validation method, organizations can objectively assess content quality and optimize model performance for improved accuracy and efficiency.
When benchmarking LLMs for medical summarization, it is crucial to distinguish between two key capabilities:
Translation of conversational language into precise medical terminology – ensuring clinical accuracy (e.g., converting “no fever” to “afebrile”).
Adherence to medical writing style – shaping how information is conveyed concisely and effectively (e.g., summarizing reassurance statements appropriately).
Evaluating LLMs on both dimensions enables a nuanced understanding of their alignment with human-annotated notes and real-world use-cases.
Methodology
Generating the perfect clinical note relies on multiple factors:
Are all clinical findings, mentioned throughout the encounter, documented?
Are the clinical findings added into the right section of the clinical note template?
Is the language formatted to the liking and preference of the clinician?
Corti has developed a method to quantify the accuracy of clinical notes at scale. This method uses Corti’s Alignment Large Language Models (see the Alignment endpoint on docs.corti.ai) specialized to find specific clinical information in two related data sources, in this example consultation transcript, generated documentation, and the final revised documentation by the clinician. By doing so, Corti can produce three measures to monitor the accuracy in production:
Completeness: The proportion of the clinician’s finally generated note that is also represented in the AI-generated note. The clinicians often add things into the final note that was not a part of the patient-encounter, hence this metric will not reach 100%.
Conciseness: The proportion of the AI-generated note that is also in the clinician’s finally generated note. Due to a high performance of the AI, the clinician should not wish to delete information in the AI-generated note which is why this metric should trend close to 100%.
Groundedness: The proportion of clinical findings in the note represented in the transcript. This metric should trend close to 100% as the AI should not generate documentation different from the content of the dialogue.
By applying these metrics, it becomes possible to compare different approaches and refine the methodology to optimize the LLMs.
Performance Comparison
The table below illustrates how Corti’s AI-generated notes normally compares with other ambient scribe solutions and human-generated notes:
| Completeness | Conciseness | Groundedness |
Other LLM based Note Generation | 87% | 95% | 92% |
Corti LLM-Based Note Generation | 94% | 98% | 97% |
Human Generated Note | 100% | 100% | 92% |
Final Note and Summary
Benchmarking LLM documentation is essential for ensuring that AI-generated content meets clinical and operational standards. By focusing on completeness, conciseness, and groundedness, organizations can objectively measure performance and drive continuous improvements, in real-time.
Corti’s structured approach provides a robust framework for assessing documentation quality, ensuring AI-generated notes align closely with clinician expectations and real-world medical documentation practices.
The Corti R&D team are continuously assisting healthcare providers and vendors on benchmarking generated notes to ensure the best possible outcome for the patients. Contact us if you find it interesting for your organization.