Skip to main content

Tuning Medical Text Summarization Models

Understand how Corti tunes coding models to your organizational needs

Before reading this page, we recommend starting with Tuning Corti's AI Models, which covers typical data-tuning scenarios and when we pursue them with customers.

Overview

Different organizations have different documentation needs, so our summarization models sometimes need tuning to achieve the right level of detail in your summaries. We already have a library of Corti Assistant Section Templates - but if you need something beyond that, we can help. Data collection breaks down into three areas: data types, format, and volume.

Data types

Data for automatic speech recognition (ASR) training or fine-tuning consists of pairs of audio recordings and their corresponding target transcripts.

Audio recordings or transcripts

The source for summarization training is either audio or a text transcript. This is what the model learns from - specifically, which clinically relevant facts to extract from a consultation.

Medical summary

The target summary is a text string that captures the same essential information as the source, in a more concise form. It may follow a standard structure, such as SOAP, with sections that group the content. Model quality improves when multiple target summaries are provided per source, each with a score indicating relative quality.

Data format

Source audio can be provided in any audio format, as individual files. Source transcripts can be provided in a variety of ways - individual text files or a single combined file both work.

Summaries often have a structured format, so Markdown or JSON are common formats. If the summary is unformatted, a plain text file is fine.

Example data file (transcripts):

{ 
"source_file": "transcript.txt", // or "audio.wav"
"target_summaries": [
{
"summary": "Here is a summary of the transcript.",
"score": 1,
},
{
"summary": "Here is a another, but worse, summary, oh noooo...",
"score": 0,
}
]
}

Data volume

Every example helps. Even a single pair is better than nothing, so don't let the numbers below put you off. That said, we expect meaningful improvements in summary quality at 10,000+ pairs of source files and corresponding target summaries.

For the best results, mapping transcripts and EHR data to clinical notes, these are the target data levels:

🥇 Ideal: 100,000 source/summary pairs

🥈 Preferred: 1,000 source/summary pairs

🥉 Minimum: At least 1 pair

Did this answer your question?