Skip to main content

Tuning Medical Coding Models

Understand how Corti tunes coding models to your organizational needs

Before reading this page, we recommend starting with Tuning Corti's AI Models, which covers typical data-tuning scenarios and when we pursue them with customers.

Overview

Every organization has specific needs: the codes they use, the workflows that need support, and the type of coding required (diagnostic, procedure, etc.). As a rule, more relevant data produces better tuning outcomes. Data collection breaks down into two areas: format and volume.

Data format

Below is an example of what the data export should look like. At a minimum, we need the clinical notes and the corresponding code(s). More fields are better.

The format should be JSON. If JSON isn't possible, the data still needs to be machine-readable and structured. PDF images or chart copies aren't sufficient. This data typically comes from an EHR export.

{ 
"clinical_note": "The patient was admitted following (...)",
"target_codes": {
"ICD-10": ["A02.1", "K70.1"],
"CPT": [...],
},
"medications": ["paracetamol"],
"test_results": {
"blood_pressure": "180/120",
},
"physician_id": "95847362",
"medical_coder_id": "10450438",
"patient_id": "123456-9876",
"admission_id": "918273645",
"date_of_admission": "01-01-2024",
"department": "Oncology",
"patient_age": 66,
"patient_gender": "male",
"patient_ethnicity": "caucasian",
"patient_survived": true,
}

Data volume

A useful way to think about data requirements is per-code. To learn any given code to a useful degree, the system needs at least 10 examples - but ideally closer to 100.

The underlying challenge is that medical codes follow a long-tail distribution. Most codes are rare; a few are very common. That imbalance means millions of records may need to be collected before rare codes appear more than 10 times.

Corti's recommended approach: hand over data once each of the X most common codes has at least 100 cases. The right value of X depends on which codes the system needs to support accurately. In practice, X typically covers around 50% of total consultations - not 50% of codes used.

For the best results, mapping clinical notes and EHR data to medical or CPT codes, these are the target data levels:

🥇 Ideal: 1,000+ examples per code
🥈 Preferred: 100+ examples per code
🥉 Minimum: 10+ examples per code

Did this answer your question?