Tuning Medical Coding Models

Before reading this page, we recommend starting with Tuning Corti's AI Models, which covers typical data-tuning scenarios and when we pursue them with customers.

Overview

Every organization has specific needs: the codes they use, the workflows that need support, and the type of coding required (diagnostic, procedure, etc.). As a rule, more relevant data produces better tuning outcomes. Data collection breaks down into two areas: format and volume.

Data format

Below is an example of what the data export should look like. At a minimum, we need the clinical notes and the corresponding code(s). More fields are better.

The format should be JSON. If JSON isn't possible, the data still needs to be machine-readable and structured. PDF images or chart copies aren't sufficient. This data typically comes from an EHR export.

{ 
    "clinical_note": "The patient was admitted following (...)",
    "target_codes": { 
       "ICD-10": ["A02.1", "K70.1"], 
       "CPT": [...], 
    }, 
    "medications": ["paracetamol"], 
    "test_results": { 
       "blood_pressure": "180/120", 
    }, 
    "physician_id": "95847362", 
    "medical_coder_id": "10450438", 
    "patient_id": "123456-9876", 
    "admission_id": "918273645", 
    "date_of_admission": "01-01-2024", 
    "department": "Oncology", 
    "patient_age": 66, 
    "patient_gender": "male", 
    "patient_ethnicity": "caucasian", 
    "patient_survived": true, 
 }

Data volume

A useful way to think about data requirements is per-code. To learn any given code to a useful degree, the system needs at least 10 examples - but ideally closer to 100.

The underlying challenge is that medical codes follow a long-tail distribution. Most codes are rare; a few are very common. That imbalance means millions of records may need to be collected before rare codes appear more than 10 times.

Corti's recommended approach: hand over data once each of the X most common codes has at least 100 cases. The right value of X depends on which codes the system needs to support accurately. In practice, X typically covers around 50% of total consultations - not 50% of codes used.

For the best results, mapping clinical notes and EHR data to medical or CPT codes, these are the target data levels:

🥇 Ideal: 1,000+ examples per code
🥈 Preferred: 100+ examples per code
🥉 Minimum: 10+ examples per code