Before reading this page, we recommend starting with Tuning Corti's AI Models, which covers typical data-tuning scenarios and when we pursue them with customers.
Overview
Every organization has specific needs: the codes they use, the workflows that need support, and the type of coding required (diagnostic, procedure, etc.). As a rule, more relevant data produces better tuning outcomes. Data collection breaks down into two areas: format and volume.
Data format
Below is an example of what the data export should look like. At a minimum, we need the clinical notes and the corresponding code(s). More fields are better.
The format should be JSON. If JSON isn't possible, the data still needs to be machine-readable and structured. PDF images or chart copies aren't sufficient. This data typically comes from an EHR export.
{
"clinical_note": "The patient was admitted following (...)",
"target_codes": {
"ICD-10": ["A02.1", "K70.1"],
"CPT": [...],
},
"medications": ["paracetamol"],
"test_results": {
"blood_pressure": "180/120",
},
"physician_id": "95847362",
"medical_coder_id": "10450438",
"patient_id": "123456-9876",
"admission_id": "918273645",
"date_of_admission": "01-01-2024",
"department": "Oncology",
"patient_age": 66,
"patient_gender": "male",
"patient_ethnicity": "caucasian",
"patient_survived": true,
}
Data volume
A useful way to think about data requirements is per-code. To learn any given code to a useful degree, the system needs at least 10 examples - but ideally closer to 100.
The underlying challenge is that medical codes follow a long-tail distribution. Most codes are rare; a few are very common. That imbalance means millions of records may need to be collected before rare codes appear more than 10 times.
Corti's recommended approach: hand over data once each of the X most common codes has at least 100 cases. The right value of X depends on which codes the system needs to support accurately. In practice, X typically covers around 50% of total consultations - not 50% of codes used.
For the best results, mapping clinical notes and EHR data to medical or CPT codes, these are the target data levels:
🥇 Ideal: 1,000+ examples per code
🥈 Preferred: 100+ examples per code
🥉 Minimum: 10+ examples per code
