Skip to main content
All CollectionsCorti APIModel Performance & Evaluation
Evaluating Clinical Coding Models: Measuring Accuracy and Efficiency with F1 Score and Recall@5
Evaluating Clinical Coding Models: Measuring Accuracy and Efficiency with F1 Score and Recall@5
Updated over a week ago

Introduction

Clinical coding is a fundamental process in healthcare, translating medical diagnoses, procedures, and services into standardized codes. Accurate coding is essential for effective communication among healthcare providers, proper billing, and reliable statistical analysis. With the advent of machine learning, models like Cortis have been developed to automate clinical coding, necessitating robust evaluation to ensure their efficacy.

Evaluation Metrics

Two critical metrics for assessing the performance of clinical coding models are the F1 score and Recall@5.

F1 Score

The F1 score harmoniously balances precision and recall, offering a single metric that reflects both the accuracy of positive predictions and the model's ability to identify all relevant positive cases. In clinical coding, a high F1 score indicates that the model accurately assigns codes and effectively identifies all pertinent codes within the dataset.​

Recall@5

Recall@5 measures the proportion of relevant codes included within the top five suggestions made by the model. This metric is particularly valuable in clinical settings where models provide multiple coding options, ensuring that the correct code is likely among the top recommendations.​

Methodology

To evaluate the Cortis model, a dataset including electronic health record (EHR) clinical documents with corresponding ICD-10 codes is utilized. The model's performance is assessed by comparing the predicted codes (i.e., output from the AI model based on the input EHR documents) to the ground truth codes in the dataset. Both the F1 score and Recall@5 were calculated to provide a comprehensive evaluation.​​

Conclusion

Evaluating clinical coding models using metrics such as the F1 score and Recall@5 provides an empirical understanding of the model performance. Higher scores in both metrics indicate the potential for a model to perform well in operational use.

Did this answer your question?