ICD-10 coding based on semantic distance: LSI_UNED at CLEF eHealth 2020 Task 1


This paper describes our contribution to the CLEF eHealth 2020 Task 1, consisting of the CIE-10-ES annotation of Spanish Electronic Health Records (EHRs). CIE-10-ES coding is the extended version of the ICD-10 in Spain. One of the sub-tasks is aimed at the interpretability of proposals, which is in line with the latest demands in Natural Language Processing (NLP). Moreover, ICD-10 entries generated by hospitals usually follow an extreme distribution, involving complex annotation challenges. For that reason, an unsupervised semantic similarity-based method has been explored using a representation based on SNOMED-CT clinical terminology. Since example-based learning is able to capture complex patterns, the proposal has been combined with Gradient Boosting methods to model the codes with more instances. mAP scores of 0.517 are achieved for CIE-10-ES codes associated with diagnoses and 0.398 for CIE-10-ES procedure codes. The mixed approach improves the strict supervised proposals by more than 38\% and 13\% respectively. Finally, the unsupervised component is used to provide code evidences in EHRs exploiting a greater interpretability.




	title     = {ICD-10 coding based on semantic distance: LSI UNED at CLEF eHealth 2020 Task 1}, 
	author    = {Almagro, Mario and Martínez, Raquel and Fresno, Víctor and Montalvo, Soto and Tissot, Hegler}, 
	volume    = {2696}, 
	booktitle = {CEUR Workshop Proceedings}, 
	publisher = {CEUR-WS}, 
	year      = {2020}