Artificial intelligence could extract information from cancer patients’ medical records: Colombian researchers

By September 29, 2023

Cali, Colombia — Colombian researchers have used artificial intelligence to automatically extract information from the medical records of thousands of breast cancer patients for emerging patterns, opening up the possibility for future interventions and treatments.

According to the American Cancer Society, breast cancer is the most common cancer worldwide today, accounting for 12.5% of all new annual cases globally.

Professor Oswaldo Solarte Pabón from the School of Systems and Computer Engineering of the Faculty of Engineering at the Universidad del Valle (Univalle) explained that, together with international collaborators, he achieved the feat of detecting patterns for the first time using clinical notes written in Spanish. 

“The objective was to extract information from the clinical notes of cancer patients to find valuable patterns,” said Professor Solarte, adding that the results obtained from this research were used to structure data from a hospital in Madrid (Spain) in order to extract models for breast cancer relapse prediction and quality of life analysis.

Image: Breast cancer annotations. Credit: Solarte et al.

The Research

Every year hospitals around the world produce about 50 petabytes of data, and although 97 per cent of this data goes unused, this is changing, with a huge potential to transform the quality of healthcare.

Professor Solarte explained that, in recent years, the use of Natural Language Processing (NLP) in the biomedical field has increased the possibility of automatically extracting information from medical records. In other words, it is now possible to automate the process of reading, understanding and structuring clinical text using artificial intelligence techniques.

“This is a great achievement as it is not feasible and very costly for doctors to extract information manually because they face a big data problem; In the case study conducted in this research, the care process of each patient generated 300 clinical notes on average and the study involved 1000 patients. In this case, 300,000 text files had to be analysed,” Professor Solarte said.

In the scientific paper entitled “Transformers for extracting breast cancer information from Spanish clinical narratives” published in the scientific journal Artificial Intelligence in Medicine, the researchers used a corpus (a large set of texts considered representative of a language) manually annotated by physicians to support named entity extraction (Name Entity Recognition) in the field of breast cancer.

“It is the first corpus intended to support the extraction of medical concepts of breast cancer in Spanish,” Professor Solarte said.

Professor Ernestina Menasalvas Ruiz, of the Universidad Politécnica de Madrid, a co-author of the scientific publication and Professor Solarte’s thesis supervisor, explained that the results of entity extraction and the detection of denial and uncertainty have helped in structuring patient information, which has facilitated the extraction of predictive models afterwards.

“The use of new language models are a current trend in artificial intelligence and we are using them to extract medical information and validating all the results,” said Prof Menasalvas, adding that when these models are validated, the next step is to continue applying them to other diseases, starting with other types of tumours.

Professor Solarte also wants to use the experiences learned from this research to strengthen the topic of Deep Learning and artificial intelligence in Colombia.

“The idea is to form a group of artificial intelligence applied to health,” Professor Solarte said.

Image: The diagram shows the proposed approach, which consists of three steps: (i) Corpus generation, (ii) Model training and (iii) Model validation. Credit: Solarte et al.

International Collaboration

Professor Solarte recently returned to Colombia after spending four years at the Universidad Politécnica de Madrid in Spain, where he studied for his PhD at the Centro de Tecnología Biomédica.

Professor Menasalvas, who already has many years of collaboration with the Universidad del Valle, explained that internationalization is undoubtedly beneficial for what the different teams can bring to the table.

“In this particular project where language models are a central element, collaboration is essential because of the differentiating aspects that Spanish can have in the different locations where it is spoken,” she said, adding that being able to train and validate Spanish models not only with texts from one location but from multiple sites will certainly add value and improve the performance of the models.

This article originally appeared on the Faculty of Engineering (Universidad del Valle) website here and was reproduced with permission. It is authored by Andrew James (NCC/Univalle).