Projects

 

❱❱❱ FOREST

Fast Ontology-based Retrieval and Search Tool

Description:
The ability of quickly retrieve information from databases and other data sources is critical for supporting several natural language applications. Previous work has proposed a TRIE-base search that integrates phonetic encoding (e.g., double metaphone) and similarity metrics (e.g., edit distance) to provide relatively high accuracy on search for dictionary entries. FOREST aims to extend previous work by including hierarchical structures of classes, assumptions, and axioms (ontology) and evaluate the implementation in a Named Entity Recognition task (NER - identifying named entities in free text).

 

❱❱❱ ASK

Querying relational databases using natural language (NL2SQL)

Description:
Recent advances in natural language processing (NLP) and understanding (NLU) have renewed the interest in using natural language to query databases. Such approaches do not require the users either (a) to learn a complex query language (e.g., Structured Query Language – SQL), or (b) to understand the exact schema of the data, or to know how the data is stored. Despite of the several NLP preprocessing relevant techniques, interpreting the natural language sentences correctly, dealing with various forms of ambiguity, and mapping queries to the appropriate context are some of the persisting challenges in this field. Recent work compares NLIDB systems against a set of benchmark questions to evaluate their functionalities and expressiveness, distinguishing between the following five different approaches: (a) Keyword-based, (b) Pattern-based, (c) Parsing-based, (d) Grammar-based, and (e) Neural machine translation-based. ASK aims at developing an extensible NL2SQL framework that integrates at least the following tools: relational database, data schema, and a hybrid rule-based and machine learning approach for language interpretation, with the ability of being generalizable over multiple database schemas with minimal retraining.

 

❱❱❱ SYNNER

Synthetic Data Generator

Description:
Using synthetic data has been proved to be a relevant approach, especially in those domains dealing with sensitive data, in which dataset sharing to produce evaluation tasks is not allowed, e.g., the clinical domain. The use of synthetic data alleviates the burden of producing anonymization tasks, a procedure that is not necessarily fully accurate. SYNNER is presented as an extensible “Synthetic Dataset Generator” framework for domain-specific (e.g., clinical) multi-relational data. The main idea behind this work is to look at a real dataset and produce a synthetic version of the same dataset, that: a) mimics the distributions and correlations of observed data, b) anonymizes sensitive data when applicable (named entities), shifts the temporal dimension without losing corresponding correlation, and c) can be used to train machine learning classifiers that still sustain accurate results when evaluated with real test data. SYNNER will be designed to be flexible on simulating distinct scenarios, e.g., when the probability of an event occurring after point in time t must increase at a ratio of 10% a year.

 

❱❱❱ WHY

Explainability Uncovered for Machine Learning Applications

Description:
In contrast with the concept of the "black box" in machine learning, where predictions can't be explained given a specific decision, in explainable AI the, resulting predictions can be understood by humans with the support of correlated features. However, looking at feature importance over the whole dataset might mislead the explanations and entail lost of information by not taking specific cases or clusters into account. WHY is based on low-dimensional vectorial representation and clustering techniques to support ML applications with domain-specific and case-centric explainability, mainly evaluated on patient’s data from the clinical domain.

 

❱❱❱ CLINNER

Converting unstructured textual into knowledge graphs

Description:
Electronic heath records (EHRs) are increasingly used for research, either via a patient-centric approach or in the form of decision support for clinicians, such as point-of-care alerts. Patient characteristics are extracted from EHR databases. However, much of the information in EHRs is unstructured, in the form of free text, rather than in a structured form. Natural language processing (NLP) techniques can extract relevant information from free text but cannot be relied upon to be completely accurate because of typographical errors and nuances of human language. Algorithms incorporating NLP have been tested for their ability to identify patients’ medical conditions directly from clinical notes, but in a very raw data format. CLINNER aims at using real and synthetic patient data to evaluate current SOTA and explore different ways of extracting and modeling clinical information found in text, converting those textual representations in natural language into a structured knowledge graph representation.

 

❱❱❱ CALEX

Identifying and normalizing calendar expression in free text

Description:
Extracting and modeling temporal information from text is an important element for developing timelines and trajectories. However, to automatically analyze complex trajectory information enclosed in text (e.g., timing of symptoms and duration of treatment in clinical domain), it is important to understand the related temporal aspects, anchoring each event on an absolute point in time. Previous work has analyzed the suitability of previous temporal annotation schema for capturing timeline information, identifying challenges and possible solutions and has proposed a novel annotation schema that could be useful for timeline reconstruction: CALendar EXpression (CALEX). CALEX aims to develop and evaluate an annotation tool that integrates more complex aspects of temporal description in text, such as those described as temporal imprecision, and formalize a new taxonomy for the different types of calendar expressions.

 

❱❱❱ KRAL & T-VEX

Knowledge Representation for Artificial Learning using Time Series Multirelational Data

Description:
KRAL & TVEX aim to find effective ways to use low-dimensional vector representation of multi-relational data in time-sensitive machine learning applications. Previous work attempted to predict whether a patient would develop a certain clinical condition during patient’s hospitalization. However, it doesn't allow to follow the patient timeline history of clinical events and determine the exact points in time when risk can be assessed that same condition.

 

❱❱❱ Text-SEG

Text Segmentation - Identifying sections in free text: an unsupervised approach

Description:
The ability of extracting information from unstructured data sources (text) is critical to support the design and enrichment of structured databases. This project explores the use of existing (and adapted) NLP and IE tools to build Unsupervised Text Segmentation tools based on free text.