The ability of quickly retrieve information from databases and other data sources is critical for supporting several
natural language applications. Previous work has proposed a TRIE-base search that integrates phonetic encoding
(e.g., double metaphone) and similarity metrics (e.g., edit distance) to provide relatively high accuracy on search
for dictionary entries. FOREST aims to extend previous work by including hierarchical structures of classes, assumptions,
and axioms (ontology) and evaluate the implementation in a Named Entity Recognition task (NER - identifying named entities in free text).
Querying relational databases using natural language (NL2SQL)
Recent advances in natural language processing (NLP) and understanding (NLU) have renewed the interest in using natural
language to query databases. Such approaches do not require the users either (a) to learn a complex query language
(e.g., Structured Query Language – SQL), or (b) to understand the exact schema of the data, or to know how the data is stored.
Despite of the several NLP preprocessing relevant techniques, interpreting the natural language sentences correctly,
dealing with various forms of ambiguity, and mapping queries to the appropriate context are some of the persisting challenges
in this field. Recent work compares NLIDB systems against a set of benchmark questions to evaluate their functionalities and
expressiveness, distinguishing between the following five different approaches: (a) Keyword-based, (b) Pattern-based,
(c) Parsing-based, (d) Grammar-based, and (e) Neural machine translation-based. ASK aims at developing an extensible NL2SQL
framework that integrates at least the following tools: relational database, data schema, and a hybrid rule-based and machine
learning approach for language interpretation, with the ability of being generalizable over multiple database schemas with
Synthetic Data Generator
Using synthetic data has been proved to be a relevant approach, especially in those domains dealing with sensitive data,
in which dataset sharing to produce evaluation tasks is not allowed, e.g., the clinical domain. The use of synthetic data
alleviates the burden of producing anonymization tasks, a procedure that is not necessarily fully accurate.
SYNNER is presented as an extensible “Synthetic Dataset Generator” framework for domain-specific (e.g., clinical) multi-relational data.
The main idea behind this work is to look at a real dataset and produce a synthetic version of the same dataset, that:
a) mimics the distributions and correlations of observed data,
b) anonymizes sensitive data when applicable (named entities), shifts the temporal dimension without losing corresponding correlation, and
c) can be used to train machine learning classifiers that still sustain accurate results when evaluated with real test data.
SYNNER will be designed to be flexible on simulating distinct scenarios, e.g., when the probability of an event occurring
after point in time t must increase at a ratio of 10% a year.
Explainability Uncovered for Machine Learning Applications
In contrast with the concept of the "black box" in machine learning, where predictions can't be explained given a specific decision,
in explainable AI the, resulting predictions can be understood by humans with the support of correlated features.
However, looking at feature importance over the whole dataset might mislead the explanations and entail lost of information by
not taking specific cases or clusters into account.
WHY is based on low-dimensional vectorial representation and clustering techniques to support ML applications with domain-specific and case-centric
explainability, mainly evaluated on patient’s data from the clinical domain.
Converting unstructured textual into knowledge graphs
Electronic heath records (EHRs) are increasingly used for research, either via a patient-centric approach or in the form
of decision support for clinicians, such as point-of-care alerts. Patient characteristics are extracted from EHR databases.
However, much of the information in EHRs is unstructured, in the form of free text, rather than in a structured form.
Natural language processing (NLP) techniques can extract relevant information from free text but cannot be relied upon to
be completely accurate because of typographical errors and nuances of human language. Algorithms incorporating NLP have
been tested for their ability to identify patients’ medical conditions directly from clinical notes, but in a very raw data format.
CLINNER aims at using real and synthetic patient data to evaluate current SOTA and explore different ways of extracting
and modeling clinical information found in text, converting those textual representations in natural language into a
structured knowledge graph representation.
Identifying and normalizing calendar expression in free text
Extracting and modeling temporal information from text is an important element for developing timelines and trajectories.
However, to automatically analyze complex trajectory information enclosed in text (e.g., timing of symptoms and duration of
treatment in clinical domain), it is important to understand the related temporal aspects, anchoring each event on an
absolute point in time. Previous work has analyzed the suitability of previous temporal annotation schema for capturing
timeline information, identifying challenges and possible solutions and has proposed a novel annotation schema that could
be useful for timeline reconstruction: CALendar EXpression (CALEX). CALEX aims to develop and evaluate an annotation
tool that integrates more complex aspects of temporal description in text, such as those described as temporal imprecision,
and formalize a new taxonomy for the different types of calendar expressions.
❱❱❱ KRAL & T-VEX
Knowledge Representation for Artificial Learning using Time Series Multirelational Data
KRAL & TVEX aim to find effective ways to use low-dimensional vector representation of multi-relational data in time-sensitive machine learning applications.
Previous work attempted to predict whether a patient would develop a certain clinical condition during patient’s hospitalization.
However, it doesn't allow to follow the patient timeline history of clinical events and determine the exact points in time
when risk can be assessed that same condition.
Text Segmentation - Identifying sections in free text: an unsupervised approach
The ability of extracting information from unstructured data sources (text) is critical to support the design and enrichment
of structured databases.
This project explores the use of existing (and adapted) NLP and IE tools to build Unsupervised Text Segmentation tools
based on free text.