Prog.PhD STI
Adm. – Grad.2009 – 2018
Dir.; Codir.Stéphane Gagnon; Michal Iglewski

Integrating Semantic Web and Unstructured Information Processing Environments

El-Kass, Wassim

Unstructured information refers primarily to text but also any information stored without a pre-defined data structure. Significant advances have been made in Natural Language Processing (NLP), with reliable syntactic and gazetteer annotations from Part of Speech (POS) tagging, Noun Phrase (NP) chunking, and Named-Entity Recognition (NER).

However, semantic annotation remains a challenging task, with precision and recall varying greatly across document types and application domains. While simple texts such as email messages in a single domain can be analyzed with consistent results, professional and scientific documents of similar size, such as news and abstracts, present too much complexity with diverse vocabulary and ambiguous meanings throughout sentences and document sections. Major difficulties remain in accurately relating concepts with one another into annotation graphs, and combining them for further classification across a hierarchy of classes with semantic relevance and completeness.

In this thesis, we demonstrate how to use semantic web technologies, in particular ontologies and graph databases, to help improve the quality (F-score) of such annotation and classification tasks. We integrate a formal ontology with a standard NLP platform, test it on a public research corpus, and report F-scores superior to prior Machine Learning algorithms.

We develop and test an innovative platform, the Adaptive Rules-Driven Architecture for Knowledge Extraction (ARDAKE). Our software integrates the Unstructured Information Management Architecture (UIMA) with a standard graph database to host our ontologies. We develop extensions to the UIMA Ruta rules language to invoke and verify class relationships from the ontology. Other extensions include computing additional text metrics useful in integrating conditional, statistical, and semantic distances for token-class matching. We also develop a new iterative n-grams algorithm to combine matching rules and optimize F-scores and area under the Receiver Operating Characteristic (ROC) curves. We propose a new pie-chart style to facilitate visualization of annotation performance evaluation. These components are integrated within a graphical interface allowing domain experts to visually compose rule sets within hierarchies of varying complexity, score and benchmark their relative performance, and improve them by integrating additional ontology sources.

Our platform is tested on a particular use case in the health sciences: the Population, Intervention, Control, and Outcome (PICO) medical literature analysis methods. We show that our platform can efficiently and automatically produce parsimonious rule sets, with higher F-scores on the P and I classes than prior authors using machine learning algorithms.