Home // Research // Big Data Analytics and Engineering // Data Quality and Data Integration

Leading Principal Investigator

Prof. Dr. Erhard Rahm

Department of Computer Science, Database Group, Chair of Databases

Leipzig University

rahm@informatik.uni-leipzig.de

Team Leads

Matthias Täschner

DSC ScaDS.AI Leipzig

Leipzig University

matthias.taeschner@uni-leipzig.de

Data Quality and Data Integration

We develop generic approaches for a largely automated data cleaning and data integration, particularly for the generation and maintenance of large knowledge graphs. Furthermore, we devise active learning techniques to generate large labeled training datasets and develop approaches to leverage large data repositories for Artificial Intelligence and Machine Learning.

Creation of knowledge graphs

We investigate the largely automatic and learning-based generation and maintenance of domain-specific knowledge graphs (KG), e.g. for precision medicine and business applications. This requires (learning-based) approaches for data preparation and cleaning as well as integration. In particular, methods are needed to identify the type of new entities, to match their attributes to the attributes in the KG, to match and fuse new entities with existing entities, and to learn new relationships among entities in the KG. Another focus is the integration of multi-model data such as time-series, images and graph data, which require to adapt the integration, storage and processing approaches.

Supporting labeled training data generation

Acquiring labels for large-scale training data is a costly task and often requires domain experts. We investigate novel Active Learning (AL) query strategies, which encode AL as a learning problem. To train the AL query strategy, we study reinforcement and imitation learning. On top of the resulting AL strategy, we plan to develop an end-to-end platform to create high-quality data labels for different formats such as text, code, formulas, images, and more.

Entity resolution

In ScaDS.AI, we pay special attention to scalable entity resolution, i.e. the identification of matching records describing the same real-word entity (e.g. customer or product), which represents one of the most important steps within the data cleaning and integration process. For parallel multi-source entity resolution, the open-source system FAMER has been developed that incorporates award-winning clustering techniques to group all matches together and that also supports incremental clustering and cluster repair methods.

Schema matching

We investigate the semi-automated matching of schemas, in particular from many sources. In this context learning-based approaches like LEAPME are developed. In AMPLE also reinforcement learning is used for finding high quality matches.

funded by:

Gefördert vom Bundesministerium für Bildung und Forschung.

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.