We develop generic approaches for a largely automated data cleaning and data integration, particularly for the generation and maintenance of large knowledge graphs. Furthermore, we devise active learning techniques to generate large labeled training datasets and develop approaches to leverage large data repositories for Artificial Intelligence and Machine Learning.
We investigate the largely automatic and learning-based generation and maintenance of domain-specific knowledge graphs (KG), e.g. for precision medicine and business applications. This requires (learning-based) approaches for data preparation and cleaning as well as integration. In particular, methods are needed to identify the type of new entities, to match their attributes to the attributes in the KG, to match and fuse new entities with existing entities, and to learn new relationships among entities in the KG. Another focus is the integration of multi-model data such as time-series, images and graph data, which require to adapt the integration, storage and processing approaches.
Acquiring labels for large-scale training data is a costly task and often requires domain experts. We investigate novel Active Learning (AL) query strategies, which encode AL as a learning problem. To train the AL query strategy, we study reinforcement and imitation learning. On top of the resulting AL strategy, we plan to develop an end-to-end platform to create high-quality data labels for different formats such as text, code, formulas, images, and more.
In ScaDS.AI, we pay special attention to scalable entity resolution, i.e. the identification of matching records describing the same real-word entity (e.g. customer or product), which represents one of the most important steps within the data cleaning and integration process. For parallel multi-source entity resolution, the open-source system FAMER has been developed that incorporates award-winning clustering techniques to group all matches together and that also supports incremental clustering and cluster repair methods.
We investigate the semi-automated matching of schemas, in particular from many sources. In this context learning-based approaches like LEAPME are developed. In AMPLE also reinforcement learning is used for finding high quality matches.