Research

Our research provides an overview of the work done to enhance the reproducibility, explainability, and interoperability of scientific experiments, machine learning and deep learning models across interdisciplinary domains including biomedicine, biodiversity, biology, and data science.

Dataset, Jupyter notebooks, Computational reproducibility

Computational reproducibility of Jupyter notebooks from biomedical publications

2021-

Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. We address computational reproducibility at two levels: First, using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks related to publications indexed in PubMed Central. We identified such notebooks by mining the articles full text, locating them on GitHub and re-running them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. Second, this study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over two years. Out of 27271 notebooks from 2660 GitHub repositories associated with 3467 articles, 22578 notebooks were written in Python, including 15817 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 10388 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 1203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. We zoom in on common problems, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.

Relevant Publications:

  • Computational reproducibility of Jupyter notebooks from biomedical publications
    Sheeba Samuel, Daniel Mietchen, 2023 (Paper)
  • Computational reproducibility of Jupyter notebooks from biomedical publications
    Sheeba Samuel, Daniel Mietchen, [arXiv preprint], 2022 (Paper)

Code: https://github.com/fusion-jena/computational-reproducibility-pmc

Initial run data: https://doi.org/10.5281/zenodo.6802158

Re-run data: https://doi.org/10.5281/zenodo.8226725

Ontology, Semantic Web, Machine Learning

ReproduceMe Ontology Network (ReproduceMeON)

2021-present

ReproduceMeON is an ontology network for the reproducibility of scientific studies. The ontology network, which includes the foundational and core ontologies, attempts to bring together different aspects of the provenance of scientific studies from various applications to support their reproducibility. The repository provides the development process of ReproduceMeON and the design methodology of developing core ontologies for the provenance of scientific experiments and machine learning using a semi-automated approach. The repository provides a systematic literature review in different areas in provenance, scientific experiments, Machine Learning, computational, microscopy, and scientific workflows. We also provide the state of the art ontolgies used for the development of ReproduceMeON. Ontology matching techniques are used to select and develop core ontology for each sub-domain and link to other ontologies in the sub-domain.

Relevant Publications:

  • Towards an Ontology Network for the reproducibility of scientific studies.
    Sheeba Samuel, Alsayed Algergawy, Birgitta König-Ries, 8th International Workshop on Ontologies and Conceptual Modeling, co-located with FOIS, 2021 (Paper, Bibtex)

Code: https://github.com/fusion-jena/ReproduceMeON

Slides: https://doi.org/10.6084/m9.figshare.16610386.v1

Ontology, Semantic Web

The REPRODUCE-ME Data Model and Ontology

2016-2019

The REPRODUCE-ME Data Model is a generic data model for the representation of scientific experiments with their provenance information. The aim of this model is to capture the general elements of scientific experiments for their understandability and reproducibility. An Experiment is considered as the central point of the REPRODUCE-ME data model. The model consists of eight components: Data, Agent, Activity, Plan, Step, Setting, Instrument, Material. The REPRODUCE-ME Data Model forms a basis for the REPRODUCE-ME ontology. The REPRODUCE-ME ontology extended from PROV-O and P-Plan is used to represent the whole picture of an experiment describing the path it took from its design to result. We aim to enable end-to-end reproducibility of scientific experiments by capturing and representing the complete provenance of a scientific experiment using the REPRODUCE-ME ontology.

Relevant Publications:

  • End-to-End Provenance Representation for the Understandability and Reproducibility of Scientific Experiments using a Semantic Approach
    Sheeba Samuel, Birgitta König-Ries, Journal of Biomedical Semantics, 2022 (Paper, Bibtex)
  • REPRODUCE-ME: Ontology-based Data Access for Reproducibility of Microscopy Experiments
    Sheeba Samuel, Birgitta König-Ries, 14th Extended Semantic Web Conference (ESWC) 2018 Poster Track, 28 May-1 June, 2017, Portoroz, Slovenia (Paper, Bibtex)

Ontology Documentation:https://w3id.org/reproduceme/

Tool, Jupyter notebooks, Provenance, Semantic Web

ProvBook

2018-2019

ProvBook, an extension of Jupyter Notebook, to capture and view the provenance over the course of time. It also allows the user to share a notebook along with its provenance in RDF and also convert it back to a notebook. We use the REPRODUCE-ME ontology extended from PROV-O and P-Plan to describe the provenance of a notebook. This helps the scientists to compare their previous results with the current ones, check whether the experiments produce the results as expected and query the sequence of executions using SPARQL. The notebook data in RDF can be used in combination with the experiments that used them and help to get a track of the complete path of the scientific experiments.

Relevant Publications:

  • ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility
    Sheeba Samuel, Birgitta König-Ries, The 17th International Semantic Web Conference (ISWC) 2018 Demo Track, 8-12 October, 2018, Monterey, California, USA (Paper, Bibtex)

Demo: https://doi.org/10.6084/m9.figshare.6401096.v1

Code: https://github.com/Sheeba-Samuel/ProvBook

Research Data Management Platform, Semantic Web, Computational reproducibility

CAESAR-A Collaborative Environment for Scientific Analysis with Reproducibility

2016-2019

CAESAR is a framework for the end-to-end provenance management of scientific experiments. This collaborative framework allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational steps in an interoperable way.

Relevant Publications:

  • A collaborative semantic-based provenance management platform for reproducibility.
    Sheeba Samuel, Birgitta König-Ries, PeerJ Computer Science, 2022 (Paper, Bibtex)

Code: https://github.com/CaesarReceptorLight

Survey, Research Practices

Reproducibility Survey

2016-2019

The “Reproducibility Crisis”, where researchers find difficulty in reproducing published results, is currently faced by several disciplines. To understand the underlying problem in the context of the reproducibility crisis, it is important to first know the different research practices followed in their domain and the factors that hinder reproducibility. We performed an exploratory study by conducting a survey addressed to researchers representing a range of disciplines to understand scientific experiments and research practices for reproducibility. The survey findings identify a reproducibility crisis and a strong need for sharing data, code, methods, steps, and negative and positive results. Insufficient metadata, lack of publicly available data, and incomplete information in study methods are considered to be the main reasons for poor reproducibility. The survey results also address a wide number of research questions on the reproducibility of scientific results.

Relevant Publications:

  • Understanding experiments and research practices for reproducibility: an exploratory study
    Sheeba Samuel, Birgitta König-Ries, PeerJ 9:e11140, 2021 (Paper, Bibtex)

Data Availability: http://doi.org/10.5281/zenodo.3862597

Analysis: https://mybinder.org/v2/gh/fusion-jena/ReproducibilitySurvey/master

Tool, Jupyter notebooks, Provenance, Computational reproducibility

ReproduceMeGit

2020-2022

ReproduceMeGit is a visualization tool for analyzing the reproducibility of Jupyter Notebooks. This will help repository users and owners to reproduce and directly analyze and assess the reproducibility of any GitHub repository containing Jupyter Notebooks. The tool provides information on the number of notebooks that were successfully reproducible, those that resulted in exceptions, those with different results from the original notebooks, etc. Each notebook in the repository along with the provenance information of its execution can also be exported in RDF with the integration of the ProvBook tool.

Relevant Publications:

  • ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks.
    Sheeba Samuel, Birgitta König-Ries, Provenance and Annotation of Data and Processes - 8th and 9th International Provenance and Annotation Workshop, IPAW 2020 + IPAW 2021 (Paper, Bibtex)

Demo: https://doi.org/10.6084/m9.figshare.12084393.v1

Code: https://github.com/fusion-jena/ReproduceMeGit

Tool, Machine Learning, Data Science, Jupyter notebooks, Provenance

MLProvLab

2021-2023

MLProvLab is a JupyterLab extension to track, manage, compare, and visualize the provenance of machine learning notebooks. The tool is designed to help data scientists and ML practitioners to automatically identify the relationships between data and models in ML scripts. It efficiently and automatically tracks the provenance metadata, including datasets and modules used. It provides users the facility to compare different runs of ML experiments, thereby ensuring a way to help them make their decisions. The tool helps researchers and data scientists to collect more information on their experimentation and interact with them.

Relevant Publications:

  • MLProvLab: Provenance Management for Data Science Notebooks
    Dominik Kerzel, Birgitta König-Ries, Sheeba Samuel, DE4DS Workshop co-located with BTW 2023 , 2023 (Paper, Bibtex)
  • Towards Tracking Provenance from Machine Learning Notebooks.
    Dominik Kerzel, Sheeba Samuel, Birgitta König-Ries, 13th International Conference on Knowledge Discovery and Information Retrieval (KDIR), 2021 (Paper, Bibtex)

Code: https://github.com/fusion-jena/MLProvLab/

AI, Machine Learning, Deep Learning

Reproducibility of AI

2021-

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows.

Relevant Publications:

  • Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles.
    Sheeba Samuel, Frank Löffler, Birgitta König-Ries, Provenance and Annotation of Data and Processes - 8th and 9th International Provenance and Annotation Workshop, IPAW 2020 + IPAW 2021 (Paper, Bibtex)
  • [RE] Nondeterminism and Instability in Neural Network Optimization
    Waqas Ahmed, Sheeba Samuel, ML Reproducibility Challenge 2021, ReScience journal, 2022 (Paper, Bibtex)
  • A Unified Framework for Reproducibility in Deep Learning
    Waqas Ahmed, Doctoral Consortium ECAI, 2023 (to appear)
  • How Reproducible are the Results Gained with the Help of Deep Learning Methods in Biodiversity Research? Waqas Ahmed, Vamsi Krishna Kommineni, Birgitta König-Ries, Sheeba Samuel, TDWG, 9-13th October 2023, Australia [Abstract] (Paper)

AI, Deep Learning, Interpretability

Explainability of AI

2021-

Deep learning models have transformed various scientific fields, including medical image analysis, drug design, speech recognition, and material inspection. While these models are widely used, their internal mechanisms remain complex and not well understood, hindering their validation and improvement. Recent research emphasizes the need for understanding model behavior and addressing biases within them. Regulations like the General Data Protection Regulation advocate for transparent algorithmic decisions. This highlights the importance of interpretability in AI models, making it crucial rather than optional. The project aims to develop interpretability methods that leverage domain knowledge, offering human-understandable explanations extracted directly from neural networks. It integrates Knowledge Graphs to enhance interpretation and accuracy, focusing on an application related to plant disease classification, essential for sustainable agriculture in a changing climate.

Relevant Publications:

  • Concept explainability for plant diseases classification
    Jihen Amara, Birgitta König-Ries, Sheeba Samuel, VISAPP, 2023 (Paper, Bibtex)

Biodiversity, Ontology, Semantic Web, ML

The Role of Ontology and Corpus Development in Biodiversity Conservation

2021-2022

Biodiversity is the variety of life on Earth, including its evolutionary, ecological, and cultural processes. It is important to understand where biodiversity is, how it is changing over time, and the factors that drive these changes. To do this, we need to describe and integrate the conditions and measures of biodiversity. We present a core ontology for biodiversity to establish a link between foundational and domain-specific ontologies. Furthermore, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts. These corpora can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques.. The underlying ontology for the classes and relations used to annotate such corpora has also been demonstrated.

Relevant Publications:

  • BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain
    Nora Abdelmageed, Felicitas Löffler, Leila Feddoul, Alsayed Algergawy, Sheeba Samuel, Jitendra Gaikwad, Anahita Kazem, Birgitta König-Ries, Biodiversity Data Journal, 2022 (Paper, Bibtex)
  • A Data-driven Approach for Core Biodiversity Ontology Development.
    Nora Abdelmageed, Alsayed Algergawy, Sheeba Samuel, Birgitta König-Ries, Third International Workshop on Semantics for Biodiversity, co-located with ICBO, 2021 (Paper, Bibtex)
  • BiodivOnto: Towards a Core Ontology for Biodiversity
    Nora Abdelmageed, Alsayed Algergawy, Sheeba Samuel, Birgitta König-Ries, 18th Extended Semantic Web Conference (ESWC) 2021 Poster Track, 6-10 June, 2021 (Paper, Bibtex)

Code:

Acknowledgements

This research is supported in parts by the Deutsche Forschungsgemeinschaft (DFG) in Project Z2 of the CRC/TRR 166 High-end light microscopy elucidates membrane receptor function - ReceptorLight, Carl Zeiss Foundation for the financial support of the project 'A Virtual Werkstatt for Digitization in the Sciences (K3)' within the scope of the program line 'Breakthroughs: Exploring Intelligent Systems for Digitization - explore the basics, use applications' and the University of Jena for IMPULSE funding: IP 2020-10.