REPRODUCE-ME: Reproducibility of Scientific Experiments

Our research provides a list of work done for the Reproducibility of Scientific Experiments.

1. The REPRODUCE-ME Data Model and Ontology

The REPRODUCE-ME Data Model is a generic data model for the representation of scientific experiments with their provenance information. The aim of this model is to capture the general elements of scientific experiments for their understandability and reproducibility. An Experiment is considered as the central point of the REPRODUCE-ME data model. The model consists of eight components: Data, Agent, Activity, Plan, Step, Setting, Instrument, Material. The REPRODUCE-ME Data Model forms a basis for the REPRODUCE-ME ontology. The REPRODUCE-ME ontology extended from PROV-O and P-Plan is used to represent the whole picture of an experiment describing the path it took from its design to result. We aim to enable end-to-end reproducibility of scientific experiments by capturing and representing the complete provenance of a scientific experiment using the REPRODUCE-ME ontology.

2. ProvBook

We present ProvBook, an extension of Jupyter Notebook, to capture and view the provenance over the course of time. It also allows the user to share a notebook along with its provenance in RDF and also convert it back to a notebook. We use the REPRODUCE-ME ontology extended from PROV-O and P-Plan to describe the provenance of a notebook. This helps the scientists to compare their previous results with the current ones, check whether the experiments produce the results as expected and query the sequence of executions using SPARQL. The notebook data in RDF can be used in combination with the experiments that used them and help to get a track of the complete path of the scientific experiments.

3. CAESAR-A Collaborative Environment for Scientific Analysis with Reproducibility

We present CAESAR, a framework for the end-to-end provenance management of scientific experiments. This collaborative framework allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational steps in an interoperable way.

4. Reproducibility Survey

The “Reproducibility Crisis”, where researchers find difficulty in reproducing published results, is currently faced by several disciplines. To understand the underlying problem in the context of the reproducibility crisis, it is important to first know the different research practices followed in their domain and the factors that hinder reproducibility. We performed an exploratory study by conducting a survey addressed to researchers representing a range of disciplines to understand scientific experiments and research practices for reproducibility. The survey findings identify a reproducibility crisis and a strong need for sharing data, code, methods, steps, and negative and positive results. Insufficient metadata, lack of publicly available data, and incomplete information in study methods are considered to be the main reasons for poor reproducibility. The survey results also address a wide number of research questions on the reproducibility of scientific results.

5. ReproduceMeGit

ReproduceMeGit is a visualization tool for analyzing the reproducibility of Jupyter Notebooks. This will help repository users and owners to reproduce and directly analyze and assess the reproducibility of any GitHub repository containing Jupyter Notebooks. The tool provides information on the number of notebooks that were successfully reproducible, those that resulted in exceptions, those with different results from the original notebooks, etc. Each notebook in the repository along with the provenance information of its execution can also be exported in RDF with the integration of the ProvBook tool.

6. MLProvLab

MLProvLab is a JupyterLab extension to track, manage, compare, and visualize the provenance of machine learning notebooks. The tool is designed to help data scientists and ML practitioners to automatically identify the relationships between data and models in ML scripts. It efficiently and automatically tracks the provenance metadata, including datasets and modules used. It provides users the facility to compare different runs of ML experiments, thereby ensuring a way to help them make their decisions. The tool helps researchers and data scientists to collect more information on their experimentation and interact with them.

7. ReproduceMe Ontology Network (ReproduceMeON)

ReproduceMeON is a an ontology network for the reproducibility of scientific studies. The ontology network, which includes the foundational and core ontologies, attempts to bring together different aspects of the provenance of scientific studies from various applications to support their reproducibility. The repository provides the development process of ReproduceMeON and the design methodology of developing core ontologies for the provenance of scientific experiments and machine learning using a semi-automated approach. The repository provides a systematic literature review in different areas in provenance, scientific experiments, Machine Learning, computational, microscopy, and scientific workflows. We also provide the state of the art ontolgies used for the development of ReproduceMeON. Ontology matching techniques are used to select and develop core ontology for each sub-domain and link to other ontologies in the sub-domain.

Acknowledgements

This research is supported in parts by the Deutsche Forschungsgemeinschaft (DFG) in Project Z2 of the CRC/TRR 166 High-end light microscopy elucidates membrane receptor function - ReceptorLight, Carl Zeiss Foundation for the financial support of the project 'A Virtual Werkstatt for Digitization in the Sciences (K3)' within the scope of the program line 'Breakthroughs: Exploring Intelligent Systems for Digitization - explore the basics, use applications' and the University of Jena for IMPULSE funding: IP 2020-10.