Software, Datasets & Ontologies
Below is an overview of the open-source tools, ontologies, knowledge graphs, and datasets produced as part of my research on reproducibility, provenance, and biodiversity informatics. All artifacts are freely available under open licenses.
Software & Tools
Pipeline
LLM assisted KG construction Pipeline
This repository contains code, data, prompts and results related to the (semi-)automatic pipeline of Ontology and Knowledge Graph Construction
with four different Large Language Models (LLMs): Mixtral 8x22B Instruct v0.1-Instruct-v0.1, GPT-4o, GPT-3.5, Gemini.
Pipeline
Information Retrieval Pipeline using multiple LLMs and RAG
This repository contains code, data, prompts and results related to information retrieval from PDFs using multiple LLMs and the RAG approach. Later, results from individual LLM were used to create a hard voting classifier,
which enhances the overall quality of the results. LLMs used: Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B.
Analysis Code
Computational Reproducibility Analysis
Scripts and pipelines for the large-scale study of Jupyter notebook reproducibility from
biomedical publications. Covers mining PubMed Central, locating notebooks on GitHub,
executing them in clean environments, and comparing outputs — across 27,000+ notebooks
from 2,660 repositories.
Jupyter Extension
ProvBook
A Jupyter Notebook extension that captures and visualizes the provenance of notebook executions
over time using the REPRODUCE-ME ontology. Stores execution history as RDF, allows side-by-side
comparison of runs, and supports SPARQL querying of experiment histories.
Visualization Tool
ReproduceMeGit
A web-based visualization tool for assessing the reproducibility of Jupyter notebooks stored
in GitHub repositories. Displays counts of reproducible, exception-throwing, and result-differing
notebooks, with RDF provenance export via ProvBook integration.
JupyterLab Extension
MLProvLab
A JupyterLab extension that automatically tracks, manages, compares, and visualizes provenance
of machine learning notebooks. Identifies relationships between data and models in ML scripts,
records datasets and modules used, and enables comparison across experimental runs.
JupyterLab Extension
MLProvCodeGen - Machine Learning Provenance Code Generator
MLProvCodeGen is a JupyterLab extension that automates the generation of machine learning training code while simultaneously capturing fine-grained provenance data according to real-world models. It enables the seamless reproduction of
experiments from recorded provenance files and generates relational graphs to provide a clear visual representation of the entire research workflow.
Platform
CAESAR
Collaborative Environment for Scientific Analysis with Reproducibility — an end-to-end provenance
management framework for scientific experiments. Allows scientists to capture, manage, query, and
visualize the complete path of an experiment, covering both computational and wet-lab steps.
Ontologies & Knowledge Graphs
Knowledge Graph
FAIR Jupyter Knowledge Graph
A knowledge graph encoding metadata about Jupyter notebook reproducibility at a granular level —
notebooks, repositories, cells, outputs, execution environments, and linked publications.
Enables semantic querying, provenance exploration, and FAIR sharing of reproducibility evidence.
Ontology
REPRODUCE-ME Ontology
A generic data model and ontology for representing scientific experiments with full provenance.
Extends PROV-O and P-Plan with eight experiment components (Data, Agent, Activity, Plan,
Step, Setting, Instrument, Material) to enable end-to-end reproducibility tracking.
Ontology Network
ReproduceMeON
An ontology network for the reproducibility of scientific studies, linking foundational and
core ontologies covering scientific experiments, machine learning workflows, computational
notebooks, and microscopy. Built using a semi-automated approach with ontology matching.
Domain Ontology
BiodivOnto
A core ontology for biodiversity that links foundational and domain-specific ontologies,
covering taxa, observations, specimens, and ecological relationships. Designed to support
semantic interoperability across biodiversity data sources and NLP pipelines.
Datasets & Corpora
Survey Dataset
Reproducibility Survey Dataset
Dataset from an exploratory survey of researchers across disciplines investigating scientific
experiments and research practices relating to reproducibility. Includes questionnaire responses,
analysis notebooks, and derived statistical summaries. Published on Zenodo.
Large-Scale Dataset
Jupyter Notebooks Reproducibility Dataset
A large-scale dataset of 27,000+ Jupyter notebooks from biomedical publications (PubMed Central),
capturing reproducibility results, execution environments, error types, and linked metadata.
Available as both raw CSV and as a FAIR knowledge graph on Zenodo.
Annotated Corpus
BiodivNERE Corpus
Two gold-standard corpora for Named Entity Recognition and Relation Extraction in the
biodiversity domain, generated from biodiversity dataset metadata and publication abstracts.
Covers entity types including taxon, location, habitat, trait, and ecological process.





