Software, Datasets & Ontologies

Below is an overview of the open-source tools, ontologies, knowledge graphs, and datasets produced as part of my research on reproducibility, provenance, and biodiversity informatics. All artifacts are freely available under open licenses.

Software & Tools
Pipeline
LLM assisted KG construction Pipeline
2024 – 2025  ·  Python, LLM
This repository contains code, data, prompts and results related to the (semi-)automatic pipeline of Ontology and Knowledge Graph Construction with four different Large Language Models (LLMs): Mixtral 8x22B Instruct v0.1-Instruct-v0.1, GPT-4o, GPT-3.5, Gemini.
LLM Ontology engineering Knowledge engineering Knowledge graphs reproducibility
Pipeline
Information Retrieval Pipeline using multiple LLMs and RAG
2025 – 2026  ·  Python, LLM
This repository contains code, data, prompts and results related to information retrieval from PDFs using multiple LLMs and the RAG approach. Later, results from individual LLM were used to create a hard voting classifier, which enhances the overall quality of the results. LLMs used: Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B.
LLM Information Retrieval Retrieval-Augmented Generation Knowledge engineering reproducibility
Analysis Code
Computational Reproducibility Analysis
2021 – present  ·  Python
Scripts and pipelines for the large-scale study of Jupyter notebook reproducibility from biomedical publications. Covers mining PubMed Central, locating notebooks on GitHub, executing them in clean environments, and comparing outputs — across 27,000+ notebooks from 2,660 repositories.
Python Docker PubMed Central GitHub API
Jupyter Extension
ProvBook
2018 – 2019  ·  Python, Jupyter Notebook
A Jupyter Notebook extension that captures and visualizes the provenance of notebook executions over time using the REPRODUCE-ME ontology. Stores execution history as RDF, allows side-by-side comparison of runs, and supports SPARQL querying of experiment histories.
Python RDF PROV-O Jupyter
Visualization Tool
ReproduceMeGit
2020 – 2022  ·  Python, Flask
A web-based visualization tool for assessing the reproducibility of Jupyter notebooks stored in GitHub repositories. Displays counts of reproducible, exception-throwing, and result-differing notebooks, with RDF provenance export via ProvBook integration.
Python Flask GitHub API RDF
JupyterLab Extension
MLProvLab
2021 – 2023  ·  Python, TypeScript, JupyterLab
A JupyterLab extension that automatically tracks, manages, compares, and visualizes provenance of machine learning notebooks. Identifies relationships between data and models in ML scripts, records datasets and modules used, and enables comparison across experimental runs.
Python TypeScript JupyterLab RDF
JupyterLab Extension
MLProvCodeGen - Machine Learning Provenance Code Generator
2022 – 2023  ·  Python, TypeScript, JupyterLab
MLProvCodeGen is a JupyterLab extension that automates the generation of machine learning training code while simultaneously capturing fine-grained provenance data according to real-world models. It enables the seamless reproduction of experiments from recorded provenance files and generates relational graphs to provide a clear visual representation of the entire research workflow.
Python TypeScript JupyterLab RDF
Platform
CAESAR
2016 – 2019  ·  Python
Collaborative Environment for Scientific Analysis with Reproducibility — an end-to-end provenance management framework for scientific experiments. Allows scientists to capture, manage, query, and visualize the complete path of an experiment, covering both computational and wet-lab steps.
Python RDF SPARQL Provenance
NLP Code
BiodivNERE
2021 – 2022  ·  Python
Code and gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) in the biodiversity domain, generated from biodiversity dataset metadata and publication abstracts. Supports training and evaluation of NLP models for ecological text.
Python NLP NER Biodiversity
Ontologies & Knowledge Graphs
Knowledge Graph
FAIR Jupyter Knowledge Graph
2024  ·  RDF, SPARQL, Wikibase
A knowledge graph encoding metadata about Jupyter notebook reproducibility at a granular level — notebooks, repositories, cells, outputs, execution environments, and linked publications. Enables semantic querying, provenance exploration, and FAIR sharing of reproducibility evidence.
RDF SPARQL Wikibase FAIR Data
Ontology
REPRODUCE-ME Ontology
2016 – 2019  ·  OWL, RDF
A generic data model and ontology for representing scientific experiments with full provenance. Extends PROV-O and P-Plan with eight experiment components (Data, Agent, Activity, Plan, Step, Setting, Instrument, Material) to enable end-to-end reproducibility tracking.
OWL RDF PROV-O P-Plan
Ontology Network
ReproduceMeON
2021 – present  ·  OWL, RDF
An ontology network for the reproducibility of scientific studies, linking foundational and core ontologies covering scientific experiments, machine learning workflows, computational notebooks, and microscopy. Built using a semi-automated approach with ontology matching.
OWL RDF Ontology Matching SPARQL
Domain Ontology
BiodivOnto
2021 – 2022  ·  OWL
A core ontology for biodiversity that links foundational and domain-specific ontologies, covering taxa, observations, specimens, and ecological relationships. Designed to support semantic interoperability across biodiversity data sources and NLP pipelines.
OWL RDF Biodiversity
Datasets & Corpora
Survey Dataset
Reproducibility Survey Dataset
2021  ·  CSV, Jupyter Notebook
Dataset from an exploratory survey of researchers across disciplines investigating scientific experiments and research practices relating to reproducibility. Includes questionnaire responses, analysis notebooks, and derived statistical summaries. Published on Zenodo.
Survey CSV Jupyter Zenodo
Large-Scale Dataset
Jupyter Notebooks Reproducibility Dataset
2024  ·  RDF, CSV, Zenodo
A large-scale dataset of 27,000+ Jupyter notebooks from biomedical publications (PubMed Central), capturing reproducibility results, execution environments, error types, and linked metadata. Available as both raw CSV and as a FAIR knowledge graph on Zenodo.
Jupyter PubMed Central RDF Zenodo
Annotated Corpus
BiodivNERE Corpus
2022  ·  CoNLL, JSON
Two gold-standard corpora for Named Entity Recognition and Relation Extraction in the biodiversity domain, generated from biodiversity dataset metadata and publication abstracts. Covers entity types including taxon, location, habitat, trait, and ecological process.
NER Relation Extraction Biodiversity CoNLL