Software & Datasets — Sheeba Samuel

Software, Datasets & Ontologies

Below is an overview of the open-source tools, ontologies, knowledge graphs, and datasets produced as part of my research on reproducibility, provenance, and biodiversity informatics. All artifacts are freely available under open licenses.

Software & Tools

Assessment Tool

ReproScore

2025 – 2026 · Python

A dual-framework tool that separates reproducibility readiness (what a repository contains) from reproducibility outcome (whether the software actually executes). Combines 26 metrics across five categories into a composite score, evaluated on 423 GitHub repositories.

Python Reproducibility GitHub Assessment

GitHub Zenodo Preprint

Pipeline

LLM assisted KG construction Pipeline

2024 – 2025 · Python, LLM

This repository contains code, data, prompts and results related to the (semi-)automatic pipeline of Ontology and Knowledge Graph Construction with four different Large Language Models (LLMs): Mixtral 8x22B Instruct v0.1-Instruct-v0.1, GPT-4o, GPT-3.5, Gemini.

LLM Ontology engineering Knowledge engineering Knowledge graphs reproducibility

GitHub Preprint SEMANTiCS 2024 Workshop

Pipeline

Information Retrieval Pipeline using multiple LLMs and RAG

2025 – 2026 · Python, LLM

This repository contains code, data, prompts and results related to information retrieval from PDFs using multiple LLMs and the RAG approach. Later, results from individual LLM were used to create a hard voting classifier, which enhances the overall quality of the results. LLMs used: Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B.

LLM Information Retrieval Retrieval-Augmented Generation Knowledge engineering reproducibility

GitHub PeerJ Computer Science

Analysis Code

Computational Reproducibility Analysis

2021 – present · Python

Scripts and pipelines for the large-scale study of Jupyter notebook reproducibility from biomedical publications. Covers mining PubMed Central, locating notebooks on GitHub, executing them in clean environments, and comparing outputs — across 27,000+ notebooks from 2,660 repositories.

Python Docker PubMed Central GitHub API

GitHub GigaScience 2024 Preprint 2026

Jupyter Extension

ProvBook

2018 – 2019 · Python, Jupyter Notebook

A Jupyter Notebook extension that captures and visualizes the provenance of notebook executions over time using the REPRODUCE-ME ontology. Stores execution history as RDF, allows side-by-side comparison of runs, and supports SPARQL querying of experiment histories.

Python RDF PROV-O Jupyter

GitHub Demo ISWC 2018

Visualization Tool

ReproduceMeGit

2020 – 2022 · Python, Flask

A web-based visualization tool for assessing the reproducibility of Jupyter notebooks stored in GitHub repositories. Displays counts of reproducible, exception-throwing, and result-differing notebooks, with RDF provenance export via ProvBook integration.

Python Flask GitHub API RDF

GitHub Demo IPAW 2021

JupyterLab Extension

MLProvLab

2021 – 2023 · Python, TypeScript, JupyterLab

A JupyterLab extension that automatically tracks, manages, compares, and visualizes provenance of machine learning notebooks. Identifies relationships between data and models in ML scripts, records datasets and modules used, and enables comparison across experimental runs.

Python TypeScript JupyterLab RDF

GitHub BTW 2023

JupyterLab Extension

MLProvCodeGen - Machine Learning Provenance Code Generator

2022 – 2023 · Python, TypeScript, JupyterLab

MLProvCodeGen is a JupyterLab extension that automates the generation of machine learning training code while simultaneously capturing fine-grained provenance data according to real-world models. It enables the seamless reproduction of experiments from recorded provenance files and generates relational graphs to provide a clear visual representation of the entire research workflow.

Python TypeScript JupyterLab RDF

GitHub BTW 2023

Platform

CAESAR

2016 – 2019 · Python

Collaborative Environment for Scientific Analysis with Reproducibility — an end-to-end provenance management framework for scientific experiments. Allows scientists to capture, manage, query, and visualize the complete path of an experiment, covering both computational and wet-lab steps.

Python RDF SPARQL Provenance

GitHub PeerJ CS 2022

NLP Code

BiodivNERE

2021 – 2022 · Python

Code and gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) in the biodiversity domain, generated from biodiversity dataset metadata and publication abstracts. Supports training and evaluation of NLP models for ecological text.

Python NLP NER Biodiversity

GitHub BDJ 2022

Ontologies & Knowledge Graphs

Knowledge Graph

FAIR Jupyter Knowledge Graph

2024 · RDF, SPARQL, Wikibase

A knowledge graph encoding metadata about Jupyter notebook reproducibility at a granular level — notebooks, repositories, cells, outputs, execution environments, and linked publications. Enables semantic querying, provenance exploration, and FAIR sharing of reproducibility evidence.

RDF SPARQL Wikibase FAIR Data

GitHub TGDK 2024 Dataset

Ontology

REPRODUCE-ME Ontology

2016 – 2019 · OWL, RDF

A generic data model and ontology for representing scientific experiments with full provenance. Extends PROV-O and P-Plan with eight experiment components (Data, Agent, Activity, Plan, Step, Setting, Instrument, Material) to enable end-to-end reproducibility tracking.

OWL RDF PROV-O P-Plan

Ontology J. Biomed. Semantics 2022 ESWC 2017

Ontology Network

ReproduceMeON

2021 – present · OWL, RDF

An ontology network for the reproducibility of scientific studies, linking foundational and core ontologies covering scientific experiments, machine learning workflows, computational notebooks, and microscopy. Built using a semi-automated approach with ontology matching.

OWL RDF Ontology Matching SPARQL

GitHub FOIS 2021

Domain Ontology

BiodivOnto

2021 – 2022 · OWL

A core ontology for biodiversity that links foundational and domain-specific ontologies, covering taxa, observations, specimens, and ecological relationships. Designed to support semantic interoperability across biodiversity data sources and NLP pipelines.

OWL RDF Biodiversity

GitHub ESWC 2021

Datasets & Corpora

Survey Dataset

Reproducibility Survey Dataset

2021 · CSV, Jupyter Notebook

Dataset from an exploratory survey of researchers across disciplines investigating scientific experiments and research practices relating to reproducibility. Includes questionnaire responses, analysis notebooks, and derived statistical summaries. Published on Zenodo.

Survey CSV Jupyter Zenodo

Zenodo Analysis PeerJ 2021

Large-Scale Dataset

Jupyter Notebooks Reproducibility Dataset

2024 · RDF, CSV, Zenodo

A large-scale dataset of 27,000+ Jupyter notebooks from biomedical publications (PubMed Central), capturing reproducibility results, execution environments, error types, and linked metadata. Available as both raw CSV and as a FAIR knowledge graph on Zenodo.

Jupyter PubMed Central RDF Zenodo

Zenodo GigaScience 2024 TGDK 2024

Annotated Corpus

BiodivNERE Corpus

2022 · CoNLL, JSON

Two gold-standard corpora for Named Entity Recognition and Relation Extraction in the biodiversity domain, generated from biodiversity dataset metadata and publication abstracts. Covers entity types including taxon, location, habitat, trait, and ecological process.

NER Relation Extraction Biodiversity CoNLL

GitHub BDJ 2022