ReSHACL Thesis Experiments

This repository contains the code and scripts used for my bachelor thesis experiments on ontology-aware SHACL validation with Re-SHACL and two shape-level target-class rewriting implementations (RDFLib and SPARQL/Virtuoso).

The main entry point for reproducing experiments is run.py.

Repository structure (high level)

run.py Runs the benchmarking pipeline on the provided datasets and writes outputs (timings + validation reports).
reSHACL/ Pipeline variants:
- baseline Re-SHACL (re_shacl.py)
- Re-SHACL with RDFLib implementation integrated (re_shacl_no_tc.py)
- Re-SHACL variant that uses a shapes graph rewritten via Virtuoso/SPARQL (re_shacl_virtuoso.py). So, the baseline target-class handling is turned off completely.
tc_engine/ Implementations of class-aware shapes rewriting:
- engine_rdflib.py (in-process RDFLib traversal + rewrite)
- engine_virtuoso.py / engine_sparql.py (endpoint-side rewrite; engine_sparql.py is experimental/test code)
source/ Note: some large artifacts are not included in the GitHub repo due to size limits.

Inputs used by the runner:
- source/Datasets/ — dataset TTL files (e.g., small.ttl, medium.ttl, large.ttl)
- source/Ontologies/ — ontology file(s) (e.g., dbpedia_ontology.owl)
- source/ShapesGraphs/ — shapes graphs and rewritten shapes artifacts
  - DBpedia_SHACL.ttl (full collection)
  - DBpedia_SHACL_selected30.ttl (selected subset used in experiments)
  - DBpedia_SHACL_selected30_no_shin.ttl (variant for the “remove sh:in” experiment)
  - expanded_shapes.ttl (example output after shapes rewriting)
Outputs/ Generated by run.py:
- RunTimeResults.txt per dataset
- validationReports/ text outputs
- violationGraph/ TTL graphs produced from the validation report
tools/ Helper scripts for dataset generation, shape selection, and preprocessing utilities (ranking, cleaning, conversions, stats, etc.).
target_class_closure.dot / target_class_closure.png Visualization of class relations (subclass/equivalence/identity) for the chosen target classes (derived from chosen shapes + ontology).
Other branch: prev_experiments Contains earlier profiling and exploratory scripts (shape ranking variants, comparisons, debugging experiments).

Requirements

Python 3.x (the project was developed with Python 3.11+)
Typical packages used in the project:
- rdflib
- pyshacl
- requests
- pandas for CSV work
- (optionally) tqdm for the graph generator

Install dependencies (example):

pip install rdflib pyshacl requests tqdm pandas

If you use the Virtuoso/SPARQL pipeline, you also need a running Virtuoso endpoint (see below).

Running the main benchmark (`run.py`)

From the project root:

python run.py

What `run.py` does

For each dataset (e.g., small, medium, large) the runner:

Loads the dataset graph from source/Datasets/<name>.ttl
Loads the ontology and combines it with the dataset graph (same RDFLib graph in memory)
Loads the selected shapes graph from source/ShapesGraphs/
Executes the chosen pipeline variant(s) (can be configured in run_experiment(...))
Validates using pyshacl
Writes results into Outputs/<dataset_name>/...

Outputs

For each dataset folder under Outputs/<dataset_name>/:

RunTimeResults.txt — timing summary per method
validationReports/*.txt — human-readable validation results
violationGraph/*.ttl — the validation report graph serialized as Turtle

Virtuoso / SPARQL pipeline notes

The SPARQL/Virtuoso variant performs shapes rewriting on an endpoint and exports a rewritten shapes graph (Turtle). Your local run typically needs:

Virtuoso running (often via Docker)
Endpoint URL configured in the engine script (commonly http://localhost:8890/sparql)
Named graphs for:
- ontology
- original shapes
- rewrite graph used during iterative rewriting

Named graphs can be uploaded to Virtuoso using the Linked Data upload. Also, the standart configuration of Docker image does not allowed to output more than 10k lines which should be set up to a much hight number due to the high volume of the outputted shape graph. If anything differs in your setup, update the constants in tc_engine/engine_virtuoso.py (endpoint URL, graph URIs, etc.).

Dataset generation (DBpedia subsets)

The script below extracts a DBpedia-derived dataset using a queue-based traversal from seed nodes.

Script: `tools/dbpedia_subset_generator_local.py`

This script is configured to work with a SPARQL endpoint (DBpedia by default). It:

picks seed resources typed with selected target classes, then
expands outgoing triples up to a chosen depth, producing an RDF graph.

Run:

python tools/dbpedia_subset_generator_local.py \
  --endpoint https://dbpedia.org/sparql \
  --dataset_name small \
  --ontology_uri http://dbpedia.org/ontology/ \
  --sameas_uri http://www.w3.org/2002/07/owl#sameAs \
  --max_depth 2 \
  --max_nodes 20000 \
  --fanout_limit 200 \
  --seed_targets http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Place \
  --seed_sample_per_class 50 \
  --out source/Datasets/small.ttl

Important arguments (exactly as in the script):

--endpoint : SPARQL endpoint URL
--dataset_name : label used in logs / output naming
--ontology_uri : prefix used to filter ontology classes (default DBpedia ontology namespace)
--sameas_uri : URI used for identity property (default owl:sameAs)
--max_depth : traversal depth (default 2)
--max_nodes : max visited nodes/resources
--fanout_limit : max outgoing triples per visited resource (default 200)
--seed_targets : list of target classes (IRIs) used to fetch initial typed resources
--seed_sample_per_class : how many typed resources per class to use as starting points
--out : output TTL file
--include_incoming : if set, also adds incoming triples (s p node) during expansion
--resume : if set, resumes from existing output file

Shape ranking and selection (building `DBpedia_SHACL_selected30.ttl`)

Script: `tools/rank_dbpedia_shapes.py`

This script ranks candidate node shapes from a larger shapes collection and writes a CSV with closure statistics and scores.

Run:

python tools/rank_dbpedia_shapes.py \
  --shapes_ttl source/ShapesGraphs/DBpedia_SHACL.ttl \
  --ontology source/Ontologies/dbpedia_ontology.owl \
  --out_csv tools/DBpedia_SHACL_ranking.csv \
  --select_by score_rich \
  --k 30 \
  --skip_class_regexes ".*Thing$" ".*Agent$" \
  --max_prop_shapes 60 \
  --max_sh_in 5 \
  --inst_cap 10000

Key arguments (exact names from the script):

--shapes_ttl : input shapes collection TTL
--ontology : ontology file used for class closure computation
--out_csv : output CSV path
--select_by : ranking key (default score_rich, alternatives include score_layers, inst_proxy, etc.)
--k : number of shapes to select (default 30)
--skip_class_regexes : regex patterns to filter out target classes (can repeat)
--skip_classes_file : file containing classes to exclude (one IRI per line)
--max_sh_in : max size of sh:in lists allowed (default 5)
--max_prop_shapes : max number of property shapes per node shape (default 60)
--inst_cap : cap for instance proxy value (default 10000)

What it outputs:

A CSV where each row is a candidate target class / node shape with:
- closure statistics (subclasses/equivalence/identity, closure size, number of rounds)
- an instance-count proxy
- ranking scores

After ranking, the repository uses additional scripts (e.g., tools/select_target_classes.py) to extract the final selected shapes graph. Check the header/usage in that script (or run python tools/select_target_classes.py --help if it uses argparse).

Other helper scripts (tools)

tools/clean_shin.py Removes singleton sh:in constraints (used for the “no sh:in on rdf:type” experiment).
tools/nt_to_ttl.py Converts .nt output to .ttl and can skip malformed lines.
tools/stats.py Computes dataset/shapes statistics used in the thesis tables.
source/combine_ontology.py Utility to combine ontology triples with a dataset graph (if needed outside run.py).

Source code is available on GitHub: https://github.com/zzirkell/reshacl_thesis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReSHACL Thesis Experiments

What `run.py` does

Outputs

Script: `tools/dbpedia_subset_generator_local.py`

Script: `tools/rank_dbpedia_shapes.py`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Outputs		Outputs
reSHACL		reSHACL
source		source
tc_engine		tc_engine
tools		tools
README.md		README.md
__init__.py		__init__.py
run.py		run.py
target_class_closure.dot		target_class_closure.dot
target_class_closure.png		target_class_closure.png

Folders and files

Latest commit

History

Repository files navigation

ReSHACL Thesis Experiments

What run.py does

Outputs

Script: tools/dbpedia_subset_generator_local.py

Script: tools/rank_dbpedia_shapes.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `run.py` does

Script: `tools/dbpedia_subset_generator_local.py`

Script: `tools/rank_dbpedia_shapes.py`

Packages