This repository contains the code and scripts used for my bachelor thesis experiments on ontology-aware SHACL validation with Re-SHACL and two shape-level target-class rewriting implementations (RDFLib and SPARQL/Virtuoso).
The main entry point for reproducing experiments is run.py.
Repository structure (high level)
-
run.pyRuns the benchmarking pipeline on the provided datasets and writes outputs (timings + validation reports). -
reSHACL/Pipeline variants:- baseline Re-SHACL (
re_shacl.py) - Re-SHACL with RDFLib implementation integrated (
re_shacl_no_tc.py) - Re-SHACL variant that uses a shapes graph rewritten via Virtuoso/SPARQL (
re_shacl_virtuoso.py). So, the baseline target-class handling is turned off completely.
- baseline Re-SHACL (
-
tc_engine/Implementations of class-aware shapes rewriting:engine_rdflib.py(in-process RDFLib traversal + rewrite)engine_virtuoso.py/engine_sparql.py(endpoint-side rewrite;engine_sparql.pyis experimental/test code)
-
source/Note: some large artifacts are not included in the GitHub repo due to size limits.Inputs used by the runner:
-
source/Datasets/— dataset TTL files (e.g.,small.ttl,medium.ttl,large.ttl) -
source/Ontologies/— ontology file(s) (e.g.,dbpedia_ontology.owl) -
source/ShapesGraphs/— shapes graphs and rewritten shapes artifactsDBpedia_SHACL.ttl(full collection)DBpedia_SHACL_selected30.ttl(selected subset used in experiments)DBpedia_SHACL_selected30_no_shin.ttl(variant for the “remove sh:in” experiment)expanded_shapes.ttl(example output after shapes rewriting)
-
-
Outputs/Generated byrun.py:RunTimeResults.txtper datasetvalidationReports/text outputsviolationGraph/TTL graphs produced from the validation report
-
tools/Helper scripts for dataset generation, shape selection, and preprocessing utilities (ranking, cleaning, conversions, stats, etc.). -
target_class_closure.dot/target_class_closure.pngVisualization of class relations (subclass/equivalence/identity) for the chosen target classes (derived from chosen shapes + ontology). -
Other branch:
prev_experimentsContains earlier profiling and exploratory scripts (shape ranking variants, comparisons, debugging experiments).
Requirements
-
Python 3.x (the project was developed with Python 3.11+)
-
Typical packages used in the project:
rdflibpyshaclrequestspandasfor CSV work- (optionally)
tqdmfor the graph generator
Install dependencies (example):
pip install rdflib pyshacl requests tqdm pandasIf you use the Virtuoso/SPARQL pipeline, you also need a running Virtuoso endpoint (see below).
Running the main benchmark (`run.py`)
From the project root:
python run.pyFor each dataset (e.g., small, medium, large) the runner:
- Loads the dataset graph from
source/Datasets/<name>.ttl - Loads the ontology and combines it with the dataset graph (same RDFLib graph in memory)
- Loads the selected shapes graph from
source/ShapesGraphs/ - Executes the chosen pipeline variant(s) (can be configured in
run_experiment(...)) - Validates using
pyshacl - Writes results into
Outputs/<dataset_name>/...
For each dataset folder under Outputs/<dataset_name>/:
RunTimeResults.txt— timing summary per methodvalidationReports/*.txt— human-readable validation resultsviolationGraph/*.ttl— the validation report graph serialized as Turtle
Virtuoso / SPARQL pipeline notes
The SPARQL/Virtuoso variant performs shapes rewriting on an endpoint and exports a rewritten shapes graph (Turtle). Your local run typically needs:
-
Virtuoso running (often via Docker)
-
Endpoint URL configured in the engine script (commonly
http://localhost:8890/sparql) -
Named graphs for:
- ontology
- original shapes
- rewrite graph used during iterative rewriting
Named graphs can be uploaded to Virtuoso using the Linked Data upload. Also, the standart configuration of Docker image does not allowed to output more than 10k lines which should be set up to a much hight number due to the high volume of the outputted shape graph. If anything differs in your setup, update the constants in tc_engine/engine_virtuoso.py (endpoint URL, graph URIs, etc.).
Dataset generation (DBpedia subsets)
The script below extracts a DBpedia-derived dataset using a queue-based traversal from seed nodes.
This script is configured to work with a SPARQL endpoint (DBpedia by default). It:
- picks seed resources typed with selected target classes, then
- expands outgoing triples up to a chosen depth, producing an RDF graph.
Run:
python tools/dbpedia_subset_generator_local.py \
--endpoint https://dbpedia.org/sparql \
--dataset_name small \
--ontology_uri http://dbpedia.org/ontology/ \
--sameas_uri http://www.w3.org/2002/07/owl#sameAs \
--max_depth 2 \
--max_nodes 20000 \
--fanout_limit 200 \
--seed_targets http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Place \
--seed_sample_per_class 50 \
--out source/Datasets/small.ttlImportant arguments (exactly as in the script):
--endpoint: SPARQL endpoint URL--dataset_name: label used in logs / output naming--ontology_uri: prefix used to filter ontology classes (default DBpedia ontology namespace)--sameas_uri: URI used for identity property (defaultowl:sameAs)--max_depth: traversal depth (default2)--max_nodes: max visited nodes/resources--fanout_limit: max outgoing triples per visited resource (default200)--seed_targets: list of target classes (IRIs) used to fetch initial typed resources--seed_sample_per_class: how many typed resources per class to use as starting points--out: output TTL file--include_incoming: if set, also adds incoming triples(s p node)during expansion--resume: if set, resumes from existing output file
Shape ranking and selection (building `DBpedia_SHACL_selected30.ttl`)
This script ranks candidate node shapes from a larger shapes collection and writes a CSV with closure statistics and scores.
Run:
python tools/rank_dbpedia_shapes.py \
--shapes_ttl source/ShapesGraphs/DBpedia_SHACL.ttl \
--ontology source/Ontologies/dbpedia_ontology.owl \
--out_csv tools/DBpedia_SHACL_ranking.csv \
--select_by score_rich \
--k 30 \
--skip_class_regexes ".*Thing$" ".*Agent$" \
--max_prop_shapes 60 \
--max_sh_in 5 \
--inst_cap 10000Key arguments (exact names from the script):
--shapes_ttl: input shapes collection TTL--ontology: ontology file used for class closure computation--out_csv: output CSV path--select_by: ranking key (defaultscore_rich, alternatives includescore_layers,inst_proxy, etc.)--k: number of shapes to select (default30)--skip_class_regexes: regex patterns to filter out target classes (can repeat)--skip_classes_file: file containing classes to exclude (one IRI per line)--max_sh_in: max size ofsh:inlists allowed (default5)--max_prop_shapes: max number of property shapes per node shape (default60)--inst_cap: cap for instance proxy value (default10000)
What it outputs:
-
A CSV where each row is a candidate target class / node shape with:
- closure statistics (subclasses/equivalence/identity, closure size, number of rounds)
- an instance-count proxy
- ranking scores
After ranking, the repository uses additional scripts (e.g., tools/select_target_classes.py) to extract the final selected shapes graph. Check the header/usage in that script (or run python tools/select_target_classes.py --help if it uses argparse).
Other helper scripts (tools)
-
tools/clean_shin.pyRemoves singletonsh:inconstraints (used for the “no sh:in on rdf:type” experiment). -
tools/nt_to_ttl.pyConverts.ntoutput to.ttland can skip malformed lines. -
tools/stats.pyComputes dataset/shapes statistics used in the thesis tables. -
source/combine_ontology.pyUtility to combine ontology triples with a dataset graph (if needed outsiderun.py).
Source code is available on GitHub: https://github.com/zzirkell/reshacl_thesis