Skip to content

zzirkell/reshacl_thesis

Repository files navigation

ReSHACL Thesis Experiments

This repository contains the code and scripts used for my bachelor thesis experiments on ontology-aware SHACL validation with Re-SHACL and two shape-level target-class rewriting implementations (RDFLib and SPARQL/Virtuoso).

The main entry point for reproducing experiments is run.py.

Repository structure (high level)
  • run.py Runs the benchmarking pipeline on the provided datasets and writes outputs (timings + validation reports).

  • reSHACL/ Pipeline variants:

    • baseline Re-SHACL (re_shacl.py)
    • Re-SHACL with RDFLib implementation integrated (re_shacl_no_tc.py)
    • Re-SHACL variant that uses a shapes graph rewritten via Virtuoso/SPARQL (re_shacl_virtuoso.py). So, the baseline target-class handling is turned off completely.
  • tc_engine/ Implementations of class-aware shapes rewriting:

    • engine_rdflib.py (in-process RDFLib traversal + rewrite)
    • engine_virtuoso.py / engine_sparql.py (endpoint-side rewrite; engine_sparql.py is experimental/test code)
  • source/ Note: some large artifacts are not included in the GitHub repo due to size limits.

    Inputs used by the runner:

    • source/Datasets/ — dataset TTL files (e.g., small.ttl, medium.ttl, large.ttl)

    • source/Ontologies/ — ontology file(s) (e.g., dbpedia_ontology.owl)

    • source/ShapesGraphs/ — shapes graphs and rewritten shapes artifacts

      • DBpedia_SHACL.ttl (full collection)
      • DBpedia_SHACL_selected30.ttl (selected subset used in experiments)
      • DBpedia_SHACL_selected30_no_shin.ttl (variant for the “remove sh:in” experiment)
      • expanded_shapes.ttl (example output after shapes rewriting)
  • Outputs/ Generated by run.py:

    • RunTimeResults.txt per dataset
    • validationReports/ text outputs
    • violationGraph/ TTL graphs produced from the validation report
  • tools/ Helper scripts for dataset generation, shape selection, and preprocessing utilities (ranking, cleaning, conversions, stats, etc.).

  • target_class_closure.dot / target_class_closure.png Visualization of class relations (subclass/equivalence/identity) for the chosen target classes (derived from chosen shapes + ontology).

  • Other branch: prev_experiments Contains earlier profiling and exploratory scripts (shape ranking variants, comparisons, debugging experiments).

Requirements
  • Python 3.x (the project was developed with Python 3.11+)

  • Typical packages used in the project:

    • rdflib
    • pyshacl
    • requests
    • pandas for CSV work
    • (optionally) tqdm for the graph generator

Install dependencies (example):

pip install rdflib pyshacl requests tqdm pandas

If you use the Virtuoso/SPARQL pipeline, you also need a running Virtuoso endpoint (see below).

Running the main benchmark (`run.py`)

From the project root:

python run.py

What run.py does

For each dataset (e.g., small, medium, large) the runner:

  1. Loads the dataset graph from source/Datasets/<name>.ttl
  2. Loads the ontology and combines it with the dataset graph (same RDFLib graph in memory)
  3. Loads the selected shapes graph from source/ShapesGraphs/
  4. Executes the chosen pipeline variant(s) (can be configured in run_experiment(...))
  5. Validates using pyshacl
  6. Writes results into Outputs/<dataset_name>/...

Outputs

For each dataset folder under Outputs/<dataset_name>/:

  • RunTimeResults.txt — timing summary per method
  • validationReports/*.txt — human-readable validation results
  • violationGraph/*.ttl — the validation report graph serialized as Turtle
Virtuoso / SPARQL pipeline notes

The SPARQL/Virtuoso variant performs shapes rewriting on an endpoint and exports a rewritten shapes graph (Turtle). Your local run typically needs:

  • Virtuoso running (often via Docker)

  • Endpoint URL configured in the engine script (commonly http://localhost:8890/sparql)

  • Named graphs for:

    • ontology
    • original shapes
    • rewrite graph used during iterative rewriting

Named graphs can be uploaded to Virtuoso using the Linked Data upload. Also, the standart configuration of Docker image does not allowed to output more than 10k lines which should be set up to a much hight number due to the high volume of the outputted shape graph. If anything differs in your setup, update the constants in tc_engine/engine_virtuoso.py (endpoint URL, graph URIs, etc.).

Dataset generation (DBpedia subsets)

The script below extracts a DBpedia-derived dataset using a queue-based traversal from seed nodes.

Script: tools/dbpedia_subset_generator_local.py

This script is configured to work with a SPARQL endpoint (DBpedia by default). It:

  • picks seed resources typed with selected target classes, then
  • expands outgoing triples up to a chosen depth, producing an RDF graph.

Run:

python tools/dbpedia_subset_generator_local.py \
  --endpoint https://dbpedia.org/sparql \
  --dataset_name small \
  --ontology_uri http://dbpedia.org/ontology/ \
  --sameas_uri http://www.w3.org/2002/07/owl#sameAs \
  --max_depth 2 \
  --max_nodes 20000 \
  --fanout_limit 200 \
  --seed_targets http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Place \
  --seed_sample_per_class 50 \
  --out source/Datasets/small.ttl

Important arguments (exactly as in the script):

  • --endpoint : SPARQL endpoint URL
  • --dataset_name : label used in logs / output naming
  • --ontology_uri : prefix used to filter ontology classes (default DBpedia ontology namespace)
  • --sameas_uri : URI used for identity property (default owl:sameAs)
  • --max_depth : traversal depth (default 2)
  • --max_nodes : max visited nodes/resources
  • --fanout_limit : max outgoing triples per visited resource (default 200)
  • --seed_targets : list of target classes (IRIs) used to fetch initial typed resources
  • --seed_sample_per_class : how many typed resources per class to use as starting points
  • --out : output TTL file
  • --include_incoming : if set, also adds incoming triples (s p node) during expansion
  • --resume : if set, resumes from existing output file
Shape ranking and selection (building `DBpedia_SHACL_selected30.ttl`)

Script: tools/rank_dbpedia_shapes.py

This script ranks candidate node shapes from a larger shapes collection and writes a CSV with closure statistics and scores.

Run:

python tools/rank_dbpedia_shapes.py \
  --shapes_ttl source/ShapesGraphs/DBpedia_SHACL.ttl \
  --ontology source/Ontologies/dbpedia_ontology.owl \
  --out_csv tools/DBpedia_SHACL_ranking.csv \
  --select_by score_rich \
  --k 30 \
  --skip_class_regexes ".*Thing$" ".*Agent$" \
  --max_prop_shapes 60 \
  --max_sh_in 5 \
  --inst_cap 10000

Key arguments (exact names from the script):

  • --shapes_ttl : input shapes collection TTL
  • --ontology : ontology file used for class closure computation
  • --out_csv : output CSV path
  • --select_by : ranking key (default score_rich, alternatives include score_layers, inst_proxy, etc.)
  • --k : number of shapes to select (default 30)
  • --skip_class_regexes : regex patterns to filter out target classes (can repeat)
  • --skip_classes_file : file containing classes to exclude (one IRI per line)
  • --max_sh_in : max size of sh:in lists allowed (default 5)
  • --max_prop_shapes : max number of property shapes per node shape (default 60)
  • --inst_cap : cap for instance proxy value (default 10000)

What it outputs:

  • A CSV where each row is a candidate target class / node shape with:

    • closure statistics (subclasses/equivalence/identity, closure size, number of rounds)
    • an instance-count proxy
    • ranking scores

After ranking, the repository uses additional scripts (e.g., tools/select_target_classes.py) to extract the final selected shapes graph. Check the header/usage in that script (or run python tools/select_target_classes.py --help if it uses argparse).

Other helper scripts (tools)
  • tools/clean_shin.py Removes singleton sh:in constraints (used for the “no sh:in on rdf:type” experiment).

  • tools/nt_to_ttl.py Converts .nt output to .ttl and can skip malformed lines.

  • tools/stats.py Computes dataset/shapes statistics used in the thesis tables.

  • source/combine_ontology.py Utility to combine ontology triples with a dataset graph (if needed outside run.py).

Source code is available on GitHub: https://github.com/zzirkell/reshacl_thesis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages