FrenchAdmin processes French public administration data for AI applications in the public sector, with a focus on tax law (BOFiP/ CGI). It downloads, processes, embeds, and stores data in PostgreSQL with PgVector for vector search, and FalkorDB for knowledge graph relationships.
Key capabilities:
- LEGI: French legislative texts (Code GΓ©nΓ©ral des ImpΓ΄ts, LPF, etc.)
- JADE: Judicial decisions from French courts
- BOFiP: Tax guidance documents (Bulletin Officiel des Finances Publiques)
- Cross-reference inference: Automatic linking between JADE/BOFiP and LEGI articles for RAG and graphRAG
-
Install the required apt dependencies:
sudo apt-get update sudo apt-get install -y $(cat config/requirements-apt-container.txt) -
Create and activate a virtual environment:
python3 -m venv .venv # Create the virtual environment source .venv/bin/activate # Activate the virtual environment
-
Install the required python dependencies:
pip install -e .
Installing in development mode (
-e) allows you to use themediatechcommand and modify the code without reinstalling.
Note: Make sure your environment is properly configured before continuing.
-
Set up the environment variables in a
.envfile based on the example in.env.example. -
Export
.envvariables :export $(grep -v '^#' .env | xargs)
-
Start the containers with Docker:
docker compose up -d
-
Verify containers are running:
docker ps
You should see:
pg- PostgreSQL with PgVector (vector search)falkor- FalkorDB (graph database)
After installation, the mediatech command is available globally and replaces python main.py:
If you encounter issues with the
mediatechcommand, you can still usepython main.pyinstead.
The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:
mediatech <command> [options]or
python main.py <command> [options]Command examples:
- View help:
mediatech --help
- Create PostgreSQL tables:
mediatech create_tables --model BAAI/bge-m3
- Download all files listed in
data_config.json:mediatech download_files --all
- Download files from the
legisource:mediatech download_files --source legi
- Download and process all files listed in
data_config.json:mediatech download_and_process_files --all --model BAAI/bge-m3
- Process all data:
mediatech process_files --all --model BAAI/bge-m3
- Split a table into subtables based on different criteria (see
main.py):mediatech split_table --source legi
- Export PostgreSQL tables to parquet files:
mediatech export_tables --output data/parquet
- Upload parquet datasets to the Hugging Face repository:
mediatech upload_dataset --input data/parquet/bofip.parquet --dataset-name bofip
Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.
If you prefer to use the Python script directly, you can always use:
python main.py <command> [options]Examples:
python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3The processing pipeline now exposes optimization switches via environment variables:
export ENABLE_BATCH_EMBEDDING=true
export ENABLE_FAST_DB_INSERT=true
export ENABLE_BATCH_GRAPH_UPSERT=true
export ENABLE_PARALLEL_PROCESSING=false
export ENABLE_PERF_TELEMETRY=trueTuning variables:
export EMBEDDING_BATCH_MAX_SIZE=64
export FAST_DB_INSERT_PAGE_SIZE=1000
export MAX_WORKERS=4
export BATCH_SIZE_DOCS=32When telemetry is enabled, each run writes a JSON report in data/perf_reports/.
You can run the fixed-sample benchmark helper and enforce a regression gate:
python scripts/benchmark_pipeline.py \
--command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
--runs 3 \
--run-prefix process_legi \
--reports-dir data/perf_reportsOptional baseline gate (fails if runtime degrades by more than 10%):
python scripts/benchmark_pipeline.py \
--command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
--runs 3 \
--run-prefix process_legi \
--baseline data/perf_reports/process_legi_baseline.json \
--regression-threshold 0.10Using the update.sh Script
The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:
./scripts/update.shThis script will:
- Wait for the PostgreSQL database to be available,
- Create or update the necessary tables in the PostgreSQL database,
- Download public files listed in
data_config.json, - Process and vectorize the data,
- Export the tables in Parquet format,
- Upload the Parquet files to Hugging Face.
main.py: Main entry point with CLI for pipeline commands.pyproject.toml: Python project and dependency configuration.Dockerfile: Docker image for containerized execution, installs system dependencies and project packages.docker-compose.yml: Multi-container setup: PostgreSQL (PgVector) + FalkorDB..github/: GitHub Actions workflows for CI/CD.download_and_processing/: Scripts to download and extract files from DILA (LEGI, JADE) and data.economie.gouv.fr (BOFiP).database/: Database management (table creation, data insertion, FalkorDB graph operations).docs/: Documentation and tutorials.docs/hugging_face_rag_tutorial.ipynb: RAG Tutorial: How to load datasets from Hugging Face and use them in a RAG pipeline ?docs/reconstruct_vector_database.ipynb: Tutorial: How to reconstruct a dataset without chunking and embedding from parquet files?docs/fr/: French translations of documentation.
utils/: Shared utilities (chunking, embedding, HuggingFace, telemetry).config/: Project configuration (data sources, embedding models, optimization flags).logs/: Log files from script execution.scripts/: Shell scripts for pipeline automation.scripts/update.sh: Run the entire data processing pipeline.scripts/periodic_update.sh: Automate pipeline via cron.scripts/backup.sh: Back up PostgreSQL volume and config files.scripts/restore.sh: Restore PostgreSQL volume and config.scripts/initial_deployment.sh: Set up a new server (Docker, dependencies).scripts/containers_deployment.sh: Build and deploy Docker containers.scripts/delete_old_files.sh: Delete old files from logs/ and backups/ directories.scripts/manage_checkpoint.sh: Manage checkpoint files for processing.scripts/write_tchap_message.sh: Send notifications to Tchap (French government chat).
CROSSREFERENCE.md: Technical specification for JADE/BOFiP β LEGI cross-reference inference (RAG/graphRAG).
Files created/modified:
βββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β File β Purpose |
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β database/cross_reference_manage.py β 3 new tables, catalog refresh, mention/edge CRUD β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β database/__init__.py β Added cross-reference exports β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β database/database_manage.py β Wired create_cross_reference_tables() + init_graph_schema() into create_all_tables β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β database/graph_manage.py β inject_cross_reference_edges() for APPLIES_TO/INTERPRETS β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/__init__.py β Package exports β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/normalizer.py β Primary + loose article number normalization β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/alias_detector.py β CGI/LPF/CIBS family alias detection β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/extractor.py β Article token extraction with enumeration support β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/resolver.py β Full cascade: AβBβCβDβE β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/fuzzy_resolver.py β rapidfuzz scoped fallback β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/semantic_resolver.py β Cosine-distance semantic fallback β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/confidence.py β Confidence scoring with adjustments β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β crossreference/pipeline.py β Orchestrator: catalog refresh β doc aggregation β extraction β resolution β edges β graph β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β main.py β Added infer_crossreferences CLI command β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β pyproject.toml β Added rapidfuzz, crossreference* to packages β
βββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Usage:
1 main.py infer_crossreferences --source all
2 main.py infer_crossreferences --source jade
3 main.py infer_crossreferences --source bofip --debug
Resolution cascade per mention:
A. Exact normalized + temporal + family β 0.99
B. Loose key fallback β 0.92
C. Family-prior deterministic β 0.84
D. Fuzzy scoped (rapidfuzz, β₯96) β 0.74
E. Semantic scoped (cosine <=>) β 0.62
This project is licensed under the MIT License.