Skip to content

Latest commit

 

History

History
303 lines (256 loc) · 125 KB

File metadata and controls

303 lines (256 loc) · 125 KB

Glossary

Short definitions of terms used in Data Boar documentation and configuration. Helps new readers and translators. First-column names match config keys, API fields, and code where applicable; the pt-BR file adds Portuguese glosses.

Audience: Definitions target integrators and compliance readers. The companion GLOSSARY.pt_BR.md uses Brazilian Portuguese glosses for the explanatory text; the first column stays in English when that matches config keys and reports—by design, not random code-switching.

Português (Brasil): GLOSSARY.pt_BR.md

How to use this page: Skim the theme table below and jump to the section that matches your hat (e.g. laws vs APIs vs ML). Each row is intentionally one short paragraph with pointers to long-form docs—enough to disambiguate jargon, not a tutorial.

How this glossary is organised (taxonomy)

Terms are grouped by theme (product, scanning workflow, compliance, APIs, etc.). Within each group, rows are sorted alphabetically by the first column. This is easier to browse than a single flat A–Z list when many acronyms mix domains (law vs transport vs ML). The decision to expand the glossary and keep it maintainable is recorded in ADR 0022.

Theme (section) What it covers
Product identity & data landscape Naming, dashboard nickname, data soup metaphor, hidden ingredients (cloaking, legacy stores, passwords, rich metadata, stego, tracker strings / “pixels”, scope breadcrumbs—bounded capabilities)
Session, targets, connectors & outputs One scan run, sources, code modules, Excel/heatmap, sampling behaviour
Findings, tags, overrides & levels Detection results, labels, regex (pattern language), regex override, recommendations, sensitivity, FP/FN
Risk & quasi-identifiers Combined risk and re-identification analysis; minors' data and heightened safeguards (organisational obligation to contextualise)
Laws & data-protection roles Statutes, authorities (e.g. ANPD), controller/processor, legal basis, report controls, data categories (PII, sensitive data, PHI, cardholder data), US healthcare referral (e.g. Stark Law), legal entity vs personal data (scanning), and organisational roles (DPO, CISO)
Engineering: reliability, supply chain & records SRE, toil (manual repetitive ops work vs automation), CI/CD, SBOM, ADRs, SAST, CVE/dependency hygiene, delivery culture, security posture (Zero Trust, least privilege, secure by design, shift-left), boar_fast_filter (Rust/PyO3 performance core)
APIs, authorization & transport REST, OpenAPI, OAuth2, OIDC (OpenID Connect), TLS, CSP / HSTS (headers), RBAC, CDN, WAF, DoS/DDoS, timeouts, scan parallelism, rate limiting, request body limits (customer-friendly load controls)
Detection technology Regex + supervised ML/DL (reproducible confidence on sampled data); contrast with generative LLMs; advanced inventory signals (quasi-identifiers, minors, jurisdiction hints) stay structured, not chat completions; historical context (no hype): AI_EVOLUTION_PRIMER.md
Compliance positioning & artefacts GRC (governance, risk, compliance), metadata-only reporting, samples, cross-border and governance terms used with counsel; jurisdiction hints / tension (heuristic, not applicable law); lawful-basis vocabulary; vendor chain; VBA (value-based agreement—pharma/payer vocabulary; not Office VBA); Corporate-Entity-C/WRB and Google Gemini (third-party maintainer review/chat tooling—process, not a shipped engine dependency)
ISO/IEC & ABNT NBR (management-system standards) ISMS, privacy information management (PIMS), and information-security risk management—short titles + pointers to ISO catalogue
Licensing, open core & subscriptions Intended open-core model, OSS baseline vs commercial modules, signed JWT license claims, subscription entitlements (policy drafts)
Data governance (DMBOK & lifecycle) data, data management, data governance, GGD, Data Steward / Data Owner, data lineage, data quality, DMBOK—DAMA-DMBOK + ISO/IEC 25012 vocabulary
Lab orchestration: Maestro, completão & homelab Maestro (conductor), handler (canonical taxonomy for Handle-*.ps1), persona, completão, bench track, lab-op vocabulary; Capo as musical metaphor flavor only

1. Product identity & data landscape

Hidden ingredients — how this group fits: The data soup story is deliberate: organisations rarely see every ingredient at once. Hidden ingredients are not a single detector—they are the narrative that groups bounded product capabilities (some shipped, some roadmapped or opt-in). The boar roots where you aim it (targets + config); it does not magically clear the whole forest (sampling, ethics, no covert surveillance).

Term Definition
Audit Trail The immutable record of a scan session and its findings — timestamps, connectors used, detection patterns fired, and output artefacts. Functions as compliance evidence that the scan ran, what it covered, and what it found. Stored in SQLite per session and optionally exported. Distinct from host-level OS audit logs. Metaphor: every completed performance by the boar leaves a score you can replay; the trail is the permanent record of what the boar tasted. See session, report; in the Maestro lab context see §12.
Cloaked file A file whose filename extension does not match its true format (e.g. a PDF renamed to .txt). The engine uses magic bytes / content sniffing so the scan follows the real type—metaphor: an ingredient wearing another ingredient’s label on the shelf. See PLAN_CONTENT_TYPE_AND_CLOAKING_DETECTION.md. Distinct from steganography (payload hidden inside a file that already looks like the right type).
dashBOARd Optional sub-brand nickname for the web dashboard (browser UI backed by the API). Use for nav/About-style labels; keep Data Boar as the primary product name in pitch, legal, and README. See README.md, USAGE.md.
Data Boar The product display name. A boar roots through data; the tool discovers and maps personal and sensitive data across many sources for compliance (LGPD, GDPR, CCPA, etc.). The PyPI distribution id is data-boar — see CONTRIBUTING.md.
Data Sniffing Optional POC / compliance-deck label for the engine’s discovery and sampling pass over configured targets: connectors discover structure, read bounded excerpts, and run sensitivity detection (regex / ML / optional DL)—metadata-only findings, no exfiltration. Informal plan language (“sniff harder”, “deep sniffing”) points at the same motor narrative. Use Data Sniffing in slide banks and structured technical docs when you want a stable term; the README executive pitch prefers everyday wording (sniffing with judgment) without requiring this label. See also: data soup, hidden ingredients, COMPLIANCE_FRAMEWORKS.md.
Deep Boring Optional POC / compliance-deck label for the structured report artefact side of the same run: Excel workbooks (findings sheets, Recommendations / controls wording, norm_tag alignment), optional heatmap, and recommendation overrides—“going deep into the boring compliance spreadsheet” the product helps produce as evidence, not legal conclusions. Not a second scan engine; it pairs with Data Sniffing in deck/glossary usage. See also: report, compliance sample, COMPLIANCE_FRAMEWORKS.md.
data soup The mix of data sources you scan—databases, files, APIs, Power BI, Dataverse, shares, etc. Data Boar ingests and digests this soup: it discovers structures, samples values, and reports where PII or sensitive data appears.
hidden ingredients Product metaphor (pairs with data soup): kinds of evidence that are easy to overlook in real estates—cloaked names, password-locked containers, legacy stores, rich-media surface text, optional steganography, embedded tracker strings, and scope breadcrumbs from exports of other tools (scope import from exports — see docs/plans/completed/PLAN_SCOPE_IMPORT_FROM_EXPORTS.md). Metaphor: discovery stays bounded to configured targets and roadmap (deck vocabulary: Data Sniffing); not unlimited “see everything,” not covert ops. Umbrella narrative: additional data soup formats (docs/plans/completed/PLAN_ADDITIONAL_DATA_SOUP_FORMATS.md); optional network hints: opt-in port service hints (docs/plans/PLAN_OPT_IN_NETWORK_PORT_SERVICE_HINTS.md).
legacy or abandoned data Databases, shares, or file trees that outlived clear ownership or documentation. If configured as targets, the engine samples what is reachable—the product does not certify “orphan” status. Metaphor: an old pot at the back of the kitchen (still in the soup if you point the boar at it). Inventory and retention are organisational decisions.
password-protected archive or document Containers (e.g. ZIP, 7z) or documents (e.g. PDF, Office) that require a password to decrypt. Support may include operator-supplied password lists for archives where implemented (see PLAN_COMPRESSED_FILES.md); without the secret, inner content stays invisible to the scan—expect false negatives. Metaphor: a lid you do not have; the boar cannot taste through it. Not a password cracker.
rich-media metadata Text surfaced from images, audio, video from side channels that are not the raw pixel/audio payload: EXIF/XMP, ID3-style tags, video container tags, subtitle sidecars—often opt-in (scan_rich_media_metadata, optional OCR). Metaphor: the aroma at the rim of the bowl; not claiming invisible payload extraction. Lighter than steganography. See PLAN_ADDITIONAL_DATA_SOUP_FORMATS.md Tier 3.
Safe-Hold The operational state where the product halts a scan and reports clearly rather than continuing with missing or insufficient evidence — e.g. connector authentication fails, target is unreachable, or a required config key is absent. The boar does not silently skip and produce a partial result; it surfaces the gap so the operator can decide. Metaphor: the boar pauses mid-dig when the ground is unsafe — it does not disappear. See also: Audit Trail, false negative. In the Maestro lab context see §12.
steganography Hiding a secret payload inside a cover file (image/audio/video) so the file still looks normal. Roadmap-heavy, optional—high CPU/I/O compared to metadata reads. Do not confuse with marketing / tracking pixels (see tracking references) or with cloaked extensions. See PLAN_ADDITIONAL_DATA_SOUP_FORMATS.md and Steganography (future / optional) in PLAN_CONTENT_TYPE_AND_CLOAKING_DETECTION.md. “Hidden pixels” in security discourse often means payload in pixel data (LSB-style tricks)—that belongs here, not under web analytics.
tracking references (embedded) Heuristic hits for third-party tracker URLs or hostnames in metadata or text (e.g. Tier 3b—telemetry / web beacon / marketing pixel strings in exports). Privacy-adjacent governance signal; not a replacement for PII detection. Do not confuse with image steganography. PLAN_ADDITIONAL_DATA_SOUP_FORMATS.md § Tier 3b.

2. Session, targets, connectors & outputs

Term Definition
connector The code that talks to a target type: discovers schema or listing, samples data, runs sensitivity detection, and saves findings. New connectors are added per ADDING_CONNECTORS.md.
heatmap A visualisation (PNG and/or sheet in the Excel report) of sensitivity/risk: typically rows = tables or files, columns = sensitivity level or categories, colour = risk. Helps DPOs and compliance see “hot” areas at a glance.
report The main output bundle for a session: typically an Excel workbook (findings sheets, recommendations, optional heatmap), plus session metadata in SQLite. Output paths and filenames come from report.* config (e.g. report.output_dir). See USAGE.md.
sampling Use of representative rows or excerpts from a target (not necessarily a full copy of every row or file) to run detection within time and resource limits. Depth is connector- and config-dependent; no match in a sample is not proof that no sensitive data exists elsewhere. See USAGE.md, SENSITIVITY_DETECTION.md.
scan The operation of running the engine over one or more targets within a session (CLI one-shot, POST /scan via API, or dashboard). Synonym in docs: varredura (pt-BR).
session A single scan run. Each CLI or API scan creates a session (UUID + timestamp). Findings and failures are stored against that session; the Excel report and heatmap are generated per session.
SQLite Embedded SQL database engine used by the product to persist session metadata, findings, and related operational data (e.g. audit trail inputs) on disk—paths and retention follow operator config and deployment. Not a target type for scanning customer data in the sense of connector targets; it is the app’s local store. See USAGE.md, TECH_GUIDE.md.
target A configured data source to scan: e.g. a database (SQL/NoSQL), a filesystem path, a REST API, a share (SMB, NFS, SharePoint, WebDAV), Power BI, or Dataverse. Defined under targets in the config.

3. Findings, tags, overrides & levels

Term Definition
false negative / false positive False positive: a match flagged as sensitive when it is not (or not in scope). False negative: sensitive data exists but was not detected (rules, sampling, encoding, format). Tune patterns and use suggested review for ambiguous cases. See SENSITIVITY_DETECTION.md.
finding A single detection result: a location (e.g. table.column or file path), the pattern_detected, sensitivity level, and optional norm_tag. Stored as metadata in SQLite and shown in the Excel report (e.g. “Database findings”, “Filesystem findings”).
norm_tag A label that ties a finding to a regulation or framework (e.g. LGPD Art. 5, GDPR Art. 4(1), CCPA). Set by built-in patterns or in regex overrides and recommendation overrides so reports show the right “Base legal” and “Relevante para” per framework.
pattern_detected The name of the rule that matched: a built-in pattern (e.g. LGPD_CPF, EMAIL, CCPA_SSN) or a custom name from regex_overrides_file. Used in reports and for recommendation overrides (e.g. norm_tag_pattern).
recommendation override A config block (report.recommendation_overrides) that customises the Excel “Recommendations” row for a given norm_tag or pattern_detected: base legal, risk, recommendation text, priority, and “relevant for”. Lets you align report language with UK GDPR, PIPEDA, POPIA, APPI, PCI-DSS, or internal norms without code changes.
regex Regular expression (regex): a compact pattern language for matching text—literals, character classes, quantifiers (*, +, ?, {m,n}), anchors (^, $), groups, and flags; this codebase follows typical Python re rules unless documented otherwise. The detector runs regex against sampled cell text and column names to recognise structured identifiers (e.g. CPF, email). Built-in patterns ship with the product; extras use regex override. Mis-tuned patterns increase false positives/negatives. See also: regex override, pattern_detected; SENSITIVITY_DETECTION.md.
regex override A custom regex pattern defined in regex_overrides_file (or inline). Each entry has name, pattern, and optional norm_tag. The detector matches against column names and sample text; a match produces a finding with that name and norm_tag. See regex_overrides.example.yaml and SENSITIVITY_DETECTION.md. Builds on: regex (this section).
sensitivity level One of HIGH, MEDIUM, LOW. Indicates how sensitive the detected data is (e.g. direct identifiers → HIGH; quasi-identifiers or context-only → MEDIUM/LOW). Used for filtering (report.min_sensitivity), report sheets, and the heatmap.
suggested review A report row or column (e.g. Excel Suggested review (LOW)) where ID-like or ambiguous columns are flagged at LOW sensitivity for human follow-up—often driven by detection.persist_low_id_like_for_review and related settings. See SENSITIVITY_DETECTION.md.

4. Risk & quasi-identifiers

Term Definition
aggregated identification Analysis that treats risk in combination: the report highlights tables/files where multiple quasi-identifier-style signals (and related categories) co-occur, not only isolated cell matches. Config, sheet names, and wording evolve with SENSITIVITY_DETECTION.md and product releases.
minor (children's and adolescents' data) In risk and governance, personal data about children and adolescents usually demand stronger safeguards and context-specific assessment (e.g. LGPD Art. 14—best interest; GDPR rules for child-facing services; US COPPA and state minors' privacy laws for US-facing programmes—map with counsel). Organisations must interpret context (age bands, consent, sector) and apply tighter access, retention, and purpose limits than for adults. Data Boar may surface DATE-like or aggregated identification signals; it does not decide age, lawful basis, or parental consent—that is policy and legal work. See also: aggregated identification, quasi-identifier, counsel; MINOR_DETECTION.md.
quasi-identifier A piece of data that, alone or in combination with others, can contribute to re-identifying a person (e.g. gender, job, age band, postcode). When aggregated identification is enabled, the report flags tables/files where several quasi-identifier categories appear together (LGPD Art. 5, GDPR Recital 26).

5. Laws & data-protection roles

Term Definition
ANPD Autoridade Nacional de Proteção de Dados (Brazil): LGPD supervisory authority—guidance, investigations, sanctions. Not a substitute for counsel on your specific facts. Official: gov.br/anpd.
cardholder data Cardholder data (CHD) and sensitive authentication data (SAD) under PCI-DSS: e.g. PAN, cardholder name, expiry, full magnetic-stripe or chip data, CVV/CVC where rules prohibit storage. Distinct from generic PII—scope is payment-card security for merchants and processors. PCI SSC; product sample: compliance-sample-pci_dss.yaml.
CCPA California Consumer Privacy Act (as amended by CPRA—California Privacy Rights Act): US state law governing consumers' personal information for many businesses; includes rights to know, delete, correct, and opt out of sale or certain sharing (with sector/size thresholds and exceptions). Data Boar references CCPA in norm_tag values and built-in patterns (e.g. CCPA_SSN); mapping and samples: COMPLIANCE_FRAMEWORKS.md. Not legal advice. Official overview: California DOJ — CCPA. See also: PII (terminology vs GDPR personal data).
CISO Chief Information Security Officer (sometimes CSO in smaller organisations): executive accountable for the information security programme—risk treatment, control alignment with policy, incident-readiness, and coordination with IT operations. Overlaps with privacy on breaches and safeguards but is not the same statutory function as DPO / LGPD Encarregado. Typical consumer of Data Boar outputs for inventory, heatmaps, and technical evidence of where sensitive data appears—not a penetration test, not legal advice. See also: DPO.
CPML Context-dependent labelnot a codified term in LGPD/GDPR. In US life-science commercial compliance, the same portfolios reference CMS fraud-and-abuse rules (Stark Law, Anti-Kickback), Sunshine / transparency reporting, and FDA promotion or manufacturing quality (CGMP; CPGM guidance manuals)—internal sheets may use codes like CPML next to HCP fields; confirm in your data dictionary. Do not assume CPML equals CGMP or CPGM without source proof. In engineering, CPML may still mean e.g. convolutional perfectly matched layer (simulation). Data Boar ships no CPML pattern; use regex overrides. See also: HCP, Stark Law.
controller / processor Controller decides why and how personal data are processed; processor processes on the controller’s instructions (GDPR Art. 4(7)–(8)). LGPD: controlador / operador (Art. 5). Determines contracts and accountability; Data Boar maps where data appears—not legal classification of roles. See also: processor (operador). GDPR Art. 4; LGPD: Lei 13.709/2018.
controls (report recommendations) In the Excel Recommendations sheet, wording such as “controles” / “controls” in the recommendation column refers to organizational and technical measures teams may apply—e.g. access restrictions, encryption, minimization, retention limits, pseudonymization. This is guidance text for DPO / CISO / security follow-up, not an automatic control matrix or certification. Recommendation overrides and compliance samples tune the phrasing. Distinct from engineering “controls” (section 6). See also: recommendation override; ADR 0025.
data subject Person whom personal data are about: GDPR data subject (Art. 4(1)); LGPD titular de dados (“titular”, Art. 5)—natural person to whom the processed personal data refer; broader than “customer”/“user” in every app label. LGPD rights (Art. 18, summary): confirmation of processing, access, correction, anonymization/blocking/deletion, portability, deletion of unnecessary data, information on sharing and consent, revocation of consent, petition to ANPD, etc. GDPR (Chapter III, summary): access, rectification, erasure, restriction, portability, objection, human intervention on automated decisions. HIPAA uses individual (not “data subject”): Privacy Rule rights include access, amendment, and accounting of disclosures to PHI—different statute and scope from LGPD/GDPR. Data Boar helps map data for governance; it does not fulfil data-subject requests. LGPD Art. 5 / 18; GDPR Art. 4(1) · Chapter III; HHS — individual rights (HIPAA).
DPO Data Protection Officer; LGPD Encarregado (Art. 41)—contact for data subjects and authority on processing. Typical reader of scan outputs. LGPD Art. 41. See also: CISO (security programme; distinct role).
GDPR General Data Protection Regulation (EU 2016/679). Full text hub: gdpr-info.eu. Product mapping: COMPLIANCE_FRAMEWORKS.md.
HCP Health care professional(s) — common abbreviation in life sciences, pharma, and healthcare IT for licensed or qualified professionals (e.g. physicians, nurses, pharmacists) distinct from patients. CRM, sampling, and consent programmes often hold personal data about HCPs (names, professional IDs, territories). Data Boar does not define a dedicated “HCP” pattern; use regex overrides, norm_tag / recommendation overrides, and aggregated identification (e.g. health-related categories) where your policy requires. See also: data subject, PHI, sensitive personal data / special category data.
HHS United States Department of Health and Human Services: federal executive department that houses CMS (Medicare/Medicaid, including Stark programme administration) and the FDA (drugs, devices, biologics), and publishes HIPAA rules enforced with OCR. Often cited next to PHI and HIPAA consumer materials. Do not confuse with HSS—other fields reuse HSS (e.g. institutional names, unrelated abbreviations); check context. HHS.gov. See also: PHI, HIPAA, Stark Law.
legal basis The statutory ground that justifies processing personal data: LGPD Art. 7 (items I–X—e.g. consent, legal obligation, studies by research body, legitimate interest where applicable); GDPR Art. 6 (lawful bases). In reports, the Base legal column (base_legal in recommendation overrides) holds short citations or summaries aligned to norm_tagreminder text for humans, not an engine deciding that a basis applies to your facts. See also: norm_tag, recommendation override; COMPLIANCE_AND_LEGAL.md.
legal entity and personal data (scanning) Under LGPD/GDPR, personal data relate to natural persons. A CNPJ (or foreign company ID) and corporate name identify a legal person; they do not cause the engine to skip emails, phones, CPF, or names of members, representatives, or sales contacts when samples match patterns—those often identify individuals. Data Boar applies regex and ML/DL to column names and sample text; it does not implement “this table is only legal entities, therefore ignore LGPD.” CNPJ columns match LGPD_CNPJ; contact fields can still produce EMAIL, PHONE, or other HIGH/MEDIUM findings. Legal characterisation is for counsel and policy; the product surfaces findings for review (sampling and tuning can still yield false negatives). See also: PII, finding.
LGPD Lei Geral de Proteção de Dados (Brazil 13.709/2018). Official law text: Planalto. Samples and labels: COMPLIANCE_FRAMEWORKS.md.
PHI Protected Health Information (US HIPAA Privacy Rule): individually identifiable health information held or transmitted by a covered entity or business associate in any form (oral, paper, electronic). ePHI is the electronic subset. Distinct legal category from LGPD/GDPR personal data labels—map programmes with counsel. See also ePHI (§9). HHS — PHI basics.
PII Personally identifiable information—information relating to an identified or identifiable natural person (aligned with GDPR/LGPD personal data). Broader umbrella than sensitive personal data, PHI, or cardholder data; those are special categories under their own rules. Surfaced via findings, sensitivity level, quasi-identifier analysis. Terminology: GDPR uses personal data (Art. 4(1)); CCPA/CPRA uses personal information—often overlapping in practice but not identical in statutory scope or exclusions (map specifics with counsel). PII is common US English in security/compliance; do not abbreviate CCPA personal information as IP (ambiguous with Internet Protocol and intellectual property). See COMPLIANCE_AND_LEGAL.md; See also: CCPA.
processor (operador) Processor (LGPD operador, Art. 5): person or entity that processes personal data on behalf of the controller, following documented instructions (contract, DPA, or equivalent)—e.g. cloud host, payroll SaaS, payment gateway, outsourced support desk. Not the same as “the operator running the scanner” in deployment docs (technical operator). Whether you are controller or processor for a given dataset is a legal fact pattern; Data Boar maps where data appears, not role conclusions. See also: controller / processor; GDPR Art. 4(8); LGPD Art. 5º.
SAR Subject Access Request (often DSARData Subject Access Request): a request by a data subject to exercise access and related transparency rights—e.g. GDPR Art. 15 (confirmation of processing, copy of personal data); UK GDPR uses the same framing. LGPD Art. 18 includes access and confirmation among titular rights—typically handled via privacy / legal workflows comparable in spirit. Data Boar helps map where personal data may appear for inventory and triage; it does not automate or complete SAR/DSAR fulfilment and does not replace DPO or counsel processes. See also: data subject. GDPR Art. 15.
sensitive personal data / special category data LGPD dados pessoais sensíveis (Art. 5, II): e.g. racial/ethnic origin, religious belief, political opinion, union or similar membership, health or sex life, genetic or biometric data when tied to a person. GDPR special categories (Art. 9): overlapping list (health, biometric ID, genetics, etc.) with conditions and prohibitions. Not automatically the same list as PHI or PCI data—framework and purpose differ. Stricter rules often apply than for ordinary personal data. LGPD Art. 5; GDPR Art. 9.
Stark Law US federal physician self-referral statute (Ethics in Patient Referrals Act, 42 USC §1395nn): restricts Medicare referrals for designated health services when the referring physician (or immediate family) has a financial relationship with the receiving entity, subject to exceptions and CMS regulations. Often discussed alongside the Anti-Kickback Statute—both are fraud-and-abuse / referral rules, not the same as HIPAA privacy of PHI. Relevant when inventorying US healthcare datasets or compliance narratives that mention Stark. Data Boar maps where data appear; it does not decide Stark compliance. CMS — physician self-referral (Stark). See also: PHI, HCP.

6. Engineering: reliability, supply chain & records

Term Definition
ADR Architecture Decision Record: a short document capturing a significant technical decision, its context, and consequences. This repo keeps ADRs under docs/adr/, starting with the baseline 0000. The concept originates with Michael Nygard (2011); this repo formalises its MADR lineage as UMADR (ADR 0045). Onboarding walkthrough: DECISION_RECORDS_PRIMER.md.
MADR Markdown Any Decision Records: a widely used Markdown template and convention for writing ADRs (adr.github.io/madr), building on Michael Nygard's 2011 ADR concept. Roots of this repo's ADR format. See also: ADR, UMADR.
UMADR Unified MADR: this repository's ADR constitution (ADR 0045) — MADR/Nygard roots plus U = Human-in-the-Loop (single maintainer curator), an immutable genesis date (git history as audit trail), an extended status enum (incl. Quarantined, Obsolete, Superseded), and en_US-only prose. See also: ADR, MADR.
boar_fast_filter The Rust extension module (compiled via PyO3) that provides a high-performance pre-filter pass before the Python detection pipeline. When available, it discards chunks that cannot contain sensitive data faster than pure Python regex, reducing CPU time and improving throughput. Benchmarked against the pure-Python path in A/B completão runs (--bench-compare). See TECH_GUIDE.md; build: rust/boar_fast_filter/. Performance contract: measured delta between stable and beta tracks is the primary signal for confirming a PyO3 or Rust core fix.
Agile Manifesto The 2001 Manifesto for Agile Software Development (Agile Manifesto): values working software, collaboration, and responding to change. Maintainer docs may reference it when discussing delivery culture; it is not a product feature.
CI/CD Continuous Integration / Continuous Delivery (or Deployment): automated build, test, and deploy pipelines; relevant to release discipline and supply-chain checks.
container A packaged, runnable image (typically OCI-compatible—e.g. Docker image built from this repo’s Dockerfile, published as fabioleitao/data_boar on Docker Hub) that includes the app and runtime. Mount a config file at the documented path (e.g. /data/config.yaml) and use the same YAML/JSON config as bare-metal. See DEPLOY.md, USAGE.md (Docker quick start).
CVE Common Vulnerabilities and Exposures: public identifiers for known security issues in software components. Dependency updates and CI scans (e.g. GitHub Dependabot alerts) reference CVEs; this repo’s SECURITY.md and chore(deps) workflow treat them as input to risk triage—not a product feature. See also: SBOM; SECURITY.md.
defense in depth Layered controls: assume any single control can fail; combine network, identity, application, and monitoring layers so gaps do not become total exposure. Aligns with secure by design and Zero Trust reasoning; not a product feature—operator architecture choices apply at deployment.
DevSecOps Practices that embed security in development and operations (pipelines, dependencies, secrets hygiene). Non-breaking alignment notes: OBSERVABILITY_SRE.md.
external connectivity evaluation Maintainer lab playbook (LAB_EXTERNAL_CONNECTIVITY_EVAL.md): exercise REST and database connectors against public HTTP APIs, optional read-only third-party databases (subject to provider ToS), and intentional failure targets—supplements LAB_SMOKE_MULTI_HOST.md and CI; not a substitute for release or HOMELAB_VALIDATION alone. Session keyword external-eval. ADR 0028.
least privilege Grant only the minimum access, scope, and duration needed for a role, account, or process (humans and services). Complements Zero Trust and RBAC; see SECURITY.md and deployment choices (api.require_api_key, reverse-proxy auth).
pipeline In software delivery, an automated pipeline is the ordered sequence of jobs (lint, test, build, security scans, publish) run as CI/CD—e.g. GitHub Actions workflows in this repo (see .github/workflows/). In data engineering, “pipeline” can also mean an ETL or processing flow from sources to outputs; unless context says otherwise, maintainer docs here mean automation around the codebase.
privacy by design Build data protection into systems by default (GDPR Art. 25–style technical and organisational measures). Related to secure by design; Data Boar choices such as metadata-only findings and sampling limits reflect this posture—they do not replace organisational privacy programmes or counsel. GDPR Art. 25.
SBOM Software Bill of Materials: machine-readable inventory of software components (e.g. CycloneDX). Roadmap: ADR 0003.
provenance Verifiable record of how, where, and from what an artifact was built (build system, source commit, inputs) — the evidence a consumer checks before trusting a binary. Basis of SLSA attestations. See also: SBOM, attestation, SLSA.
attestation Signed, machine-verifiable statement about an artifact (e.g. an in-toto / SLSA predicate) binding provenance or an SBOM to a subject digest. See also: Sigstore, provenance.
SLSA Supply-chain Levels for Software Artifacts (OpenSSF) — graded framework (levels L1-L3+) for build-pipeline integrity and tamper-resistant provenance. See also: provenance, SBOM, CVE.
in-toto CNCF framework for end-to-end supply-chain integrity: each pipeline step emits a signed attestation verified against a declared layout. See also: attestation, SLSA.
Sigstore Keyless artifact / attestation signing with a public transparency log (cosign, Fulcio, Rekor); pairs with provenance and SBOM for verifiable supply chains. See also: attestation, SBOM.
SAST Static Application Security Testing: analysis of source code (and sometimes bytecode) without running the full app—e.g. CodeQL, Semgrep, Bandit in CI. Complements manual review and runtime tests; not a guarantee of absence of bugs. This repo runs selected tools in pipelines; see SECURITY.md and .github/workflows/. See also: CVE, shift-left security.
Bandit Python SAST linter for common security issues (subprocess, eval, weak crypto). Data Boar's strict CI gate runs bandit -r . -c pyproject.toml -ll -ii (MEDIUM+ severity and confidence). See also: SAST, CI/CD.
CodeQL GitHub's semantic code-analysis engine: queries code as a database to find security patterns. Runs on push / PR and on a weekly schedule. See also: SAST, Semgrep.
Semgrep Fast, pattern-based SAST across many languages. Data Boar runs the OSS p/python ruleset in CI. See also: SAST, CodeQL.
zizmor Security linter for GitHub Actions workflows (token scope, script injection, unpinned actions). An enforced CI gate (issue #732), with an optional local pre-commit stage via scripts/workflow-security-lint. See also: SAST, CI/CD, least privilege.
Gitleaks Secret-scanning tool that flags committed credentials and keys across the working tree and Git history. Runs as a dedicated CI gate. See also: SAST, CI/CD.
SonarQube / SonarCloud Continuous code-quality and security-hotspot platform; a token-gated CI gate tracking maintainability, reliability, and security findings. See also: SAST, CI/CD.
pip-audit Audits installed Python dependencies against known vulnerability advisories (CVE / PyPI); the CI dependency-audit gate. See also: CVE, SBOM.
Ruff Fast Python linter and formatter; the style/lint gate in pre-commit and the CI Lint job (ruff, ruff format). See also: CI/CD.
Syft SBOM generator (Anchore) that inventories a built image into CycloneDX JSON; complements the lockfile-based SBOM. See also: SBOM, provenance.
DAST Dynamic Application Security Testing: probes a running application (contrast with SAST on source). Listed for completeness — Data Boar's pipeline today emphasises SAST plus dependency, secret, and workflow scanning. See also: SAST.
secure by design Security requirements and controls considered from architecture onward, not only after release. Pairs with DevSecOps, shift-left security, and defense in depth. See SECURITY.md.
shift-left security Address security earlier in the lifecycle (design, code, CI gates) rather than only in production. Overlaps DevSecOps; this repo encodes part of that via lint, tests, and pre-commit—not a guarantee of maturity on its own.
SLI / SLO / SLA Service Level Indicator (what you measure), Objective (internal reliability target), Agreement (contractual commitment, often derived from SLOs). Explained in OBSERVABILITY_SRE.md §2.
SRE Site Reliability Engineering: discipline for running reliable production software (error budgets, SLOs, runbooks). The project discusses alignment in OBSERVABILITY_SRE.md.
TAMPERED Integrity state set when behaviour-critical modules (the hashed allowlist in core/integrity_anchor.pymain.py, core/detector.py, core/engine.py, core/licensing/guard.py, api/routes.py) diverge from the validated SQLite integrity anchor on startup re-verify, or when release build digest / manifest policy fails. In licensing.mode: enforced the runtime fails closed: effective tier is capped at Community and Pro/Enterprise features are denied; in open mode it logs CRITICAL and continues. Drives the -alpha trust label. Tamper-evident, not tamper-proof. See also: TINTED / -alpha, Safe-Hold; ADR 0066, INTEGRITY_CHECK_ALPHA_LOGIC.md.
TINTED / -alpha Degraded trust level (trust_level=adulterated) the runtime self-applies when the integrity anchor detects a TAMPERED state: every user-visible surface (report Info sheet, dashboard footer, GET /about, GET /status, /health, startup logs) shows the -alpha suffix — "development / not CI-validated" — and never impersonates a stable release. See also: TAMPERED, Safe-Hold; INTEGRITY_CHECK_ALPHA_LOGIC.md, ADR 0066.
toil In SRE practice (common Google-style definition): manual, repetitive, automatable operational work that tends to grow with the estate—ticket churn, copy-paste, one-off scripts, and “spreadsheet archaeology” that does not produce durable design. Healthy teams budget down toil in favour of automation and productised tooling. Data Boar targets data-governance toil for corporate and delivery teams: connector-driven discovery, sampling, structured findings, Excel reports, and heatmaps replace endless ad-hoc greps and purely manual inventories—while humans still own norms, sampling limits, and false negative risk. See also: SRE, session, report.
Zero Trust Security model: do not assume trust from network location alone; verify explicitly (identity, device health, least privilege, continuous validation). NIST SP 800-207 is a common reference. Deployment (TLS, API keys, segmentation) is an operator responsibility—the product does not implement a full ZT architecture by itself. NIST SP 800-207.

7. APIs, authorization & transport

Term Definition
CDN Content Delivery Network: geographically distributed edge network that caches static content and serves traffic close to users—often combined with TLS termination, large-scale DDoS absorption, and faster delivery of scripts and assets. Data Boar is not a CDN; placing a public dashboard behind CDN → WAF → reverse proxy is an operator architecture choice, not a product feature. See also: DoS / DDoS, WAF; DEPLOY.md, SECURITY.md.
CSP Content Security Policy: HTTP response header (Content-Security-Policy) that tells browsers which script, style, and connection sources are allowed—reducing XSS and unexpected third-party loads. Data Boar sets a default CSP for the dashboard; stricter profiles are often applied at the reverse proxy—see SECURITY.md, USAGE.md (API and security). See also: HSTS; WAF.
connect and read timeouts Configurable timeouts (globally and per target): connect and read limits (seconds) bound how long the engine waits when opening connections and reading from databases, APIs, file shares, and similar targets. Stops one stuck or extremely slow target from blocking an entire session forever and caps outbound wait/load; values set too low cause false timeouts on busy networks or during maintenance windows. Per-target overrides help when only one source is slow. See USAGE.md (Timeouts).
DoS / DDoS Denial of Service (DoS): an attack or abuse pattern that degrades or blocks availability (CPU, memory, connections, bandwidth) so legitimate use suffers. Distributed DoS (DDoS) coordinates many sources against one target—classic network/application-layer threat. Two angles for Data Boar: (1) Inbound to the dashboard/API—floods, rapid scan triggers, or huge request bodies aimed at the Data Boar service. Mitigations in product: HTTP 413 on oversized bodies, optional rate_limit (caps concurrent scans and enforces a minimum interval between scan starts; HTTP 429 when exceeded), optional api.require_api_key. Deployment: reverse proxy, TLS, network policy—see SECURITY.md, DEPLOY.md (hardening). (2) Outbound from Data Boar to customer databases, APIs, and shares—aggressive settings could accidentally overload those systems (operator “self-DoS” of the target). Mitigations: rate_limit, conservative scan.max_workers, timeouts, tuning in USAGE.md. Data Boar is not a DDoS scrubbing or CDN service; edge protection (WAF, provider anti-DDoS, firewall) remains with the customer/operator.
HSTS HTTP Strict Transport Security: policy header (Strict-Transport-Security) telling browsers to use HTTPS only for a host for a period of time. Meaningful when the app or reverse proxy serves TLS and forwarded proto headers reflect the client scheme—see TECH_GUIDE.md. See also: TLS / HTTPS; SECURITY.md.
JSON JavaScript Object Notation: text format for structured data (objects, arrays, strings, numbers, booleans, null). Data Boar uses JSON for REST response bodies (GET /health, /status, /about/json), POST /scan optional bodies, --export-audit-trail output, and can load legacy config.json (normalized to the same shape as YAML config). Prefer YAML for hand-edited operator config when possible—see YAML. USAGE.md.
OAuth2 OAuth 2.0: authorization framework. Used for machine-to-machine access to some API targets (e.g. client credentials). Not the same as end-user “login with Google” unless configured that way. See USAGE.md.
OIDC OpenID Connect (OIDC): identity layer on top of OAuth 2.0—standard way to obtain ID tokens (often JWTs) and userinfo for browser or native sign-in. Distinct from bare OAuth2 client-credentials flows used for service-to-service API access. In deployments, a reverse proxy or IdP may terminate OIDC for human access to the dashboard; that is operator integration, not the scanner’s REST target auth. See also: JWT (different use in licensing drafts; section 11).
OpenAPI OpenAPI Specification (formerly Swagger): machine-readable description of REST routes, parameters, and responses. Data Boar exposes interactive docs (e.g. /docs) generated from the app’s OpenAPI schema—useful for integrators and contract tests; see USAGE.md, TESTING.md.
RBAC Role-based access control: authorization where access to routes or features depends on a subject’s role or group. Data Boar’s HTML dashboard today relies on deployment choices (network, reverse proxy, optional global api.require_api_key); finer-grained in-app RBAC for reports is tracked as GitHub issue #86. Maintainers coordinate delivery sequencing in the internal backlog (docs/plans/, entry via README.md).
rate limiting (rate_limit) Optional limits on the API side: maximum concurrent scan starts and a minimum interval between new scan starts. Endpoints that launch scans may return HTTP 429 when limits are hit. Reduces inbound abuse of the Data Boar service and throttles how often outbound scanning work is triggered against customer systems. Pair with scan.max_workers and timeouts; production defaults should stay conservative. See USAGE.md (Rate limiting and safe concurrency).
REST Representational State Transfer: a style of HTTP APIs. Data Boar can scan REST targets with configured auth (Bearer, Basic, OAuth2 client credentials, etc.); see USAGE.md API targets section.
request body size limit The API rejects bodies above 1 MB on relevant POST routes (e.g. config and scan triggers) with HTTP 413 Payload Too Large, limiting application-layer abuse via huge payloads. See SECURITY.md (Recommendations for technicians).
reverse proxy Middleware HTTP server (nginx, Traefik, Caddy, cloud load balancer, OAuth2 Proxy, etc.) that sits in front of the app: often terminates TLS, forwards requests to Data Boar, and can enforce auth, WAF, extra rate limits, and response headers. Set X-Forwarded-Proto: https (and related forwarded headers) so the app’s HTTPS / HSTS behaviour matches the client-facing scheme—see TECH_GUIDE.md. No code change required for basic operation behind NAT, LB, or reverse proxy—see DEPLOY.md. See also: TLS / HTTPS; SECURITY.md.
scan parallelism (scan.max_workers) Within a session, how many targets are processed in parallel (scan.max_workers: 1 = sequential; higher = concurrent I/O). Lower values mean fewer simultaneous connections and less concurrent load on customer databases, APIs, and file servers—preferred on fragile networks, shared infrastructure, or during maintenance. Pair with sampling (see Session, targets section) so each target is only partially read for detection. See USAGE.md (scan block and timeouts guidance).
TLS / HTTPS Transport Layer Security encrypts HTTP (HTTPS). The dashboard/API can use PEM certificates (api.https_* or CLI) or TLS termination at a reverse proxy; see USAGE.md and SECURITY.md.
WAF Web Application Firewall: HTTP(S)-aware filter (often at a reverse proxy, load balancer, or cloud edge) that inspects requests/responses and blocks or rate-limits common abuse (e.g. injection attempts, suspicious paths, automated floods) using rules and signatures. Complements in-app mitigations (rate_limit, api.require_api_key) but does not replace them. Data Boar does not ship an embedded WAF—configure one in front of the service when internet-exposed. See also: CDN, reverse proxy, DoS / DDoS; SECURITY.md, DEPLOY.md.
YAML YAML Ain’t Markup Language: indentation-oriented structured text for configs. Data Boar’s single operator config file is YAML or JSON (--config, CONFIG_PATH, default config.yaml); compliance sample fragments live under compliance-samples/ and merge into config; regex override examples use .yaml; the dashboard /config route edits the same file as YAML. Why YAML: readable diffs and comments-friendly for DPO/operator review; the loader accepts JSON for the same keys when preferred. USAGE.md, compliance sample (§9).

8. Detection technology

Term Definition
deterministic detection stack (product) The shipped engine combines regex and named patterns, optional structural checks, and supervised ML / DL (your training terms, fixed random_state) to emit repeatable sensitivity scores and metadata-first findings on already-sampled data—not open-ended natural-language generation. Advanced inventory signals (quasi-identifier aggregation, minor detection, anchor jurisdiction framing, jurisdiction hints / tension) follow the same audit-friendly contract (still heuristics for workshops, not counsel). Contrast: LLM row below; posture narrative: TECH_GUIDE.md, COMPLIANCE_FRAMEWORKS.md, docs/ops/LLM_AGENT_EDITING_CAUTION.md. Optional future supervised / detector experiments are tracked in maintainer-only plan material named PLAN_ADDITIONAL_DETECTION_TECHNIQUES_AND_FN_REDUCTION.md (path index: README.md Internal and reference).
LLM Large Language Model: a large neural language model (typically transformer-based) trained on broad text to generate continuations, answers, or code from a prompt—open-ended, often non-deterministic, and may hallucinate plausible false content. Distinct from Data Boar’s supervised ML / DL stack in this section. The product does not use generative LLMs for sensitivity detection. See also: deterministic detection stack (product), language model (LM) (§8b), ML / DL, TF-IDF, Random Forest; long-form SENSITIVITY_DETECTION.md; history and limits: AI_EVOLUTION_PRIMER.md.
ML / DL Colloquially “AI”; here supervised scoring on your training terms and sampled text—not LLM-style generative models (see LLM above). Regex covers deterministic patterns (e.g. CPF, email). ML uses TF-IDF + Random Forest; DL (optional) uses sentence embeddings and a small classifier when sentence-transformers is installed—see TF-IDF, Random Forest, and SENSITIVITY_DETECTION.md. Outputs confidence scores (0–100) for sensitivity bands and findings; random_state fixed for reproducibility. Scores evidence already in the target only. False positives/negatives remain possible; tune thresholds.
Random Forest Supervised ensemble of many decision trees, each trained on bootstrap row samples and a random subset of features; predictions combine votes (or averaged probabilities), usually more stable than a single tree. In Data Boar’s ML path, a RandomForestClassifier (e.g. scikit-learn) consumes TF-IDF vectors from your (text, label) pairs and sampled column/cell text; a fixed random_state keeps confidence reproducible. See also: TF-IDF, ML / DL; SENSITIVITY_DETECTION.md.
TF-IDF Term Frequency–Inverse Document Frequency: a classic text representation that weighs how characteristic a term is in a document relative to the corpus—very common words get low weight, more distinctive terms higher. It maps text to a sparse numeric vector for traditional ML. Here it feeds the Random Forest layer together with your sensitivity training terms. See also: Random Forest, ML / DL; SENSITIVITY_DETECTION.md.

8b. AI history and method families (primer)

Short glossary anchors for the non-hype narrative in AI_EVOLUTION_PRIMER.md (pt-BR). Decades and “winter” timing are approximate labels used in teaching, not precise economic data.

Term Definition
AI winter A period of reduced funding and expectations for AI after hype cycles oversold results; several occurred from the 1970s onward. Useful as history, not as proof that any specific technique “failed forever.” See: AI_EVOLUTION_PRIMER.md.
deep reinforcement learning (DRL) Reinforcement learning where policies, value functions, or models are implemented with deep neural networks (e.g. DQN, actor–critic with CNN/MLP backbones). DL supplies the function approximator inside the RL loop—it is not a separate “fourth paradigm” beside ML/DL/RL, but RL + deep nets. Acronym: DRL is standard in English papers and products; “RDL” is not a widely accepted label and invites confusion. Strengths / limits: combine RL issues (reward design, exploration, deployment safety) with DL issues (data/compute, opaque failures). See also: reinforcement learning (RL); AI_EVOLUTION_PRIMER.md.
expert system Rule- and knowledge-base programs encoding human domain logic (often if–then rules) for narrow tasks (config, diagnosis checklists). Strengths: transparent logic, audit-friendly when rules are documented. Limits: brittle outside the modeled domain; expensive to maintain. Distinct from modern LLM chat. See: AI_EVOLUTION_PRIMER.md.
language model (LM) A model that assigns probability (or related scores) to word sequences—from n-grams to neural LMs. Smaller LMs powered many pre-transformer NLP components (e.g. speech pipelines). LLM usually means the same family at much larger scale and transformer architecture. See also: LLM (§8); AI_EVOLUTION_PRIMER.md.
Lisp machine 1980s workstations optimized for Lisp and AI R&D; emblematic of the symbolic / expert-system wave. Declined with cheaper general-purpose hardware and shifting ROI—not a verdict on every future approach. See: AI_EVOLUTION_PRIMER.md.
reinforcement learning (RL) Learning by trial and error with rewards: policy improves from interaction (games, robotics, recommendation re-ranking in some stacks). Strengths: strong when simulation or clear reward exists. Limits: sample-inefficient, reward mis-specification, safety when deployed open-loop. When neural networks implement the policy or value head at scale, see deep reinforcement learning (DRL). Not the Data Boar sensitivity engine. See also: deep reinforcement learning (DRL); AI_EVOLUTION_PRIMER.md.
symbolic AI Logic, search, rules, and structured representations (early “good old-fashioned AI”). Strengths: interpretable steps, provable in tiny domains. Limits: knowledge acquisition bottleneck, weak on raw unstructured text at web scale. Complements (rather than replaces) statistical ML in many systems. See: AI_EVOLUTION_PRIMER.md.
transformer (architecture) Neural sequence model built on self-attention; foundation of most LLMs since ~2017. Strengths: parallelizable training, strong broad language/cross-modal fits when scaled. Limits: compute/data hunger, non-determinism at inference unless tightly constrained—not a substitute for evidence-first compliance scanning. See also: LLM (§8); AI_EVOLUTION_PRIMER.md.

9. Compliance positioning & artefacts

Term Definition
anonymization / pseudonymization Pseudonymization: replaces identifiers so data cannot be attributed without a key (GDPR Art. 4(5))—often still personal data. Anonymization: irreversible; not personal data only if re-identification is not reasonably likely. Data Boar helps find identifiers and quasi-identifiers; it does not certify anonymisation. GDPR Art. 4(5); EDPB hub: edpb.europa.eu.
anchor jurisdiction (operational framing) Not a statute picker in code: the operational and policy home teams use when interpreting a scan (where the engine runs, where the systems under inventory sit, which enterprise policy deck applies). Pairs with drifted data persona in multinational workshops. See JURISDICTION_COLLISION_HANDLING.md (pt-BR), ADR 0038.
compliance sample YAML fragment under compliance-samples/ merged into config: norm_tag vocabulary, terms, recommendation overrides. See COMPLIANCE_FRAMEWORKS.md.
counsel Legal counsel (in-house or external)—analysis applying law to facts. The product supplies technical evidence; it does not replace counsel. ADR 0025.
data minimization Principle: process only what is adequate, relevant and limited to the purposes (GDPR Art. 5(1)(c); LGPD Art. 6 §III). Scans reveal where excessive or legacy data may linger; retention/deletion is a business/legal decision. GDPR Art. 5.
drifted data persona (through-traffic framing) Narrative label for rows that look foreign or multi-country relative to the anchor operation (e.g. crew manifests, passports, offshore HR IDs in a host-country system). The product does not auto-classify drift; hints and findings surface signals for counsel workshops. See JURISDICTION_COLLISION_HANDLING.md, use-cases/README.md (storyboard hub; includes port logistics).
DPIA Data Protection Impact Assessment (GDPR Arts. 35–36 style). In Brazil, comparable process: RIPD (LGPD Art. 38). Outputs can inform assessment; product does not complete or approve. GDPR Art. 35.
ePHI electronic Protected Health Information (US HIPAA): identifiable health information in electronic form for covered entities / business associates. US-only category; in Brazil use sensitive personal data (LGPD Art. 5 II) with counsel. HHS HIPAA; COMPLIANCE_AND_LEGAL.md.
FCPA Foreign Corrupt Practices Act (US): federal anti-bribery and books-and-records / internal accounting controls rules for covered persons—DOJ / SEC enforcement. Product does not detect bribes or conclude violations; optional compliance-sample-us_fcpa_internal_policy_pack.yaml adds policy lexicon for inventory. See also: COMPLIANCE_AND_LEGAL.md; ADR 0025.
GRC Governance, risk, and compliance: umbrella term for programmes that align organisational governance (policies, oversight), risk management (identify, assess, treat), and regulatory/compliance obligations. Data Boar supports technical evidence for data-related GRC work—inventory, findings, metadata-only reporting, and heatmaps—not an enterprise GRC platform and not legal advice. See also: controls (report recommendations), RoPA, DPIA, counsel; COMPLIANCE_AND_LEGAL.md; executive risk-matrix JSON contract GRC_EXECUTIVE_REPORT_SCHEMA.md.
jurisdiction hint Optional Report info text when report.jurisdiction_hints (and/or per-session opt-in) scores metadata for possible relevance to named regimes (today: US-CA, US-CO, JP heuristics in code). Heuristic only—high false positive/negative rate; not applicable law. See also: jurisdiction tension, scent origin; ADR 0026, JURISDICTION_COLLISION_HANDLING.md.
jurisdiction tension (overlapping hints) When two or more jurisdiction hints fire for the same session, or when norm tags and path tokens “pull” toward different regimes—inventory stress, not a software verdict. Organisations may apply stricter interim safeguards pending counsel; the product does not encode “most restrictive law wins.” ADR 0038.
HIPAA Health Insurance Portability and Accountability Act (US): Privacy, Security, and Breach Notification rules for PHI/ePHI. Does not replace LGPD for Brazil-linked processing. Official hub: HHS HIPAA; product labels: COMPLIANCE_FRAMEWORKS.md.
KYC Know Your Customer: identity and risk checks (often finance). Product does not verify identity or screen lists; may locate onboarding PII in the data soup. COMPLIANCE_AND_LEGAL.md.
metadata-only finding Finding recording where and what pattern matched—no raw personal values in the report bundle by default. See COMPLIANCE_AND_LEGAL.md.
PEP Politically Exposed Person (AML/CFT). Product does not screen sanctions/PEP lists; may locate possible PII in files. COMPLIANCE_AND_LEGAL.md.
RIPD Relatório de Impacto à Proteção de Dados (LGPD Art. 38): controller report when processing may pose high risk; ANPD may regulate details. Purpose similar to GDPR DPIA; not identical law. Product supplies technical evidence to inform drafting—does not file or approve. ANPD; LGPD Art. 38.
RoPA Record of Processing Activities (GDPR Art. 30–style). Technical inventory feeds RoPA maintenance; does not replace the organisational register. GDPR Art. 30.
scent origin (metaphor) Optional narrative for jurisdiction hints: multiple weak geographic signals in metadata behave like several scents at once—useful for workshops, not legal certainty. See JURISDICTION_COLLISION_HANDLING.md; ADR 0038.
SCC Standard Contractual Clauses for international transfers (EU tool). Signature and legal adequacy are outside the product; scans inform which categories exist where. Commission texts via EUR-Lex (search SCC 2021/914).
SOE State-Owned Enterprise (also state-owned entity): a legal person that is government-linked; some AML programmes apply enhanced due diligence to SOE relationships—distinct from PEP (natural persons). Product does not determine SOE status or screen lists; may locate PII and onboarding artefacts in the data soup. See also: PEP, KYC; COMPLIANCE_AND_LEGAL.md.
SOC 2 Service Organization Control 2 (AICPA): an attestation report on controls at a service organization, usually mapped to Trust Services Criteria (security; optional: availability, processing integrity, confidentiality, privacy). Type I assesses control design at a point in time; Type II assesses operating effectiveness over a period. Data Boar does not perform SOC 2 examinations or issue reports; discovery and mapping of personal/sensitive data locations plus metadata-only outputs support control design and audit preparation for systems you put in scope. See Auditable and management standards in COMPLIANCE_FRAMEWORKS.md.
SOX Sarbanes-Oxley Act of 2002 (US): internal control over financial reporting; ITGC and evidence on systems touching financial data. Not financial audit; product maps sensitive fields for governance evidence. SEC — SOX.
subprocessor Processor engaged by another processor (GDPR Art. 4(17)); Subprocessors must be contractually bound. LGPD: suboperador in operator chains. Product does not manage vendor papers; maps where data appears for due diligence. GDPR Art. 4(17).
TIA Transfer Impact Assessment for cross-border transfers (often with SCCs). Product supports mapping evidence; not the legal conclusion. EDPB guidance: edpb.europa.eu.
VBA (value-based agreement) In pharma, life sciences, and payer contexts (often US), VBA commonly means value-based agreement or value-based contract: a commercial arrangement where price, reimbursement, or access depends on agreed outcomes, utilisation, or evidence—not only list price. Not a Data Boar product feature: the engine does not model contracts or compute outcome metrics; it may still support governance by mapping where related personal or health data appear in the data soup. Disambiguation: do not confuse with VBA = Visual Basic for Applications (Office macros)—ubiquitous in IT; check context. See also: HCP, PHI, Stark Law.
Corporate-Entity-C / WRB Internal maintainer vocabulary for the external review partner and its WRB (Corporate-Entity-C Review Briefing) email cycles—paste blocks and stable in-repo paths in Corporate-Entity-C_IN_REPO_BASELINE.md. Not a shipped product module or public API identifier. See also: Corporate-Entity-C_REVIEW_REQUEST_GUIDELINE.md.
Google Gemini (maintainer tooling) Google consumer/pro chat products used outside the sensitivity engine for bounded drafts, long transcripts, and screenshot-first bundle review—same process role as Corporate-Entity-C/WRB (evidence discipline, not a runtime dependency of customer scans). Treat outputs as unverified until reconciled with repo truth; never substitute for deterministic findings. See also: Corporate-Entity-C / WRB; GEMINI_PUBLIC_BUNDLE_REVIEW.md; LLM_AGENT_EDITING_CAUTION.md; OPERATOR_SESSION_CAPTURE_GUIDE.md.

10. ISO/IEC management-system standards (incl. ABNT NBR in Brazil)

Official titles below follow ISO/IEC wording; Brazil adopts the same technical content as ABNT NBR ISO/IEC … (purchase via ABNT or ISO). Data Boar supports technical evidence (inventory, metadata-only findings, categories in the data soup); it does not certify, audit, or replace an accredited assessment.

Term Definition
ISO/IEC 27001 Information security, cybersecurity and privacy protection — Information security management systems — Requirements. Foundation for an ISMS. Certification and scope are organisational; scans help demonstrate where personal and sensitive data appear for asset/process inventory and risk treatment. Brazil: typically ABNT NBR ISO/IEC 27001. Further detail: ISO catalogue.
ISO/IEC 27005 Guidance on managing information security risks (successor to earlier editions titled Information security risk management). Structures activities for identifying, analysing, and treating information-security risk (iterative guidance—not a checklist tool). Do not confuse with ISO/IEC 27701 (privacy extension to 27001/27002). Brazil: ABNT NBR ISO/IEC 27005. Data Boar supplies discovery evidence for risks involving personal-data holdings. Further detail: ISO catalogue.
ISO/IEC 27701 Security techniques — Extension to ISO/IEC 27001 and ISO/IEC 27002 for privacy information management — Requirements and guidelines (PIMS). Controllers/processors map privacy controls and declared regulations (e.g. LGPD, GDPR) to the standard’s annexes. Brazil: ABNT NBR ISO/IEC 27701. Product alignment: COMPLIANCE_FRAMEWORKS.md (ISO/IEC 27701). Further detail: ISO catalogue.

11. Licensing, open core & subscriptions

Policy drafts and technical hooks—not a substitute for counsel; see linked docs.

Term Definition
JWT JSON Web Token: compact signed payload (often Authorization: Bearer) carrying claims (identity, expiry, scope). Draft commercial licensing may use signed JWTs for subscription entitlements—see LICENSING_SPEC.md. Unrelated to dashboard JSON responses unless you configure external OAuth2 / IdP flows. See also: subscription level, open core.
open core Business and licensing pattern: a core product (scanner, dashboard, baseline connectors) stays open source under a permissive or copyleft OSS license (today BSD 3-Clause in LICENSE; future options discussed in policy), while commercial or source-available add-ons may require a subscription and signed license token. Overview: LICENSING_OPEN_CORE_AND_COMMERCIAL.md; mechanics: LICENSING_SPEC.md.
OSS Open-source software: software released under a license that grants use, modification, and redistribution under stated conditions; OSI-approved licenses are a common benchmark for the open core baseline. Different sense in detector docs: “OSS Markdown” describes typical README/contributing-style filenames for public projects when tuning filesystem sensitivity—see SENSITIVITY_DETECTION.md.
subscription level Informal term for the commercial tier or entitlement associated with a paying customer or partner (e.g. trial, partner, enterprise—names and SKUs TBD). Draft policy ties production use of certain source-available modules to a valid subscription and signed license JWT; future claims may encode tier in the token—see LICENSING_OPEN_CORE_AND_COMMERCIAL.md (“Future product tiers”), LICENSING_SPEC.md. See also: open core.

12. Data governance (DMBOK & data lifecycle)

Data-management and data-governance vocabulary (DAMA-DMBOK 2nd ed.; ISO/IEC 25012 data-quality model). Audience: DPOs, data stewards, and governance leads positioning Data Boar's discovery output within a broader data-governance programme.

Term Definition
data Facts and representations (values, records, signals) that become information in context—the raw material Data Boar discovers and classifies; PII and sensitive data are subsets. Frame: DAMA-DMBOK.
data management The DAMA-DMBOK discipline of planning, controlling, and delivering data assets across the lifecycle (architecture, modelling, storage, security, quality, metadata). Data Boar supports the security and quality functions via discovery and findings. See also: data governance, DMBOK.
data governance The exercise of authority, decision rights, and accountability over data assets (policy, standards, stewardship)—the "who decides and is accountable" layer at the centre of the DMBOK wheel, above data management. See also: Data Steward, Data Owner.
GGD Brazilian-Portuguese umbrella Gestão e Governança de Dados ("data management and governance"): the combined programme pairing data management execution with data governance oversight.
Data Steward Role accountable for day-to-day data quality, definitions, and proper use within a data domain—the operational hands of data governance. See also: Data Owner.
Data Owner Accountable authority for a data asset or domain (classification, access policy, risk acceptance); distinct from the executing Data Steward.
data lineage Documented origin and transformations of data as it flows across systems (source -> transformation -> consumption); supports impact analysis and audit. Distinct from build provenance of software artifacts.
data quality Fitness of data for its intended use, measured by dimensions (accuracy, completeness, consistency, timeliness, validity, uniqueness); formalised in ISO/IEC 25012. Data Boar findings surface gaps that affect quality and risk.
DMBOK Data Management Body of Knowledge (DAMA International): the reference framework organising data-management functions around a governance core.

See also: COMPLIANCE_FRAMEWORKS.md (regulations and extensibility), USAGE.md (config and API), SENSITIVITY_DETECTION.md (patterns and ML/DL), TECH_GUIDE.md (deterministic stack vs LLM hype), AI_EVOLUTION_PRIMER.md (AI history and method families—no hype), OBSERVABILITY_SRE.md (/health, /status). Full index: README.md · README.pt_BR.md.


13. Lab orchestration: Maestro, completão & homelab

Terms used in the Maestro lab orchestration layer and the completão validation ritual. This vocabulary is operator-facing (not product user-facing) and applies to the homelab environment used for integration testing and release confidence. Full reference: docs/ops/MAESTRO_ARCHITECTURE_AND_ROADMAP.md.

Musical metaphor hierarchy (from the Maestro design — narrative flavor, not formal taxonomy): Maestro (conductor) → handler (section lead, informally called Capo / capo di sezione in the metaphor) → Lab node (musician) → Persona (instrument) → lab-completao-host-smoke.sh (the music performed). The canonical technical term for Handle-*.ps1 scripts is always handler.

Term Definition
Audit Trail In the Maestro context: the JSONL event stream and log artefacts collected under docs/private/homelab/reports/ after each completão run — the durable, operator-readable evidence of every performance. Maps to the canonical Data Boar Audit Trail (immutable session and finding record); in lab orchestration the trail covers smoke-test outcomes, benchmark metrics, and host-environment telemetry.
bench track Isolation tag (stable or beta) applied to an A/B benchmark run. Each track gets its own ephemeral workdir on the lab host (/tmp/databoar_bench/<track>/) and a dedicated container port, preventing cross-contamination of metrics between the two versions under comparison. Set via --bench-track on lab-completao-host-smoke.sh or -BenchTrack on Maestro.ps1.
Capo From Italian: capo di sezione (section chief/lead) — used here as a musical metaphor for the handler role, not as a formal taxonomy term. In narrative docs, a handler may be described as a Capo to evoke the orchestra metaphor: each handler leads its instrument section without the Maestro needing to know implementation details. The canonical code term is handler. See handler, persona.
completão Brazilian Portuguese slang for "the full works" / "the whole thing done right." A completão run exercises all configured lab nodes, across all declared personas, against all available target surfaces — the lab equivalent of a full dress rehearsal before release. Contrasted with CI/pytest (controlled, synthetic, GitHub-hosted) and dev-PC gate (check-all). Session keyword: completao. Runbook: LAB_COMPLETAO_RUNBOOK.md.
handler A PowerShell script (scripts/maestro/handlers/Handle-<persona>.ps1) that implements the orchestration logic for one persona. Handlers are the Capos of the Maestro system: each receives context from Maestro.ps1 (node, ref, benchmark flags) and is fully responsible for how its section performs. Existing handlers: baremetal, docker, dockerswarm, podman, microk8s, lxd, web, target_postgres, target_mariadb, target_mongodb, target_nfs, target_sshfs, target_cifs. Roadmap: target_oracle, loadtest.
inventory The private JSON manifest (docs/private/homelab/data/inventory.json) that maps each lab node to its SSH credentials, repo path, and declared personas. This is the score the Maestro reads before dispatching. Never tracked in public Git (PII: real hostnames, IPs, user accounts).
lab-op The homelab environment used for Data Boar integration testing — a set of Linux hosts (bare-metal and VMs) reachable via SSH from the operator's Windows dev PC. Distinct from CI (GitHub-hosted runners) and from customer deployment environments.
Maestro The central, inventory-driven orchestrator (scripts/maestro/Maestro.ps1). Like an orchestral conductor, Maestro holds the birds-eye view: it reads the inventory (the score), checks which nodes are reachable, builds and distributes artefacts, dispatches each Capo in sequence, and collects evidence. It does not know how Docker or PostgreSQL work — that is the Capo's domain.
persona The declared capability role of a lab node, listed in the inventory. A single node may have multiple personas (e.g. ["docker", "web", "target_postgres"]). Personas are the instruments in the metaphor: they define what the node plays, not who the node is. The Maestro dispatches the matching handler for each persona in order (container personas first, web last).
Safe-Hold In the Maestro context: the behaviour of halting orchestration and reporting clearly when a required precondition is missing — e.g. no inventory file found (hard exit 1), SSH DOWN on a node (skip with warning). Maps to the canonical Data Boar Safe-Hold (scan suspended due to missing or insufficient evidence).
sentinel file A completion signal written by lab-completao-host-smoke.sh at the end of a successful run (.completao_done_$RUN_ID). The -Collect phase polls for this file to synchronize artifact collection with the async tmux execution. Prevents the race condition where -Collect runs before smoke has finished writing metrics. See PLAN_MAESTRO_BENCHMARK_METRICS_AND_FIX.md Slice 2.