Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ Each provider module generates a specific type of fake data:
| company.rs | company, job, catch_phrase | Business data |
| network.rs | url, domain_name, ipv4, ipv6, mac_address | Network identifiers |
| finance.rs | credit_card, iban | Financial identifiers with valid checksums |
| packages.rs | commit_sha, semver, calver, spdx_license, git_username, pypi/npm/cargo/gem/maven package names, version constraints, maven_coordinate, pypi_requirement | Package-registry data for PyPI, npm, Maven, Cargo, RubyGems |
| records.rs | records | Structured data from schema DSL (Rust-only, not yet exposed to Python) |

All providers follow the same pattern:
Expand All @@ -125,6 +126,8 @@ Static data organized by locale, embedded at compile time as `&'static [&str]`:
- `countries.rs`: ~200 countries
- `companies.rs`: Company name components
- `tlds.rs`: ~20 top-level domains
- `spdx_licenses.rs`: 50 common SPDX license identifiers
- `packages.rs`: package-name keywords, modifiers, Maven/npm scope components, pre-release tags, Maven qualifiers

Each data file includes tests for uniqueness and non-empty values.

Expand Down
41 changes: 40 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.4.0] - 2026-04-17

### Added

- **Package Registry Providers**: Cross-ecosystem fake data for seeding PyPI,
npm, Maven, Cargo, and RubyGems test databases. 22 method pairs
(44 Python-visible methods).
- Cross-ecosystem primitives: `commit_sha()` / `short_commit_sha()`,
`semver()` / `semver_prerelease()`, `calver()`, `spdx_license()` (50
common IDs), `git_username()` (enforces GitHub's rules: alphanumerics
and single hyphens, no leading/trailing hyphen, no consecutive hyphens,
≤ 39 chars).
- Ecosystem-specific versions: `pypi_version()` (PEP 440 — includes
pre/post/dev releases), `maven_version()` (with qualifiers like
`-SNAPSHOT`, `.RELEASE`, `.Final`, `-RC1`).
- Version constraints: `pypi_version_specifier()` (PEP 440),
`npm_version_range()`, `cargo_version_req()`, `maven_version_range()`,
`gem_version_requirement()`.
- Package identity: `pypi_package_name()` (PEP 503 normalised: lowercase
`[a-z0-9-]`, hyphen as the sole separator),
`npm_package_name()` (plain or `@scope/pkg`), `cargo_package_name()`,
`gem_name()`, `maven_group_id()` (reverse domain),
`maven_artifact_id()`, `maven_coordinate()` (GAV form
`group:artifact:version`).
- Full requirement line: `pypi_requirement()` (e.g.,
`requests>=2.0.0,<3.0.0`).
- All batch methods support parallel generation via `set_parallel()`.
- **Parallel Generation**: Opt-in multi-threaded batch generation via Rayon
- `set_parallel(enabled, num_threads=None)`: Enable/disable parallel mode
- `get_parallel()` / `get_num_threads()`: Query current parallel settings
Expand All @@ -18,6 +43,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- ~3.3x speedup at 100K+ items (names: 83ms -> 25ms for 1M items)
- `unique=True` always uses sequential path (requires shared state)
- Criterion benchmarks for parallel vs sequential comparison
- **Streaming file writer**: `records_to_file(path, n, schema, ...)` generates
records in chunks and writes each chunk to disk, keeping peak memory bounded
by `chunk_size` regardless of `n`. Supports CSV, NDJSON, SQL, and Parquet
with auto-detection from the file extension. Includes an optional progress
callback and an `estimate_memory()` utility.
- **Serialized output formats** for `records()` — serialised directly in Rust,
avoiding the cost of materialising Python objects before serialising:
- `records_csv()` — RFC 4180 CSV with header row
- `records_json()` — JSON array with proper scalar types
- `records_ndjson()` — newline-delimited JSON
- `records_parquet()` — Parquet bytes via the Arrow path
- `records_sql()` — ANSI SQL `INSERT`s, batched at 1000 rows,
with identifier quoting

## [0.3.0] - 2026-03-17

Expand Down Expand Up @@ -160,6 +198,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- SonarCloud integration for code quality
- CodeQL static analysis

[Unreleased]: https://github.com/williajm/forgery/compare/v0.3.0...HEAD
[Unreleased]: https://github.com/williajm/forgery/compare/v0.4.0...HEAD
[0.4.0]: https://github.com/williajm/forgery/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/williajm/forgery/compare/v0.2.0...v0.3.0
[0.1.0]: https://github.com/williajm/forgery/releases/tag/v0.1.0
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "forgery"
version = "0.3.0"
version = "0.4.0"
edition = "2021"
description = "Fake data at the speed of Rust"
license = "MIT"
Expand Down
81 changes: 81 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,87 @@ License plate formats by locale:
| `it_IT` | AB 123 CD | `"FG 482 HJ"` |
| `ja_JP` | 300 12-34 | `"500 38-47"` |

### Package Registry Data

For seeding test databases of package registries (PyPI, npm, Maven, Cargo, RubyGems).
Cross-ecosystem primitives share one API; ecosystem-specific shapes have their own
methods.

**Cross-ecosystem primitives**

| Batch | Single | Description |
|-------|--------|-------------|
| `commit_shas(n)` | `commit_sha()` | 40-hex-char git commit SHA |
| `short_commit_shas(n)` | `short_commit_sha()` | 7-hex-char short SHA |
| `semvers(n)` | `semver()` | SemVer `MAJOR.MINOR.PATCH` |
| `semver_prereleases(n)` | `semver_prerelease()` | Pre-release (e.g. `1.2.3-alpha.1+build.5`) |
| `calvers(n)` | `calver()` | CalVer in mixed schemes (`YYYY.MM.DD`, `YY.MM`, ...) |
| `spdx_licenses(n)` | `spdx_license()` | SPDX identifier (50 common IDs) |
| `git_usernames(n)` | `git_username()` | GitHub/GitLab/Bitbucket-compatible username |

**Ecosystem-specific versions** (where SemVer alone doesn't cover the format)

| Batch | Single | Description |
|-------|--------|-------------|
| `pypi_versions(n)` | `pypi_version()` | PEP 440 (pre/post/dev releases) |
| `maven_versions(n)` | `maven_version()` | Maven version with qualifiers (`-SNAPSHOT`, `.RELEASE`, ...) |

**Version constraints**

| Batch | Single | Description |
|-------|--------|-------------|
| `pypi_version_specifiers(n)` | `pypi_version_specifier()` | PEP 440 (e.g. `>=1.2,<2.0`, `~=1.0`) |
| `npm_version_ranges(n)` | `npm_version_range()` | npm (e.g. `^1.2.3`, `~1.2.3`, `1.x`) |
| `cargo_version_reqs(n)` | `cargo_version_req()` | Cargo (e.g. `^1.0`, `~1.2`) |
| `maven_version_ranges(n)` | `maven_version_range()` | Maven (e.g. `[1.0,2.0)`) |
| `gem_version_requirements(n)` | `gem_version_requirement()` | RubyGems (e.g. `~> 1.2`) |

**Package identity**

| Batch | Single | Description |
|-------|--------|-------------|
| `pypi_package_names(n)` | `pypi_package_name()` | PEP 503 normalised (lowercase `[a-z0-9-]`) |
| `npm_package_names(n)` | `npm_package_name()` | Plain or `@scope/pkg` (~30% scoped) |
| `cargo_package_names(n)` | `cargo_package_name()` | Rust-ident flavour |
| `gem_names(n)` | `gem_name()` | RubyGems gem name |
| `maven_group_ids(n)` | `maven_group_id()` | Reverse domain (e.g. `com.example.tools`) |
| `maven_artifact_ids(n)` | `maven_artifact_id()` | Lowercase with hyphens |
| `maven_coordinates(n)` | `maven_coordinate()` | GAV (`group:artifact:version`) |

**Full requirement lines**

| Batch | Single | Description |
|-------|--------|-------------|
| `pypi_requirements(n)` | `pypi_requirement()` | e.g. `requests>=2.0.0,<3.0.0` |

```python
from forgery import Faker

fake = Faker()
fake.seed(42)
fake.pypi_requirement() # 'requests>=2.0.0,<3.0.0'
fake.maven_coordinate() # 'com.example.tools:widget-core:1.2.3-SNAPSHOT'
fake.npm_package_name() # '@types/fast-parser'
fake.spdx_license() # 'Apache-2.0'
fake.git_username() # 'tiny-logger42'
fake.commit_sha() # 'a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2'
```

The nine batch methods below accept `unique=True` for no-duplicate output,
matching the `names(n, unique=True)` pattern — useful when seeding registry
tables that have a unique-name constraint. Exhausting the combinatorial pool
raises `ValueError`:

```python
fake.pypi_package_names(100, unique=True) # 100 distinct package names
fake.maven_coordinates(500, unique=True) # 500 distinct GAVs
fake.spdx_licenses(60, unique=True) # ValueError: only 50 SPDX IDs available
```

Methods with `unique` support: `pypi_package_names`, `npm_package_names`,
`cargo_package_names`, `gem_names`, `maven_group_ids`, `maven_artifact_ids`,
`maven_coordinates`, `git_usernames`, `spdx_licenses`.

### Profile

| Batch | Single | Description |
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "maturin"

[project]
name = "forgery"
version = "0.3.0"
version = "0.4.0"
description = "Fake data at the speed of Rust"
readme = "README.md"
license = { text = "MIT" }
Expand Down
Loading
Loading