Skip to content

Add Kreuzberg Codex plugins#220

Open
Goldziher wants to merge 1 commit into
hashgraph-online:mainfrom
Goldziher:codex/add-kreuzberg-plugins
Open

Add Kreuzberg Codex plugins#220
Goldziher wants to merge 1 commit into
hashgraph-online:mainfrom
Goldziher:codex/add-kreuzberg-plugins

Conversation

@Goldziher

Copy link
Copy Markdown

Adds three related Kreuzberg plugins to Tools & Integrations:

  • Kreuzberg: local document extraction for 91+ formats.
  • Kreuzberg Cloud: managed extraction workflows with usage, job tracking, sandbox, and presigned-upload skills.
  • Kreuzcrawl: web crawling and scraping workflows for Codex.

This also updates the generator and PR validator to support multiple Codex plugin manifests from a single monorepo by matching README entries to the appropriate nested .codex-plugin/plugin.json root and validating nested bundle directories.

Validation run locally:

  • python3 scripts/check-alphabetical.py README.md
  • python3 scripts/validate-plugin-pr.py --base-ref origin/main
  • python3 -m py_compile scripts/generate_plugins_json.py scripts/validate-plugin-pr.py
  • git diff HEAD --check
  • uvx plugin-scanner scan plugins/kreuzberg-dev/plugins/plugins/kreuzberg --format text -> 91/100
  • uvx plugin-scanner scan plugins/kreuzberg-dev/plugins/plugins/kreuzberg-cloud --format text -> 91/100
  • uvx plugin-scanner scan plugins/kreuzberg-dev/plugins/plugins/kreuzcrawl --format text -> 91/100

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three new plugins (kreuzberg, kreuzberg-cloud, and kreuzcrawl) along with their associated skills, assets, and comprehensive reference documentation. It also updates the marketplace metadata, reorders several plugins alphabetically, and enhances the generator and validator scripts to support monorepos containing multiple plugins. The code review feedback highlights a critical bug in the plugin generator script where sibling or nested plugins in a monorepo could delete each other's mirrored files during shutil.rmtree operations, and identifies a casing mismatch in a TypeScript configuration example within the advanced features documentation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 355 to 360
destination_root = PLUGINS_ROOT / plugin["owner"] / plugin["repo"]
if use_subpath and plugin_root_relative:
destination_root = destination_root / PurePosixPath(plugin_root_relative)
# Clear destination to avoid stale files from previous runs (Thread 2 fix)
if destination_root.exists():
shutil.rmtree(destination_root)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In monorepos with multiple plugins, if one plugin is located at the repository root (relative path "") and another is in a subdirectory, clearing the destination_root using shutil.rmtree for the root plugin will delete the subdirectory plugin's mirrored files if the root plugin is processed after the subdirectory plugin. To prevent sibling or nested plugins from deleting each other's files, we should clear the base repository destination directory once per repository instead of clearing the destination_root per plugin.

    repo_root = PLUGINS_ROOT / plugin["owner"] / plugin["repo"]
    import threading
    global _cleared_repos, _cleared_repos_lock
    if "_cleared_repos" not in globals():
        _cleared_repos = set()
        _cleared_repos_lock = threading.Lock()
    owner_repo = f"{plugin['owner']}/{plugin['repo']}"
    with _cleared_repos_lock:
        if owner_repo not in _cleared_repos:
            if repo_root.exists():
                shutil.rmtree(repo_root)
            _cleared_repos.add(owner_repo)
    destination_root = repo_root
    if use_subpath and plugin_root_relative:
        destination_root = destination_root / PurePosixPath(plugin_root_relative)

Comment on lines +903 to +907
```typescript
import { ExtractionConfig, extractFile } from '@kreuzberg/node';

const config: ExtractionConfig = {
securityLimits: {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the TypeScript example for securityLimits, the keys are written in snake_case (e.g., max_file_size), but the Node.js/TypeScript SDK uses camelCase for configuration fields (e.g., maxFileSize). Update these keys to match the SDK's naming conventions.

Suggested change
```typescript
import { ExtractionConfig, extractFile } from '@kreuzberg/node';
const config: ExtractionConfig = {
securityLimits: {
maxFileSize: 100_000_000, // 100 MB
maxArchiveFiles: 1000,
maxTextLength: 10_000_000, // 10 MB of text
maxPages: 10000,
maxConcurrentExtractions: 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant