Skip to content

Ryujose/llm-compact-serializer

Repository files navigation

LLM Compact Serializer

A dynamic, schema-driven Python library designed to drastically reduce LLM token usage by compressing complex JSON objects into a strict, hierarchical format.

🚀 Why Use This?

When building LLM applications (using GPT-4, Gemini, Claude, etc.), passing large JSON arrays in the system prompt consumes a massive amount of tokens due to repeated keys and whitespace.

Standard JSON (Heavy):

[
  {"product": "Apple", "price": "1.20", "category": "Fruit"},
  {"product": "Banana", "price": "0.80", "category": "Fruit"}
]

Compact Protocol (Efficient):

|{Apple, 1.20, Fruit}|;|{Banana, 0.80, Fruit}|

Impact:

Reduces token count from ~55 tokens (JSON) to ~18 tokens (Compact) for this example.

Key Benefits

Token Efficiency: Reduces input payload size by 40-60% for repetitive data structures.

Schema Enforcement: Generates strict instructions for the LLM, reducing hallucinations.

Dynamic: Works with any Python object or dictionary; just define the schema at runtime.

Recursive: Supports deeply nested objects and lists via the |{...}| syntax.

📦 Installation

Clone the repository

git clone https://github.com/Ryujose/llm-compact-serializer.git

Install dependencies using Poetry

poetry install

⚡ Quick Start

This example demonstrates how to serialize a product list with nested destination data.

  1. Define Your Schema Tell the serializer what your data looks like. Order matters!
from llm_compact_serializer.domain.schema import CompactSchema, FieldConfig
from llm_compact_serializer.core.prompt_builder import PromptBuilder

# Define a nested schema for complex objects
destination_schema = CompactSchema(
    name="Destination",
    fields=[
        FieldConfig(source_name="address"),
        FieldConfig(source_name="phones", is_list=True) # Handles arrays [x, y]
    ]
)

# Define the root schema
product_schema = CompactSchema(
    name="Product",
    fields=[
        FieldConfig(source_name="name"),
        FieldConfig(source_name="price"),
        FieldConfig(source_name="destination", nested_schema=destination_schema)
    ]
)
  1. Prepare Your Data You can use Dictionaries, Pydantic models, or Dataclasses.
data = [
    {
        "name": "MacBook Pro", 
        "price": "1200€", 
        "destination": {
            "address": "Silicon Valley, CA",
            "phones": [5550199, 5550200]
        }
    }
]
  1. Generate the Prompt The PromptBuilder automatically generates the protocol instructions and injects your compressed data.
builder = PromptBuilder(product_schema)
base_prompt = "Analyze the following orders: [INPUT]"

final_prompt = builder.build(base_prompt, data, data_marker="[INPUT]")
print(final_prompt)
  1. Output (What the LLM Sees)
[COMPACT_HIERARCHICAL_PROTOCOL]
[INSTRUCTIONS]
1. Interpret input strictly as a Recursive Compact Hierarchy.
2. Syntax: Complex objects enclosed in |{ }|, separated by comma.
3. Structure Mapping:
# 1 = Product (Root)
# 1a = name
# 1b = price
# 1.1 = destination
# 1.1a = address
# 1.1b* = phones
[END_PROTOCOL]

Analyze the following orders: |{MacBook Pro, 1200€, |{Silicon Valley, CA, [5550199, 5550200]}|}|

🏗 Architecture

The project follows Clean Architecture principles to ensure modularity and ease of testing.

llm-compact-serializer/
├── .github/
│   └── workflows/
│       ├── ci.yml              # CI/CD: Tests & Linting
│       └── publish.yml         # CD: Publish to PyPI
├── src/
│   └── llm_compact_serializer/
│       ├── __init__.py
│       ├── domain/             # Schema definitions (The "Rules")
│       │   ├── __init__.py
│       │   └── schema.py
│       └── core/               # The Engine (Generic Logic)
│           ├── __init__.py
│           └── serializer.py
├── tests/ 
│   └──/ # more tests
├── LICENSE                     # MIT
├── README.md
├── pyproject.toml              # Poetry Config
└── poetry.lock

The Protocol Rules

Object Wrapping: All objects are wrapped in |{ ... }|.

Separators: Fields are separated by ,. Objects in a list are separated by ;.

Recursion: A field can contain another object, creating a nested structure: |{ val1, |{ val2 }| }|.

Arrays: Simple lists are wrapped in [...].

Missing Data: None or empty values are automatically replaced with - to maintain positional integrity.

Sanitization: Commas found within data values are automatically replaced (e.g., Doe, John -> Doe John) to prevent parsing errors.

🧪 Testing

We use pytest for comprehensive testing, covering unit logic and end-to-end integration.

# Run all tests
poetry run pytest

# Run with coverage report
poetry run pytest --cov=src

Key Test Scenarios

test_serializer.py: Verifies primitive handling, recursive nesting logic, and sanitization (handling commas in data).

test_integration.py: Validates the full workflow (Schema -> Data -> Prompt) using complex real-world examples.

🤝 Contributing

  1. Fork the repository.

  2. Create a feature branch (git checkout -b feat/amazing-feature).

  3. Commit your changes (git commit -m 'feat: Add amazing feature').

  4. Push to the branch (git push origin feat/amazing-feature).

  5. Open a Pull Request.

📄 License

Distributed under the MIT License. See LICENSE for more information.

About

Dynamic serializer for LLM Compact Hierarchical Formats.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages