paralabelgen is a Python library for generating discrete paragraph labels
from concept extraction, graph communities, and interpretable assignment rules.
- PyPI distribution:
paralabelgen - Python import package:
labelgen - Repository:
https://github.com/HuRuilizhen/labelgen
pip install paralabelgenIf you want to use the default spaCy extractor, install a compatible English pipeline such as:
python -m spacy download en_core_web_smen_core_web_sm is the recommended default model, but you can point
spacy_model_name at another installed compatible spaCy pipeline.
from labelgen import LabelGenerator, LabelGeneratorConfig
paragraphs = [
"OpenAI builds language models for developers.",
"Developers use language models in production systems.",
]
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig(
extractor_mode="llm",
use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"
generator = LabelGenerator(config)
result = generator.fit_transform(
[
"OpenAI builds language models and developer APIs for production systems.",
"Production systems need monitoring and evaluation tooling.",
]
)LabelGeneratorConfig.extractor_mode supports three modes:
spacy: default public extractor using spaCy noun chunks and entitiesheuristic: deterministic fallback extractor using rule-based spansllm: provider-backed concept extraction using a unified OpenAI-compatible chat-completions client
If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag
is still respected. New code should prefer extractor_mode.
The LLM extraction path is opt-in and synchronous. The current provider layer is unified around one OpenAI-compatible client and supports:
openaimistralqwenollamadeepseek
Configure the provider and model under config.extraction.llm:
providermodelapi_key_env_varbase_urlorganizationtimeout_secondsmax_retriestemperaturemax_output_tokensbatch_sizemax_concepts_per_paragraph
Set the corresponding API key in the expected environment variable:
OPENAI_API_KEYMISTRAL_API_KEYDASHSCOPE_API_KEYDEEPSEEK_API_KEYOLLAMA_API_KEYfor authenticated or proxied Ollama deployments
For local Ollama usage, the default base URL is:
http://localhost:11434/v1
Local Ollama runs do not require an API key by default. When provider="ollama",
the client also disables reasoning by default to preserve output budget for the
final JSON payload.
config.extraction.llm.output_contract_mode controls how aggressively the
provider client tries to enforce a structured response:
auto: try stronger output contracts before falling backjson_schema: require JSON-schema structured outputjson_object: require JSON-object modeprompt_only: rely only on prompt instructions
auto is the recommended default. For OpenAI-compatible providers, the client
tries:
json_schema- then
json_object - then
prompt_only
and only falls back when the provider clearly rejects the stronger contract.
DeepSeek follows a narrower auto sequence based on the official API
documentation:
json_object- then
prompt_only
The LLM extractor now prefers provider-enforced structured output when the configured endpoint supports OpenAI-compatible JSON schema response formatting.
- prompt guidance is still used, but it is no longer the only output contract
- structured output is enforced first when available
- if an OpenAI-compatible endpoint rejects a stronger contract, the client degrades to a weaker output contract on the same LLM path
- the extractor does not silently fall back to
spacyorheuristic
For routine evaluation runs, prefer a conservative configuration:
temperature = 0.0batch_size = 1or a small batch sizecache_enabled = Truerecord_extraction_artifacts = False
For local Ollama models, batch_size = 1 is the safest default for benchmark
and smoke-test runs.
This keeps runs reproducible and avoids writing extra local artifacts unless you actually need them.
When you need to inspect provider behavior, you can enable artifacts:
record_extraction_artifacts = Truerecord_raw_response_text = Trueonly when raw provider output is neededrecord_paragraph_text = Trueonly when paragraph text is safe to storerecord_paragraph_metadata = Trueonly when metadata is safe to store
Artifact recording is optional and should stay disabled by default for routine usage.
cache_enabled=Truestores parsed concept lists on disk and avoids repeated provider calls for the same effective request- cache invalidation includes both
prompt_versionand the effective prompt text - artifacts are intended for local evaluation and debugging workflows, not as a default production feature
The repository includes a local benchmark harness for extractor comparisons:
benchmark/run_benchmark.pybenchmark/summarize_results.py
Benchmark inputs are local development assets and should live under
experiment/. The benchmark loader accepts:
.jsonl.json
Each record must provide:
text
and may optionally provide:
id
Benchmark code is for development evaluation only and is excluded from release artifacts.
The current TechQA benchmark comparisons include:
heuristicspacyllm:ollamallm:mistralllm:deepseek
For a small manual LLM-path verification outside the default test suite, run the example script with one provider/model pair and a valid API key in the expected environment variable:
OPENAI_API_KEY=... .venv/bin/python examples/llm_extraction.pyThis is intended as a lightweight manual smoke test for provider connectivity and parsing, not as part of the default automated suite.
The main public entrypoints are:
LabelGeneratorLabelGeneratorConfigParagraph,Concept,ConceptMention,Community,ParagraphLabelsdump_result()andload_result()
Detailed API notes are available in docs/public_api.md.
Runnable examples are available in examples/:
examples/basic_usage.pyexamples/custom_config.pyexamples/save_and_load.pyexamples/llm_extraction.py
fit()learns concepts and communities from a corpustransform()applies previously learned communities to new paragraphsfit_transform()learns and labels the same input in one passuse_graph_community_detection=Trueuses Leiden community detectionuse_graph_community_detection=Falseuses deterministic connected components- the default spaCy path requires the configured spaCy model to be installed
- the LLM path requires valid provider configuration and credentials
- local Ollama usage does not require credentials unless your deployment is explicitly authenticated