Skip to content

Added language detection throught the distant path of AA.#1031

Draft
eephyne wants to merge 4 commits into
calibrain:mainfrom
eephyne:feature-search_wider
Draft

Added language detection throught the distant path of AA.#1031
eephyne wants to merge 4 commits into
calibrain:mainfrom
eephyne:feature-search_wider

Conversation

@eephyne

@eephyne eephyne commented May 28, 2026

Copy link
Copy Markdown
  • Added language detection throught the distant path of AA.
  • Added option in setting to disable this feature.

This is to avoid shelfmark dumping many result due to missing language in AA.
With this option enabled, it will parse the distant path (don’t know how its named) to look for language and set the language accordingly.

I tested it with different request and language and the parsing seem ok to me but it can probably be improved.
The option can be enabled or disabled in the setting Direct Download > Download Source

- Added option in setting to disable this feature.
Copilot AI review requested due to automatic review settings May 28, 2026 14:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in feature to the Direct Download source that infers a book's language from the "distant path" (file path in search results) when the language metadata is missing, with corresponding settings, parsing logic, and tests.

Changes:

  • New DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH checkbox setting (default off).
  • Logic in direct_download.py to extract a distant path from result rows, detect language via bracket/keyed/name/code patterns with alias mapping from book-languages.json, and apply local language filtering after parsing.
  • Extensive new tests covering detection, false-positive avoidance, legacy behavior, and search-level local filtering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
shelfmark/config/settings.py Adds new CheckboxField for the path-language toggle.
shelfmark/release_sources/direct_download.py Implements distant-path extraction, language inference, alias map, and conditional local filtering in search_books.
tests/config/test_download_settings.py Verifies the new settings field exists with expected default/description.
tests/direct_download/test_search_queries.py Adds tests for distant-path language detection, edge cases, and the new local filtering path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +709 to +722
detected_from_path = _detect_language_from_distant_path(distant_path)

# Temporary visual diagnostics for field mapping and path-language inference.
if _is_language_from_path_enabled():
logger.info(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
record_id,
_short_debug(title),
_short_debug(language),
_short_debug(detected_from_path),
_short_debug(distant_path, limit=260),
_short_debug(cells[10].get_text(" ", strip=True), limit=140),
_short_debug(row.get_text(" ", strip=True), limit=260),
)
Comment on lines +712 to +713
if _is_language_from_path_enabled():
logger.info(
_short_debug(row.get_text(" ", strip=True), limit=260),
)

if _is_language_from_path_enabled() and _is_missing_or_placeholder_language(language):
Comment on lines +290 to +292

def _extract_distant_path(row: Tag) -> str | None:
"""Extract distant path hints from a direct-download search row."""
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})
_AMBIGUOUS_SHORT_LANGUAGE_CODES = frozenset({"de", "en", "it", "la", "no", "or", "is", "in"})
r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
re.IGNORECASE,
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
Comment on lines +576 to +581
# When path-language inference is enabled, language filtering must happen
# after row parsing, otherwise source-side lang filters drop rows too early.
if not (path_language_enabled and requested_langs):
for value in filters.lang or []:
if value and value != "all":
filters_query += f"&lang={quote(value)}"
Comment on lines +463 to +480
import shelfmark.release_sources.direct_download as dd

captured_url: dict[str, str] = {}

original_get = dd.config.get

def _fake_get(key: str, default=None, user_id=None):
del user_id
if key == "DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH":
return True
return original_get(key, default)

monkeypatch.setattr(dd.config, "get", _fake_get)
monkeypatch.setattr(dd.network, "get_aa_base_url", lambda: "https://mirror.example")
monkeypatch.setattr(dd.network, "AAMirrorSelector", lambda: object())

def _fake_html_get_page(url: str, selector, allow_bypasser_fallback=False):
del selector, allow_bypasser_fallback
Comment thread shelfmark/release_sources/direct_download.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 07:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comment on lines +726 to +729
# Temporary visual diagnostics for field mapping and path-language inference.
if _is_language_from_path_enabled():
logger.info(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
Comment on lines +224 to +226
_LANGUAGE_CODE_TOKEN_PATTERN = re.compile(
r"(?:^|[\s_./\\\-\[(])([A-Za-z]{2,3})(?=$|[\s_./\\\-)\]])"
)
r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b",
re.IGNORECASE,
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
Comment on lines +266 to +270
def _language_alias_to_code() -> dict[str, str]:
"""Build alias->code map from bundled language metadata."""
global _LANGUAGE_ALIAS_TO_CODE
if _LANGUAGE_ALIAS_TO_CODE is not None:
return _LANGUAGE_ALIAS_TO_CODE
Comment on lines +593 to +596
if not (path_language_enabled and requested_langs):
for value in filters.lang or []:
if value and value != "all":
filters_query += f"&lang={quote(value)}"
Comment on lines +462 to +465
def test_search_books_filters_language_locally_when_path_language_enabled(monkeypatch):
import shelfmark.release_sources.direct_download as dd

captured_url: dict[str, str] = {}
)
_LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?")
_LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None
_LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"})
eephyne added 2 commits May 29, 2026 13:39
…eedback

Downgrade temporary per-row language diagnostics from INFO to DEBUG to reduce production log noise.
Cache path-language toggle per parsed row and skip distant-path extraction when the feature is disabled.
Make language alias cache initialization thread-safe with a lock to avoid race conditions on first load.
Reduce language false positives by tightening name-token matching and preferring non-ambiguous strong candidates.
Keep server-side language filtering enabled and apply local path-based filtering as an additional refinement.
Remove redundant normalized placeholder handling for em-dash language values.
Update Direct Download setting description to document language-filter trade-offs.
Fix test indentation consistency and update assertions for restored server-side lang query behavior.
Add regression coverage for bracket-order ambiguity (e.g., EN marker appearing before FR marker).
…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.
Copilot AI review requested due to automatic review settings June 4, 2026 14:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines +1443 to +1452
CheckboxField(
key="DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH",
label="Detect Language From Distant Path",
description=(
"When language metadata is missing, parse the distant path and set language "
"from tags like [BD FR]. Falls back to unknown when not detected. "
"Note: source-side language filters still apply and may exclude poorly tagged rows."
),
default=False,
),
Comment on lines +758 to +769
# Temporary visual diagnostics for field mapping and path-language inference.
if path_language_enabled:
logger.debug(
"DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s",
record_id,
_short_debug(title),
_short_debug(language),
_short_debug(detected_from_path),
_short_debug(distant_path, limit=260),
_short_debug(cells[10].get_text(" ", strip=True), limit=140),
_short_debug(row.get_text(" ", strip=True), limit=260),
)
Comment on lines +774 to +779
logger.debug(
"DD lang debug resolved | id=%s | final_lang=%s | fallback=%s",
record_id,
_short_debug(language),
"unknown" if detected_from_path is None else "detected",
)
Comment on lines +323 to +328
normalized = re.sub(
r"\s+\.(epub|mobi|azw3|fb2|djvu|cbz|cbr|pdf|zip|rar|m4b|mp3)\b",
r".\1",
normalized,
flags=re.IGNORECASE,
)
Comment on lines +280 to +286
mapping: dict[str, str] = {}
data_path = Path(__file__).resolve().parents[2] / "data" / "book-languages.json"

try:
raw = json.loads(data_path.read_text(encoding="utf-8"))
except (OSError, ValueError, TypeError):
_LANGUAGE_ALIAS_TO_CODE = {}
Comment on lines +434 to +440
def _book_matches_requested_languages(book_language: str | None, requested: set[str]) -> bool:
"""Return True when a book language matches normalized requested filters.

A book whose language is unknown (None) passes through: the server-side
``&lang=`` filter already constrained the result set, so dropping rows
that simply lack metadata would hide relevant results.
"""
@eephyne eephyne marked this pull request as draft June 4, 2026 15:10
NemesisHubris added a commit to NemesisHubris/litfinder that referenced this pull request Jun 6, 2026
When DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH is enabled, parse the file
path column in AA search results (e.g. lgli/N:\...\[BD FR] Book.cbz)
to infer language when AA's own metadata is missing or unknown.

Detection priority:
1. Explicit bracket tags: [FR], [BD FR], [En]
2. Keyed markers: "BD FR", "language: fr"
3. Full language names: "french", "deutsch"
4. Loose 2-3 char codes (ambiguous ones like "en"/"de" require
   bracket or key context to avoid false positives)

When enabled with a language filter, the server-side &lang= parameter
is suppressed and filtering is done locally so lgli files without AA
language metadata are not excluded before the path can be inspected.

Also relaxes _parse_search_result_row to only require title + format,
allowing sparse lgli rows (missing author/publisher/year) to pass through.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants