Added language detection throught the distant path of AA.#1031
Draft
eephyne wants to merge 4 commits into
Draft
Conversation
- Added option in setting to disable this feature.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in feature to the Direct Download source that infers a book's language from the "distant path" (file path in search results) when the language metadata is missing, with corresponding settings, parsing logic, and tests.
Changes:
- New
DIRECT_DOWNLOAD_LANGUAGE_FROM_PATHcheckbox setting (default off). - Logic in
direct_download.pyto extract a distant path from result rows, detect language via bracket/keyed/name/code patterns with alias mapping frombook-languages.json, and apply local language filtering after parsing. - Extensive new tests covering detection, false-positive avoidance, legacy behavior, and search-level local filtering.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| shelfmark/config/settings.py | Adds new CheckboxField for the path-language toggle. |
| shelfmark/release_sources/direct_download.py | Implements distant-path extraction, language inference, alias map, and conditional local filtering in search_books. |
| tests/config/test_download_settings.py | Verifies the new settings field exists with expected default/description. |
| tests/direct_download/test_search_queries.py | Adds tests for distant-path language detection, edge cases, and the new local filtering path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+709
to
+722
| detected_from_path = _detect_language_from_distant_path(distant_path) | ||
|
|
||
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if _is_language_from_path_enabled(): | ||
| logger.info( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", | ||
| record_id, | ||
| _short_debug(title), | ||
| _short_debug(language), | ||
| _short_debug(detected_from_path), | ||
| _short_debug(distant_path, limit=260), | ||
| _short_debug(cells[10].get_text(" ", strip=True), limit=140), | ||
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) |
Comment on lines
+712
to
+713
| if _is_language_from_path_enabled(): | ||
| logger.info( |
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) | ||
|
|
||
| if _is_language_from_path_enabled() and _is_missing_or_placeholder_language(language): |
Comment on lines
+290
to
+292
|
|
||
| def _extract_distant_path(row: Tag) -> str | None: | ||
| """Extract distant path hints from a direct-download search row.""" |
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") | ||
| _LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None | ||
| _LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"}) | ||
| _AMBIGUOUS_SHORT_LANGUAGE_CODES = frozenset({"de", "en", "it", "la", "no", "or", "is", "in"}) |
| r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") |
Comment on lines
+576
to
+581
| # When path-language inference is enabled, language filtering must happen | ||
| # after row parsing, otherwise source-side lang filters drop rows too early. | ||
| if not (path_language_enabled and requested_langs): | ||
| for value in filters.lang or []: | ||
| if value and value != "all": | ||
| filters_query += f"&lang={quote(value)}" |
Comment on lines
+463
to
+480
| import shelfmark.release_sources.direct_download as dd | ||
|
|
||
| captured_url: dict[str, str] = {} | ||
|
|
||
| original_get = dd.config.get | ||
|
|
||
| def _fake_get(key: str, default=None, user_id=None): | ||
| del user_id | ||
| if key == "DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH": | ||
| return True | ||
| return original_get(key, default) | ||
|
|
||
| monkeypatch.setattr(dd.config, "get", _fake_get) | ||
| monkeypatch.setattr(dd.network, "get_aa_base_url", lambda: "https://mirror.example") | ||
| monkeypatch.setattr(dd.network, "AAMirrorSelector", lambda: object()) | ||
|
|
||
| def _fake_html_get_page(url: str, selector, allow_bypasser_fallback=False): | ||
| del selector, allow_bypasser_fallback |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines
+726
to
+729
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if _is_language_from_path_enabled(): | ||
| logger.info( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", |
Comment on lines
+224
to
+226
| _LANGUAGE_CODE_TOKEN_PATTERN = re.compile( | ||
| r"(?:^|[\s_./\\\-\[(])([A-Za-z]{2,3})(?=$|[\s_./\\\-)\]])" | ||
| ) |
| r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") |
Comment on lines
+266
to
+270
| def _language_alias_to_code() -> dict[str, str]: | ||
| """Build alias->code map from bundled language metadata.""" | ||
| global _LANGUAGE_ALIAS_TO_CODE | ||
| if _LANGUAGE_ALIAS_TO_CODE is not None: | ||
| return _LANGUAGE_ALIAS_TO_CODE |
Comment on lines
+593
to
+596
| if not (path_language_enabled and requested_langs): | ||
| for value in filters.lang or []: | ||
| if value and value != "all": | ||
| filters_query += f"&lang={quote(value)}" |
Comment on lines
+462
to
+465
| def test_search_books_filters_language_locally_when_path_language_enabled(monkeypatch): | ||
| import shelfmark.release_sources.direct_download as dd | ||
|
|
||
| captured_url: dict[str, str] = {} |
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") | ||
| _LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None | ||
| _LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"}) |
…eedback Downgrade temporary per-row language diagnostics from INFO to DEBUG to reduce production log noise. Cache path-language toggle per parsed row and skip distant-path extraction when the feature is disabled. Make language alias cache initialization thread-safe with a lock to avoid race conditions on first load. Reduce language false positives by tightening name-token matching and preferring non-ambiguous strong candidates. Keep server-side language filtering enabled and apply local path-based filtering as an additional refinement. Remove redundant normalized placeholder handling for em-dash language values. Update Direct Download setting description to document language-filter trade-offs. Fix test indentation consistency and update assertions for restored server-side lang query behavior. Add regression coverage for bracket-order ambiguity (e.g., EN marker appearing before FR marker).
…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.
Comment on lines
+1443
to
+1452
| CheckboxField( | ||
| key="DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH", | ||
| label="Detect Language From Distant Path", | ||
| description=( | ||
| "When language metadata is missing, parse the distant path and set language " | ||
| "from tags like [BD FR]. Falls back to unknown when not detected. " | ||
| "Note: source-side language filters still apply and may exclude poorly tagged rows." | ||
| ), | ||
| default=False, | ||
| ), |
Comment on lines
+758
to
+769
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if path_language_enabled: | ||
| logger.debug( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", | ||
| record_id, | ||
| _short_debug(title), | ||
| _short_debug(language), | ||
| _short_debug(detected_from_path), | ||
| _short_debug(distant_path, limit=260), | ||
| _short_debug(cells[10].get_text(" ", strip=True), limit=140), | ||
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) |
Comment on lines
+774
to
+779
| logger.debug( | ||
| "DD lang debug resolved | id=%s | final_lang=%s | fallback=%s", | ||
| record_id, | ||
| _short_debug(language), | ||
| "unknown" if detected_from_path is None else "detected", | ||
| ) |
Comment on lines
+323
to
+328
| normalized = re.sub( | ||
| r"\s+\.(epub|mobi|azw3|fb2|djvu|cbz|cbr|pdf|zip|rar|m4b|mp3)\b", | ||
| r".\1", | ||
| normalized, | ||
| flags=re.IGNORECASE, | ||
| ) |
Comment on lines
+280
to
+286
| mapping: dict[str, str] = {} | ||
| data_path = Path(__file__).resolve().parents[2] / "data" / "book-languages.json" | ||
|
|
||
| try: | ||
| raw = json.loads(data_path.read_text(encoding="utf-8")) | ||
| except (OSError, ValueError, TypeError): | ||
| _LANGUAGE_ALIAS_TO_CODE = {} |
Comment on lines
+434
to
+440
| def _book_matches_requested_languages(book_language: str | None, requested: set[str]) -> bool: | ||
| """Return True when a book language matches normalized requested filters. | ||
|
|
||
| A book whose language is unknown (None) passes through: the server-side | ||
| ``&lang=`` filter already constrained the result set, so dropping rows | ||
| that simply lack metadata would hide relevant results. | ||
| """ |
NemesisHubris
added a commit
to NemesisHubris/litfinder
that referenced
this pull request
Jun 6, 2026
When DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH is enabled, parse the file path column in AA search results (e.g. lgli/N:\...\[BD FR] Book.cbz) to infer language when AA's own metadata is missing or unknown. Detection priority: 1. Explicit bracket tags: [FR], [BD FR], [En] 2. Keyed markers: "BD FR", "language: fr" 3. Full language names: "french", "deutsch" 4. Loose 2-3 char codes (ambiguous ones like "en"/"de" require bracket or key context to avoid false positives) When enabled with a language filter, the server-side &lang= parameter is suppressed and filtering is done locally so lgli files without AA language metadata are not excluded before the path can be inspected. Also relaxes _parse_search_result_row to only require title + format, allowing sparse lgli rows (missing author/publisher/year) to pass through.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is to avoid shelfmark dumping many result due to missing language in AA.
With this option enabled, it will parse the distant path (don’t know how its named) to look for language and set the language accordingly.
I tested it with different request and language and the parsing seem ok to me but it can probably be improved.
The option can be enabled or disabled in the setting Direct Download > Download Source