Skip to content

feat: add new page text extraction methods and improve content parsing#17

Open
jankzl wants to merge 1 commit intoctxinf:devfrom
jankzl:feature/heuristic-dom-content-extract
Open

feat: add new page text extraction methods and improve content parsing#17
jankzl wants to merge 1 commit intoctxinf:devfrom
jankzl:feature/heuristic-dom-content-extract

Conversation

@jankzl
Copy link
Copy Markdown

@jankzl jankzl commented Apr 12, 2026

  • Introduced new extraction methods: DOM heuristic as alternative option for @mozilla/readability.
  • Updated localization files for English, Simplified Chinese, and Traditional Chinese to include new extraction method messages.
  • Enhanced Summary component to allow users to select the extraction method.
  • Implemented logic in useSummary composable to handle extraction method changes and update webpage content accordingly.
  • Refactored page-read utility functions to support new extraction methods and improve content parsing.
  • Added new properties to WebpageContent type to track extraction method and input text length.
  • Updated default configuration to set the default extraction method to readability.

- Introduced new extraction methods:  DOM heuristic as alternative option for @mozilla/readability.
- Updated localization files for English, Simplified Chinese, and Traditional Chinese to include new extraction method messages.
- Enhanced Summary component to allow users to select the extraction method.
- Implemented logic in useSummary composable to handle extraction method changes and update webpage content accordingly.
- Refactored page-read utility functions to support new extraction methods and improve content parsing.
- Added new properties to WebpageContent type to track extraction method and input text length.
- Updated default configuration to set the default extraction method to readability.
@jankzl
Copy link
Copy Markdown
Author

jankzl commented Apr 12, 2026

The default content extract method @mozilla/readablity failed to extract the important section at the specific website (https://developer.apple.com/documentation/bundleresources/information-property-list/nsapptransportsecurity/nsallowsarbitraryloads?language=objc). The branch introduces DOM heuristic method to provide more content for the LLM(s).

@jankzl jankzl changed the base branch from main to dev April 13, 2026 01:38
<span class="font-bold"> general</span> extract method.
Two <span class="font-bold">general</span> extract methods are available now.
<br />
<span class="font-bold">@mozilla/readability</span> remains the default for classic article pages.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方不需要这个说明了

<template #trigger> @mozilla/readability </template>
</Select>
<div class="w-[28rem] max-w-full">
<RadioGroup v-model="pageTextExtractMethod" class="gap-3">
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个布局现在是水平的, 需要改为垂直的: 标题在上, 选择区在下

import { minimatch } from 'minimatch';



Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个核心流程的代码是在Agent编程还未成熟之前我手工写的, 经过了大量的人工测试和调试, 虽然写的是一坨狗屎,但是请采用最小变更, 不要重构它的核心流程。

})
return
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于这个feature, 我的理解是只用变更这个地方就可以了,根据选项判断去具体调用哪个函数, 获取返回的结果

<SummaryDialog class="mt-[-1px] min-h-16 overflow-y-auto max-h-[--webpage-summary-panel-dialog-max-height]"
style="overflow-anchor: auto" ref="summaryDialog">
<template #top-right-buttons>
<div class="flex items-center gap-1 rounded-md border bg-background/80 p-1" :title="t('Extract_method')">
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为界面里没有为这个Toggle Buttons Group提供空间,现在的样子非常奇怪, 建议不要这个Toggle Buttons Group, 仅允许用户在设置界面里进行配置

Comment thread src/types/summary.ts
}

export type SummaryInput = WebpageContent & { summaryLanguage: string, currentSelection?: string }
export type SummaryInput = WebpageContent & { summaryLanguage: string, currentSelection?: string, currentModel?: string }
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里currentModel似乎没有使用?

Comment thread README.md

or download from [Github Releases](https://github.com/slow-groovin/webpage-summary/releases) and manually install

### Load Modified Extension In Firefox
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该项目当前是使用 bun 作为包管理器的。请确认这一段描述是否是人工核实过的,以及是否有需要添加的说明。

Copy link
Copy Markdown
Owner

@ctxinf ctxinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢你的提交,请参考评论进行修改。另外,你的另外的两个提交是否是和这个功能相关的后续提交?如果是的话,请在相同分支提交commits作为一个PR。

Comment thread src/utils/page-read.ts
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件可以重构为一下,比如说新增一个文件夹 utils/extract,然后每个不同的功能分散到不同的文件里, export相同类型的函数入口

ctxinf

This comment was marked as low quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants