Improve NLP Tool Output

In #4 and FreeAndFair/TuskMobileVoting#60, we discussed the fact that the current output of the NLP tool is pretty rough; the raw output includes things like pieces of LaTeX equations, footnote markers, etc. I addressed this manually in #4 by running the combined histograms through an LLM with some manual cleanup stages ("eliminate everything that starts with a symbol", "eliminate everything that doesn't have at least one word in it", etc.), and also, for the verb phrases, had it coalesce phrases with the same primary verb. We should, for the future, consider some extensions to the NLP tool to:
- automatically do the kind of cleanup I did manually, either via an LLM API or programmatically where that is straightforward
- perform better OCR on PDF files to ensure that odd kerning and LaTeX artifacts don't cause misreadings (this is much harder than it sounds, and is likely far too much effort for us to attempt any time soon)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NLP Tool Output #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve NLP Tool Output #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions