In #4 and FreeAndFair/TuskMobileVoting#60, we discussed the fact that the current output of the NLP tool is pretty rough; the raw output includes things like pieces of LaTeX equations, footnote markers, etc. I addressed this manually in #4 by running the combined histograms through an LLM with some manual cleanup stages ("eliminate everything that starts with a symbol", "eliminate everything that doesn't have at least one word in it", etc.), and also, for the verb phrases, had it coalesce phrases with the same primary verb. We should, for the future, consider some extensions to the NLP tool to:
- automatically do the kind of cleanup I did manually, either via an LLM API or programmatically where that is straightforward
- perform better OCR on PDF files to ensure that odd kerning and LaTeX artifacts don't cause misreadings (this is much harder than it sounds, and is likely far too much effort for us to attempt any time soon)
In #4 and FreeAndFair/TuskMobileVoting#60, we discussed the fact that the current output of the NLP tool is pretty rough; the raw output includes things like pieces of LaTeX equations, footnote markers, etc. I addressed this manually in #4 by running the combined histograms through an LLM with some manual cleanup stages ("eliminate everything that starts with a symbol", "eliminate everything that doesn't have at least one word in it", etc.), and also, for the verb phrases, had it coalesce phrases with the same primary verb. We should, for the future, consider some extensions to the NLP tool to: