Skip to content
Carlos Sanchez edited this page Oct 22, 2023 · 1 revision

Contentapi has a background service which can attach OCR text to images using local programs to generate the OCR. By default, this service is off.

Setup

Currently, the only program supported is https://github.com/tesseract-ocr/tesseract, but more may be added in the future if required.

To enable tesseract ocr, you must first ensure the tesseract command is available on the command line. If on linux, there are apparently packages in several distros for this. Then, in your appsettings.json file, locate the "OcrCrawlConfig" : { "Program" : "none" ... section and replace "none" with "tesseract". The next time contentapi runs, it will slowly crawl images starting with the newest, doing 10 every minute. You can modify how many it does per interval and how frequently.

Note that there's no options for configuring tesseract through contentapi just yet, sorry.

Where is the OCR?

If successful (even if nothing is found), ocr text will be placed in a value "ocr-crawl" on the image. You can use the API to search for ocr text with a request query like "!valuelike(@key, @value)" with values { "key" : "ocr-crawl", "value" : "%searchtext%" }.

If unsuccessful (the program exits in a failure state or is not found or whatever) the "ocr-fail" value will instead be set with the error message.

Clone this wiki locally