-
Notifications
You must be signed in to change notification settings - Fork 2
OCR
Contentapi has a background service which can attach OCR text to images using local programs to generate the OCR. By default, this service is off.
Currently, the only program supported is https://github.com/tesseract-ocr/tesseract, but more may be added in the future if required.
To enable tesseract ocr, you must first ensure the tesseract command is available on the command line. If on linux, there are apparently packages in several distros for this. Then, in your appsettings.json file, locate the "OcrCrawlConfig" : { "Program" : "none" ... section and replace "none" with "tesseract". The next time contentapi runs, it will slowly crawl images starting with the newest, doing 10 every minute. You can modify how many it does per interval and how frequently.
Note that there's no options for configuring tesseract through contentapi just yet, sorry.
If successful (even if nothing is found), ocr text will be placed in a value "ocr-crawl" on the image. You can use the API to search for ocr text with a request query like "!valuelike(@key, @value)" with values { "key" : "ocr-crawl", "value" : "%searchtext%" }.
If unsuccessful (the program exits in a failure state or is not found or whatever) the "ocr-fail" value will instead be set with the error message.