I have still problems with the quality of ocr textrecognition.
with installed tesseract
When I try to extract text manually on command line the extracted text is first class.
Only one example(always the same jpg-file) a text should be: "Allmersbach im Tal" on command line i got "Allmersbach im Tal" as text --> fine!
When I let extract the text by openkm the extracted text in the database is bad.
by openkm I got "Gummersbacher Jim Tal", or "Angeknackst", or out of "EUR" i got "EUER"
My configuration:
ubuntu Ubuntu 18.04.2 LTS
tomcat-8.5.24
openkm: 6.3.6 build: 87d181f CE
system.ocr: /usr/bin/tesseract ${fileIn} ${fileOut} -l deu+deu_frak+eng
system.ocr.rotate: 0;90;180;270;
system.pdf.force.ocr: true
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.openoffice.dictionary: /home/openkm/dictonarys/dict-de_de-frami_2017-01-12.oxt
system.openoffice.path: /usr/lib/libreoffice
following languages are installed:
user@server:# tesseract --list-langs
List of available languages (4):
eng
osd
deu_frak
deu
For the both languages deu_frak, deu are the newest training libraries installed.
The problem is, that with such results a fulltext search always fails.
Any help/hint, how I can improve the results?
Thank you in advance
OpaHeinz
with installed tesseract
When I try to extract text manually on command line the extracted text is first class.
Only one example(always the same jpg-file) a text should be: "Allmersbach im Tal" on command line i got "Allmersbach im Tal" as text --> fine!
When I let extract the text by openkm the extracted text in the database is bad.
by openkm I got "Gummersbacher Jim Tal", or "Angeknackst", or out of "EUR" i got "EUER"
My configuration:
ubuntu Ubuntu 18.04.2 LTS
tomcat-8.5.24
openkm: 6.3.6 build: 87d181f CE
system.ocr: /usr/bin/tesseract ${fileIn} ${fileOut} -l deu+deu_frak+eng
system.ocr.rotate: 0;90;180;270;
system.pdf.force.ocr: true
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.openoffice.dictionary: /home/openkm/dictonarys/dict-de_de-frami_2017-01-12.oxt
system.openoffice.path: /usr/lib/libreoffice
following languages are installed:
user@server:# tesseract --list-langs
List of available languages (4):
eng
osd
deu_frak
deu
For the both languages deu_frak, deu are the newest training libraries installed.
The problem is, that with such results a fulltext search always fails.
Any help/hint, how I can improve the results?
Thank you in advance
OpaHeinz