• where is OCR menu in opensource version

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #26087  by vincentk222
 
where is OCR menu in open source version?

here are my setting:

system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.pdf.force.ocr Boolean Active
 #26141  by vincentk222
 
In general feature, I can see that OCR is marked as green.
I also configured :
system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.pdf.force.ocr Boolean Active


But how does it work? If I have a pdf image can I make a pdf searchable?
What does this OCR function?
 #26151  by jllort
 
Take in mind you got document content index queue ( administration -> stats -> queue ). If document is not processed you're not able to search into.

I suggest take a look at administration -> database query
use jdbc and make a query to get OKM_NODE_DOCUMENT ( there's a column to indicate if text has been extracted = T and you can see there the extracted text in other column )

Final considerations, depending the resolucion of images in pdf etc... some OCR engine will be better than other. Last year tests seams tesserract gives better results than cuneiform from latest released versions.
 #26191  by vincentk222
 
extraction was done
My mistake was, I believe the OCR add a layer text to the PDF, but this is not the case.

If the document is a PDF (scanned image) there is nothing in the text extracted, I think there is no ocr done
If the file is a TIF, OCR is processed but the result is only minus : --------------------------------- ------------------------ ----------------------
 #26230  by jllort
 
Open source ocr engines can not work with low resolution images. I suggest extract image into pdf and execute ocr application from terminal to see results. For example with Abby ocr capture will get good results with 100ppp images. Take in mind with open source solution not always will get same performance than comercial otherside nobody will buy it.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.