• OCR not working in OpenKM configured to MySQL

  • OpenKM tiene muchas características interesantes, pero es necesario un proceso de configuración para mostrar todo su potencial.
OpenKM tiene muchas características interesantes, pero es necesario un proceso de configuración para mostrar todo su potencial.
Forum rules: Por favor, antes de preguntar algo consulta el wiki de documentación o utiliza la función de búsqueda del foro. Recuerda que no tenemos una bola de cristal ni poderes mentales, o sea que que para informar sobre un error es necesario que nos indiques tanto la versión de OpenKM que usas como la del navegador y sistema operativo. Para más información consulta Cómo informar de fallos de forma efectiva.
 #21051  by Muhammad Imran
 
Hi,
I have installed OpenKM 6.2.0 on windows 7. It was working well with embedded database HSQLDB. Then I configure to replaced HSQLDB with MySQL. Right now, OpenKM is working except OCR( full text ) search.

On the console window I can see that the image.JPG is extracted successfully with Tesseract3.0.
After that there is some problem:
Code: Select all
Caused by: java.sql.SQLException: Incorrect string value: '\xEF\xAC\x81\xEF\xAC\
x81...' for column 'NDC_TEXT' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3562)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3494)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1960)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2114)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2696)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2105)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2398)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2316)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2301)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:105)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:105)
        at org.hibernate.persister.entity.AbstractEntityPersister.update(AbstractEntityPersister.java:2595)
I don't know what's the problem.
Please can any one can tell me to get rid of this problem?
Please give some hint or suggest me wiki link to get it resolve.
Last edited by Muhammad Imran on Wed Jan 30, 2013 8:03 am, edited 2 times in total.
 #21093  by jllort
 
There's some bug in version 6.2.0 with the text extraction feature ( this problems is caused when you're indexing some utf-16 files, chinese, russian, etc.... ), well if you upgrade to 6.2.2 I think is already solved there. Take a look at migration guide section for doing it http://wiki.openkm.com/index.php/Migration_Guide
 #21138  by Muhammad Imran
 
Thanx jllort for your reply!
I have migrated to OpenKM 6.2.2 successfully. Now ocr(Full Text Search) is working for only "image.png and document.docx".
Now in console window I can see the following error:
Code: Select all
Caused by: java.sql.SQLException: Incorrect string value: '\xEF\xAC\x81\xEF\xAC\x81...' for column 'NDC_TEXT' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3562)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3494)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1960)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2114)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2696)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2105)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2398)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2316)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2301)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:105)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:105)
        at org.hibernate.persister.entity.AbstractEntityPersister.update(AbstractEntityPersister.java:2595)
What should I do to run full text search for .pdf,.JPG,.txt etc...?
 #21203  by Muhammad Imran
 
Hi jllort,
Thanks for replying.

I have installed nighly build at integration.openkm.com but still there is some problem in PdfTextExtraction.
Now in console window I can see the following error:
Code: Select all
2013-01-31 13:55:01,973 [Thread-15] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-01-31 13:55:01,974 [Thread-15] WARN  com.openkm.dao.NodeDocumentDAO - There  was a problem extracting text from '/okm:root/Testing/DatabaseBasics.pdf': Too few text extracted
How can I get it fix?
 #21224  by bgrr
 
Have the same problem with version 6.2.2 build 7815 on ubuntu 12.04.1 ltd

JPG is working fine and text PDF ( selectable text in pdf) is working fine

But a scanned PDF (pdf with image raster) gives me the same error when i try tesseract in openkm and by commandline

Installed tesseract 3.02 and ImageMagick 6.6.9-7 2012-08-17 Q16
 #21234  by jllort
 
could be low resolution while scan image. Can you try with more high resolution ? If you got problem with command line, concentrate there.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.