Hello there,
I encountered a problem with PDF text extraction. Text extraction of PDF with image layer (scanned documents) works just fine. Text extraction of other type of documents (Word, Text, etc.) works also and documents are indexed as expected. But I found that certain PDF files are not indexed. I’ve made further investigation of the problem by using function “Administration -> Utilities -> Check text extraction” function.
I’ve made tests with 5 files, containing the same text – one line of English text and another line with Cyrillic. File contains following text:
Another clue: Previously I had problem with copy-paste with some PDF files, especially with ones, made with PDF printers. I’ve found that this topic helped me. I think that is somehow connected with mentioned above problem.
Here is the information about installation:
Server: OpenKM – community edition, version 6.2.5 (build: 8109), running on Windows 7 Pro SP1 with Apache Tomcat 7.0.27, JRE 7 Update 45, OpenOffice 4.0.1, Tesseract 3.02, ImageMagick 6.8.7, MS SQL Server 2008 R2 Express edition.
Client: Google Chrome 31.0.1650.57 on Windows 7 Pro SP 1.
Configuration settings: Here are my test files:
I encountered a problem with PDF text extraction. Text extraction of PDF with image layer (scanned documents) works just fine. Text extraction of other type of documents (Word, Text, etc.) works also and documents are indexed as expected. But I found that certain PDF files are not indexed. I’ve made further investigation of the problem by using function “Administration -> Utilities -> Check text extraction” function.
I’ve made tests with 5 files, containing the same text – one line of English text and another line with Cyrillic. File contains following text:
Code: Select all
Here are the results of my investigation:
This is text.
Това е текст.
- Text file (txt), ANSI encoding: works, although Cyrillic text is wrongly read with characters of ISO 8859-1 insted of WIndows 1252, but it's normal for ANSI encoding.
- Text file (txt), UTF-8 encoding: works fine, correct encoding.
- Word file (docx, MS Word 2010): works fine, correct encoding.
- PDF file (produced with Word, "Save As PDF" function): Doesn’t extract any text, neither English text nor Cyrillic.
- PDF file (produced trough printing to CutePDF printer 3.0/Ghostscript): Doesn’t extract any text, neither English text nor Cyrillic.
Another clue: Previously I had problem with copy-paste with some PDF files, especially with ones, made with PDF printers. I’ve found that this topic helped me. I think that is somehow connected with mentioned above problem.
Here is the information about installation:
Server: OpenKM – community edition, version 6.2.5 (build: 8109), running on Windows 7 Pro SP1 with Apache Tomcat 7.0.27, JRE 7 Update 45, OpenOffice 4.0.1, Tesseract 3.02, ImageMagick 6.8.7, MS SQL Server 2008 R2 Express edition.
Client: Google Chrome 31.0.1650.57 on Windows 7 Pro SP 1.
Configuration settings: Here are my test files:
Attachments
The ANSI encoded text file (code page - Windows 1251) and the second - an UTF-8 encoded one.
(431 Bytes) Downloaded 299 times
(431 Bytes) Downloaded 299 times
The Word document, "printed" to CutePDF.
(11.18 KiB) Downloaded 353 times
(11.18 KiB) Downloaded 353 times
The Word document, saved as PDF from Word.
(83.37 KiB) Downloaded 371 times
(83.37 KiB) Downloaded 371 times
The Word 2010 document.
(12.48 KiB) Downloaded 400 times
(12.48 KiB) Downloaded 400 times
