• Full text indexing of 'application/octet-stream' is not supported - text document

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #44947  by maniachhz
 
Hello guys.

Now i has millions of license files (text format, but name with .lic suffix), each license file includes one serial number. please review attached file:
license.png
license.png (17.2 KiB) Viewed 5613 times
I want to uploads theses licenses to server , then some one can download then by searching the file name or series number.

After uploading the license, i can find the by searching the file name but cannot find it by searching the serial number .

I check the log, the extractor seems that identify the file MIME type as 'application/octet-stream' , even i added the 'lic' to the extensions:
added_lic_extension.png
added_lic_extension.png (4.67 KiB) Viewed 5613 times
, the extractor still didnot work correctly.
Code: Select all
2017-11-29 11:47:25,942 [http-bio-0.0.0.0-8181-exec-1] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/licenses/Used_licenses/2017/20171009/00:25:82:07:11:02.lic': Full text indexing of 'application/octet-stream' is not supported
2017-11-29 11:47:25,958 [http-bio-0.0.0.0-8181-exec-1] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=bf61988e-9b6d-450b-8b0c-5bc321602733, docPath=/okm:root/licenses/Used_licenses/2017/20171009/00:25:82:07:0B:DA.lic, docVerUuid=d271f150-2170-4050-8ba7-0f8a9d5c88cf, date=Tue Nov 28 15:56:32 HKT 2017}
2017-11-29 11:47:25,959 [http-bio-0.0.0.0-8181-exec-1] WARN  com.openkm.extractor.RegisteredExtractors- Text extraction failure: Full text indexing of 'application/octet-stream' is not supported
2017-11-29 11:47:25,959 [http-bio-0.0.0.0-8181-exec-1] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/licenses/Used_licenses/2017/20171009/00:25:82:07:0B:DA.lic': Full text indexing of 'application/octet-stream' is not supported
But the utility tool - check text extraction works:
check_extraction.png
check_extraction.png (34.18 KiB) Viewed 5613 times
My questions:
1. Can you tell me to how to fix the MIME type to 'application/plain' for these files?
2. There are millions of the license files named with MAC, each file includes one serial number. Some one somethings search them by serial number, so i want to enable the full text index, but if done , it will take up a lot of space in the database. For my case, can you give me sugguestions?

Thanks.
 #44963  by jllort
 
Go to administration and update the "text/plain" for these extension.
If you have yet inserted the documents I can tell to you how to modify the actual mime type registered in the database, you can try with it:
Code: Select all
UPDATE OKM_NODE_DOCUMENT SET NDC_MIME_TYPE = 'text/plain" where NDC_NAME LIKE '%mac';
Then to set all the documents in the indexing queue
Code: Select all
UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED = 'F" where NDC_NAME LIKE '%mac';
The docs file is an updated mime type definition ? for us is more conformable if you fork the code https://github.com/openkm/document-management-system and ask for a merge request from you changes. ( Merge request should go into branch 6.3-DEV )
 #44968  by maniachhz
 
jllort wrote: Thu Nov 30, 2017 6:45 pm Go to administration and update the "text/plain" for these extension.
@jllort,

After updating the 'text/plain' for these extension and re-index by utility tool, then re-upload the files, the extractor works as my expect.

Another two questions:
There are millions of the license files (each file takes 128 bytes) named with MAC, each file includes one serial number. Some one somethings search them by serial number, so must to enable the full text index, but it will take up a lot of space in the database.
1. For my case, do you think if i need enable for the full text index for these files?
2. I tried uplaoded these small files (128 bytes) by call SOAP API in Python, it took 24 hours to upload 400 thousand files. Is there any quicker way to upload these files? Using REST API?
Last edited by maniachhz on Wed Dec 06, 2017 11:04 am, edited 1 time in total.
 #44977  by jllort
 
Are you using linear or parallel logic. In case you decide for parallel, crease as many threads as cores.
Another thing to consider is where are you copying the documents, I suppose in a separate folders ? here the logic is important, if you plan a logic based on:
1- create or check folder ( some extra time )
2- upload file into existing folder

If you decide for other kind of catalog ( upload the each 1000 documents into folder X etc... and the next ones into X+1 ) you can win a lot of perfomance on it ( but for this kind of solucion catalog myst de done based in automation ) -> the idea is upload into same folder /okm:import and then based on automation class linked with create document event perform in background the catalog of the document.

If you work without security -> remove all roles and users -> then you can win an extra perfomance ... all it depends on how the system will be used, for how many users etc... is quite difficult suggest some direction without more detailed information about for what will be used the system, and for how many users or single user etc... With it I can give you some clues.
 #45004  by maniachhz
 
Thanks jllort.

Previous directory i uploaded like as:
previous_directory.png
previous_directory.png (3.5 KiB) Viewed 5586 times
Now i added more deep layout in the directory like below, and limit 500 documents in each folder, uploaded with 4 parallel (also try 10, 8, 5, 3), But the speed of uploading is still not up to my expect.
I would try by import utility on the GUI next time, maybe it's a better one.
now_directory.png
now_directory.png (4.43 KiB) Viewed 5586 times
As I added more deep layout in the directory, now i need to purge the previous folder. I tried purge these folders on the Web GUI, but failed as too many files(some folders has more than 300 thousand documents in it.

I can find all the uuid, the nbs_name of the document on the table okm_node_base, but I don't know what associated UUID needs to be deleted in other tables.
I want to purge the document by removing uuid on the database backend , can you tell me all associated UUID in all the tables? (I mean, UUID may be stored in several database tables, i need to purge all associated UUID)
Last edited by maniachhz on Fri Dec 08, 2017 12:06 am, edited 1 time in total.
 #45017  by jllort
 
You must purge the documents from the API, otherwise will be garbage in the repository ( database or file system ). You can create and small script for doing it -> first only remove documents from trash and at the end the folders:

Take a look:
https://docs.openkm.com/kcenter/view/ok ... rsal-.html
https://docs.openkm.com/kcenter/view/ok ... html#purge
 #45026  by maniachhz
 
Hi jllort,

I tried to upload the files by import utility tool , but the upload speed also is slow.
spend_time.png
spend_time.png (13.85 KiB) Viewed 5569 times
Maybe just uploading the tgz(tar.gz) archive to server is good way. Does the OpenKm support to extract all the file's name which compress in in the tgz? ( I mean that when finishing upload the tgz archive to server, e.g., the file named 00:00:00:11:22:33.lic is in the archive named 2017001188.tgz, when i search name 00:00:00:11:22:33.lic on the Web GUI , i can get 2017001188.tgz). if not suport this feature, can you tell me to how to do this ?

Thanks
 #45032  by jllort
 
Zip file is supported but I think tar.gz not. Anyway how big is this file ... because OpenKM does not do magic. When you upload a zip file, openkm uncompress in the file system ( folder ) and then upload in the same way you are uploading files directly from file system.
 #45035  by maniachhz
 
Anyway how big is this file ... because OpenKM does not do magic.
Each file inside the archive (zip file) has 677 bytes size.
When you upload a zip file, openkm uncompress in the file system ( folder ) and then upload in the same way you are uploading files directly from file system.
there are too many files(millions) when openkm upcompress the zip files, this is difficult for backup(file system).
I need: uploading the zip file, the openkm extract the filenames which are inside the zip file, but not uncompress the the file system, any suggestions?

Thanks.
 #45062  by maniachhz
 
Hi jllort,

I have a idea about my case.

I uploaded the zip file to OpenKM, then updated the nbs_text by SQL client manually according the nbs_uuid.
Code: Select all
update okm_node_document 
set ndc_language='en', ndc_text='00_19_3B_05_14_1c.lic,00_19_3b_05_14_1e.lic', ndc_text_extracted='T' 
where nbs_uuid='b60898d9-49c4-4617-bbc8-7de0c8b2f65b' 
nbs_text.png
nbs_text.png (5.82 KiB) Viewed 5551 times

and rebuilt the indexes by utility - Rebuild indexes,

then i searched the name: 00_19_3b_05_14_1e.lic on the GUI(also tried to search 00_19_3B_05_14_1c.lic,00_19_3b_05_14_1e.lic), but got nothing.
can you tell me what am i missing to do? or how to add the ndc_text value to the lucene manually for searching?
search_content.png
search_content.png (15.17 KiB) Viewed 5551 times
 #45079  by jllort
 
If you do not create your own TextExtractor from the source code you will not success on it. Take a look into some of them https://github.com/openkm/document-mana ... /extractor for example https://github.com/openkm/document-mana ... actor.java
 #45105  by maniachhz
 
Hi jllort,

I want to manually update the lucene index by Pylucene(although this is not a good idea, but compared to Java, I am familiar with python, temporarily processing).

Can you tell me how to know which lucene version currently i am using? ( The OpenKM i currently use was download from https://sourceforge.net/projects/openkm/files/6.3.4/).
 #45108  by jllort
 
Consider doing a backup before this kind of experiments and obviously with OpenKM stopped.

Here is the official community code.
https://github.com/openkm/document-management-system

In this file you will see all the library versions
https://github.com/openkm/document-mana ... er/pom.xml
The Lucene version is the 3.1.0

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.