• De-duplication of data

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #7778  by Viral Raithatha
 
Hi,

I am using OpenKM for long time.
My question is does OpenKM support De-duplication or not?

Thank you,
Viral Raithatha
 #7779  by jllort
 
I don't understand exactly the de-duplication concept, could you please be more specific with minimal example what you want doing.
 #7780  by Viral Raithatha
 
I want to say that, is there any way to find same file from the system and delete that files and maintain only one file for all the reference.

Eg. I have one file Temp.pdf and uploaded into Folder1 and Folder2.
Physically Temp.pdf file at two different location of the disk.
Using de-duplication only one reference of those file is save and both folder are pointing to that single file.

Benefit of this is that it require less space on disk and data are easy to find.

Did you get it?

Thank you,
Viral Raithatha
 #7781  by jllort
 
could be easilly done, but the major problem is for example that scenario:

You upload temp.pdf file on folder 1 with rights 1
Other user upload temp.pdf file on folder 2 with rights 2

In that case which is the correct option ? We've not implemented - by default - the kind of compacting repository by default because there's some problems like I exposed before that could have several solution and not all are valid to all environments ( user needs ).

Better option - I think - is planning the problem from other point of view. Better than make some option on uploading etc... I think it's better some mechanish that administrator could use to compating it. That could be done with two options, or combination both. First can be done some report with objective to list duplicate folder, and then make option you desire. Other option is some automatic task on scheduler.

That I think it's better solution than something made by default by OpenKM core, because, as I said, are several ways to solve it, but seems not all might be valid to all users.

The logic on a scheduler task could be:
1- Scheduler task executed each day
2- Search only files uploaded last day
3- Look if there's some file with same size ( if exist, then compare with it binary ) ( if it's equals then make some option )
4- Send mail to some users, about what task have done

( good option - better than delete - could be move that kind of files to some folder, and administrator then decide if want to delete or not, in you case seems you want to delete it directly .... then might be useful notify to user who uploaded that the files has been deleted because is duplicated and where's the file ). etc...

Obviously that needs some extra job - OpenKM parametrization / configuration - but with it the system will make exactly what you want.

Other solution, could be for example assign a workflow to all documents, and make it operation after uploading file, or using scripting for doing it.

I think workflow or task scheduler are more powerfull ways for doing it.
 #7906  by pavila
 
I think hard disk space is actually cheap enough to make this feature not very useful. Anyway, if you configure OpenKM repository to use a DataStore to store the documents, this subsistem already have this feature. You can have two testing.pdf on your repository but only a binary content is stored.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.