• No primary keys on tables

  • He we will discuss about how to make customization and improvement to the OpenKM source code.
He we will discuss about how to make customization and improvement to the OpenKM source code.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #24441  by Alexander
 
Hi.

I'm trying to create "federated" openkm installation using postgres and Bucardo.

I encounted with tables without primary keys.

42. Table: public.okm_node_mail_bcc DB: mydb_node1 PK: none
48. Table: public.okm_node_mail_cc DB: mydb_node1 PK: none
49. Table: public.okm_node_mail_reply DB: mydb_node1 PK: none
50. Table: public.okm_node_mail_to DB: mydb_node1 PK: none

etc.

Is it any reason to create tables without primary keys ?
 #24458  by jllort
 
I've take a look on
Code: Select all
SELECT * FROM OKM_NODE_MAIL_BCC
;

Really is created with hibernate annotations, and have two columns NMB_NODE, NML_BCC. This is the internal definition:
Code: Select all
@ElementCollection
	@Column(name = "NML_BCC")
	@CollectionTable(name = "OKM_NODE_MAIL_BCC", joinColumns = { @JoinColumn(name = "NMB_NODE") })
	@Field(index = Index.TOKENIZED, store = Store.YES)
	@FieldBridge(impl = SetFieldBridge.class)
	private Set<String> bcc = new HashSet<String>();
NMB_NODE is a foreing key ( primary key of NodeMail ) and NML_BCC is a mail string ( bcc mail ), can happens for example that the same mail the same bcc mail address twice ( contraint unique (NMB_NODE, NML_BCC) will not be valid case and pk (NMB_NODE) will not be valid also ). Probably we could add some annotation to make auto pk, but I do not know if will have some benefits because these table if used to store BCC mail related with some mail node. The internal query should be something like
Code: Select all
select * from OKM_NODE_MAIL_BCC where NMB_NODE0='node uuid';
 #24472  by Alexander
 
From one side,
database theory highly recommended NOT to use tables without primary keys cause of lot of problems
(C.J. Date - Database Design and Relational Theory: Normal Forms and All That Jazz (Theory in Practice)),
one of problems is replication problem (that's what my post about)

from other side, is it hibernate problem ?

I find ticket in hibernate Jira
https://hibernate.atlassian.net/browse/HHH-7750

Is it the reason ?
 #24486  by jllort
 
It's exactly the same case. I do not know if it's a problem with hibernate annotations, with hibernate is not good idea thinking as table implementation, but anyway obviously it'll be a table or some couples.

What do you got in mind to do with database ? you're talking about replication .... you're thinking about some database replication configuration ? cluster or similar ? and if it's the case which database you got in mind ?
 #24535  by Alexander
 
jllort wrote:It's exactly the same case. I do not know if it's a problem with hibernate annotations, with hibernate is not good idea thinking as table implementation, but anyway obviously it'll be a table or some couples.

What do you got in mind to do with database ? you're talking about replication .... you're thinking about some database replication configuration ? cluster or similar ? and if it's the case which database you got in mind ?
I try to replicate database over "far" internet with master-master replication using Bucardo over postgress (http://bucardo.org/) .

I solve primary key problem by writing stored procedure, which added primary keys to tables.
 #24543  by jllort
 
Without primary keys the sync database application is not able to working correctly ?
If passed some days you get this kind of configuration running and you want to share with community we will open a section on wiki for it.
 #24545  by Alexander
 
jllort wrote:Without primary keys the sync database application is not able to working correctly ?
Yes. Primary keys required by many (may be all?) row replication applications. (Another type is "based on log" replication)
jllort wrote: If passed some days you get this kind of configuration running and you want to share with community we will open a section on wiki for it.
I've got "large object NNN does not exists" error.
The reason is method used postgresql to store BLOB data. Bucardo not able to replicate pg_largeobject system table.
 #24548  by jllort
 
We use lob for long character but binary information normally is stored in hard disk ( I refer to file binary data ) althought can be configured to be stored in dbms. This kind of table columns is needed to store text strings greater than 2048

I have not asked why you want to replicate the databases ? and if are on same network or not ? In case are not in same network why you're not looking for master / slave postgre cluster configuration ( In mysql exists some kind of configuration in this way I suppose postgresql have similar solution ) ?

Which is the final goal you're looking for ?
 #24551  by Alexander
 
jllort wrote:We use lob for long character but binary information normally is stored in hard disk ( I refer to file binary data ) althought can be configured to be stored in dbms. This kind of table columns is needed to store text strings greater than 2048

I have not asked why you want to replicate the databases ? and if are on same network or not ? In case are not in same network why you're not looking for master / slave postgre cluster configuration ( In mysql exists some kind of configuration in this way I suppose postgresql have similar solution ) ?

Which is the final goal you're looking for ?
The idea is to create configuration with 2 active OpenKM installations in Europe and China with continuous data exchange between them. One active installation is not good cause of poor internet channel quality between China and Europe.

Solution by rsync for files and burcado for db seems to be good, but now i try to solve it with db replication.
 #24555  by jllort
 
I imagine something like you're talking about. Two more questions:
- Are you java developer ( do you have java skills ) ?
- Both installation will be able to write new documents, delete etc... you're thinking in bidirectional sync or only on one direction ?
 #24558  by Alexander
 
jllort wrote:I imagine something like you're talking about. Two more questions:
- Are you java developer ( do you have java skills ) ?
I have some skills, but my java level not high.
jllort wrote: - Both installation will be able to write new documents, delete etc... you're thinking in bidirectional sync or only on one direction ?
Bidirectional sync (master-master replication).
 #24559  by jllort
 
If you're planning bidirectional, then stop for a while, because only sync via database is not enought. That's why I ask you is you got some java skills. We have been thinking a lot about replication actually we've solved clustering configuration ( on same intranet ) and mirror master-slave, but the case master-master althought is thinked is not implemented ( or solved )

There're several problem with this kind of synchronization I will try to give you some minimal example:
Upload file from api -> execute internal process which call lucene indexer and then the file is indexed. But if you make this change only in databaes level ( not calling api, the application never will know that this file should be indexed ). The problem you'll got is that files uploaded in master A will be searchable in scenario A but not in B and the same with B.

Synchronization can not be done only at database or file ( rsynch level ) althought is a good aproximation ( and valid for master-slave ) is not valid for master-master configuration.

Option 1
Based on the solution you propose, control new files on master A and master B and make a process to index these specific files by lucene. I think with it could be enought, but althought sound easy is not so trivial, because we can go into infinite cycle of updates caused by database update infinite loop ( change in master A needs change in master B to reindex, which generates database change in master B, that fires change in master A. I think can be solved and not come into infinite loop but must be think with care.

Option 2
We have thinked do it by java task based on activity log operation and extending import / export features, and some minimal tables sync ( basically users, roles and other administration but not entire ). The idea is, for example in log appears actions:
1- Created folder XX
2- Created document path YY
3- Renamed
4- Moved
5- Changed grants

Evaluate the sequence and create exports actions wich related files ( similar as export but as a collection of actions to be executed in some order ) and similar as import in same order ( obviously can be colisions, problems etc..., folders that not exists ). This option is based on sync each minutes, each hours or only daily ( that's the idea of this option ), basically not live sync.

Option 3.
Real time sync with webservices ( that can make system more slowly, and I think it not solves the conectivity problem you got ).

Us we have not enought hands - at this moment - for doing option 2 which for us have more interest althought 1 is also interesting. If you're able work on it contact with us with http://www.openkm.com/en/contact.html indicating the post of the url and we will contact with you directly. I think is an interesting feature ( configuration ) and if you wish to work on it we will try to help you more nearly.
 #24561  by Alexander
 
jllort wrote:If you're planning bidirectional, then stop for a while, because only sync via database is not enought. That's why I ask you is you got some java skills. We have been thinking a lot about replication actually we've solved clustering configuration ( on same intranet ) and mirror master-slave, but the case master-master althought is thinked is not implemented ( or solved )

There're several problem with this kind of synchronization I will try to give you some minimal example:
Upload file from api -> execute internal process which call lucene indexer and then the file is indexed. But if you make this change only in databaes level ( not calling api, the application never will know that this file should be indexed ). The problem you'll got is that files uploaded in master A will be searchable in scenario A but not in B and the same with B.
Is there other pitfalls, except search, in this scenario ?
jllort wrote:
Synchronization can not be done only at database or file ( rsynch level ) althought is a good aproximation ( and valid for master-slave ) is not valid for master-master configuration.

Option 1
Based on the solution you propose, control new files on master A and master B and make a process to index these specific files by lucene. I think with it could be enought, but althought sound easy is not so trivial, because we can go into infinite cycle of updates caused by database update infinite loop ( change in master A needs change in master B to reindex, which generates database change in master B, that fires change in master A. I think can be solved and not come into infinite loop but must be think with care.

Option 2
We have thinked do it by java task based on activity log operation and extending import / export features, and some minimal tables sync ( basically users, roles and other administration but not entire ). The idea is, for example in log appears actions:
1- Created folder XX
2- Created document path YY
3- Renamed
4- Moved
5- Changed grants

Evaluate the sequence and create exports actions wich related files ( similar as export but as a collection of actions to be executed in some order ) and similar as import in same order ( obviously can be colisions, problems etc..., folders that not exists ). This option is based on sync each minutes, each hours or only daily ( that's the idea of this option ), basically not live sync.

Option 3.
Real time sync with webservices ( that can make system more slowly, and I think it not solves the conectivity problem you got ).

Us we have not enought hands - at this moment - for doing option 2 which for us have more interest althought 1 is also interesting. If you're able work on it contact with us with http://www.openkm.com/en/contact.html indicating the post of the url and we will contact with you directly. I think is an interesting feature ( configuration ) and if you wish to work on it we will try to help you more nearly.
Option 2 seems to be much more complicated - it requires deep understanding of OpenKM internals.
Option 1 - can it be process, that scan db and looking for [not indexed content] ?
 #24563  by jllort
 
About other possible fails I should thinking a while for it. Basically there're some conflicts that can have problematic solution, for example:
folderX is moved in master A at location Y
folderX is moved in master B at location Z
then try to synchronize it, here we got some conflict that should be solved by human intervention

Another
document X is deleted in master A ( that means has moved into some user trash )
document X is updated with version+1

Finally
has been removed delete grants in folder X to user X in master A
in master B user X removes folder X

There're a lot of cases when we're on paral.lel scenario that must have in mind and database sync will not solve it. All depends how much realtime will be the database sync ( but if we're on real time then you need cluster configuration not this kind of configuration ).

For simply cases Option 1 will go fine, but just for simply in some cases you'll get problem and you'll not be able to take control on it, that can be a problem. In my opinion - we have thinking a lot about it during last two years - Option 2 seems right direction.

If you want to work on it we will orientate on how doing it and help in first steps ( the basic application structure ). Obviously we will analyze document, folder etc... cases al we will give you the cases and logic should take in consideration, revise your code etc... what we can not asume is all the work because we're involved in other features and we have not enought time for it. Now major openkm team staff is on holidays next week comes back one of our main Core developers and if you want we can think about it.

About Option 1 - I do not know anything about database sync when same node is modified at same time in both masters ? more information on it I will be able to predict how will go it with OpenKM.
 #24565  by Alexander
 
Please, suppose that conflict problems not exists.
Are there other than conflict problems - for example do another technical problems (same as lucene problem) exists? - I.e hibernate uses postgresql sequence, and manual starting number changing is required?


jllort wrote:About other possible fails I should thinking a while for it. Basically there're some conflicts that can have problematic solution, for example:
folderX is moved in master A at location Y
folderX is moved in master B at location Z
then try to synchronize it, here we got some conflict that should be solved by human intervention

Another
document X is deleted in master A ( that means has moved into some user trash )
document X is updated with version+1

Finally
has been removed delete grants in folder X to user X in master A
in master B user X removes folder X

There're a lot of cases when we're on paral.lel scenario that must have in mind and database sync will not solve it. All depends how much realtime will be the database sync ( but if we're on real time then you need cluster configuration not this kind of configuration ).
Yes , you mention part of standard replication conflict problems.
jllort wrote: For simply cases Option 1 will go fine, but just for simply in some cases you'll get problem and you'll not be able to take control on it, that can be a problem. In my opinion - we have thinking a lot about it during last two years - Option 2 seems right direction.
Option 2 has the same conflict problems. Do you involve in thinking standard Database replication procedures ? It seems different streams for metadata and data exchange solves o lot of problems for not heavy
(~300 operations per hour) installations and permanent (but poor ) connection. Regularly it takes few seconds to transfer table row (without BLOBs).

Another way is to mark each entity with "master server" data (column) and prohibit changing at not "master server" location with [2-phase locking algorithm analog] for changing.
jllort wrote: If you want to work on it we will orientate on how doing it and help in first steps ( the basic application structure ). Obviously we will analyze document, folder etc... cases al we will give you the cases and logic should take in consideration, revise your code etc... what we can not asume is all the work because we're involved in other features and we have not enought time for it. Now major openkm team staff is on holidays next week comes back one of our main Core developers and if you want we can think about it.
It very much depends on complexity.
jllort wrote: About Option 1 - I do not know anything about database sync when same node is modified at same time in both masters ? more information on it I will be able to predict how will go it with OpenKM.
Many algorithms exists. Some of them consider "last come - last apply", other requires manually resolution and so on. In my particular case prohibiting to change at another server is enough.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.