First sync of big folders when most of the files are the same

stefanochiappini · November 7, 2017, 5:21pm

Hello everybody,
I'm migrating to Owncloud from a previous storage cloud solution "XYZ".
I successfully installed the server (10.0.3) and the desktop client onto two windows workstations.

Now I have the following problem: in the two PCs (let's say PC-home and PC-work) I have the same folder "cloud_data" with 200 GB data already synchronized with the old XYZ software.
On PCwork, after stopping the XYZ client, I moved the cloud_data folder inside the designated Owncloud folder, and all the data inside it were correctly copied in the Owncloud server.
Then I did the same on PC-home (stop XYZ client and move folder). I was expecting a very fast sync between
PC-home and the server, because the folder content was exactly the same. Instead i starded an endless
upload and download process, because all files were consedered new, regardless of their actual content.

Now my question is: did I something wrong (and in case, what?) , or it is true that Owncloud can't recognize
two identical files before uploading/downloading to/from the server ?
Is there an easy way to implement this feature ?

Thanks in advance for any hint.
S.

dragotin · November 7, 2017, 9:06pm

It is true that the ownCloud client by default does not compare files by their content. The reason is very simple: Comparing files byte by byte takes time, for big files that can be pretty severe.

That is why we decided to keep a little database (the "sync journal") that contains something like a fingerprint of each file. That is used to detect changes.

So for your concrete use case: It is best if you go through the lengthy process again, letting the client build up the journal on both clients. I would recommend to let the work_pc sync the files up to the server, and home_pc let sync it down from the server. That way you have a clean setup.

stefanochiappini · November 8, 2017, 2:11pm

Thank you dragotin for your explaination.
Unfortunately the full process can take days (or weeks) on a slow network link, considering
also the need to repeat the procedure for a laptop... a big waste of time.

From what I remember from the documentation of the other XYZ cloud storage solution I was using before,
the server was applying a sort of "hashing" function on the uploaded files, and those information
could give two important advantages:
1) recognize which file were already present in the destination or unchanged, so avoiding an unnecessary data transfer;
2) data deduplication on the storage side, if the same file is stored on several folders, even belonging to different users, it is stored only once.

I think these features are very useful, both from the user perspective and for the administrators, and I would like to see them implemented in Owncloud too.
If it might be useful, I can provide the link to the source code of the XYZ cloud solution (code is free, project is now dead and the company who was delevoping it doesn't exist anymore), maybe it could give hints to the developers...

Please, any comment is welcome, expecially suggestions on how to proceed, in case what I said sounds interesting to other people too.

Bye,
S.

dragotin · November 8, 2017, 8:20pm

Yes, you are right. A hash over the content of files is useful for the usecases that you mention. We did not do that back then for various reasons: You need to realise that on ownCloud, especially the serverside is a very open and integrating system. That makes it hard to implement the calculation of the hashes in some cases, for example when integrating external storages. Today that probably would be solveable, but fife, six years ago we saw challenges.

That said, I think your usecase is valid. If I am not mistaken, nowadays the server already knows the hashes (or checksums). So what the client could do if it does not have a sync journal is:

for every local file, check if a file with the same name and size exists on the server
if it does, calculate the hash of the file locally and compare that to the one on the server.
if both are equal, the client knows that both files are the same, and it can fill the entry in the local sync journal from the values from the server, while not downloading the file.

Maybe. Not sure

I think this is worth a ticket in the github bugtracker, marked as Enhancement. If you create one, please cc me on that.

stefanochiappini · November 9, 2017, 4:20pm

I think this is worth a ticket in the github bugtracker

Well, then I'll open it and see what happens.
Do you think it is more appropriate in the "core" or the "desktop client" repository ?
From your descritpion it looks like the client has to do all the job, while the server is already ready.

Bye,
S.

dragotin · November 9, 2017, 5:32pm

Yes, client is the right place.

stefanochiappini · November 10, 2017, 10:06am

Ok, I opened issue #6153. I'm not sure if I used
the correct way to CC to you (sorry, I've not much practice with github).

Bye,
S.