GSoC 2017 - Allow remote-delta on file synchronization

gsoc

#1

Hello,

I am interested in submitting a proposal for ownCloud in the scope of Google Summer of Code 2017.

I am currently a graduate student at Georgia Tech with 3+ years of professional experience in web/http applications and have skills in Python, Java, C++ and PHP (it's been long time since I had used it but I can manage for sure).

I saw ownCloud's list for the GSoC projects but what is most appealing to me is the partial file synchronization issue:
Allow remote-delta on sync files
Sync only the file change, not entire file

Could this be a GSoC project for 2017?

An ultra-rough timeline would be:

  • May 4 - 30:
    Research ways to implement delta-file sync and decide on the solution (I saw there are already some ideas like the synching protocol or the rsync protocol), consider potential bottlenecks.

  • June:
    Implement the protocol as an http extension on the server

  • July:
    Implement the protocol on the client

  • August:
    Some buffer for testing, unexpected issues and issues I have not considered in this rough timeline.

If you agree, I could write a more well-defined proposal with your help.

Thanks!
George


#2

You might have small hints from me e.g. https://central.owncloud.org/t/improving-data-sync-process/2517


#3

Applications open today, 20th March, https://developers.google.com/open-source/gsoc/. On April 3 - 24 Organizations review and select student proposals.

@giorgosp However, you might need to be aware that ownCloud distributed system is not operationg on blocks of data, as Dropbox, GDrive and others, but real files are being stored on the server side. This means, there is no actual delta in there. OwnCloud needs to support Object Storages, https://www.druva.com/blog/object-storage-versus-block-storage-understanding-technology-differences/

This implies, that delta here is not out of the box. You cannot just write to Object Storage FILE at some position as in fseek and fwrite. You would need to copy the file from the Storage, apply the delta and store it as [updated file]/[version].

This means, that if you have file of 10MB localy, you need to store it as a single piece on the server, of 10MB. Now lets assume, that you change a bit in your 10MB file, and chunk it into 1MB pieces. You detect that you change only 1-1MB piece, so you transffer your 1MB to the server (yes and this is big win!), but when this PUT arrives on the server, you would need to do 1 storage RTT to fetch a file and copy it, then apply your delta, and have another 10MB RTT to storage to store it (however, this is very fast operation compared to the transfer over the network for modern servers)

If you are still interested in the project, hope the above description helped you to understand what are the requirements. For the application, you need to know definitely both C++ (and maybe some PHP) to develop on server and client. You need to well understand HTTP to easily get around WebDav. You also need to have a well structured understanding how networks work. During your Summer of Code you will work with your mentor (probably me) and as well as all other developers which will help you understand the code base.

I am also strongly encouraging you to submit some patch during application and between 4-30 of May (introduction session with mentor, most probably me with support of other colleagues), we will later announce what patches that could be, so that you get good introduction to the code.


#4

Hi @mrow4a, thank you for the reply!

I saw the videos from the above link, however the thread is more about bundling it seems.

From my understanding, the delta sync project has two challenges: detecting which part of the file changed on the client and how the changes will be applied to the server object file.

I am interested in this project and willing to write a more detailed proposal for it. Could you please guide me on what kinds of details should I include in the proposal?

Meanwhile, I will check the issues on github in order to make a first patch. I believe I will cope just fine with the skills required by the project.

Thanks!
George


#5

Yep, the links are mostly to give you hint how other things are being fitted in the code, how code looks like and where to search for things. For more details of sync please go through articles in @dragotin blog https://dragotin.wordpress.com/2015/03/13/owncloud-etags-and-fileids/1 and https://cds.cern.ch/record/1970463

For more details and small patches, I need to discuss it with other desktop guys. We have meeting this week.

Hope to hear from you soon!


#6

I think you already have all required information to construct the Project Proposal on the GSoC page. After project proposals deadline, we will probably ask for some patches on client/server to verify your application (to verify that you will be capable of completing the project on time, you know how to use tools like github, make for C++ code etc.). Please mind the deadline for proposals is fastly approaching, and the acceptance period is very long, for us to verify your application)


#7

Hi @mrow4a,

I have submitted a first version of the proposal through the GSoC website.

In my readings around this project, I saw that ownCloud does not want to perform many computations on the server. As such, in my proposal, I focus on the incremental file upload. However, I believe that download should be addressed too, because again the clients will need to re-download the whole file even if only a part of it is changed. However, in the download case, the checksum comparison will have to be performed on the server. I have to think more about download though, maybe there is a better solution.

Looking forward to your comments.
Thanks!