Files Bundling extension - multiple files update/creation in various locations on the server

mrow4a · September 15, 2016, 3:14pm

Should I create separate development PRs?

felixboehm · September 15, 2016, 3:46pm

How many move / mkdir / delete operations are in a single sync run - expected average …?
Let us understand if bundling for other operations than put will really improve sync, right?

mrow4a · September 15, 2016, 3:56pm

These are only the data about PUT and GET operations, I think I will ask our biggest customers about some more detailed characteristics about PUT/GET/DELETE/MOVE oeprations, how often do they do that and so one. I also included some geolocations of users, since bundling makes extreme sense for big latencies.

Some statistics also about operations at CERN, what you can find on http://cs3.ethz.ch/ :

CERN:

SWITCH:

jfd · September 20, 2016, 8:58am

Correct me if I am wrong:

The client currently calculates a list of operations for every sync run, right?
We could use a single bundle containing the sync operations in order as they have been calculated?
We still want like 2-4? parrallel connections?

Does the current implementation bundle all CREATION PUT requests? Or does it create eg. a bundle per dir?

What is the reason for not bundling the UPDATE PUT requests?

Obviously, as you already stated above MOVE and MKCOL need to come before PUT and DELETE.

Why do you want to abert on DELETE errors? Id say it depends on the error. If the resourca is already gone we can ignore the error. on a 500 we should stop the sync but I’d leave 500 out of the expected sync operations. If a PUT fails we should collect the errors and return a list of each cause. Again, depending on the error type we need to decide if we can continue or not. I currently cant think of an error thet would let us continue, Out of Quota maybe …

In any case lets do this step by step. Having bundles for CREATE PUT is already a huge step forward.

Now, regarding multipart/mixed / multipart/related. While multipart/related would allow us to describe the whole sync we would have to invent eg. a content type containing information about binary blobs, like RFC 2387 - The MIME Multipart/Related Content-type or RFC 2387 - The MIME Multipart/Related Content-type. Reading through the spec it looks better suited to bundle multiple resources that can be aggregated to a single resource that can then be rendered. I looks a little fragile when one of the resources is broken / corrupted eg, during transmission.

multipart/mixed would require us to add headers for the target directory, eg. “Content-Location”. I think we don’t necessarily need to add an X-OC-METHOD to each body part, because at least currently we are bundling only CREATE PUT operations. We could also add UPDATE PUT without needing to give a X-OC-METHOD. We could even say that multipart/mixed body parts derive the destination from the request so an upload of images to the same folder does not need any additional headers in the body parts.

Since we require to execute MOVE/MKCOL before any PUT / DELETE we can bundle them together as well. mkdir in ownCloud is always recursive so the four bundles in order are MKCOL, MOVE, PUT, DELETE. Each could be contained in a separate multipart/mixed request. We could start a new bundle to execute operations on a different directory to save the ‘Content-Location’ header … Then again I dont think we can gain a lot … because the new request also contains a ton of other headers…

So … no need to invent headers for multipart/mixed I think.

mrow4a · September 20, 2016, 9:12am

The client currently calculates a list of operations for every sync run, right?

Yes, in the current bundle implementation sync client segregates the operations in the directories order:

So that All MOVE/MKDIR/PUT operations will be done per directory and will finish before any POST bundle job will be started.

When it is finished, it means all the big files(we want big files to be synced first) and preconditions are satisfied, so that we can continue with small files. You are bundling them into multipart/mixed requests in 5 parallel flows and send. In current implementation it is cross-directory since all the preconditions are being satisfied. This has once clear advantage. It is very easy to document and present the logic. It is also extremely easy to implement it on the server, since server just checks the paths, methods and then executes the proper operations. You have a single endpoint to do that. Headers overhead will be neglible, since anyways in separate requests you would need to insert this information somehow.

I think segregation per directories makes sense only for preconditions and maybe for DELETE DIR jobs. It makes no sense for UPDATE/CREATE jobs, and there the only limitation will be how many files and how big bundle could be per request. This way you could optimize it with dynamic chunking and http/2.

mrow4a · September 20, 2016, 9:14am

Hmm, I think you might be right. It makes no sense to abort everything there. I think interesting approach could be a DELETE multipart request header Content-Location, so that you delete the files per directory. The parts inside multipart are the names of the files in the directory then. But anyways, you need to insert the names of the files in the header of part.

Nice

mrow4a · September 20, 2016, 9:19am

Since we require to execute MOVE/MKCOL before any PUT / DELETE we can bundle them together as well. mkdir in ownCloud is always recursive so the four bundles in order are MKCOL, MOVE, PUT, DELETE. Each could be contained in a separate multipart/mixed request. We could start a new bundle to execute operations on a different directory to save the ‘Content-Location’ header … Then again I dont think we can gain a lot … because the new request also contains a ton of other headers…
So … no need to invent headers for multipart/mixed I think.

I would not do that. If MKCOL fails, you just transported 10MB of data for nothing. There is need for 3 separate multipart messages. (Preconditions, Transfers, Deletes). As stated before, this way you could very clearly define your sync in 3 stages. So that e.g. Preconditions are done per directory, Transfers cross-directory and Deletes per directory.

mrow4a · September 20, 2016, 9:31am

@jfd

What do you think about content disposition header?
https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html
Section 19.5.1 Content-Disposition

This way we could include in one header X-OC-Method and X-OC-Path.

EDIT:
Ok, we cannot use that. This is not a way you use it.

EDIT:
We cannot use Content-Location and Location header, these are response headers: List of HTTP header fields - Wikipedia

DeepDiver1975 · October 13, 2016, 1:12pm

In addition to all of this: do we have an understanding if such an approach is still necessary as soon as http/2 is used?

see HTTP pipelining - Wikipedia

mrow4a · October 13, 2016, 1:15pm

This is another topic I discussed with Kuba. I will compare and combine this both things.

mrow4a · November 16, 2016, 2:58am

Update from 16 Nov:

Bundling now goes from prototype to implementation:

github.com/owncloud/core

[9.2] Bundling plugin

master ← bundling_plugin

opened 03:45PM - 10 Aug 16 UTC

mrow4a

+2476 -8

Update 16 Nov 2016: Research of performance of cloud synchronization services l…ike ownCloud, Seafile and Dropbox has shown, that on-premise services show better performance characteristics than public clouds syncing big files (higher transfer rates in both upload and download could be obtained due to simple implementation and smaller activity of users for specific bandwidth) and are very competitive syncing mixtures of files. Unlike typical web services, cloud sync and share is characterized by requests load/number much outreaching the typical loads to the web server per user in some specific activity type. Underutilized upload/download bandwidth and long distribution tails (penalizing transfers of small files over WAN) are characteristic for services using current ownCloud synchronization protocol. Important factor in synchronization performance is also number of operations performed per single-file request on the web-server. Along with http/2 extensions - which will be addressed separately - this feature should reduce the impact of latency and significantly lower number of requests to the server, making server more lightweight, utilize pipe better and in turn sync files faster. With 0 latency, having sync on local machine, the following scenario has been under the test (in this scenario latency/locality is much favourising traditional http/1 puts): https://s3.owncloud.com/owncloud/index.php/s/kSVQvr3y7EdmZ6b?path=%2F 1000f of 1kB and 100f of 100kB - total 11MB. Number of requests has been reduced from 1100 to 15 requests (typical number for web content) Sync time on the test machine has been reduced, on average, from 37s to 31s (taking also into account recent sync performance improvement for single puts in the folder, which bundling improved out of the box from concept https://github.com/owncloud/client/pull/5230, https://github.com/owncloud/client/pull/5274) This is profile for 1 ( ONE! ) PUT of 22kB file using standard http/1: ![selection_066](https://cloud.githubusercontent.com/assets/13368647/20332335/47b62bce-abab-11e6-9391-6dc42eaa4019.png) This is profile for 1 ( ONE! ) Bundle containing 10 (TEN) 22kB files - total 220kB using standard http/1: ![selection_068](https://cloud.githubusercontent.com/assets/13368647/20332505/871be4e2-abac-11e6-97d8-b6b8f716ae01.png) Bundled request requirements > 1. A bundle Request receives a 207 Multi-Status Response with the individual 20x, 30x, 40x, 50x statuses for each file. It receives a 400 Bad Request response with an error message if the Request was malformed. > > 2. Request body can be any mime-type, with full implementation freedom. > > 3. Request is finishing with delivering last part of successfull response after all linked operations has been successfuly finished, or aborted immedietaly in case of request cancelation/termination. > > 4. If request cannot be executed or response cannot be correctly constructed, request has to be aborted and error 4xx-5xx has to be returned for whole the request. We already had implemented both prototypes for multipart/mixed and multipart/related, discussed it a lot and tried out: I found following limitations for each of the request, starting with the order of implementation: Multipart/related: - This type of mime type includes in the first part the list of files to be created, in the key->value manner, where key is the path and value is metadata for that file. Response is created based on the keys in the metadata part, and files are reconstructed from binary contents in the request body, referenced by Content-ID - This mime-type allows to to easily return response for the file, because key-value structure and validation at the begining allows you to correctly construct the response for each of the keys-files(even if content-id is missing, you can return the response for the specific file that binary content for that file has not been found) or return parsing error at the begining. - This is high performance solution, where files are being added to OC as they are being read from the request body and allows you to use chunked transfer encoding for the response while files are being added. - Disadvantage is that list of files is specified in the first part, and the binary contents are anynomous reading the request without first part. - Disadvantage is that, in case of parsing error, request has occupied bandwidth for nothing - this should however not happen in practice. Multipart/mixed: - This mime types includes in each part headers (metadata for a file), and in the part body, the actual file body. - As in multipart/related, we can use chunked transfer encoding for the response. - Advantage here is that reading pure request, we have an independent parts in the request. - This mime-type requires you to parse the request body on fly in order to read the body of the request and simultaneously add it to OC. In order to correctly construct the response for each of the files, each part has to be parsed and validated, since it serves as a container for the file. - In this mime-type, in order to validate the request at the beginning, one would need to parse the request body and save it in-memory or on the disk. The other option is to seek in the request body, however this is dangerous and unpredictable. This results in the fact, that the bundle has to occupy in the peak moment the memory equal to chunk size - typicaly 10MB for request itself and 10MB for in-memory storage, and will decrease with each file added to OC. This gives 20MB per request, and 20GB per 1000 simultaneous request in peak moment. - In this mime-type in one need to validate and parse independent parts on fly. In case of single part parsing error, one would need to raise the full request error, since the construction for that file might not be possible due to lacking URL header for that file, which will result in lack of the response of the previous files already added in the OC for that bundle. Having half of the files acknowledged for that request, and afterwards return error for the whole request invalidates the architectural concept for request in OC (refer to point 3 and 4). - Disadvantage is that, in case of parsing error, request has occupied bandwidth for nothing - this should however not happen in practice. The implementation will follow the multipart/related in this version. - [x] Bundle feature passes prove of concept tests - [x] Integrated error handling - [ ] Full unit tests coverage - [ ] Documentation