Files Bundling extension - multiple files update/creation in various locations on the server

expert

#1

I want to create a prototype app for bundling support on both server and the client. The idea behind the bundling is to have one PHP process handling creation/update of the files on the server, instead of separate PUT requests for each file.

I am using a word prototype, since I wanted to veryify the design with extended smashbox framework which is also now in prototype phase.

Here are some specifications:

  • POST request coming from client side will trigger one PHP process, which will handle bundle processing
  • Bundle is a non-compressed ZIP file coming from various locations on the client size and associated with unique IDs
  • POST request contains webdav details about each of the files associated with unique IDs mapping. These mapping is required to decide about an end location of the file.
  • Upon receiving of the POST request, server will extract each file to its required location.
  • These feature could be parallel or outdated in the future with other features like HTTP2 or other wevdav extensions MPUT if they occur to be better performing. It could also happen that HTTP2 could be another improvement to bundling.

Unknowns:
* where is the proper location in the server code to implement it, not to break the compatibility with older clients.
* Some implications on the server side preventing implementation

Please share your thoughts.
Piotr

CURRENT:
* For now, no zip, multipart with json containing metadata and files


Sync über ownCloud-Client auf Apple-MAC dauert extrem lange
Will someone give me some suggestions for tuning a low power consumption server? Much appreciated!
Improving data-sync process
#2

Notes:
- implementing a POST requires a plugin to be registered - https://github.com/owncloud/core/blob/master/apps/dav/lib/DAV/Sharing/Plugin.php#L104

  • because this is webdav it has to be implemented as part of core

Question:
- what will be the url to which the POST will be sent?
- how will the mapping be communicated?
- what does unique ID mean in this contaxt?
- how will additional meta data be transmitted (e.g. mtime)?
- how does the response look like?
- how is error handling supposed to work (e.g. one file is not allowed to be put into the target location and so on)?


#3

- what will be the url to which the POST will be sent?
That should be discussed

- how will the mapping be communicated?
- what does unique ID mean in this contaxt?
Each filename in zip will not contain usual filename, but ID assigned by the client. Extraction from zip will be conducted in following maner:

copy("zip://".$path."#".$UNIQUE_ID, $PATH_TO_DIRECTORY_FROM_UNIQUEID_MAPPING );

Communication could be done in two ways. One is that xml file describing bundle structure is appended in zip file, or second, POST is multipart, where one part is zip, and the other xml.

- how will additional meta data be transmitted (e.g. mtime)?
To be discussed, I have to little knowledge what is mtime.

- how does the response look like?
Response will announce status codes for each of the files.

- how is error handling supposed to work (e.g. one file is not allowed to be put into the target location and so on)?
The same as with single files, however done per file bundled according to response.


#4

you can also use $FILES['userfile']['tmpname'] as the $path for an uploaded ZIP so you never have to move it or extract a uploaded zip file.


#5

Please check HTTP multipart requests, might be a good fit here: http://stackoverflow.com/questions/16958448/what-is-http-multipart-request


#6

Discussed: Url should go as new Plugin in app/dav/lib/server for "method:POST", going into url `/remote.php/dav/files/$user'


#7

https://developers.google.com/drive/v3/web/manage-uploads#multipart

with:

Content-Type: multipart/related; boundary=foo_bar_baz

Seems to be better than zipping. We already have all metadata inside.

https://packagist.org/packages/riverline/multipart-parser


#8

With the mime/multipart we are limited to POST requests, but that is an acceptable limitation.
Remember that we have to upload metadata (expected ETag for existing files, modification time) and receive other metadata such as new ETag of the file.

This is the plan how to proceed here:

  1. How do others do? Are there alternatives to http multipart POST? Check GDrive, S3, Dropbox and friends
  2. Create a POST request to /remote.php/dav/files/${user}/ with multipart
  3. Check the available multipart request sub type which the correct one is and find a suitable multipart response. Remember the metadata transport: Filename, ETags, modification times.
  4. Discuss Pros/Cons of the multipart specs
  5. Implement the Sabre Plugin for the server
  6. Implement it in the clients
  7. Document the protocol

#9

How do others do? Are there alternatives to http multipart POST? Check GDrive, S3, Dropbox and friends
According to guys from this year CS3 conference of cloud storage solutions users (mainly owncloud) which did a performance tests - Bocchi slides - of various public file storage servers, only Dropbox is supporting bundling. Anyway, Dropbox is using control metadata trafic to the servers distributed around the globe and streams[?] data to/from centralized point in US. They are not facing the same problems as ownCloud do (ownCloud is filesystem based from the concept, with interated metadata and data flow), while they are based on Object storage.

Create a POST request to /remote.php/dav/files/${user}/ with multipart
Implemented in bundling implementation branch and being test with, for now:

curl -X POST -H 'Content-Type: multipart/related; boundary=boundary_1234' --proxy $proxy --cookie "XDEBUG_SESSION=MROW4A;path=/;" --data-binary $'--boundary_1234\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{\r\n\t"title": "TestFile"\r\n}\r\n\r\n--boundary_1234\r\nContent-Type: image/jpeg\r\n\r\n' --data-binary "@$testfile" --data-binary $'\r\n--boundary_1234--\r\n' http://$user:$pass@$server/remote.php/dav/files/$user/

#10

3. Check the available multipart request sub type which the correct one is and find a suitable multipart response. Remember the metadata transport: Filename, ETags, modification times.
4. Discuss Pros/Cons of the multipart specs
My suggestion is to use the following structure:

Content-Type: multipart/related; boundary=related_boundary

--related_boundary
Content-Type: application/json; charset=UTF-8
{ "JSON WITH FILES METADATA" : "GOES HERE"}

--related_boundary
Content-Type: application/whatever
Content-ID: 1
FILE_CONTENTS

--related_boundary
Content-Type: application/whateverelse
Content-ID: 2
FILE_CONTENTS

--related_boundary
Content-Type: application/whateverelse
....
--related_boundary--

RFC about multipart/related

The primary content-type for message would be multipart/related which, as specifications says, is used to indicate that each message part is a component of an aggregate whole. It is for compound objects consisting of several inter-related components - proper processing cannot be achieved by individually displaying the constituent parts. The message consists of a root part (by default, the first) which reference other parts, which may in turn reference other parts. Message parts are commonly referenced by the Content-ID part header.

The first root part will contain json structured metadata for the other Content-ID referenced parts.

Advantages:
* Easy reference to the files from 'Content-ID' parts
* Well defined and widely used with php libraries on the net

Disadvantages:
* Might require base64 encoding?

Other solutions:

multipart/form-data[RFC1867]
Potential alternative, the multipart/form-data content type is intended to allow information providers to express file upload requests uniformly, and to provide a MIME-compatible representation for file upload responses. It is used to submit uploads of files in the form of form

multipart/alternative
Does not fit in the image, since it is designed to have usualy to parts of the same logic content, but presented differently, e.g. 2 parts, raw text and html formated.

multipart/digest
Does not fit in the image, it is designed for delivery of small messages.

multipart/mixed[RFC1521]
Potential alternative, usualy used for combination of multipart/alternative and uploaded files.
The multipart/mixed content type is used when the body parts are independent and need to be bundled in a particular order. When a UA does not recognize a multipart subtype, it will treat the message as multipart/mixed.

multipart/parallel[RFC1521]
The purpose of the multipart/parallel content type is to display all of the parts simultaneously on hardware and software that can do so. For instance, an image file can be displayed while a sound file is playing.


#11

@DeepDiver1975

What about using https://github.com/sroze/SRIORestUploadBundle
forking it and modifying UploadProcessor and other dependencies

https://github.com/sroze/SRIORestUploadBundle/blob/0a6ae601fbc23782b338630f9cfceff7a15cac7a/Processor/MultipartUploadProcessor.php

so that It will validate "our form" as metadata, communicate with our "storage handler", and use Sabre\HTTP\RequestInterface instead of Symfony\Component\HttpFoundation\Request. It will require modifying/adding/deleting scripts, however keeping the logic of the structure. As I understand, it should go to 3rdparty section of ownCloud, isnt it?


#12

In what respect would that be helpful?

The actual implementation of the sabre dav plugin which will handle post will at most be 100-200 sloc together with one dependency which can parse multipart/related.


#13

@DeepDiver1975 Not sure if I understand. I just want to use library to validate and parse my multipart/related http message. Most of the libraries use specific class of request passed to the function. Since we use Sabre\HTTP\RequestInterface as class passed to e.g. handleBundledUpload function as result of:

$server->on('method:POST', array($this, 'handleBundledUpload'));

I wanted to reuse the concept there in
https://github.com/sroze/SRIORestUploadBundle/blob/0a6ae601fbc23782b338630f9cfceff7a15cac7a/Processor/MultipartUploadProcessor.php
and refactor it to be compatibile with our bundle app


#14

Now I understand - yes - that code can be adopted to work on sabres request interface.
But there is no need in creating a library to do so - just add the lines to the plugin


#15

TO clarify: I should fork it to ownCloud repository, delete unused code and refactor that SRIORestUploadBundle lib?


#16

or simply copy the 20 sloc you need :wink:


#17

@DeepDiver1975 In the branch https://github.com/owncloud/core/compare/bundling_plugin you can find proposed architecture for the bundling plugin. The following code is first reading and decoding json metadata information, and secondly, reading each appended file content and headers:

Request

curl -X POST -H 'Content-Type: multipart/related; boundary=boundary_1234' --proxy $proxy --cookie "XDEBUG_SESSION=MROW4A;path=/;" \
    --data-binary $'--boundary_1234\r\nContent-Type: application/json; charset=UTF-8\r\n\r\n{"0":{"filename":"config.cfg"},"1":{"filename":"zombie.jpg"}}\r\n\r\n--boundary_1234\r\nContent-Type: text/plain\r\nContent-ID: 0\r\n\r\n' \
    --data-binary $"@$testfile1" \
    --data-binary $'\r\n\r\n--boundary_1234\r\nContent-Type: image/jpeg\r\nContent-ID: 0\r\n\r\n' \
    --data-binary $"@$testfile2" \
    --data-binary $'\r\n--boundary_1234--\r\n' \
    http://$user:$pass@$server/remote.php/dav/files/$user/

Response

TODOS:
1. I will suggest the JSON metadata structure for the files and lets discuss that
2. Implement metadata reading and the way files are added to the database
3. Implement multistatus response based on the Point. 2
4. Implement tests and based on the set of tests drive further development (handling exceptions etc.)


#18

Looks good - please open a pull request - we can discuss the next steps there!

THX


#19

ownCloud Client PR: https://github.com/owncloud/client/pull/5155

owncloud Server PR: https://github.com/owncloud/core/pull/25760


#20

After the conference, and presentation of the results of bundling prototype, it was discussed that it might make sense to simplify the structure of bundle internaly by using multipart/mixed type instead of multipart/related type.

The primary subtype for multipart, "mixed", is intended for use when
the body parts are independent and need to be bundled in a particular
order. Any multipart subtypes that an implementation does not
recognize must be treated as being of subtype "mixed".

It will require additional X-OC headers inside the multipart/mixed message parts. Each part now will be an independent part, and have proper file headers and if needed contain appended data. Since the bundling might be intended to reduce number of request for all types of operations, it would make sense to create independent contents and let the logic on the server decide how to react on specific type of errors for specific type of requests:

  1. Predifined condition is that MOVE and MKDIR operations should be bundled first into multipart/mixed messages and be blocking for operations like PUT and DELETE. We need to be sure that the preconditions are satisfied and any fail in MOVE/MKDIR will abort the whole sync. This is also motivated bby the fact that MOVE/MKDIR operations are carring no data payload and abort will not cause memory and bandwidth loss.

  2. MOVE/MKDIR might be bundled together in "directories order" and these 2 operations cannot be bundled with other webdav like operation, also they have to go first before any other.

  3. Any error in MOVE/MKDIR operation should rise the error for the whole bundle, only successful operations should be updated to local client database and the whole sync should be aborded.

  4. If MOVE/MKDIR operation did not abort the sync and did not signal the error, the PUT operations can be taken from the job queue and bundled into chunksize and files number limited multipart/mixed messages. PUT operation bundle should be blocking for DELETE operations.

  5. Failure in DELETE operation should abort the whole sync and only successfull operations should be taken into account.

Final note:
Current implementation will create separate MOVE/MKDIR/DELETE requests, it will create separate requests for UPDATE PUT operations and will create bundled multipart/mixed message for CREATION PUT request.