Supporting s3 in reva


#1

Last week I helped get the ownCloud Design System into poenix. While that was a welcome break from reva and nexus I wanted to go back to the server side and flesh out some of the storage providers. EOS support is mostly there, a rudimentary POSIX local storage provider exists, time to make s3 storages accessible. Back in the days, I desingned and implemented the primary object storage support and had a good understanding of what to expect. Or so I thought. Turns out there are some dead ends on the way. They are tightly linked to the file cache, or in the new architecture with the lack thereof.

The file cache is dead

If you remember the underlying problems, one of the main goals for reva is getting rid of the file cache table.

There are two ways of using an object storage.

  • We can either consume the keys as they are and use a / delimiter to bolt a directory concept on top of it. It looks like a directory tree, but it isn’t.

  • Or we separate the directory tree (the metadata) from the binary blob data, storing only the latter in the object store.

Both approaches are implemented in ownCloud 10 and have been for a while, so we had a lot of time to learn what works and what does not.

Splitting the metadata yields a much better performance, because metadata operations do not need a trip to the s3 storage: no slow HTTP request, no XML marshalling and unmarshalling, no copying of objects to rename or move a file ( to implement move s3 does a copy and delete). While this is nice when setting up a new instance, often you want to make an existing s3 storage available to your users. This is the use case I wanted to look into … aaaand, well, let me try to explain the possible trade offs that I see.

Atomicity and propagating changes

Our motivation is the sync client being able to actually sync files when they change. For this miracle, we need to keep one fundamental WebDAV property intact: the ETAG of the root node needs to change when one of it’s child elements changes. WebDAV actually wants that to be atomic, and ownCloud actually takes care of that when it is involved in the change. When the change is triggered outside of ownCloud, atomicity is not the problem. Propagating the ETAG change up the tree becomes the challenge. Why?

In ownCloud 10 we use the file cache table to propagate the ETAG with the same mechanism we use for all other storages. In reva we want to push the metadata down into the storage itself, if possible. Unfortunately, s3 is not a well defined standard. It is just a protocol that is being re implemented in several products. Mostly by looking at the AWS docs. With varying degrees of completeness. Minio for example has no tagging support and uses policies instead of ACLs. So, we cannot even attach a UUID to a file to re identify files when they were moved in the rudimentary file tree. Why rudimentary?

Rudimentary directory tree

An ETAG can either be based on the file content alone or take metadata into account. For directories the ETAG should change when the file listing of the folder changes. For WebDAV that includes mtime and ETAG changes. Recursively, to make sync work properly. What the sync client does, is poll every 30 seconds to check if the root ETAG changed. If it did, it will walk down the tree to update the changed resources accordingly. For that to work without the ownCloud file cache we at least need to be able to fetch some metadata for a folder. The minimal amount would be an mtime. Together with the object key we can calculate an ETAG that can be used for syncing. Now, s3 only has two ways of getting an mtime: a HEAD request and a ListObjects(V2) operation. The former (at least for minio) returns a 404 on directory objects (keys ending in the / delimiter), the latter aggregates keys with the same prefix to a list of CommonPrefix elements that only contain the prefix. So, it seems we cannot efficiently fetch mtimes or any metadata at all for directory objects, which makes it kind of hard to read metadata that has been attached to it. Sharing might be implemented via ACLs or some other product specific way, like policies for minio. But comments and tags would need an external db, or would they?

Storing additional metadata

If we can attach metadata to objects a simple oc-file-id=<uuid> tag would be enough to glue any metadata we need to any object. We can store tags, comments or sharing information in a different place: a traditional SQL database? We can store it inside the object store as new objects. This would allow using any built in replication mechanism of the storage to deal with redundancy or geo distribution. We can cache metadata to speed up reads. Together with ETAGs cache invalidation can be done efficiently.

That last sentence is a fallacy. Why?

Remember the rudimentary directory tree? Or should I say: missing mtime and ETAG for directories? What can be done about that? Do a file listing of a folder to calculate an ETAG for the parent directory? Then what? Do it recursively, so we may finally be able to have an ETAG for the root? Oh no, optimize it by ignoring the delimiter any just fetch all keys? In a bucket with a million keys?

Long live the file cache

So the metadata is there, but we cannot get at it reasonably fast. If we cannot rely on the s3 storage to provide us with the necessary metadata in an efficient way we can introduce a cache that we use to store the calculated metadata our self. Mind you, the new cache can always be rebuilt by rereading all keys. Rereading is always the fallback solution to pick up changes in the storage to make sync clients eventually see a new ETAG on the root. But there are better strategies than periodically re scanning the storage.

  • The most efficient one is using notifications. This again is storage dependent, but is the cleanest solution.

  • Another possible way is allowing users to manually update the metadata by navigating through the tree in phoenix. The PROPFINDS can be used to fetch the metadata from the s3 storage and update it in the cache.

The first iteration

For now I will implement a storage provider without a cache that will use a fixed time as the mtime for all directories instead of now() (which is what minios mc command does). That will allow browsing the files in phoenix, but prevent the sync client from constantly syncing all folders and updating the mtime. Then we can introduce a cache, maybe a persistent one like QuarkDB and try updating the mtime when phoenix is used to browse files. Finally, we can start leveraging product specific push notifications if they support them.

That’s it from the storage world. A lot of work to be done.

If you know how other s3 products behave for HEAD requests on directory objects or think I may have missed something let me know below!

I’ll be on vacation next week and then hope to get some of the upstream changes merged. I’ll let you know how it went!