Duplicate Files


#1

If I understand correctly, there is no way presently to configure my ownCloud to detect and alert me when someone uploads a duplicate file. Is this a candidate for an app?

I have a client with two requirements:

  1. Check for existing file with same name
  2. Check for existing file with same file type and size

Thank you.


#2

I have a client with two requirements:
Check for existing file with same name
Check for existing file with same file type and size

This is pretty much useless to detect duplicate files. Just a small example:
"file1.txt" with content "Hello World". Now I upload another file called "file1.txt" with the content "middleworld". The file is detected as duplicated, but clearly it isn't.

Now, why do you want to detect duplicates? ownCloud provide its own storage view to each user. It would be very bad if a new user without any content uploaded wants to upload a file and gets a "this file is duplicated" message because other user has uploaded the same file in its own space.

The only viable way to implement this properly is to do it per ownCloud storage. There are several problems with this approach:

  • The FS API doesn't provide a way to search files. I guess the main problem here is that the filecache table might have incomplete data (there might be some changes in the backend not synced) and searching against the backend might be impossible (you might need to search in a FS with millions of files).
  • If we opt for the easy option to check the filecache table in the DB, I don't think the storage and the checksum columns are indexed. This means that, for big instances, the performance will likely be awful. Unless there are more things where those columns are useful to search by, core won't likely add those indexes, and the DB admin shouldn't add them by hand.

I think such app will be a waste of time and won't work as everyone would expect. Note that, in addition, I don't think any FS has the option to detect duplicate files, so I wonder if ownCloud can implement this.


#3

Hi. Thank you for your response.

"file1.txt" with content "Hello World". Now I upload another file called "file1.txt" with the content "middleworld". The file is detected as duplicated, but clearly it isn't.

Correct, but this is precisely what they want -- they need to be notified that someone did that and then examine the situation manually.

Now, why do you want to detect duplicates? ownCloud provide its own storage view to each user.

This particular use case is that all files are stored in one user's space and they want to know, for example, if someone uploads a duplicate file with the name Video-Venice Beach-2014-June.mp4 because given they way they are used to storing their data, that would be a red flat that they would want to be alerted to in order to investigate.

I think such app will be a waste of time and won't work as everyone would expect.

All we need is to run this search every time a file is added, or a name changed I suppose. I would have to investigate the details more, but perhaps searching the filecache table might indeed be sufficient.

Note that, in addition, I don't think any FS has the option to detect duplicate files, so I wonder if ownCloud can implement this.

There is fslint, fdupes, dupeGuru, SearchMyFiles, Duplicate Files Finder and others. The point is that duplication detection is a real need that some people have.

The other option is to write an independent script to use fdupes and run it via cron.


#4

It's better if they manually select what files they don't want to be duplicated. They might want just a small set of files, not all of them.
This is a better approach because you can store whatever data you need in you own table for your own app, and set the DB indexes you need. This approach has several advantages:

  • Fast search: either you have a small set of file you have to check, or you have proper indexes in the DB to search in constant time regardless of the file set.
  • Fast checks: when a file is uploaded you need to check if there is a file with the same content in your own table. As said above, this is fast (if it's done properly). Then you need to check if the file is really present (it could have been deleted). This should also be fast since you should be checking a specific path.
  • No worries about previous state: if there were duplicate files before marking one of them, the duplicates will remain. This is a drawback in order to keep the performance as fast as possible.

Such app seems a better choice. The standard (not enforced) constraint of isolation that are expected of an app is maintained, and whatever the app needs from core can be obtained via standard APIs or public classes.

Nope. Those are applications that run over the FS, not FS options that prevents you to create duplicated files. You can create a script that does practically the same connecting via webdav to the ownCloud instance. Then you'll likely cry because either the application takes forever (it might need to scan millions of files) or the ownCloud FS is write-locked and you can't upload anything until the scan finishes (probably the first more than the second).

For me, those apps are fine for a personal computer where you scan 10000 files or so, even less since you might scan specific personal folders which are local, but for a server with multiple users and external storages seems very impractical.

My main point here is that the approach of those app doesn't scale. We have to deal with an unknown number of files (billions, why not?) under slow FS (external storages such as dropbox, gdrive, etc). Restricting the functionality to a "it works fine for less than 10000 local files" isn't a good approach when people will need to deal with much more than that.


#5

It may be worth each file after changing/creating / loading to calculate the checksum hash and enter it into the database, and after checking with the existing database, in case of a match, do not create/transfer from tmp and create a link to an existing one
The speed of verification on the database is significantly faster than each time to calculate the hash of the previously loaded files