Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Externalized attachments #1021

Open
fflorent opened this issue Jun 6, 2024 · 1 comment
Open

Externalized attachments #1021

fflorent opened this issue Jun 6, 2024 · 1 comment
Labels
Attachments enhancement New feature or request

Comments

@fflorent
Copy link
Collaborator

fflorent commented Jun 6, 2024

The idea has been proposed by @paulfitz #886 (comment)

If I understand correctly, and I tend to think that's also something we would like on the ANCT's side: the idea would be to allow creating an attachment column where the file would not be injected in the document but rather in an external storage like S3.

There are several advantages, especially reducing the grist document size and thus allowing increasing the attachment size limit.

@fflorent fflorent added enhancement New feature or request Attachments labels Jun 6, 2024
@paulfitz
Copy link
Member

paulfitz commented Jun 6, 2024

Thanks for opening this @fflorent ! Going to dump in some thoughts a developer prepared about this in 2022. Grist has changed somewhat since, and also this was not a plan just one person's thoughts (I personally disagree with some of it), but the set of concerns raised could be helpful for things about this project.

Externalizing attachments

External store

We need a generic interface for storing and retrieving file data that can be implemented in different ways. One obvious data store is S3-compatible stores. Theoretically, the local filesystem might also work.

Migration

Once at least one store is implemented, _gristsys_Files could be deprecated, and a special migration could move the data from there to the external store. The usual Python migration system won’t be enough on its own because the data engine doesn’t see _gristsys_Files, but maybe it could make an external call to node to deal with that.

Downloading documents

We still want to be able to download a single self-contained .grist sqlite file containing all the attachments. When this happens, we’d need to:

  1. Make a copy of the database file
  2. Download all the externalized attachments and put them in the copy, perhaps back in _gristsys_Files
  3. Give that to the user to download
  4. When the document is uploaded again, perform the same process as the migration to move data to the external store.

This would also allow using downloaded documents in older versions of Grist.

Serving attachments without the DocWorker

Currently the client uses a special DocWorker API to view and download attachments. To serve the files, the DocWorker retrieves them from _gristsys_Files. In the first iteration of work, this would be changed to retrieving them from the external store instead. But in the long term, it would be nice if the client could bypass the DocWorker and retrieve the files directly from the store. S3 would work well for this, but other types of store may not allow this.

Deleting externalized attachments

Attachments are likely to contain sensitive data, and storing them longer than necessary is a security risk. When a user deletes an attachment, it’s reasonable for them to expect it to actually be deleted eventually, just like any other data, so that it can’t be leaked. This applies whether they deleted a row, a document, or an entire organisation. We can’t actually fully delete the data immediately in the first case because deleted rows still live in the snapshot history, but we should delete them eventually.

This is like the problem of tracking attachments referenced within a document, on a much larger scale. In this case actually tracking the references (or maybe just their counts) from documents to the external store seems essential. These would need to be updated whenever a document is copied or deleted within a Grist installation. We’d need to consider:

  • “Duplicate Document”
  • “Work on a copy”
  • Other ways of ‘forking’ such as from fiddle mode or templates
  • Creation and pruning of snapshots.
  • Deleting a document permanently.

Downloading a document ‘disconnects’ it from the Grist installation so it doesn’t need to be counted. It has its own copy of the attachments so it should either delete or ignore the metadata about the externalized data.

Encryption

An alternative to tracking attachment references to allow deleting them is to encrypt the attachment data to avoid the need to delete it. Each attachment file would have a unique encryption key stored only in the corresponding row of _grist_Attachments . Once all copies of that row are fully deleted, the encryption key should be lost, and decrypting the data in the external store should become impossible. That means we don’t ever have to delete the actual data, so we don’t need to keep track of references to it.

Another security benefit of encryption is that if someone gains access to the data in the external attachments store, they can’t actually read it unless they also have the referencing documents.

One downside is that serving attachments directly from S3 instead of the DocWorker becomes more tricky. Decrypting and displaying a single encrypted file in the browser using SubtleCrypto and createObjectURL seems straightforward. But it’s a lot more delicate to handle a user scrolling through a grid filled with thumbnails, displaying them all efficiently and then reclaiming memory after they disappear from view.

Access Control

Would need thinking about. Important to preserve the property that the existing metadata (particularly fileIdent) is not enough to download the file, so that access is properly revoked even if someone has a past copy of the metadata. It might also be nice if the download URL couldn’t be computed purely from the file content, so that someone with a local copy of a file can’t test whether it exists in the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Attachments enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants