Skip to content

For cloud storage client library, consider adding integrity checking functionality to compute checksums for downloads and uploads #660

@houglum

Description

@houglum

Context: Cloud Storage already offers the ability to supply a known checksum when uploading an object [1]:

Cloud Storage supports server-side validation for uploads, but client-side validation is also a common approach. If your application has already computed the object’s MD5 or CRC32c prior to starting the upload, you can supply it with the upload request, and Cloud Storage will only create the object if the hash you provided matches the value we calculated.

But for downloads and uploads where the MD5 or CRC32C checksum(s) is not known beforehand, clients have to take responsibility for integrity checking. W.r.t downloads, this is mentioned in the same docs page:

A download integrity check can be performed by hashing downloaded data on the fly and comparing your results to server-supplied hashes. You should discard downloaded data with incorrect hash values, and [...].

The same strategy could be used for uploads where the checksum(s) isn't known beforehand.

Suggestion: From a reliability standpoint, it makes sense that a storage client library should take care of integrity checks for all local-to-cloud uploads and cloud-to-local downloads. We ought to be doing this to protect users from data corruption (in the event that a user doesn't provide us with the file's checksum up-front). While other client libraries didn't have this functionality at their inception, it's since been added, e.g. in the Python storage library ([2], [3]) and the dotnet storage library ([4], [5]). As mentioned in the Python library issue [2], the MD5 sums were used there because a fast CRC32C implementation (the C extension of the crcmod module) is not guaranteed to be available on all Python installations; since this is a C++ library and we can ensure we have a fast implementation of it, we should use CRC32C (guaranteed to be available for all objects, including composite objects), rather than MD5 checksumming.

  • Note: The Python library actually deletes a corrupted download file, while I think the C#/dotnet library just throws an error without deleting it. Deleting the file seems more appealing and doesn't seem hard to tack on to this functionality, but removing the corrupted file means we lose some ability to debug the corruption (e.g. what's the nature of the corruption? Did we just miss a byte at the end? etc.). We should think about the pros and cons of both approaches if we decide to add this integrity checking functionality.

[1] https://cloud.google.com/storage/docs/hashes-etags
[2] googleapis/google-resumable-media-python#22
[3] googleapis/google-cloud-python#4133
[4] googleapis/google-cloud-dotnet#395
[5] jskeet/google-cloud-dotnet@aaa2aa0

Metadata

Metadata

Assignees

Labels

api: storageIssues related to the Cloud Storage API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions