Some Cloud Storage Services

Java, programming

Azure Blob Storage (“Azure”) and Google Cloud Storage (“GCS”) are now supported as well as AWS S3 (“S3”) by log4j-s3-search project.

While working on adding the two options, I learned a bit about the three storage services. These services’ similarity to one another may not be a coincidence. After all, they are competing offerings from competing service providers, so months/years of competitive analyses may just yield a bunch of similar things.

HTTP and Language SDKs

All three services have the basic HTTP interface:

However, all three services also have language bindings, or client SDKs, for popular programming languages (e.g. Java, Python, C, etc.). And my experience is that working with these SDKs is definitely easier than dealing with them on your own via HTTP. This is especially true for AWS S3, considering the logic used to sign a request.

Storage Model

The models are similar for all three:

  • A global namespace is used for a container of blobs. This can be referred to either a container (Azure) or a bucket (S3 and GCS).
  • Within each container are key-value pairs where the key is a string of characters that may resemble a path-like format to mimic that used by the folder hierarchy in file systems (e.g. “/documents/2020/April/abc.doc“).
    The consoles for these services may interpret keys of that format and present a hierarchical tree-like interface to further the illusion of the hierarchy. However, keep in mind that underneath in the implementation, a key is just a string of characters.
  • The value is a binary stream (a “blob”) of arbitrary data. Typically the services will allow attaching metadata to the key-value entry. One of the common metadata properties is “Content-Type” that is similar to the HTTP header of the same name in usage: to hint to the users of the blob what the content is (e.g. “text/plain,” “application/json,” “application/gzip,” etc.).

Not-So-Quick Walkthrough

The following steps are what I went through in order to upload a file into the services.

In order to use these services, of course, an account with the service is required. All three services have a “free” period for first-time users.

Set Up Authentication

S3

Sign into https://console.aws.amazon.com/

Create an access key. Despite the name, creating one actually yields two values: an access key and a corresponding secret key. The process is a bit clumsy. Be sure to write down and/or download and save the secret key because it is only available during this time. The access key is listed in the console. If the secret key is lost, a new access key needs to be created.

Create a subdirectory .aws and a file credentials under your user’s home directory ($HOME in Linux/Mac, %USERPROFILE% in Windows):

.aws/
    credentials

The contents of the file should be something like (substitute in your actual access and secret keys, of course):

[default]
aws_access_key_id = ABCDEFGABCDEFGABCDEFG
aws_secret_access_key = EFGABCDEFG/ABCDEFGABCDEFGAABCDEFGABCDEFG

That should be sufficient for development purposes.

Azure

Sign into https://portal.azure.com/

Create a Storage account. One of the Settings for the Storage account is Access keys. A pair of keys should have been generated. Any of them will work fine. Just copy down the Connection string of a key to use.

The connection string will be used to authenticate when using the SDK.

Optional: one common pattern I see is that an environment variable AZURE_STORAGE_CONNECTION_STRING is created whose value is the connection string. Then the code will simply look up the environment variable for the value. This will avoid having to hard-code the connection string into the source code.

GCS

Sign into https://console.cloud.google.com/

Create a project. Then create a Service account within the project.

In the project’s IAM > Permissions page, add the appropriate “Storage *” roles to the service account.

Add “Storage Admin” to include everything. After a while, the “Over granted permissions” column will have information on the actual permissions needed based on your code’s usage, and you can adjust then.

Then create a key for the Service account. The recommended type is JSON. This will download a JSON file that will be needed.

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the full path to where the JSON file is stored.

Write Code to Upload File

The examples below are in Java.

S3

String bucketName = "mybucket";
String key = "myfile";
File file = new File(...);  // file to upload

AmazonS3Client s3 = (AmazonS3Client)AmazonS3ClientBuilder
    .standard()
    .build();
if (!client.doesBucketExist(bucketName)) {
    client.createBucket(bucketName);
}
PutObjectRequest por = new PutObjectRequest(
    bucketName, key, file);
PutObjectResult result = client.putObject(por);

Azure

This is using the v8 (Legacy) API that I ended up doing. To do this with the newer v12 API, see https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-java#upload-blobs-to-a-container

String containerName = "mycontainer";
String key = "myfile";
File file = new File(...);  // file to upload

String connectionString = System.getenv(
    "AZURE_STORAGE_CONNECTION_STRING");
CloudStorageAccount account = CloudStorageAccount.parse(
    connectionString);
CloudBlobClient blobClient = account.createCloudBlobClient();
CloudBlobContainer container = blobClient.getContainerReference(
    containerName);
boolean created = container.createIfNotExists(
    BlobContainerPublicAccessType.CONTAINER, 
    new BlobRequestOptions(), new OperationContext());

CloudBlockBlob blob = container.getBlockBlobReference(key);
blob.uploadFromFile(file.getAbsolutePath());

GCS

While the other two services have convenient methods to upload a file, the GCS Java SDK does not; it only has a version that uploads a byte[] which is dangerous if your data can be large.

Internet to the rescue, I guess, since this article has one solution by implementing our own buffering uploader:

private void uploadToStorage(
    Storage storage, File uploadFrom, BlobInfo blobInfo) 
    throws IOException {
    // Based on: https://stackoverflow.com/questions/53628684/how-to-upload-a-large-file-into-gcp-cloud-storage

    if (uploadFrom.length() < 1_000_000) {
        storage.create(
            blobInfo, Files.readAllBytes(uploadFrom.toPath()));
    } else {
        try (WriteChannel writer = storage.writer(blobInfo)) {
            byte[] buffer = new byte[10_240];
            try (InputStream input = Files.newInputStream(uploadFrom.toPath())) {
                int limit;
                while ((limit = input.read(buffer)) >= 0) {
                    writer.write(
                        ByteBuffer.wrap(buffer, 0, limit));
                }
            }
        }
    }
}

With that defined, then the upload code is:

String bucketName= "mybucket";
String key = "myfile";
File file = new File(...);  // file to upload

Storage storage = StorageOptions.getDefaultInstance()
    .getService();
Bucket bucket = storage.get(bucketName);
if (null == bucket) {
    bucket = storage.create(BucketInfo.of(bucketName));
}

BlobId blobId = BlobId.of(bucketName, key);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
uploadToStorage(storage, file, blobInfo);