Some Cloud Storage Services

Java, programming

Azure Blob Storage (“Azure”) and Google Cloud Storage (“GCS”) are now supported as well as AWS S3 (“S3”) by log4j-s3-search project.

While working on adding the two options, I learned a bit about the three storage services. These services’ similarity to one another may not be a coincidence. After all, they are competing offerings from competing service providers, so months/years of competitive analyses may just yield a bunch of similar things.

HTTP and Language SDKs

All three services have the basic HTTP interface:

However, all three services also have language bindings, or client SDKs, for popular programming languages (e.g. Java, Python, C, etc.). And my experience is that working with these SDKs is definitely easier than dealing with them on your own via HTTP. This is especially true for AWS S3, considering the logic used to sign a request.

Storage Model

The models are similar for all three:

  • A global namespace is used for a container of blobs. This can be referred to either a container (Azure) or a bucket (S3 and GCS).
  • Within each container are key-value pairs where the key is a string of characters that may resemble a path-like format to mimic that used by the folder hierarchy in file systems (e.g. “/documents/2020/April/abc.doc“).
    The consoles for these services may interpret keys of that format and present a hierarchical tree-like interface to further the illusion of the hierarchy. However, keep in mind that underneath in the implementation, a key is just a string of characters.
  • The value is a binary stream (a “blob”) of arbitrary data. Typically the services will allow attaching metadata to the key-value entry. One of the common metadata properties is “Content-Type” that is similar to the HTTP header of the same name in usage: to hint to the users of the blob what the content is (e.g. “text/plain,” “application/json,” “application/gzip,” etc.).

Not-So-Quick Walkthrough

The following steps are what I went through in order to upload a file into the services.

In order to use these services, of course, an account with the service is required. All three services have a “free” period for first-time users.

Set Up Authentication

S3

Sign into https://console.aws.amazon.com/

Create an access key. Despite the name, creating one actually yields two values: an access key and a corresponding secret key. The process is a bit clumsy. Be sure to write down and/or download and save the secret key because it is only available during this time. The access key is listed in the console. If the secret key is lost, a new access key needs to be created.

Create a subdirectory .aws and a file credentials under your user’s home directory ($HOME in Linux/Mac, %USERPROFILE% in Windows):

.aws/
    credentials

The contents of the file should be something like (substitute in your actual access and secret keys, of course):

[default]
aws_access_key_id = ABCDEFGABCDEFGABCDEFG
aws_secret_access_key = EFGABCDEFG/ABCDEFGABCDEFGAABCDEFGABCDEFG

That should be sufficient for development purposes.

Azure

Sign into https://portal.azure.com/

Create a Storage account. One of the Settings for the Storage account is Access keys. A pair of keys should have been generated. Any of them will work fine. Just copy down the Connection string of a key to use.

The connection string will be used to authenticate when using the SDK.

Optional: one common pattern I see is that an environment variable AZURE_STORAGE_CONNECTION_STRING is created whose value is the connection string. Then the code will simply look up the environment variable for the value. This will avoid having to hard-code the connection string into the source code.

GCS

Sign into https://console.cloud.google.com/

Create a project. Then create a Service account within the project.

In the project’s IAM > Permissions page, add the appropriate “Storage *” roles to the service account.

Add “Storage Admin” to include everything. After a while, the “Over granted permissions” column will have information on the actual permissions needed based on your code’s usage, and you can adjust then.

Then create a key for the Service account. The recommended type is JSON. This will download a JSON file that will be needed.

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the full path to where the JSON file is stored.

Write Code to Upload File

The examples below are in Java.

S3

String bucketName = "mybucket";
String key = "myfile";
File file = new File(...);  // file to upload

AmazonS3Client s3 = (AmazonS3Client)AmazonS3ClientBuilder
    .standard()
    .build();
if (!client.doesBucketExist(bucketName)) {
    client.createBucket(bucketName);
}
PutObjectRequest por = new PutObjectRequest(
    bucketName, key, file);
PutObjectResult result = client.putObject(por);

Azure

This is using the v8 (Legacy) API that I ended up doing. To do this with the newer v12 API, see https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-java#upload-blobs-to-a-container

String containerName = "mycontainer";
String key = "myfile";
File file = new File(...);  // file to upload

String connectionString = System.getenv(
    "AZURE_STORAGE_CONNECTION_STRING");
CloudStorageAccount account = CloudStorageAccount.parse(
    connectionString);
CloudBlobClient blobClient = account.createCloudBlobClient();
CloudBlobContainer container = blobClient.getContainerReference(
    containerName);
boolean created = container.createIfNotExists(
    BlobContainerPublicAccessType.CONTAINER, 
    new BlobRequestOptions(), new OperationContext());

CloudBlockBlob blob = container.getBlockBlobReference(key);
blob.uploadFromFile(file.getAbsolutePath());

GCS

While the other two services have convenient methods to upload a file, the GCS Java SDK does not; it only has a version that uploads a byte[] which is dangerous if your data can be large.

Internet to the rescue, I guess, since this article has one solution by implementing our own buffering uploader:

private void uploadToStorage(
    Storage storage, File uploadFrom, BlobInfo blobInfo) 
    throws IOException {
    // Based on: https://stackoverflow.com/questions/53628684/how-to-upload-a-large-file-into-gcp-cloud-storage

    if (uploadFrom.length() < 1_000_000) {
        storage.create(
            blobInfo, Files.readAllBytes(uploadFrom.toPath()));
    } else {
        try (WriteChannel writer = storage.writer(blobInfo)) {
            byte[] buffer = new byte[10_240];
            try (InputStream input = Files.newInputStream(uploadFrom.toPath())) {
                int limit;
                while ((limit = input.read(buffer)) >= 0) {
                    writer.write(
                        ByteBuffer.wrap(buffer, 0, limit));
                }
            }
        }
    }
}

With that defined, then the upload code is:

String bucketName= "mybucket";
String key = "myfile";
File file = new File(...);  // file to upload

Storage storage = StorageOptions.getDefaultInstance()
    .getService();
Bucket bucket = storage.get(bucketName);
if (null == bucket) {
    bucket = storage.create(BucketInfo.of(bucketName));
}

BlobId blobId = BlobId.of(bucketName, key);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
uploadToStorage(storage, file, blobInfo);

Fast Guide to Launching an EC2 Instance w/ SSH Access

AWS, ec2, ssh, Windows

Concepts

Minimal number of concepts to understand:

  • Key pair — a pair of public and private cryptographic keys that will be used to establish a secure shell/terminal to the launched EC2 instance.
  • Security group — a group of access rules that determine what network traffic and go into (inbound rules) and go out of (outbound rules) the EC2 instance.
  • IAM role — a collection of rules that determine what AWS services the EC2 instance will have access to (and what kind of access). E.g. read-only access to S3.
  • AMI — an image that prescribes an OS and some software to run when an EC2 instance comes up.

Shared Key Pair

The only thing that is shared between EC2 and the SSH program that matters in this example is the key pair. The instructions here will describe how to create a new key pair.

Creating a Key Pair

  • Log into the AWS console. The remaining steps to launch an instance will be done in the AWS Console.
  • Access Services > EC2 > Key Pairs from AWS Console.
  • Click “Create Key Pair”
  • Give it a name “KP”
  • Once it’s created, a “.pem” file will be downloaded. Remember the name and where the file is downloaded. It will be needed later.

Create a Security Group

  • Access Services > EC2 > Security Groups
  • Click “Create Security Group” to create a security group. Name it “SG.”
  • In the “Inbound” rules, add an entry for Type SSH, Protocol TCP, Port Range 22. For the Source, select “My IP” to let the tool automatically select your IP address.
  • Add other rules to open up more ports as needed.

Create an IAM Role

  • Access Services > Security, Identity, & Compliance > IAM > Roles
  • Click “Create Role” and select “EC2” (as opposed to Lambda)
  • Click “Next: Permissions”
  • Add permissions as needed (e.g. add “AmazonS3ReadOnlyAccess” if read-only access to S3 is needed).
  • Give the role a name and description.

Launch Instance

  • Access Services > EC2 > EC2 Dashboard
  • Click “Launch Instance”
  • Select an appropriate AMI (e.g. any of the Amazon Linux ones) to use for the EC2 instance. For the instance type, start with “t2.nano” to experiment with since it’s cheapest. Once the instance is up and running, larger instance types can be used as needed.
  • Click “Next: Configure Instance Details.”
  • For IAM role, select the role created above. Everything else can stay as-is.
  • Click “Next: Add Storage.”
  • Edit as desired, but the default is fine.
  • Click “Next: Add Tags.”
  • Add tags as needed. These are optional.
  • Click “Next: Configure Security Group.”
  • Choose “Select an existing security group” and select the security group “SG” created above.
  • Click “Review and Launch.”
  • Click “Launch” after everything looks right.
  • A modal comes up to allow selection of a key pair to use to access the instance. Select “KP” as created above.
  • Continue the launch.
  • Click on the “i-xxxxxxxxx” link to see the status of the instance.
  • Wait until Instance State is “running” and Status Checks is “2/2.”
  • Note the “Public DNS (IPv4)” value. It is the host name to SSH into.

Connecting to The EC2 Instance

Windows with Bitvise SSH Client

  • Download and start Bitvise SSH Client.
  • Click “New profile”
  • Go to “Login” tab
  • Click the link “Client key manager”
  • Click “Import”
  • Change file filter to (*.*)
  • Locate the .pem file downloaded above and import it. Accept default settings.
  • In the “Server” tab, enter the Public DNS host name from above. Use port 22.
  • In the “Authentication” section, enter “ec2-user” as the Username.
  • Use “publickey” as the Initial Method.
  • For “Client key,” select the profile created earlier when importing the .pem file.
  • Click “Log in” and confirm any dialogs.

Mac OS

Change the permission of the downloaded .pem file to allow only the owner access:

chmod 400 ~/Downloads/mykey.pem

Use ssh with the .pem file:

ssh -i ~/Downloads/mykey.pem ec2-user@xx-xx-xx-xx.yyyy.amazonaws.com

EC2: “chmod ugo+rw ~” breaks SSH

AWS, ec2, programming, ssh

“chmod go+rw ~” breaks SSH

Quick note: running

chmod go+rw /home/ec2-user

could break subsequent attempts to SSH into the EC2 instance.

When all the usual suspects regarding SSH identity files, keypairs, etc., are ruled out, one not-well documented cause for the dreaded

Permission denied (publickey).

error could be that the default permissions on /home/ec2-user was modified.

The permissions can be modified temporarily in order to perform some tasks. However, before exiting that SSH session, be sure to restore the original ACL (0700) on that home dir lest all subsequent SSH attempts will fail.

 

SimpleDB w/ AWS JDK

programming

Problem to solve

A flow on the App server triggers an event that needs to be handled by several listeners that then need to each run for a couple of seconds (or more).  Obviously the Web request has to return relatively fast, so doing all that synchronously is out of the question.

One solution is to use a thread pool with N worker threads to do these queued up tasks.  But that approach also takes processing power away from the App server.  Works some times and not others, depending on the load and cost of these tasks.

Another solution, if timing is not that important, is to push up a request into a persistent message queue and have worker processes pull requests off of that queue and process each.  Nice separation of concerns.  Decoupling. Etc.  All the nice things that message queue vendors say.

Each message has to be processed in sequence relative to others.  In other words, FIFO.

Why SimpleDB?

There are full-blown message queue solutions (ZeroMQ & ActiveMQ come to mind). But then I’d have to get the “DevOps” people to stand one up and manage it which is kinda like pulling teeth.

Then there’s the good ‘ol stand-by: Postgresql.  Others have used a dedicated table for queuing purposes, with a “status” column and values like “Pending,” “Processing,” “Success,” and “Fail.”  It’s persistent, and the synthetic row ID can be used for sequencing to get the FIFO iteration. But I have hated the idea of using a table to implement a queue.  Mostly because all the hassle of checking in a migration script to create the damn table and a lot of boiler-plate process to go through.  And the schema is inflexible if I ever need to tweak the message structure.  So I either need to worry about schema changes or break the 1st normal form.

Then there are SaaS offerings.  We use EC2 and various other services from the AWS stack, so it makes sense to see what can work.

SQS — not FIFO

Initially I had gone with SQS for a work queue. It worked well for some other things I’ve done in the past.  BUT in this case I actually need strict ordering which SQS is not providing. I tried to get around that using a timestamp in the message, but unless I read all the packets (or a sufficient number of them), I can’t tell if the first message I get is the right one.  Then I have to either cache the message locally to sort them and then work through them locally and/or re-queue the rest.  That makes them further back in the queue when they should probably be at the head, and the sequencing just got worse.  Yuck.

Comes SimpleDB

So then comes SimpleDB. It’s persistent, almost like a RDBMS without the strict schema and DDL BS (those are good things for domain data, to be sure–just not for my message queue).

So each “row” will have the attribute “timestamp” which currently is just that Unix time (System.currentTimeInMillis())–a convenient long that is easy to sort by to get the FIFO behavior. And the rest of the attributes are basically whatever I need for my message.

AWS Console: no SimpleDB here

I had to check my glasses because I cannot find a page/console for SimpleDB. Almost all of the services in the AWS suite each at least has a page for management of instances of the service.  DynamoDB has one, for instance.  (Incidentally, I didn’t go with DynamoDB since it just felt like overkill at the time.)

Anyway.  No management page.  So it’s a bit frustrating to get thing up to test and figuring out what went wrong when things don’t work.  There IS a tool: SimpleDB Scratchpad  (https://aws.amazon.com/code/JavaScript/1137), but that needs to be downloaded and “installed.” The installation process is more than just un-packaging the files:

  1. Change the various endpoints to point to the correct one for the region I want (i.e. change “sdb.amazonaws.com” to “sdb.us-west-2.amazonaws.com” for the US-WEST-2 region).
  2. Pre-fill the key and secret in the navbar since it’s tedious to type that in all the time.

Once that’s all done, simply bringing up the webapp/index.html as a file in a browser pretty much works–except not for Chrome because it thinks it’s more clever than you about security of Javascript.  Fine.  Firefox, it is.

Weird SELECT syntax

The SELECT query syntax is deceptively similar to SQL but not really it.  Obviously joins are out of the question, but even simple queries have nuances:

  • If your table (“domain”) has anything other than alphanumeric and _ OR if it starts with a digit, you need to quote it with back-tick (`).  E.g. SELECT * FROM `my-domain` …
  • To sort on (aka ORDER BY) something, that thing has to be in the WHERE clause (huh ?!) E.g. SELECT … WHERE timestamp > ‘0’ ORDER BY timestamp
  • Why the quote on 0? All literals seem to be of type text, even things like the timestamp.