GDAL and Google Cloud Storage (GCS)

GDAL has support for GDAL Virtual File System which allows GDAL library and command-line tools to work with files on network storage. This is critical in the era of Cloud-Native Geospatial where it is becoming a standard practice to access and share geospatial data via cloud storage services. In this post, we will see how to use GDAL command-line tools to read and write data to Google Cloud Storage (GCS) using the /vsigs file system handler. We will focus exclusively on Google Cloud Storage for this post -but the same concepts apply when using other cloud services such as AWS S3 or Azure Blobstore. Similarly, other GDAL-based tools – such as rasterio – will be able to access the data from GCS using the same configuration options shown here.

The post covers the following topics

  • Reading Files from Public GCS Buckets.
  • Creating Private GCS Buckets and Uploading Data
  • Configuring Authentication and Reading Data from Private GCS Buckets
  • Writing Data to Private GCS Buckets
  • Using Environment Variables

Note 1: This post assumes familiarity with the GDAL command-line tools and assumes you have installed GDAL on your machine. You will find detailed instructions for installation in our course material for Mastering GDAL Tools.

Note 2: The code snippets are split over multiple lines for redability using the Windows Line Continuation character ^. If you are running these on Mac/Linux, you can replace them with \ instead.

Reading Files from Public GCS Buckets

Public files do not need any authentication parameters for read-access. We can specify the filenames in the form of /vsigs/<gcs bucket name>/<folder name>/<file name> for GDAL to locate the file. When reading public files, we can specify the GS_NO_SIGN_REQUEST=YES option to disable authentication. The following command accesses the public data from the GCS bucket spatialthoughts-public-data and shows the file metadata.

gdalinfo  [gcs_file_path] --config GS_NO_SIGN_REQUEST YES

Creating Private GCS Buckets and Uploading Data

If you already have a GCS Bucket and know how to upload data, you can skip to the next section. For users who want to start using GCS, you can follow this guide to setup your private bucket.

Head over to Google Cloud Console. If this is your first time, you will need to accept the Terms of Service and setup a billing account. To create a cloud-storage bucket, you have to first create a Cloud Project. From your Cloud Console Dashboard, find the Projects dropdown menu and click New Project. Give a Project name and a unique Project ID. Click Create.

Once the project is created, go to Cloud Storage Browser.

Click +Create at the top to create a new bucket. You need to give a globally unique name for your bucket. The other options can be kept to their defaults or changed as per your requirement. Click Create to create the bucket with the chosen options.

Once created, click on the bucket name. You can then use the menu options to create folders within the bucket and upload files. If you want to upload large amount of data, you can use gcloud utility that is covered in the next section.

Now you have a private cloud storage bucket ready with some data – we will learn how to access this via GDAL command-line tools.

Configuring Authentication and Reading Data from Private GCS Buckets

When accessing private data with GDAL, you need to add certain configuration options so GDAL can authenticate the user correctly. There are several options available for authentication. We will cover the following methods

  • Using the .boto configuration file: Simple method but it is tied to a specific machine.
  • Using OAuth2 authentication: Easy to setup and is portable across different machines.
  • Using Service Account Credentials: Most complex to setup but recommended for production use.

To obtain the authentication credentials we will need to install the gcloud CLI on your machine. Follow the official instructions for installation.

Once installed, run the following command, and follow the instructions to sign-in to your account and select the Cloud Project that is linked with your Cloud Storage Bucket.

gcloud init

You may also have to run the following command to complete authentication.

gcloud auth login

Authentication Using the .boto configuration file

Once you have successfully initialized gcloud on your machine, it saves the authentication credentials to a file named .boto. We can find the location of this file and use it with GDAL tools to supply the credentials. When you install gcloud, you also get another command-line utility named gsutil. This is a utility to work with Google Cloud Storage data. Let’s find the location of the .boto file using the following command.

gsutil version -l

You may see multiple paths listed under config path(s). Pick the one that has the file containing the credentials. You can use type <filename> on Windows or cat <filename> on Mac/Linux to check which is the correct file.

Now we are ready to read the files stored in the private GCS bucket with GDAL. The following command uses CPL_GS_CREDENTIALS_FILE configuration option which points to the location of the .boto file. Using this, GDAL will be able to access and read the data.

gdalinfo [gcs_file_path] ^
    --config CPL_GS_CREDENTIALS_FILE [.boto_file_path]

OAuth2 authentication

We can use another method to specify the authentication parameters on the command line itself. This makes the commands more portable, allowing you to run them from any machine. Let’s check the contents of the.boto file we used in the previous method. Enter the following command.

Windows Users

type [.boto_file_path]

Linux/Mac Users

cat  [.boto_file_path]

We need the values of the client_id, client_secret and gs_auth2_refresh_token shown here. Copy them and construct a command-line with configuration options as below. Replace the values in <> with your own. You can now run this command on any machine and read data from your private bucket.

gdalinfo [gcs_file_path] ^
    --config GS_OAUTH2_CLIENT_ID [client_id] ^
    --config GS_OAUTH2_CLIENT_SECRET [client_secret] ^
    --config GS_OAUTH2_REFRESH_TOKEN [gs_auth2_refresh_token]

Using Service Account Credentials

If you plan to run GDAL jobs on a server, it is preferable to use a Service Account. A service account is a special kind of account with specific permissions and is used by automated jobs. You can create a service account from Google Cloud Console and obtain the required credentials. From your Google Cloud Console Dashboard, go to APIs & Services → Credentials.

Click on + Create Credentials and select Service account.

Once the account is created, you will have an email address for the service account. Make a note of that. Next, click on KEYS.

Click ADD KEYCreate new key.

Select JSON as the key type and click CREATE. A new file with .json extension will be downloaded to your computer. This file contains the key to access the service account.

GDAL expects the key file to contain only the private key and the file must begin with —–BEGIN PRIVATE KEY—–. Open the .json file in a text editor and copy the text for the private key in a new file.

Name this file as key.txt.

We are now ready to use the service account credentials with GDAL commands. We specify the service account email and path to the key.txt file as configuration options. GDAL uses these to authenticate and read the file from the private bucket.

gdalinfo [gcs_file_path] ^
    --config GS_OAUTH2_PRIVATE_KEY_FILE key.txt ^
    --config GS_OAUTH2_CLIENT_EMAIL [oauth2_client_email]

Writing Data to Private GCS Buckets

In the previous section, we saw 3 different authentication methods and used them to read data from a private GCS bucket. We can use any of those authentication methods to also write data to a private GCS bucket.

When creating GeoTIFF files, GDAL users must set the configuration option CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE to YES. Using this option, we can now use gdal_translate command to take a local file and convert it to a Cloud-Optimized GeoTIFF directly in our private bucket.

gdal_translate -of COG [input_file] [gcs_file_path] ^
    --config GS_OAUTH2_CLIENT_ID [client_id] ^
    --config GS_OAUTH2_CLIENT_SECRET [client_secret] ^
    --config GS_OAUTH2_REFRESH_TOKEN [gs_auth2_refresh_token] ^
    --config CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE YES

Using Environment Variables

A good practice is to store sensitive credentials as environment variables instead of configuration options specified on the command-line. This allows you to safely store your code in a version control system without exposing the credentials. You are also able to change/update the authentication credentials without making any changes to your code.

Let’s say we want to use the OAuth Authentication and want to store the credentials as environment variables. To set the environment variable, you can use the following syntax.

Windows

set GS_OAUTH2_CLIENT_ID=[client_id]
set GS_OAUTH2_CLIENT_SECRET=[client_secret]
set GS_OAUTH2_REFRESH_TOKEN=[gs_auth2_refresh_token]

Linux/Mac

export GS_OAUTH2_CLIENT_ID=[client_id]
export GS_OAUTH2_CLIENT_SECRET=[client_secre]
export GS_OAUTH2_REFRESH_TOKEN=[gs_auth2_refresh_token]

The environment variables need to be set just once per session. Once they are set, GDAL commands will automatically use them to perform authentication. With the environment variables set, we can simplify our command for writing data to a private bucket.

gdal_translate -of COG [input_file] [gcs_file_path] ^
    --config CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE YES

Note on Security

API keys allow automated systems access to your account and they must be kept secret. You should delete or revoke the keys in case they are leaked or exposed.

As a precaution, all keys and tokens used in the screenshots of this post have been deleted/revoked.

Deleting Service Account Key

You can delete previously obtained JSON keys for a service account directly from the Google Cloud Console → APIs & Services → Credentials.

Deleting Refresh Tokens

There is no user interface to delete Refresh Tokens that are obtained when you run gcloud init or gcloud auth login. You need to revoke them by calling a URL to prevent further usage. Here’s an example Python script that can be run from your system to revoke a key. Save the script as revoke.py containing the following lines and replacing [gs_auth2_refresh_token] with your key.

import requests
 
exposed_token = '[gs_auth2_refresh_token]'
 
requests.post('https://oauth2.googleapis.com/revoke',
    params={'token': exposed_token},
    headers = {
        'content-type': 'application/x-www-form-urlencoded'
    })

Then run the script using the command below

python revoke.py

2 Comments

Leave a Comment

  1. Nice guide!
    For the AUTHENTICATION USING THE .BOTO CONFIGURATION FILE section, I found that after running `gcloud init`, I also had to run `gcloud auth login` before the .boto file was created.

Leave a Reply to AlphaEarth Foundations Satellite Embeddings: Now Available on Google Cloud Storage – The Latest World NewsCancel reply