Access Data from the Ocean Color Instrument (OCI)

Access Data from the Ocean Color Instrument (OCI)#

Authors: Anna Windle (NASA, SSAI), Ian Carroll (NASA, UMBC), Carina Poulin (NASA, SSAI)

PREREQUISITES

This notebook has the following prerequisites:

An Earthdata Login account is required to access data from the NASA Earthdata system, including NASA ocean color data.

There are no prerequisite notebooks for this module.

Summary#

In this example we will use the earthaccess package to search for OCI products on NASA Earthdata. The earthaccess package, published on the Python Package Index and conda-forge, facilitates discovery and use of all NASA Earth Science data products by providing an abstraction layer for NASA’s Common Metadata Repository (CMR) API and by simplifying requests to NASA’s Earthdata Cloud. Searching for data is more approachable using earthaccess than low-level HTTP requests, and the same goes for S3 requests.

In short, earthaccess helps authenticate with Earthdata Login, makes search easier, and provides a stream-lined way to load data into xarray containers. For more on earthaccess, visit the documentation site. Be aware that earthaccess is under active development.

To understand the discussions below on downloading and opening data, we need to clearly understand where our notebook is running. There are three cases to distinguish:

The notebook is running on the local host. For instance, you started a Jupyter server on your laptop.
The notebook is running on a remote host, but it does not have direct access to the NASA Earthdata Cloud. For instance, you are running in GitHub Codespaces.
The notebook is running on a remote host that does have direct access to the NASA Earthdata Cloud. At this time, we cannot provide a “for instance” which is available to everyone.

Learning Objectives#

At the end of this notebook you will know:

How to store your NASA Earthdata Login credentials with earthaccess
How to use earthaccess to search for OCI data using search filters
How to download OCI data, but only when you need to

Contents#

Setup
NASA Earthdata Authentication
Search for Data
Download Data

1. Setup#

We begin by importing the only package used in this notebook. If you have created an environment following the guidance provided with this tutorial, then the import will be successful.

import earthaccess

We also need pathlib for directory creation, at least until earthaccess version 0.9.1 is available.

import pathlib

Back to top

2. NASA Earthdata Authentication#

Next, we authenticate using our Earthdata Login credentials. Authentication is not needed to search publicaly available collections in Earthdata, but is always needed to access data. We can use the login method from the earthaccess package. This will create an authenticated session when we provide a valid Earthdata Login username and password. The earthaccess package will search for credentials defined by environmental variables or within a .netrc file saved in the home directory. If credentials are not found, an interactive prompt will allow you to input credentials.

The persist=True argument ensures any discovered credentials are stored in a .netrc file, so the argument is not necessary (but it's also harmless) for subsequent calls to earthaccess.login.

auth = earthaccess.login(persist=True)

Back to top

3. Search for Data#

Collections on NASA Earthdata are discovered with the search_datasets function, which accepts an instrument filter as an easy way to get started. Each of the items in the list of collections returned has a “short-name”.

results = earthaccess.search_datasets(instrument="oci")

Datasets found: 19

for item in results:
    summary = item.summary()
    print(summary["short-name"])

PACE_OCI_L1A_SCI
PACE_OCI_L1B_SCI
PACE_OCI_L1C_SCI
PACE_OCI_L2_AOP_NRT
PACE_OCI_L2_BGC_NRT
PACE_OCI_L2_IOP_NRT
PACE_OCI_L2_PAR_NRT
PACE_OCI_L3B_CHL_NRT
PACE_OCI_L3B_IOP_NRT
PACE_OCI_L3B_KD_NRT
PACE_OCI_L3B_PAR_NRT
PACE_OCI_L3B_POC_NRT
PACE_OCI_L3B_RRS_NRT
PACE_OCI_L3M_CHL_NRT
PACE_OCI_L3M_IOP_NRT
PACE_OCI_L3M_KD_NRT
PACE_OCI_L3M_PAR_NRT
PACE_OCI_L3M_POC_NRT
PACE_OCI_L3M_RRS_NRT

Next, we use the search_data function to find granules within a collection. Let’s use the short_name for the PACE/OCI Level-2 quick-look, or near real time (NRT), product for biogeochemical properties (although you can search for granules accross collections too).

The short name can also be found on Eartdata Search, directly under the collection name, after clicking on the "i" button for a collection in any search result.

The count argument limits the number of granules returned and stored in the results list, not the number of granules found.

results = earthaccess.search_data(
    short_name="PACE_OCI_L2_BGC_NRT",
    count=1,
)

Granules found: 8967

We can refine our search by passing more parameters that describe the spatiotemporal domain of our use case. Here, we use the temporal parameter to request a date range and the bounding_box parameter to request granules that intersect with a bounding box. We can even provide a cloud_cover threshold to limit files that have a lower percetnage of cloud cover. We do not provide a count, so we’ll get all granules that satisfy the constraints.

tspan = ("2024-05-01", "2024-05-16")
bbox = (-76.75, 36.97, -75.74, 39.01)
clouds = (0, 50)

results = earthaccess.search_data(
    short_name="PACE_OCI_L2_BGC_NRT",
    temporal=tspan,
    bounding_box=bbox,
    cloud_cover=clouds,
)

Granules found: 3

Displaying results shows the direct download link: try it! The link will download one granule to your local machine, which may or may not be what you want to do. Even if you are running the notebook on a remote host, this download link will open a new browser tab or window and offer to save a file to your local machine. If you are running the notebook locally, this may be of use. However, in the next section we’ll see how to download all the results with one command.

results[0]

Data: PACE_OCI.20240502T172807.L2.OC_BGC.V1_0_0.NRT.nc

Size: 18.65 MB

Cloud Hosted: True

results[1]

Data: PACE_OCI.20240508T174100.L2.OC_BGC.V1_0_0.NRT.nc

Size: 18.45 MB

Cloud Hosted: True

results[2]

Data: PACE_OCI.20240513T171853.L2.OC_BGC.V1_0_0.NRT.nc

Size: 19.24 MB

Cloud Hosted: True

Back to top

4. Download Data#

An upcoming tutorial will need access to Level-1 files, whether or not we have direct access to the Earthdata Cloud, so let’s go ahead and download a couple granules. As always, we start with an earthaccess.search_data.

results = earthaccess.search_data(
    short_name="PACE_OCI_L1B_SCI",
    temporal=tspan,
    bounding_box=bbox,
    count=2,
)

Granules found: 23

Now, we need to first understand the alternative to downloading granules, since you may be surprised that there is an alternative at all. The earthaccess.open function accepts the list of results from earthaccess.search_data and returns a list of file-like objects. No actual files are transferred.

paths = earthaccess.open(results)

Opening 2 granules, approx size: 3.47 GB
using endpoint: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials

The file-like objects held in paths can each be read like a normal file. Here we load the first few bytes without any specialized reader.

with paths[0] as file:
    line = file.readline().strip()
line

b'\x89HDF'

Of course that doesn’t mean anything (or does it? 😉), because this is a binary file that needs a reader which understands the file format.

The earthaccess.open function is used when you want to directly read a bytes from a remote filesystem, but not download a whole file. When running code on a host with direct access to the NASA Earthdata Cloud, you don’t need to download the data and earthaccess.open is the way to go.

Now, let’s look at the earthaccess.download function, which is used to copy files onto a filesystem local to the machine executing the code. For this function, provide the output of earthaccess.search_data along with a directory where earthaccess will store downloaded granules.

Even if you only want to read a slice of the data, and downloading seems unncessary, if you use earthaccess.open while not running on a remote host with direct access to the NASA Earthdata Cloud, performance will be very poor. This is not a problem with “the cloud” or with earthaccess, it has to do with the data format and may soon be resolved.

Let’s continue to downloading the list of granules!

directory = pathlib.Path("L1B")
directory.mkdir(exist_ok=True)
paths = earthaccess.download(results, directory)

 Getting 2 granules, approx download size: 3.47 GB
Accessing cloud dataset using dataset endpoint credentials: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials
Downloaded: L1B/PACE_OCI.20240501T165311.L1B.nc
Downloaded: L1B/PACE_OCI.20240501T165811.L1B.nc

The paths list now contains paths to actual files on the local filesystem.

paths

[PosixPath('L1B/PACE_OCI.20240501T165311.L1B.nc'),
 PosixPath('L1B/PACE_OCI.20240501T165811.L1B.nc')]

Anywhere in any of these notebooks where

paths = earthaccess.open(...)

is used to read data directly from the NASA Earthdata Cloud, you need to substitute

paths = earthaccess.download(..., local_path)

before running the notebook on a local host or a remote host that does not have direct access to the NASA Earthdata Cloud.

You have completed the notebook on downloading and opening datasets. We now suggest starting the notebook on File Structure at Three Processing Levels.