Object Storage and Cloud Optimized Datasets

Guillaume Eynard-Bontemps, Hugues Larat, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2024-08-01

Object Storage

Object Storage concepts

Object storage is a computer data storage that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier. (Wikipedia)

But Why?

  • Scalability. Scale-out, infinitly.
  • Security/Reliability/Availability. Erasure Coding.
  • Cost. Commodity hardware.
  • Performance (bandwith). Just need good network, and scale.

POSIX vs Object Storage

Object store Architecture

Ceph @CNES

Quizz

What makes object storage efficient? (multiple choices)

  • Answer A: The use of high-end and complex harware for storage solutions
  • Answer B: The multiplication of standard servers with JBOD storage
  • Answer C: High performance network
  • Answer D: POSIX API
  • Answer E: Dedicated API developed for simple transactions
Answer

Answer link Key: mz

Processing and Cloud Optimized datasets

Object storage interface and libraries

  • Just HTTP Rest Calls: GET, PUT
  • With authentification and authorizations: Access key and secret access key.
  • Easier with AWS CLI or alike:
# Needs a keys file in your $HOME dir, or to dynamically obtain a key
aws s3 cp /tmp/foo/ s3://bucket/ --recursive --exclude "*" --include "*.jpg"

Interfaces and libraries for major programming languages:

Processing framework and Object store

All Major Data processing framework are compatible with S3 like interfaces.

Just replace / or file:// by s3://.

import dask.dataframe as dd
df = dd.read_csv('s3://bucket/path/to/data-*.csv')
df = dd.read_parquet('gcs://bucket/path/to/data-*.parq')

More on the next part of the course.

ARCO Data

Analysis Ready Cloud Optimized Data.

Thanks to Ryan Abernathey.

What is Analysis Ready?

  • Think in Datasets, not data files
  • No need for tedious homogenizing, cleaning steps
  • Curated and cataloged
How do data scientists spend their time (Crowdflower Data Science Report, 2016)

ARCO Data (2)

Analysis Ready Cloud Optimized Data.

Thanks to Ryan Abernathey.

What is Cloud Optimized?

  • Compatible with object storage (access via HTTP)
  • Support lazy access and intelligent subsetting
  • Integrates with high-level analysis libraries and distributed frameworks

Cloud Optimized Geotiff

  • Metadata at the start of the file only
  • Tiling instead of stripes (chunks)
  • Compression
  • Overviews (zoom out)
  • HTTP range requests, and so object storage!

Zarr

  • Python library for storage of chunked, compressed NDarrays
  • Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo)
  • Arrays are split into user-defined chunks; each chunk is optional compressed (zlib, zstd, etc.)
  • Can store arrays in memory, directories, zip files, or any python mutable mapping interface (dictionary)
  • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage

Parquet

  • Chunked binary file
  • Compressed
  • Metadata easily accessible

See more

Quizz

Object storage is always performant for scientific data processing.

Answer

Answer link Key: qp