Getting started#

fcollections is a library that aims at reading a collections of files. Its primary goal is to combine the selection, reading and concatenation of files within a common model.

Let’s set up a minimal case with stub data for the SWOT altimetry mission.

import tempfile
import numpy as np
import xarray as xr

# Create stub data
path = tempfile.mkdtemp()
ds = xr.Dataset(data_vars={
    "ssha": (('num_lines', 'num_pixels'), np.random.random((9860, 69))),
    "swh": (('num_lines', 'num_pixels'), np.random.random((9860, 69))),})
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Expert_001_011_20240101T000000_20240101T030000_PGC0_01.nc')
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Expert_001_012_20240101T030000_20240101T060000_PGC0_01.nc')

Implementations#

When confronted to a files collection, the first step is to try and find if an implementation matching the data already exists. Such implementation may be found in the catalog

From the catalog, we can see that NetcdfFilesDatabaseSwotLRL2 matches our file names. In case no implementation is available, the user can build its own following creation procedure.

Listing files#

An implementation can be used by simply giving the path to the data. An important endpoint for the implementation is the ability to list files matching given criterias

from fcollections.implementations import NetcdfFilesDatabaseSwotLRL2

fc = NetcdfFilesDatabaseSwotLRL2(path)
fc.list_files(cycle_number=1)
cycle_number pass_number time subset version filename
0 1 11 [2024-01-01T00:00:00.000000, 2024-01-01T03:00:... ProductSubset.Expert PGC0_01 /tmp/tmpuc3c7es_/SWOT_L2_LR_SSH_Expert_001_011...
1 1 12 [2024-01-01T03:00:00.000000, 2024-01-01T06:00:... ProductSubset.Expert PGC0_01 /tmp/tmpuc3c7es_/SWOT_L2_LR_SSH_Expert_001_012...

Listing files using filters is the first step toward subsetting the files set.

Query data#

Another important endpoint is the ability to read the file contents using the query method.

fc.query()
<xarray.Dataset> Size: 22MB
Dimensions:       (num_lines: 19720, num_pixels: 69)
Dimensions without coordinates: num_lines, num_pixels
Data variables:
    ssha          (num_lines, num_pixels) float64 11MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
    swh           (num_lines, num_pixels) float64 11MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
    cycle_number  (num_lines) uint16 39kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    pass_number   (num_lines) uint16 39kB 11 11 11 11 11 11 ... 12 12 12 12 12

The method returns a xarray.Dataset containing the combined data for all files matching the regex specified by the implementation.

It is possible to load only a subset of the data by applying filters in the query. For example, giving the cycle_number and pass_number argument will select one half orbit of our altimetry mission.

fc.query(cycle_number=1, pass_number=11)
<xarray.Dataset> Size: 11MB
Dimensions:       (num_lines: 9860, num_pixels: 69)
Dimensions without coordinates: num_lines, num_pixels
Data variables:
    ssha          (num_lines, num_pixels) float64 5MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
    swh           (num_lines, num_pixels) float64 5MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
    cycle_number  (num_lines) uint16 20kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    pass_number   (num_lines) uint16 20kB 11 11 11 11 11 11 ... 11 11 11 11 11

Variable selection is also available to return only part of the data

ds = fc.query(selected_variables=['ssha'])
list(ds.variables)
['ssha', 'cycle_number', 'pass_number']

Filter types#

Each implementation has its own filters. By order of availability, the user should consult:

  • The Query overview section of the implementation’s Documentation (see the catalog)

  • The API documentation of the implementation’s method (see the catalog)

  • The prompted help displayed in a jupyter notebook or Python interpreter

fc.query?

Filter values#

Possible values for a given filter can be displayed

fc.filter_values('version')
{PGC0_01}

Only filters whose information are contained in the intermediated folders can be scanned in a quick way, other will trigger a full scan. As such, to ensure optimal performance, this method should be called with the layouts enabled, with files organized with folders (see the advanced section), and on filters whose information is encoded in the folders.

In case you are working on a small set of data, it is safe to ignore this warning

from fcollections.core import PerformanceWarning
import warnings
warnings.simplefilter("ignore", PerformanceWarning)

Access metadata#

The database can display information about the variables and attributes contained in the files’ collection using the variables_info method

# Use the enumeration name for filtering a specific subset
fc.variables_info()
Group: /
Dimensions
num_lines9860
num_pixels69
Variables
ssha
namessha
dtypefloat64
dimensions('num_lines', 'num_pixels')
_FillValuenan
swh
nameswh
dtypefloat64
dimensions('num_lines', 'num_pixels')
_FillValuenan
Attributes

It will offer a simple collapsible tree view with multiple levels of nesting depending on the data you manipulate

Subsets#

Errors on mixed subsets#

In order to return consistent results, most methods must work on an homogeneous subset of data. In case multiple subsets are mixed (for example Expert and Unsmoothed datasets), proper filters matching the partitioning keys must be given. If these filters are missing, an error with the possible choices will be raised.

# Create Unsmoothed file, this will mix Expert and Unsmoothed dataset
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Unsmoothed_001_012_20240101T030000_20240101T060000_PGC0_01.nc')

# This will not work because we don't know if we need to display Expert or
# Unsmoothed metadata
fc.variables_info()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 6
      2 ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Unsmoothed_001_012_20240101T030000_20240101T060000_PGC0_01.nc')
      3 
      4 # This will not work because we don't know if we need to display Expert or
      5 # Unsmoothed metadata
----> 6 fc.variables_info()

File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:285, in _create_method.<locals>.wrapped(self, *args, **kwargs)
    284 def wrapped(self, *args, **kwargs):
--> 285     return getattr(self, internal_name)(*args, **kwargs)

File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:791, in FilesDatabase._variables_info(self, **kwargs)
    788     msg = f"{inspect.stack()[0][3]} got unexpected keyword arguments {unknown}"
    789     raise TypeError(msg)
--> 791 df = self._files(**kwargs, unmix=True)
    792 if len(df.filename) == 0:
    793     warnings.warn('No files found with current filters "%s"' % kwargs)

File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:540, in FilesDatabase._files(self, sort, deduplicate, unmix, predicates, stat_fields, **kwargs)
    524 postprocesses = map(
    525     lambda item: item[1],
    526     filter(
   (...)    536     ),
    537 )
    539 for postprocess in postprocesses:
--> 540     df = postprocess(df)
    542 return df

File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:1205, in SubsetsUnmixer.__call__(self, df)
   1196 # Pick one subset using panda duplicate handling
   1197 subsets = [
   1198     (
   1199         dict(zip(grouping_keys, group))
   (...)   1203     for group in df_grouped.groups
   1204 ]
-> 1205 subset = self.pick_subset(subsets)
   1207 group_name = tuple(subset[k] for k in grouping_keys)
   1208 group_name = group_name if len(subset) > 1 else group_name[0]

File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:1155, in SubsetsUnmixer.pick_subset(self, subsets, **subset_filters)
   1146         ambiguity = {
   1147             key: df_subsets[key].unique().tolist()
   1148             for key in manual_pick
   1149             if len(df_subsets[key].unique()) > 1
   1150         }
   1151         msg = (
   1152             "Subsets could not be unmixed, the following keys are "
   1153             f"duplicated and should be fixed manually: {ambiguity}"
   1154         )
-> 1155         raise ValueError(msg)
   1157 subset = df_subsets.iloc[-1].to_dict()
   1158 logger.info("Picked subset %s", subset)

ValueError: Subsets could not be unmixed, the following keys are duplicated and should be fixed manually: {'subset': [<ProductSubset.Unsmoothed: 4>, <ProductSubset.Expert: 2>]}

Compatibility matrix#

The following table summarizes which methods can work on mixed data. Most methods need homogeneous data and will require filtering the subset.

Method

Works on mixed data ?

list_files

Yes

variables_info

No

filter_values

No

query

No

map

No

Listing subsets#

Subsets that are on the file system can be listed using the subsets property.

fc.subsets
[{'subset': <ProductSubset.Unsmoothed: 4>},
 {'subset': <ProductSubset.Expert: 2>}]

One of the returned choices must be selected and used as a filter to work on an homogeneous dataset.

# Use the enumeration name for filtering a specific subset
fc.variables_info(subset='Expert')
Group: /
Dimensions
num_lines9860
num_pixels69
Variables
ssha
namessha
dtypefloat64
dimensions('num_lines', 'num_pixels')
_FillValuenan
swh
nameswh
dtypefloat64
dimensions('num_lines', 'num_pixels')
_FillValuenan
Attributes