Getting started#
fcollections is a library that aims at reading a collections of files. Its
primary goal is to combine the selection, reading and concatenation of files
within a common model.
Let’s set up a minimal case with stub data for the SWOT altimetry mission.
import tempfile
import numpy as np
import xarray as xr
# Create stub data
path = tempfile.mkdtemp()
ds = xr.Dataset(data_vars={
"ssha": (('num_lines', 'num_pixels'), np.random.random((9860, 69))),
"swh": (('num_lines', 'num_pixels'), np.random.random((9860, 69))),})
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Expert_001_011_20240101T000000_20240101T030000_PGC0_01.nc')
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Expert_001_012_20240101T030000_20240101T060000_PGC0_01.nc')
Implementations#
When confronted to a files collection, the first step is to try and find if an implementation matching the data already exists. Such implementation may be found in the catalog
From the catalog, we can see that NetcdfFilesDatabaseSwotLRL2
matches our file names. In case no implementation is available, the user can
build its own following creation procedure.
Listing files#
An implementation can be used by simply giving the path to the data. An important endpoint for the implementation is the ability to list files matching given criterias
from fcollections.implementations import NetcdfFilesDatabaseSwotLRL2
fc = NetcdfFilesDatabaseSwotLRL2(path)
fc.list_files(cycle_number=1)
| cycle_number | pass_number | time | subset | version | filename | |
|---|---|---|---|---|---|---|
| 0 | 1 | 11 | [2024-01-01T00:00:00.000000, 2024-01-01T03:00:... | ProductSubset.Expert | PGC0_01 | /tmp/tmpuc3c7es_/SWOT_L2_LR_SSH_Expert_001_011... |
| 1 | 1 | 12 | [2024-01-01T03:00:00.000000, 2024-01-01T06:00:... | ProductSubset.Expert | PGC0_01 | /tmp/tmpuc3c7es_/SWOT_L2_LR_SSH_Expert_001_012... |
Listing files using filters is the first step toward subsetting the files set.
Query data#
Another important endpoint is the ability to read the file contents using the
query method.
fc.query()
<xarray.Dataset> Size: 22MB
Dimensions: (num_lines: 19720, num_pixels: 69)
Dimensions without coordinates: num_lines, num_pixels
Data variables:
ssha (num_lines, num_pixels) float64 11MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
swh (num_lines, num_pixels) float64 11MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
cycle_number (num_lines) uint16 39kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
pass_number (num_lines) uint16 39kB 11 11 11 11 11 11 ... 12 12 12 12 12The method returns a xarray.Dataset containing the combined data for
all files matching the regex specified by the implementation.
It is possible to load only a subset of the data by applying filters in the
query. For example, giving the cycle_number and pass_number argument
will select one half orbit of our altimetry mission.
fc.query(cycle_number=1, pass_number=11)
<xarray.Dataset> Size: 11MB
Dimensions: (num_lines: 9860, num_pixels: 69)
Dimensions without coordinates: num_lines, num_pixels
Data variables:
ssha (num_lines, num_pixels) float64 5MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
swh (num_lines, num_pixels) float64 5MB dask.array<chunksize=(9860, 69), meta=np.ndarray>
cycle_number (num_lines) uint16 20kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
pass_number (num_lines) uint16 20kB 11 11 11 11 11 11 ... 11 11 11 11 11Variable selection is also available to return only part of the data
ds = fc.query(selected_variables=['ssha'])
list(ds.variables)
['ssha', 'cycle_number', 'pass_number']
Filter types#
Each implementation has its own filters. By order of availability, the user should consult:
The
Query overviewsection of the implementation’sDocumentation(see the catalog)The API documentation of the implementation’s method (see the catalog)
The prompted help displayed in a jupyter notebook or Python interpreter
fc.query?
Filter values#
Possible values for a given filter can be displayed
fc.filter_values('version')
{PGC0_01}
Only filters whose information are contained in the intermediated folders can be scanned in a quick way, other will trigger a full scan. As such, to ensure optimal performance, this method should be called with the layouts enabled, with files organized with folders (see the advanced section), and on filters whose information is encoded in the folders.
In case you are working on a small set of data, it is safe to ignore this warning
from fcollections.core import PerformanceWarning
import warnings
warnings.simplefilter("ignore", PerformanceWarning)
Access metadata#
The database can display information about the variables and attributes
contained in the files’ collection using the variables_info method
# Use the enumeration name for filtering a specific subset
fc.variables_info()
Group: /
Dimensions
| num_lines | 9860 |
| num_pixels | 69 |
Variables
ssha
| name | ssha |
| dtype | float64 |
| dimensions | ('num_lines', 'num_pixels') |
| _FillValue | nan |
swh
| name | swh |
| dtype | float64 |
| dimensions | ('num_lines', 'num_pixels') |
| _FillValue | nan |
Attributes
It will offer a simple collapsible tree view with multiple levels of nesting depending on the data you manipulate
Subsets#
Errors on mixed subsets#
In order to return consistent results, most methods must work on an homogeneous subset of data. In case multiple subsets are mixed (for example Expert and Unsmoothed datasets), proper filters matching the partitioning keys must be given. If these filters are missing, an error with the possible choices will be raised.
# Create Unsmoothed file, this will mix Expert and Unsmoothed dataset
ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Unsmoothed_001_012_20240101T030000_20240101T060000_PGC0_01.nc')
# This will not work because we don't know if we need to display Expert or
# Unsmoothed metadata
fc.variables_info()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[10], line 6
2 ds.to_netcdf(f'{path}/SWOT_L2_LR_SSH_Unsmoothed_001_012_20240101T030000_20240101T060000_PGC0_01.nc')
3
4 # This will not work because we don't know if we need to display Expert or
5 # Unsmoothed metadata
----> 6 fc.variables_info()
File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:285, in _create_method.<locals>.wrapped(self, *args, **kwargs)
284 def wrapped(self, *args, **kwargs):
--> 285 return getattr(self, internal_name)(*args, **kwargs)
File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:791, in FilesDatabase._variables_info(self, **kwargs)
788 msg = f"{inspect.stack()[0][3]} got unexpected keyword arguments {unknown}"
789 raise TypeError(msg)
--> 791 df = self._files(**kwargs, unmix=True)
792 if len(df.filename) == 0:
793 warnings.warn('No files found with current filters "%s"' % kwargs)
File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:540, in FilesDatabase._files(self, sort, deduplicate, unmix, predicates, stat_fields, **kwargs)
524 postprocesses = map(
525 lambda item: item[1],
526 filter(
(...) 536 ),
537 )
539 for postprocess in postprocesses:
--> 540 df = postprocess(df)
542 return df
File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:1205, in SubsetsUnmixer.__call__(self, df)
1196 # Pick one subset using panda duplicate handling
1197 subsets = [
1198 (
1199 dict(zip(grouping_keys, group))
(...) 1203 for group in df_grouped.groups
1204 ]
-> 1205 subset = self.pick_subset(subsets)
1207 group_name = tuple(subset[k] for k in grouping_keys)
1208 group_name = group_name if len(subset) > 1 else group_name[0]
File ~/work/fcollections/fcollections/src/fcollections/core/_filesdb.py:1155, in SubsetsUnmixer.pick_subset(self, subsets, **subset_filters)
1146 ambiguity = {
1147 key: df_subsets[key].unique().tolist()
1148 for key in manual_pick
1149 if len(df_subsets[key].unique()) > 1
1150 }
1151 msg = (
1152 "Subsets could not be unmixed, the following keys are "
1153 f"duplicated and should be fixed manually: {ambiguity}"
1154 )
-> 1155 raise ValueError(msg)
1157 subset = df_subsets.iloc[-1].to_dict()
1158 logger.info("Picked subset %s", subset)
ValueError: Subsets could not be unmixed, the following keys are duplicated and should be fixed manually: {'subset': [<ProductSubset.Unsmoothed: 4>, <ProductSubset.Expert: 2>]}
Compatibility matrix#
The following table summarizes which methods can work on mixed data. Most methods need homogeneous data and will require filtering the subset.
Method |
Works on mixed data ? |
|---|---|
|
Yes |
|
No |
|
No |
|
No |
|
No |
Listing subsets#
Subsets that are on the file system can be listed using the
subsets property.
fc.subsets
[{'subset': <ProductSubset.Unsmoothed: 4>},
{'subset': <ProductSubset.Expert: 2>}]
One of the returned choices must be selected and used as a filter to work on an homogeneous dataset.
# Use the enumeration name for filtering a specific subset
fc.variables_info(subset='Expert')
Group: /
Dimensions
| num_lines | 9860 |
| num_pixels | 69 |
Variables
ssha
| name | ssha |
| dtype | float64 |
| dimensions | ('num_lines', 'num_pixels') |
| _FillValue | nan |
swh
| name | swh |
| dtype | float64 |
| dimensions | ('num_lines', 'num_pixels') |
| _FillValue | nan |