Why oups?
Purpose
Targeting the management of ‘large-size’ collections of ordered datasets (more specifically time series), oups provides convenience class and functions to ease their identification, creation, update, and loading. These datasets may contain different data (from different channels or feeds) or be the results of different processings of the same raw data.
oups most notably hides path management. By decorating a likewise
dataclasss
with@toplevel
decorator, this class is turned into an index generator, with all attributes and functions so that path to related datasets can be generated.it also provides an efficient update logic suited for ordered datasets (low memory footprint).
Alternatives
Other libraries out there already exist to manage collections of datasets,
many that I have not tested, for instance Arctic,
one that I have tested, pystore. Being based on Dask, it supports parallelized reading/writing out of the box. Its update logic can be reviewed in collection.py. Not elaborating about its possible performance issues, and only focusing on this logic applicability, current procedure implies that any duplicate rows be dropped, except last (duplicate considering all columns, but not the index, the latter being necessarily a
DatetimeIndex
as per pystore implementation). But this hard-coded logic may not suit all dataflows.
In comparison, current version of oups,
is not based on Dask but directly on fastparquet. No parallelized reading/writing is possible.
provides an efficient update function with a user-defined logic for optionally dropping duplicates.