Collection indexing
Motivation
Datasets are gathered in a parent directory as a collection. Each of them materializes as parquet files located in a child directory whose naming is derived from a user-defined index.
By formalizing this index through a likewise dataclass, index management (user scope) is dissociated from path management (oups scope).
Proposal
oups provides 2 class decorators for defining an indexing logic.
@toplevel
is compulsory, and defines naming logic of the first directory level,@sublevel
is optional, and can be used as many times as number of sub-directories are required
By splitting indexes into different directory levels, related datasets can be gathered in common directories. A first level could for instance specify physical quantities in different places, and a second one could specify the sampling frequency of the measures.
Each of these levels is thus specified by a class. Those corresponding to a parent directory necessarily embed as last attribute the sub-level-related class.
Example
from oups import sublevel, toplevel
@sublevel
class Sampling:
frequency: str
@toplevel
class Measure:
quantity: str
city: str
sampling: Sampling
# Define different indexes for temperature in Berlin.
berlin_1D = Measure('temperature', 'berlin', Sampling('1D'))
berlin_1W = Measure('temperature', 'berlin', Sampling('1W'))
# Store data in a new collection
from os import path as os_path
import pandas
from oups import ParquetSet
dirpath = os_path.expanduser('~/Documents/code/data/weather_kbase')
ps = ParquetSet(dirpath, Measure)
dummy_data_1D = pd.DataFrame(
{'timestamp':pd.date_range('2021/01/01', '2021/01/05', freq='1D'),
'temperature':range(10,15)})
dummy_data_1W = pd.DataFrame(
{'timestamp':pd.date_range('2021/01/01', '2021/01/14', freq='1W'),
'temperature':range(10,12)})
ps[berlin_1D] = dummy_data_1D
ps[berlin_1W] = dummy_data_1W
Created folders and files are then organized as illustrated below.
data
|- weather_kbase
|- temperature-berlin
|- 1D
| |- _common_metadata
| |- _metadata
| |- part.0.parquet
|
|- 1W
|- _common_metadata
|- _metadata
|- part.0.parquet
@toplevel
@toplevel
decorator provides attributes and functions which are used by a ParquetSet
instance to
generate paths from attributes values (
__str__
andto_path
methods),generate class instance (
from_path
classmethod)
It modifies the __init__
method of decorated class so that attributes values are checked at instantiation, and use of any forbidden character or combination raises related exception.
It also calls @dataclass
class decorator, with order
and frozen
parameters set as True
. This setting enables equality between class instances with same attributes values.
Some other characteristics are:
@toplevel
accepts an optionalfields_sep
parameter to define the character separating fields (by default-
). This separator applies to all levels.Decorated class can have any number of attributes (also named fields), but only of types
int
orstr
.If an attribute is a
@sublevel
-decorated class, it is necessarily positioned last.
@sublevel
Likewise,
decorated class can have any number of attributes, but only of types
int
orstr
.if yet another deeper sub-level is defined (using a
@sublevel
-decorated class), it necessarily has to be positioned as last attribute.
@sublevel
is here only an alias for @dataclass
, with order
and frozen
parameters set as True
.