Collection indexing

Motivation

Datasets are gathered in a parent directory as a collection. Each of them materializes as parquet files located in a child directory whose naming is derived from a user-defined index.

By formalizing this index through a likewise dataclass, index management (user scope) is dissociated from path management (oups scope).

Proposal

oups provides 2 class decorators for defining an indexing logic.

  • @toplevel is compulsory, and defines naming logic of the first directory level,

  • @sublevel is optional, and can be used as many times as number of sub-directories are required

By splitting indexes into different directory levels, related datasets can be gathered in common directories. A first level could for instance specify physical quantities in different places, and a second one could specify the sampling frequency of the measures.

Each of these levels is thus specified by a class. Those corresponding to a parent directory necessarily embed as last attribute the sub-level-related class.

Example

from oups import sublevel, toplevel

@sublevel
class Sampling:
    frequency: str
@toplevel
class Measure:
    quantity: str
    city: str
    sampling: Sampling
# Define different indexes for temperature in Berlin.
berlin_1D = Measure('temperature', 'berlin', Sampling('1D'))
berlin_1W = Measure('temperature', 'berlin', Sampling('1W'))

# Store data in a new collection
from os import path as os_path
import pandas
from oups import ParquetSet

dirpath = os_path.expanduser('~/Documents/code/data/weather_kbase')
ps = ParquetSet(dirpath, Measure)
dummy_data_1D = pd.DataFrame(
                   {'timestamp':pd.date_range('2021/01/01', '2021/01/05', freq='1D'),
                    'temperature':range(10,15)})
dummy_data_1W = pd.DataFrame(
                   {'timestamp':pd.date_range('2021/01/01', '2021/01/14', freq='1W'),
                    'temperature':range(10,12)})
ps[berlin_1D] = dummy_data_1D
ps[berlin_1W] = dummy_data_1W

Created folders and files are then organized as illustrated below.

data
|- weather_kbase
   |- temperature-berlin
      |- 1D
      |  |- _common_metadata
      |  |- _metadata
      |  |- part.0.parquet
      |
      |- 1W
         |- _common_metadata
         |- _metadata
         |- part.0.parquet

@toplevel

@toplevel decorator provides attributes and functions which are used by a ParquetSet instance to

  • generate paths from attributes values (__str__ and to_path methods),

  • generate class instance (from_path classmethod)

It modifies the __init__ method of decorated class so that attributes values are checked at instantiation, and use of any forbidden character or combination raises related exception.

It also calls @dataclass class decorator, with order and frozen parameters set as True. This setting enables equality between class instances with same attributes values.

Some other characteristics are:

  • @toplevel accepts an optional fields_sep parameter to define the character separating fields (by default -). This separator applies to all levels.

  • Decorated class can have any number of attributes (also named fields), but only of types int or str.

  • If an attribute is a @sublevel-decorated class, it is necessarily positioned last.

@sublevel

Likewise,

  • decorated class can have any number of attributes, but only of types int or str.

  • if yet another deeper sub-level is defined (using a @sublevel-decorated class), it necessarily has to be positioned as last attribute.

@sublevel is here only an alias for @dataclass, with order and frozen parameters set as True.