Store Architecture ================== The ``oups.store`` module provides the core functionality for managing collections of ordered parquet datasets. It consists of three main components working together to provide efficient storage, indexing, and querying of time-series data. Overview -------- The store architecture is designed around three key components: 1. **Indexer**: Provides a schema-based indexing system for organizing datasets 2. **OrderedParquetDataset**: Manages individual parquet datasets with ordering validation 3. **Store**: Provides a collection interface for multiple indexed datasets Main Components --------------- Indexer ~~~~~~~ The indexer system allows you to define hierarchical schemas for organizing your datasets using dataclasses decorated with ``@toplevel`` and optionally ``@sublevel``. This provides a structured way to organize related datasets in common directories. **Motivation** Datasets are gathered in a parent directory as a collection. Each materializes as parquet files located in a child directory whose naming is derived from a user-defined index. By formalizing this index through dataclasses, index management (user scope) is dissociated from path management (*oups* scope). **Decorators** *oups* provides 2 class decorators for defining an indexing logic: - ``@toplevel`` is compulsory, and defines naming logic of the first directory level - ``@sublevel`` is optional, and can be used as many times as needed for sub-directories **@toplevel Decorator** The ``@toplevel`` decorator: - Generates *paths* from attribute values (``__str__`` and ``to_path`` methods) - Generates class instances (``from_path`` classmethod) - Validates attribute values at instantiation - Calls ``@dataclass`` with ``order`` and ``frozen`` parameters set as ``True`` - Accepts an optional ``fields_sep`` parameter (default ``-``) to define field separators - Only accepts ``int`` or ``str`` attribute types - If an attribute is a ``@sublevel``-decorated class, it must be positioned last **@sublevel Decorator** The ``@sublevel`` decorator: - Is an alias for ``@dataclass`` with ``order`` and ``frozen`` set as ``True`` - Only accepts ``int`` or ``str`` attribute types - If another deeper sub-level is defined, it must be positioned as last attribute **Hierarchical Example** .. code-block:: python from oups.store import toplevel, sublevel @sublevel class Sampling: frequency: str @toplevel class Measure: quantity: str city: str sampling: Sampling # Define different indexes for temperature in Berlin berlin_1D = Measure('temperature', 'berlin', Sampling('1D')) berlin_1W = Measure('temperature', 'berlin', Sampling('1W')) # When this indexer is connected to a Store, the directory structure will look like: # temperature-berlin/ # ├── 1D/ # │ ├── file_0000.parquet # │ └── file_0001.parquet # └── 1W/ # ├── file_0000.parquet # └── file_0001.parquet **Simple Example** .. code-block:: python from oups.store import toplevel @toplevel class TimeSeriesIndex: symbol: str date: str # This creates a schema where datasets are organized as: # symbol-date/ (e.g., "AAPL-2023.01.01/") OrderedParquetDataset ~~~~~~~~~~~~~~~~~~~~~ ``OrderedParquetDataset`` is the core class for managing individual parquet datasets with strict ordering validation. It provides: **Key Features:** - **Ordered Storage**: Data is stored in row groups ordered by a specified column - **Incremental Updates**: Efficiently merge new data with existing data - **Row Group Management**: Automatic splitting and merging of row groups - **Metadata Tracking**: Comprehensive metadata for each row group - **Metadata Updates**: Add, update, or remove custom key-value metadata - **Duplicate Handling**: Configurable duplicate detection and removal - **Write Optimization**: Configurable row group sizes and merge strategies **File Structure:** .. code-block:: parent_directory/ ├── my_dataset/ # Dataset directory │ ├── file_0000.parquet # Row group files │ └── file_0001.parquet ├── my_dataset_opdmd # Metadata file └── my_dataset.lock # Lock file **Example:** .. code-block:: python from oups.store import OrderedParquetDataset import pandas as pd # Create or load a dataset dataset = OrderedParquetDataset("/path/to/dataset", ordered_on="timestamp") # Write data df = pd.DataFrame({ "timestamp": pd.date_range("2023-01-01", periods=1000), "value": range(1000) }) dataset.write(df=df) # Read data back result = dataset.to_pandas() Store ~~~~~ The ``Store`` class provides a collection interface for managing multiple ``OrderedParquetDataset`` instances organized according to an indexer schema. **Key Features:** - **Schema-based Organization**: Uses indexer schemas for dataset discovery - **Lazy Loading**: Datasets are loaded on-demand - **Collection Interface**: Dictionary-like access to datasets - **Cross-dataset Operations**: Advanced querying across multiple datasets - **Automatic Discovery**: Finds existing datasets matching the schema **Example:** .. code-block:: python from oups.store import Store from oups.store import toplevel @toplevel class StockIndex: symbol: str year: str # Create store store = Store("/path/to/data", StockIndex) # Access datasets aapl_2023 = store[StockIndex("AAPL", "2023")] # Iterate over all datasets for key in store: dataset = store[key] print(f"Dataset {key} has {len(dataset)} row groups") Advanced Features ----------------- Write Method ~~~~~~~~~~~~ The ``write()`` function provides advanced data writing capabilities: **Parameters:** - ``row_group_target_size``: Control row group sizes (int or pandas frequency string) - ``duplicates_on``: Specify columns for duplicate detection - ``max_n_off_target_rgs``: Control row group coalescing behavior - ``key_value_metadata``: Store custom metadata (supports add/update/remove operations) **Example:** .. code-block:: python from oups.store import write # Write with time-based row groups and metadata write( "/path/to/dataset", ordered_on="timestamp", df=df, row_group_target_size="1D", # One row group per day duplicates_on=["timestamp", "symbol"], key_value_metadata={ "source": "market_data", "version": "2.1", "processed_by": "data_pipeline" } ) # Update existing metadata (add new, update existing, remove with None) write( "/path/to/dataset", ordered_on="timestamp", df=new_df, key_value_metadata={ "version": "2.2", # Update existing "last_updated": "2023-12-01", # Add new "processed_by": None # Remove existing } ) iter_intersections ~~~~~~~~~~~~~~~~~~ The ``iter_intersections()`` method enables efficient querying across multiple datasets with overlapping ranges: **Key Features:** - **Range Queries**: Query specific ranges (time, numeric, etc.) across multiple datasets - **Intersection Detection**: Automatically finds overlapping row groups - **Memory Efficient**: Streams data without loading entire datasets - **Synchronized Iteration**: Iterates through multiple datasets in sync **Example:** .. code-block:: python # Query multiple datasets for overlapping data keys = [StockIndex("AAPL", "2023"), StockIndex("GOOGL", "2023")] for intersection in store.iter_intersections( keys, start=pd.Timestamp("2023-01-01"), end_excl=pd.Timestamp("2023-02-01") ): for key, df in intersection.items(): print(f"Data from {key}: {len(df)} rows") Best Practices -------------- 1. **Indexer Design**: Design your indexer schema to match your data access patterns 2. **Ordered Column**: Choose an appropriate column for ordering (typically timestamp) 3. **Row Group Size**: Balance between query performance and storage efficiency 4. **Duplicate Handling**: Use ``duplicates_on`` when data quality is a concern 5. **Metadata**: Use key-value metadata to store important dataset information See Also -------- - :doc:`api` - Complete API reference - :doc:`tutorial` - Getting started guide