Why *oups*? =========== Purpose ------- *oups* (Ordered Updatable Parquet Store) is designed for managing large collections of ordered datasets, particularly time-series data. It provides a comprehensive framework for efficient storage, indexing, and querying of structured data with validated ordering. **Key Design Goals:** * **Schema-driven Organization**: Use dataclass schemas to automatically organize and discover datasets * **Ordered Storage Validation**: Verify strict ordering within datasets for optimal query performance * **Efficient Updates**: Support incremental data updates with intelligent merging strategies * **Memory Efficiency**: Minimize memory footprint during read/write operations * **Concurrent Access**: Provide safe concurrent access through file-based locking * **Flexible Querying**: Enable cross-dataset queries and range-based data retrieval **Core Features:** * **Hierarchical Indexing**: Define complex organizational schemas using ``@toplevel`` decorated dataclasses * **Row Group Management**: Use of parquet file structure to optimize both storage and query performance * **Duplicate Handling**: Configurable duplicate detection and removal * **Metadata Support**: Rich metadata storage alongside datasets * **Range Queries**: Efficient querying across time ranges and multiple datasets simultaneously Use Cases --------- You may think of using *oups* for: * **Financial Time Series**: Managing market data, trading records, and risk metrics across multiple instruments * **IoT Data Collection**: Organizing sensor data from multiple devices and locations * **Analytics Pipelines**: Storing intermediate and final results of data processing workflows * **Research Data**: Managing experimental datasets with complex organizational requirements Alternatives ------------ Several alternatives exist for managing dataset collections: **Arctic (MongoDB-based)** - Provides powerful time-series storage - Requires MongoDB infrastructure - More complex deployment and maintenance **PyStore (Dask-based)** - Supports parallelized operations - Less flexible organizational schemas - `Performance concerns `_ in some scenarios **DuckDB or DataFusion** - Excellent query performance - SQL-based querying **Direct Parquet + File Management** - Maximum control over file structure - Requires implementing indexing, updates, and concurrency manually - This is how *oups* started *oups* Advantages ------------------ Compared to these alternatives, *oups* offers: * **Pure Python Implementation**: No external database dependencies * **Flexible Duplicate Handling**: User-defined logic for handling duplicate rows * **Automated Path Management**: Schema-driven directory organization * **Incremental Updates**: Efficient merging of new data with existing datasets * **Ordering Validation**: Built-in verification of data ordering for optimal performance * **Simple API**: An interface not requiring SQL knowledge * **Lock-based Concurrency**: Safe concurrent access without complex coordination Example Comparison ------------------- **Traditional Approach:** .. code-block:: python # Manual path management path = f"/data/{symbol}/{year}/{month}/data.parquet" # Manual duplicate handling existing_df = pd.read_parquet(path) new_df = pd.concat([existing_df, new_data]) new_df = new_df.drop_duplicates().sort_values('timestamp') new_df.to_parquet(path) **With oups:** .. code-block:: python @toplevel class DataIndex: symbol: str year: str month: str store = Store("/data", DataIndex) key = DataIndex("AAPL", "2023", "01") # Automatic path management, duplicate handling, and ordering store[key].write( df=new_data, ordered_on='timestamp', duplicates_on=['timestamp', 'symbol'] )