Why oups?

Purpose

oups (Ordered Updatable Parquet Store) is designed for managing large collections of ordered datasets, particularly time-series data. It provides a comprehensive framework for efficient storage, indexing, and querying of structured data with validated ordering.

Key Design Goals:

Schema-driven Organization: Use dataclass schemas to automatically organize and discover datasets
Ordered Storage Validation: Verify strict ordering within datasets for optimal query performance
Efficient Updates: Support incremental data updates with intelligent merging strategies
Memory Efficiency: Minimize memory footprint during read/write operations
Concurrent Access: Provide safe concurrent access through file-based locking
Flexible Querying: Enable cross-dataset queries and range-based data retrieval

Core Features:

Hierarchical Indexing: Define complex organizational schemas using @toplevel decorated dataclasses
Row Group Management: Use of parquet file structure to optimize both storage and query performance
Duplicate Handling: Configurable duplicate detection and removal
Metadata Support: Rich metadata storage alongside datasets
Range Queries: Efficient querying across time ranges and multiple datasets simultaneously

Use Cases

You may think of using oups for:

Financial Time Series: Managing market data, trading records, and risk metrics across multiple instruments
IoT Data Collection: Organizing sensor data from multiple devices and locations
Analytics Pipelines: Storing intermediate and final results of data processing workflows
Research Data: Managing experimental datasets with complex organizational requirements

Alternatives

Several alternatives exist for managing dataset collections:

Arctic (MongoDB-based)

Provides powerful time-series storage
Requires MongoDB infrastructure
More complex deployment and maintenance

PyStore (Dask-based)

Supports parallelized operations
Less flexible organizational schemas
Performance concerns in some scenarios

DuckDB or DataFusion

Excellent query performance
SQL-based querying

Direct Parquet + File Management

Maximum control over file structure
Requires implementing indexing, updates, and concurrency manually
This is how oups started

oups Advantages

Compared to these alternatives, oups offers:

Pure Python Implementation: No external database dependencies
Flexible Duplicate Handling: User-defined logic for handling duplicate rows
Automated Path Management: Schema-driven directory organization
Incremental Updates: Efficient merging of new data with existing datasets
Ordering Validation: Built-in verification of data ordering for optimal performance
Simple API: An interface not requiring SQL knowledge
Lock-based Concurrency: Safe concurrent access without complex coordination

Example Comparison

Traditional Approach:

# Manual path management
path = f"/data/{symbol}/{year}/{month}/data.parquet"

# Manual duplicate handling
existing_df = pd.read_parquet(path)
new_df = pd.concat([existing_df, new_data])
new_df = new_df.drop_duplicates().sort_values('timestamp')
new_df.to_parquet(path)

With oups:

@toplevel
class DataIndex:
    symbol: str
    year: str
    month: str

store = Store("/data", DataIndex)
key = DataIndex("AAPL", "2023", "01")

# Automatic path management, duplicate handling, and ordering
store[key].write(
    df=new_data,
    ordered_on='timestamp',
    duplicates_on=['timestamp', 'symbol']
)