API Reference

This section provides detailed API documentation for the main store components.

Indexer Functions

Core Classes

OrderedParquetDataset

Store

Write Operations

Utility Functions

Type Definitions

The following are important type definitions used throughout the store module:

Index Types

Indexer classes are dataclasses decorated with @toplevel that define the schema for organizing datasets.

Ordered Column Types

The ordered_on parameter accepts:

  • str: Single column name

  • Tuple[str]: Multi-index column name (for hierarchical columns)

Row Group Target Size Types

The row_group_target_size parameter accepts:

  • int: Target number of rows per row group

  • str: Pandas frequency string (e.g., “1D”, “1H”) for time-based grouping

Key-Value Metadata

Custom metadata stored as Dict[str, str] alongside parquet files.

Examples

Basic Usage

from oups.store import toplevel, Store, OrderedParquetDataset
import pandas as pd

# Define indexer schema
@toplevel
class MyIndex:
    category: str
    subcategory: str

# Create store
store = Store("/path/to/data", MyIndex)

# Create sample data
df = pd.DataFrame({
    "timestamp": pd.date_range("2023-01-01", periods=1000),
    "value": range(1000)
})

# Access dataset and write data
key = MyIndex("stocks", "tech")
dataset = store[key]
dataset.write(df=df, ordered_on="timestamp")

Advanced Write Options

from oups.store import write

# Time-based row groups with duplicate handling
write(
    "/path/to/dataset",
    ordered_on="timestamp",
    df=df,
    row_group_target_size="1D",  # Daily row groups
    duplicates_on=["timestamp", "symbol"],  # Drop duplicates
    max_n_off_target_rgs=2,  # Coalesce small row groups
    key_value_metadata={"source": "bloomberg", "version": "1.0"}
)

Cross-Dataset Queries

# Query multiple datasets simultaneously
keys = [MyIndex("stocks", "tech"), MyIndex("stocks", "finance")]

for intersection in store.iter_intersections(
    keys,
    start=pd.Timestamp("2023-01-01"),
    end_excl=pd.Timestamp("2023-02-01")
):
    for key, df in intersection.items():
        print(f"Processing {key}: {len(df)} rows")

Hierarchical Indexing

from oups.store import toplevel, sublevel

@sublevel
class DateInfo:
    year: str
    month: str

@toplevel
class HierarchicalIndex:
    symbol: str
    date_info: DateInfo

# This creates paths like: AAPL/2023-01/
key = HierarchicalIndex("AAPL", DateInfo("2023", "01"))