API Reference

This section provides detailed API documentation for the main store components.

Indexer Functions

toplevel(index_class=None, *, field_sep: str = '-') Type | Callable

Turn decorated class into an indexing schema.

Decorated class is equipped with methods and attributes to use with a ParquetSet instance. It has to be defined as one would define a class decorated by @dataclass.

Parameters:

field_sep (str, default '.') – Character to use as separator between fields of the dataclass.

Returns:

Decorated class.

field_sep

Fields separator (can’t assign).

Type:

str

depth

Number of levels, including ‘toplevel’ (can’t assign).

Type:

int

Notes

@dataclass is actually called when decorating with @toplevel with parameters set to:

  • order=True,

  • frozen=True

When class is instantiated, a validation step is conducted on attributes types and values.

  • An instance can only be composed with int, str or a dataclass object coming in last position;

  • Value of attribute can not incorporate forbidden characters like / and self.field_sep.

sublevel(index_class)

Define a subdirectory level.

This decorator really is an alias of @dataclass decorator, with parameters set to:

  • order=True,

  • frozen=True

is_toplevel(toplevel) bool

Return True if toplevel-decorated class.

Returns ‘True’ if ‘toplevel’ (class or instance) has been decorated with @toplevel’. It checks presence ‘field_sep’ attribute and ‘from_path’ method.

Core Classes

OrderedParquetDataset

class OrderedParquetDataset(dirpath: str | Path, ordered_on: str | None = None, lock_timeout: int | None = None, lock_lifetime: int | None = 15)

Base class for Ordered Parquet Dataset with shared functionality.

This class contains all shared attributes, properties, and methods between the full OrderedParquetDataset and its read-only version.

_file_ids_n_digits

Number of digits to use for ‘file_id’ in filename. It is kept as an attribute to avoid recomputing it at each call to ‘get_parquet_filepaths()’.

Type:

int

_lock

Exclusive lock held for the object’s entire lifetime.

Type:

Lock

_max_allowed_file_id

Maximum allowed file id. Kept as hidden attribute to avoid recomputing it at each call in ‘write_row_group_files()’.

Type:

int

_max_n_rows

Maximum allowed number of rows in a row group. Kept as hidden attribute to avoid recomputing it at each call in ‘write_row_group_files()’.

Type:

int

dirpath

Directory path from where to load data.

Type:

Path

is_newly_initialized

True if this dataset instance was just created and has no existing metadata file. False if the dataset was loaded from existing files.

Type:

bool

key_value_metadata

Key-value metadata, from user and including ‘ordered_on’ column name.

Type:

Dict[str, str]

max_file_id

Maximum file id in current directory.

Type:

int

ordered_on

Column name to order row groups by. Can be set either at opd instantiation or in ‘kwargs’ of ‘write()’ method. Once set, it cannot be changed.

Type:

str

row_group_stats
Row groups statistics,
  • “ordered_on_min”, min value in ‘ordered_on’ column for this group,

  • “ordered_on_max”, max value in ‘ordered_on’ column for this group,

  • “n_rows”: number of rows per row group,

  • “file_id”: an int indicating the file id for this group.

Type:

DataFrame

remove_from_disk()

Remove all dataset files from disk and update in-memory state.

to_pandas()

Return data as a pandas dataframe.

write()

Write data to disk, merging with existing data.

__del__()

Release lock when object is garbage collected. Uses reference counting to ensure lock is only released when all instances are gone.

__getitem__(self, item: int | slice) 'ReadOnlyOrderedParquetDataset'

Select among the row-groups using integer/slicing.

__len__()

Return number of row groups in the dataset.

_align_file_ids()

Align file ids to row group position in the dataset.

_release_lock()

Release lock with reference counting.

_remove_row_group_files()

Remove row group files from disk. Row group indexes are also removed from row_group_stats.

_sort_row_groups()

Sort row groups according their min value in ‘ordered_on’ column.

_write_metadata_file()

Write metadata to disk.

_write_row_group_files()

Write row group as files to disk. One row group per file.

Notes

  • There is one row group per file.

  • Dataset metadata are written in a separate file in parquet format, located at the same level than the dataset directory (not within the directory). This way, if provided the directory path, another parquet reader can read the dataset without being confused by this metadata file.

  • File ids (in file names) have the same number of digits. This is to ensure that files can be read in the correct order by other parquet readers.

  • When creating an OrderedParquetDataset object, a lock is acquired and held for the object’s entire lifetime. The purpose is to provide race-condition-free exclusive access suitable for scenarios with limited concurrent processes. The lock is acquired with a timeout and a lifetime. The timeout is the maximum time to wait for lock acquisition in seconds. The lifetime is the expected maximum lifetime of the lock, as a timedelta or integer number of seconds, relative to when the lock is acquired. Reading and writing operations refresh the lock to the lifetime it has been initially provided.

Store

class Store(basepath: str | Path, indexer: Type[dataclass])

Sorted list of keys (indexes to parquet datasets).

basepath

Directory path to the set of parquet datasets.

Type:

Path

indexer

Indexer schema (class) to be used to index parquet datasets.

Type:

Type[dataclass]

keys

Set of indexes of existing parquet datasets.

Type:

SortedSet

_needs_keys_refresh

Flag indicating that the ‘keys’ property needs to be refreshed from disk. Set to True when a new ‘OrderedParquetDataset’ is accessed but doesn’t yet have a metadata file on disk. When True, the next access to the ‘keys’ property will rescan the filesystem to update the keys collection.

Type:

bool

get()

Return the OrderedParquetDataset instance corresponding to key.

iter_intersections()

Iterate over row group intersections across multiple datasets in store.

__getitem__()

Return the OrderedParquetDataset instance corresponding to key.

__delitem__()

Remove dataset from parquet set.

__iter__()

Iterate over keys.

__len__()

Return number of datasets.

__repr__()

List of datasets.

__contains__()

Assess presence of this dataset.

Notes

SortedSet is the data structure retained for keys instead of SortedList as its __contains__ appears faster.

Write Operations

write(dirpath: str | Path | OrderedParquetDataset, ordered_on: str | Tuple[str], df: pandas.DataFrame | None = None, row_group_target_size: str | int | None = 6345000, duplicates_on: str | List[str] | List[Tuple[str]] = None, max_n_off_target_rgs: int = None, key_value_metadata: Dict[str, str] = None, **kwargs)

Write data to disk at location specified by path.

Parameters:
  • dirpath (Union[str, Path, OrderedParquetDataset]) – If a string or a Path, it is the directory where writing pandas dataframe. If an OrderedParquetDataset, it is the dataset where writing pandas dataframe.

  • ordered_on (Union[str, Tuple[str]]) – Name of the column with respect to which dataset is in ascending order. If column multi-index, name of the column is a tuple. It allows knowing ‘where’ to insert new data into existing data, i.e. completing or correcting past records (but it does not allow to remove prior data).

  • df (Optional[DataFrame], default None) – Data to write. If None, a resize of Oredered Parquet Dataset may however be performed.

  • row_group_target_size (Optional[Union[int, str]]) – Target size of row groups. If not set, default to 6_345_000, which for a dataframe with 6 columns of float64 or int64 results in a memory footprint (RAM) of about 290MB. It can be a pandas freqstr as well, to gather data by timestamp over a defined period.

  • duplicates_on (Union[str, List[str], List[Tuple[str]]], optional) – Column names according which ‘row duplicates’ can be identified (i.e. rows sharing same values on these specific columns) so as to drop them. Duplicates are only identified in new data, and existing recorded row groups that overlap with new data. If duplicates are dropped, only last is kept. To identify row duplicates using all columns, empty list [] can be used instead of all columns names. If not set, default to None, meaning no row is dropped.

  • max_n_off_target_rgs (int, optional) –

    Max expected number of ‘off target’ row groups. If ‘row_group_target_size’ is an int, then a ‘complete’ row group is one which size is ‘close to’ row_group_target_size (>=80%). If ‘row_group_target_size’ is a pandas freqstr, and if there are several row groups in the last period defined by the freqstr, then these row groups are considered incomplete. To evaluate number of ‘incomplete’ row groups, only those at the end of an existing dataset are accounted for. ‘Incomplete’ row groups in the middle of ‘complete’ row groups are not accounted for (they can be created by insertion of new data in the middle of existing data). If not set, default to None.

    • None value induces no coalescing of row groups. If there is no drop of duplicates, new data is systematically appended.

    • A value of 0 or 1 means that new data should systematically be merged to the last existing one to ‘complete’ it (if it is not ‘complete’ already).

  • key_value_metadata (Dict[str, str], optional) – Key-value metadata to write, or update in dataset. Please see fastparquet for updating logic in case of None value being used.

  • **kwargs – Additional parameters to pass to ‘OrderedParquetDataset.write_row_group_files()’.

Notes

  • When writing a dataframe with this function,

    • index of dataframe is not written to disk.

    • parquet file scheme is ‘hive’ (one row group per parquet file).

  • Coalescing off target size row groups is triggered if actual number of off target row groups is larger than max_n_off_target_rgs. This assessment is however only triggered if max_n_off_target_rgs is set. Otherwise, new data is simply appended, without prior check.

  • When duplicates_on is set, ‘ordered_on’ column is added to duplicates_on list, if not already part of it. Purpose is to enable a first approximate search for duplicates, to load data of interest only.

  • For simple data appending, i.e. without need to drop duplicates, it is advised to keep ordered_on and duplicates_on parameters set to None as this parameter will trigger unnecessary evaluations.

  • Off target size row groups are row groups:

    • either not reaching the maximum number of rows if ‘row_group_target_size’ is an int,

    • or several row groups lying in the same time period if ‘row_group_target_size’ is a pandas ‘freqstr’.

  • When incorporating new data within recorded data, existing off target size row groups will only be resized if there is intersection with new data. Otherwise, new data is only added, without merging with existing off target size row groups.

Utility Functions

check_cmidx(cmidx: pandas.MultiIndex)

Check if column multi-index complies with fastparquet requirements.

Library fastparquet requires names for each level in a Multiindex. Also, column names have to be tuple of string.

Parameters:

cmidx (MultiIndex) – MultiIndex to check.

conform_cmidx(df: pandas.DataFrame)

Conform pandas column multi-index.

Library fastparquet has several requirements to handle column MultiIndex.

  • It requires names for each level in a Multiindex. If these are not set, there are set to ‘’, an empty string.

  • It requires column names to be tuple of string. If an object is different than a string (for instance float or int), it is turned into a string.

DataFrame is modified in-place.

Parameters:

df (DataFrame) – DataFrame with a column multi-index to check and possibly adjust.

Returns:

None