API Reference
This section provides detailed API documentation for the main store components.
Indexer Functions
- toplevel(index_class=None, *, field_sep: str = '-') Type | Callable
Turn decorated class into an indexing schema.
Decorated class is equipped with methods and attributes to use with a
ParquetSet
instance. It has to be defined as one would define a class decorated by@dataclass
.- Parameters:
field_sep (str, default '.') – Character to use as separator between fields of the dataclass.
- Returns:
Decorated class.
- field_sep
Fields separator (can’t assign).
- Type:
str
- depth
Number of levels, including ‘toplevel’ (can’t assign).
- Type:
int
Notes
@dataclass
is actually called when decorating with@toplevel
with parameters set to:order=True
,frozen=True
When class is instantiated, a validation step is conducted on attributes types and values.
An instance can only be composed with
int
,str
or a dataclass object coming in last position;Value of attribute can not incorporate forbidden characters like
/
andself.field_sep
.
- sublevel(index_class)
Define a subdirectory level.
This decorator really is an alias of
@dataclass
decorator, with parameters set to:order=True
,frozen=True
- is_toplevel(toplevel) bool
Return True if toplevel-decorated class.
Returns ‘True’ if ‘toplevel’ (class or instance) has been decorated with ‘@toplevel’. It checks presence ‘field_sep’ attribute and ‘from_path’ method.
Core Classes
OrderedParquetDataset
- class OrderedParquetDataset(dirpath: str | Path, ordered_on: str | None = None, lock_timeout: int | None = None, lock_lifetime: int | None = 15)
Base class for Ordered Parquet Dataset with shared functionality.
This class contains all shared attributes, properties, and methods between the full OrderedParquetDataset and its read-only version.
- _file_ids_n_digits
Number of digits to use for ‘file_id’ in filename. It is kept as an attribute to avoid recomputing it at each call to ‘get_parquet_filepaths()’.
- Type:
int
- _lock
Exclusive lock held for the object’s entire lifetime.
- Type:
Lock
- _max_allowed_file_id
Maximum allowed file id. Kept as hidden attribute to avoid recomputing it at each call in ‘write_row_group_files()’.
- Type:
int
- _max_n_rows
Maximum allowed number of rows in a row group. Kept as hidden attribute to avoid recomputing it at each call in ‘write_row_group_files()’.
- Type:
int
- dirpath
Directory path from where to load data.
- Type:
Path
- is_newly_initialized
True if this dataset instance was just created and has no existing metadata file. False if the dataset was loaded from existing files.
- Type:
bool
- key_value_metadata
Key-value metadata, from user and including ‘ordered_on’ column name.
- Type:
Dict[str, str]
- max_file_id
Maximum file id in current directory.
- Type:
int
- ordered_on
Column name to order row groups by. Can be set either at opd instantiation or in ‘kwargs’ of ‘write()’ method. Once set, it cannot be changed.
- Type:
str
- row_group_stats
- Row groups statistics,
“ordered_on_min”, min value in ‘ordered_on’ column for this group,
“ordered_on_max”, max value in ‘ordered_on’ column for this group,
“n_rows”: number of rows per row group,
“file_id”: an int indicating the file id for this group.
- Type:
DataFrame
- remove_from_disk()
Remove all dataset files from disk and update in-memory state.
- to_pandas()
Return data as a pandas dataframe.
- write()
Write data to disk, merging with existing data.
- __del__()
Release lock when object is garbage collected. Uses reference counting to ensure lock is only released when all instances are gone.
- __getitem__(self, item: int | slice) 'ReadOnlyOrderedParquetDataset'
Select among the row-groups using integer/slicing.
- __len__()
Return number of row groups in the dataset.
- _align_file_ids()
Align file ids to row group position in the dataset.
- _release_lock()
Release lock with reference counting.
- _remove_row_group_files()
Remove row group files from disk. Row group indexes are also removed from row_group_stats.
- _sort_row_groups()
Sort row groups according their min value in ‘ordered_on’ column.
- _write_metadata_file()
Write metadata to disk.
- _write_row_group_files()
Write row group as files to disk. One row group per file.
Notes
There is one row group per file.
Dataset metadata are written in a separate file in parquet format, located at the same level than the dataset directory (not within the directory). This way, if provided the directory path, another parquet reader can read the dataset without being confused by this metadata file.
File ids (in file names) have the same number of digits. This is to ensure that files can be read in the correct order by other parquet readers.
When creating an OrderedParquetDataset object, a lock is acquired and held for the object’s entire lifetime. The purpose is to provide race-condition-free exclusive access suitable for scenarios with limited concurrent processes. The lock is acquired with a timeout and a lifetime. The timeout is the maximum time to wait for lock acquisition in seconds. The lifetime is the expected maximum lifetime of the lock, as a timedelta or integer number of seconds, relative to when the lock is acquired. Reading and writing operations refresh the lock to the lifetime it has been initially provided.
Store
- class Store(basepath: str | Path, indexer: Type[dataclass])
Sorted list of keys (indexes to parquet datasets).
- basepath
Directory path to the set of parquet datasets.
- Type:
Path
- indexer
Indexer schema (class) to be used to index parquet datasets.
- Type:
Type[dataclass]
- keys
Set of indexes of existing parquet datasets.
- Type:
SortedSet
- _needs_keys_refresh
Flag indicating that the ‘keys’ property needs to be refreshed from disk. Set to True when a new ‘OrderedParquetDataset’ is accessed but doesn’t yet have a metadata file on disk. When True, the next access to the ‘keys’ property will rescan the filesystem to update the keys collection.
- Type:
bool
- get()
Return the
OrderedParquetDataset
instance corresponding tokey
.
- iter_intersections()
Iterate over row group intersections across multiple datasets in store.
- __getitem__()
Return the
OrderedParquetDataset
instance corresponding tokey
.
- __delitem__()
Remove dataset from parquet set.
- __iter__()
Iterate over keys.
- __len__()
Return number of datasets.
- __repr__()
List of datasets.
- __contains__()
Assess presence of this dataset.
Notes
SortedSet
is the data structure retained forkeys
instead ofSortedList
as its__contains__
appears faster.
Write Operations
- write(dirpath: str | Path | OrderedParquetDataset, ordered_on: str | Tuple[str], df: pandas.DataFrame | None = None, row_group_target_size: str | int | None = 6345000, duplicates_on: str | List[str] | List[Tuple[str]] = None, max_n_off_target_rgs: int = None, key_value_metadata: Dict[str, str] = None, **kwargs)
Write data to disk at location specified by path.
- Parameters:
dirpath (Union[str, Path, OrderedParquetDataset]) – If a string or a Path, it is the directory where writing pandas dataframe. If an OrderedParquetDataset, it is the dataset where writing pandas dataframe.
ordered_on (Union[str, Tuple[str]]) – Name of the column with respect to which dataset is in ascending order. If column multi-index, name of the column is a tuple. It allows knowing ‘where’ to insert new data into existing data, i.e. completing or correcting past records (but it does not allow to remove prior data).
df (Optional[DataFrame], default None) – Data to write. If None, a resize of Oredered Parquet Dataset may however be performed.
row_group_target_size (Optional[Union[int, str]]) – Target size of row groups. If not set, default to
6_345_000
, which for a dataframe with 6 columns offloat64
orint64
results in a memory footprint (RAM) of about 290MB. It can be a pandas freqstr as well, to gather data by timestamp over a defined period.duplicates_on (Union[str, List[str], List[Tuple[str]]], optional) – Column names according which ‘row duplicates’ can be identified (i.e. rows sharing same values on these specific columns) so as to drop them. Duplicates are only identified in new data, and existing recorded row groups that overlap with new data. If duplicates are dropped, only last is kept. To identify row duplicates using all columns, empty list
[]
can be used instead of all columns names. If not set, default toNone
, meaning no row is dropped.max_n_off_target_rgs (int, optional) –
Max expected number of ‘off target’ row groups. If ‘row_group_target_size’ is an
int
, then a ‘complete’ row group is one which size is ‘close to’row_group_target_size
(>=80%). If ‘row_group_target_size’ is a pandas freqstr, and if there are several row groups in the last period defined by the freqstr, then these row groups are considered incomplete. To evaluate number of ‘incomplete’ row groups, only those at the end of an existing dataset are accounted for. ‘Incomplete’ row groups in the middle of ‘complete’ row groups are not accounted for (they can be created by insertion of new data in the middle of existing data). If not set, default toNone
.None
value induces no coalescing of row groups. If there is no drop of duplicates, new data is systematically appended.A value of
0
or1
means that new data should systematically be merged to the last existing one to ‘complete’ it (if it is not ‘complete’ already).
key_value_metadata (Dict[str, str], optional) – Key-value metadata to write, or update in dataset. Please see fastparquet for updating logic in case of None value being used.
**kwargs – Additional parameters to pass to ‘OrderedParquetDataset.write_row_group_files()’.
Notes
When writing a dataframe with this function,
index of dataframe is not written to disk.
parquet file scheme is ‘hive’ (one row group per parquet file).
Coalescing off target size row groups is triggered if actual number of off target row groups is larger than
max_n_off_target_rgs
. This assessment is however only triggered ifmax_n_off_target_rgs
is set. Otherwise, new data is simply appended, without prior check.When
duplicates_on
is set, ‘ordered_on’ column is added toduplicates_on
list, if not already part of it. Purpose is to enable a first approximate search for duplicates, to load data of interest only.For simple data appending, i.e. without need to drop duplicates, it is advised to keep
ordered_on
andduplicates_on
parameters set toNone
as this parameter will trigger unnecessary evaluations.Off target size row groups are row groups:
either not reaching the maximum number of rows if ‘row_group_target_size’ is an
int
,or several row groups lying in the same time period if ‘row_group_target_size’ is a pandas ‘freqstr’.
When incorporating new data within recorded data, existing off target size row groups will only be resized if there is intersection with new data. Otherwise, new data is only added, without merging with existing off target size row groups.
Utility Functions
- check_cmidx(cmidx: pandas.MultiIndex)
Check if column multi-index complies with fastparquet requirements.
Library fastparquet requires names for each level in a Multiindex. Also, column names have to be tuple of string.
- Parameters:
cmidx (MultiIndex) – MultiIndex to check.
- conform_cmidx(df: pandas.DataFrame)
Conform pandas column multi-index.
Library fastparquet has several requirements to handle column MultiIndex.
It requires names for each level in a Multiindex. If these are not set, there are set to ‘’, an empty string.
It requires column names to be tuple of string. If an object is different than a string (for instance float or int), it is turned into a string.
DataFrame is modified in-place.
- Parameters:
df (DataFrame) – DataFrame with a column multi-index to check and possibly adjust.
- Returns:
None