Quickstart
ParquetSet and indexing
An instance of ParquetSet
class gathers a collection of datasets.
ParquetSet
instantiation requires the definition of a collection path and a dataset indexing logic.
Collection path
It is directory path (existing or not) where will be (are) gathered directories for each dataset.
Indexing logic
A logic is formalized by use of a decorated class. Indices themselves are then materialized by instantiating this class, and more specifically by the instance attributes values.
The class itself is declared just as a dataclass.
@toplevel
is then used as a class decorator (and not @dataclass
).
from os import path as os_path
from oups import ParquetSet, toplevel
# Define an indexing logic to generate each individual dataset folder name.
@toplevel
class DatasetIndex:
country: str
city: str
# Define a collection path.
dirpath = os_path.expanduser('~/Documents/code/data/weather_knowledge_base')
# Initialize a parquet dataset collection.
ps = ParquetSet(dirpath, DatasetIndex)
Writing new data
import pandas as pd
# Index of a first dataset, for some temperature records related to Berlin.
idx1 = DatasetIndex('germany','berlin')
# Data to be recorded.
df1 = pd.DataFrame({'timestamp':pd.date_range('2021/01/01', '2021/01/05', freq='1D'),
'temperature':range(10,15)})
# Populate parquet collection with a first dataset.
ps[idx1] = df1
weather_knowledge_base
folder has now been created with new data.
data
|- weather_knowledge_base
|- germany-berlin
|- _common_metadata
|- _metadata
|- part.0.parquet
Reading existing data
# Read data as a pandas dataframe.
df = ps[idx1].pdf