Polars
Note
Install with pip install labox[polars]
Labox provides a high-performance serializer for Polars DataFrames using the efficient Parquet columnar format. This integration leverages Polars' native Parquet I/O capabilities to ensure optimal performance and compatibility with the broader data ecosystem.
DataFrame Serializer
The ParquetDataFrameSerializer
serializes Polars DataFrames using the Parquet format, which provides excellent
compression ratios and fast read/write performance for analytical workloads.
Basic Usage
A default instance of the serializer is available as parquet_dataframe_serializer:
import polars as pl
from labox.extra.polars import parquet_dataframe_serializer
df = pl.DataFrame(
{
"id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Charlie", "Diana"],
"score": [95.5, 87.2, 92.8, 89.1],
"active": [True, False, True, True],
}
)
serialized_data = parquet_dataframe_serializer.serialize_data(df)
Custom Configuration
You can create a custom serializer instance with specific Parquet options:
from labox.extra.polars import ParquetDataFrameSerializer
# Configure write options for better compression
serializer = ParquetDataFrameSerializer(
dump_args={
"compression": "zstd",
"compression_level": 3,
"statistics": True,
"row_group_size": 50000,
},
load_args={
"parallel": "auto",
"use_statistics": True,
"rechunk": True,
},
)
Advanced Options
The serializer supports extensive Parquet configuration through
ParquetDumpArgs and
ParquetLoadArgs:
# Example with advanced write options
write_optimized_serializer = ParquetDataFrameSerializer(
dump_args={
"compression": "snappy",
"statistics": {"string_columns": True, "null_count": False},
"data_page_size": 1024 * 1024, # 1MB pages
"use_pyarrow": False, # Use Polars native engine
}
)
# Example with advanced read options
read_optimized_serializer = ParquetDataFrameSerializer(
load_args={
"parallel": "row_groups",
"low_memory": True,
"use_statistics": True,
"rechunk": False, # Keep original chunking
}
)
Content Type
The serializer uses the standard content type "application/vnd.apache.parquet" to
identify Parquet format data, ensuring compatibility with other Parquet-based tools and
systems.
Performance Considerations
- Compression: The default compression provides a good balance of speed and size.
For write-heavy workloads, consider
"lz4"or"snappy". For storage-optimized scenarios, use"zstd"or"brotli". - Parallel Processing: Polars can leverage multiple CPU cores during read
operations. Set
parallel="auto"inload_argsfor optimal performance. - Statistics: Enabling statistics collection during write can significantly speed up filtered reads but may slow down write operations.
For more details on the available options, refer to the Polars Parquet documentation.