
Security News
NVD Concedes Inability to Keep Pace with Surging CVE Disclosures in 2025
Security experts warn that recent classification changes obscure the true scope of the NVD backlog as CVE volume hits all-time highs.
Supply Chain Security
Vulnerability
Quality
Maintenance
License
.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg :target: https://github.com/dask/fastparquet/actions/workflows/main.yaml
.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest :target: https://fastparquet.readthedocs.io/en/latest/
fastparquet is a python implementation of the parquet format <https://github.com/apache/parquet-format>
_, aiming integrate
into python-based big data work-flows. It is used implicitly by
the projects Dask, Pandas and intake-parquet.
We offer a high degree of support for the features of the parquet format, and very competitive performance, in a small install size and codebase.
Details of this project, how to use it and comparisons to other work can be found in the documentation_.
.. _documentation: https://fastparquet.readthedocs.io
(all development is against recent versions in the default anaconda channels and/or conda-forge)
Required:
Supported compression algorithms:
Available by default:
Optionally supported
lzo <https://github.com/jd-boyd/python-lzo>
_Install using conda, to get the latest compiled version::
conda install -c conda-forge fastparquet
or install from PyPI::
pip install fastparquet
You may wish to install numpy first, to help pip's resolver. This may install an appropriate wheel, or compile from source. For the latter, you will need a suitable C compiler toolchain on your system.
You can also install latest version from github::
pip install git+https://github.com/dask/fastparquet
in which case you should also have cython
to be able to rebuild the C files.
Please refer to the documentation_.
Reading
.. code-block:: python
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.
Writing
.. code-block:: python
from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
compression='GZIP', file_scheme='hive')
The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.
This project forked in October 2016 from parquet-python
_, which was not designed
for vectorised loading of big data or parallel access.
.. _parquet-python: https://github.com/jcrobak/parquet-python
FAQs
Python support for Parquet file format
We found that fastparquet demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Security experts warn that recent classification changes obscure the true scope of the NVD backlog as CVE volume hits all-time highs.
Security Fundamentals
Attackers use obfuscation to hide malware in open source packages. Learn how to spot these techniques across npm, PyPI, Maven, and more.
Security News
Join Socket for exclusive networking events, rooftop gatherings, and one-on-one meetings during BSidesSF and RSA 2025 in San Francisco.