Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

nanoarrow

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

nanoarrow

Python bindings to the nanoarrow C library

0.6.0
PyPI

Maintainers: 1

nanoarrow for Python

The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.

Installation

The nanoarrow Python bindings are available from PyPI and conda-forge:

pip install nanoarrow
conda install nanoarrow -c conda-forge

Development versions (based on the main branch) are also available:

pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
    --prefer-binary --pre nanoarrow

If you can import the namespace, you're good to go!

import nanoarrow as na

Data types, arrays, and array streams

The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema which represents a data type of an array, the ArrowArray which represents the values of an array, and an ArrowArrayStream, which represents zero or more ArrowArrays with a common ArrowSchema. These concepts map to the nanoarrow.Schema, nanoarrow.Array, and nanoarrow.ArrayStream in the Python package.

na.int32()

<Schema> int32

na.Array([1, 2, 3], na.int32())

nanoarrow.Array<int32>[3]
1
2
3

The nanoarrow.Array can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., pyarrow.ChunkedArray, polars.Series) support this.

chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
chunked

nanoarrow.Array<int32>[6]
1
2
3
4
5
6

Whereas chunks of an Array are always fully materialized when the object is constructed, the chunks of an ArrayStream have not necessarily been resolved yet.

stream = na.ArrayStream(chunked)
stream

nanoarrow.ArrayStream<int32>

with stream:
    for chunk in stream:
        print(chunk)

nanoarrow.Array<int32>[3]
1
2
3
nanoarrow.Array<int32>[3]
4
5
6

The nanoarrow.ArrayStream also provides an interface to nanoarrow's Arrow IPC reader:

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)

nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>

These objects implement the Arrow PyCapsule interface for both producing and consuming and are interchangeable with pyarrow objects in many cases:

import pyarrow as pa

pa.field(na.int32())

pyarrow.Field<: int32>

pa.chunked_array(chunked)

<pyarrow.lib.ChunkedArray object at 0x12a49a250>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5,
    6
  ]
]

pa.array(chunked.chunk(1))

<pyarrow.lib.Int32Array object at 0x11b552500>
[
  4,
  5,
  6
]

na.Array(pa.array([10, 11, 12]))

nanoarrow.Array<int64>[3]
10
11
12

na.Schema(pa.string())

<Schema> string

Low-level C library bindings

The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using nanoarrow.c_schema(), nanoarrow.c_array(), and nanoarrow.c_array_stream().

Schemas

Use nanoarrow.c_schema() to convert an object to an ArrowSchema and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Schema, pyarrow.DataType, and pyarrow.Field).

na.c_schema(pa.decimal128(10, 3))

<nanoarrow.c_schema.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

Using c_schema() is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use Schema():

schema = na.Schema(pa.decimal128(10, 3))
schema.precision, schema.scale

(10, 3)

The CSchema object cleans up after itself: when the object is deleted, the underlying ArrowSchema is released.

Arrays

You can use nanoarrow.c_array() to convert an array-like object to an ArrowArray, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Array, pyarrow.RecordBatch).

na.c_array(["one", "two", "three", None], na.string())

<nanoarrow.c_array.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (4754305168, 4754307808, 4754310464)
- dictionary: NULL
- children[0]:

Using c_array() is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use Array():

array = na.Array(["one", "two", "three", None], na.string())
array.to_pylist()

['one', 'two', 'three', None]

array.buffers

(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
 nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
 nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))

Advanced users can create arrays directly from buffers using c_array_from_buffers():

na.c_array_from_buffers(
    na.string(),
    2,
    [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
)

<nanoarrow.c_array.CArray string>
- length: 2
- offset: 0
- null_count: 0
- buffers: (0, 5002908320, 4999694624)
- dictionary: NULL
- children[0]:

Array streams

You can use nanoarrow.c_array_stream() to wrap an object representing a sequence of CArrays with a common CSchema to an ArrowArrayStream and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.RecordBatchReader, pyarrow.ChunkedArray).

pa_batch = pa.record_batch({"col1": [1, 2, 3]})
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
array_stream = na.c_array_stream(reader)
array_stream

<nanoarrow.c_array_stream.CArrayStream>
- get_schema(): struct<col1: int64>

You can pull the next array from the stream using .get_next() or use it like an iterator. The .get_next() method will raise StopIteration when there are no more arrays in the stream.

for array in array_stream:
    print(array)

<nanoarrow.c_array.CArray struct<col1: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
  'col1': <nanoarrow.c_array.CArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers: (0, 2642948588352)
    - dictionary: NULL
    - children[0]:

Use ArrayStream() for a higher level interface:

reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
na.ArrayStream(reader).read_all()

nanoarrow.Array<non-nullable struct<col1: int64>>[3]
{'col1': 1}
{'col1': 2}
{'col1': 3}

Development

Python bindings for nanoarrow are managed with setuptools. This means you can build the project using:

git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .

Tests use pytest:

# Install dependencies
pip install -e ".[test]"

# Run tests
pytest -vvx

CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.

FAQs

What is nanoarrow?

Is nanoarrow well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

nanoarrow

nanoarrow for Python

Installation

Data types, arrays, and array streams

Low-level C library bindings

Schemas

Arrays

Array streams

Development

Related posts

Malicious PyPI Package ‘pycord-self’ Targets Discord Developers with Token Theft and Backdoor Exploit

UK Officials Consider Banning Ransomware Payments from Public Entities