Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Quivr is a Python library which provides great containers for Arrow data.
Quivr's Table
s are like DataFrames, but with strict schemas to
enforce types and expectations. They are backed by the
high-performance Arrow memory model, making them well-suited for
streaming IO, RPCs, and serialization/deserialization to Parquet.
Documentation is at https://quivr.readthedocs.org. A changelog is in this repository to document changes in each released version. This blog post introduces some of the motivation for quivr.
Data engineering involves taking analysis code and algorithms which were prototyped, often on pandas DataFrames, and shoring them up for production use.
While DataFrames are great for ad-hoc exploration, visualization, and prototyping, they aren't as great for building sturdy applications:
We don't want to throw everything out, here. Vectorized computations are often absolutely necessary for data work. But what if we could have those vectorized computations, but with:
This is what Quivr's Tables try to provide.
Check out this repo, and pip install
it.
Your main entrypoint to Quivr is through defining classes which represent your tables. You write a subclass of quivr.Table, annotating it with Columns that describe the data you're working with, and quivr will handle the rest.
import quivr as qv
import pyarrow as pa
class Coordinates(qv.Table):
x = qv.Float64Column()
y = qv.Float64Column()
z = qv.Float64Column()
vx = qv.Float64Column()
vy = qv.Float64Column()
vz = qv.Float64Column()
Then, you can construct tables from data:
coords = Coordinates.from_kwargs(
x=np.array([ 1.00760887, -2.06203093, 1.24360546, -1.00131722]),
y=np.array([-2.7227298 , 0.70239707, 2.23125432, 0.37269832]),
z=np.array([-0.27148738, -0.31768623, -0.2180482 , -0.02528401]),
vx=np.array([ 0.00920172, -0.00570486, -0.00877929, -0.00809866]),
vy=np.array([ 0.00297888, -0.00914301, 0.00525891, -0.01119134]),
vz=np.array([-0.00160217, 0.00677584, 0.00091095, -0.00140548])
)
# Sort the table by the z column. This returns a copy.
coords_z_sorted = coords.sort_by("z")
print(len(coords))
# prints 4
# Access any of the columns as a numpy array with zero copy:
xs = coords.x.to_numpy()
# Present the table as a pandas DataFrame, with zero copy if possible:
df = coords.to_dataframe()
You can embed one table's definition within another, and you can make columns nullable:
class AsteroidOrbit(qv.Table):
designation = qv.StringColumn()
mass = qv.Float64Column(nullable=True)
radius = qv.Float64Column(nullable=True)
coords = Coordinates.as_column()
# You can construct embedded columns from Arrow StructArrays, which you can get from
# other Quivr tables using the to_structarray() method with zero copy.
orbits = AsteroidOrbit.from_kwargs(
designation=np.array(["Ceres", "Pallas", "Vesta", "2023 DW"]),
mass=np.array([9.393e20, 2.06e21, 2.59e20, None]),
radius=np.array([4.6e6, 2.7e6, 2.6e6, None]),
coords=coords,
)
When you reference columns, you'll get numpy arrays which you can use to do computations:
import numpy as np
print(np.quantile(orbits.mass + 10, 0.5)
You can also use access columns of the data as Arrow Arrays to do computations using the Pyarrow compute kernels:
import pyarrow.compute as pc
median_mass = pc.quantile(pc.add(orbits.mass, 10), q=0.5)
# median_mass is a pyarrow.Scalar, which you can get the value of with .as_py()
print(median_mass.as_py())
There is a very extensive set of functions available in the
pyarrow.compute
package, which you can see
here. These
computations will, in general, use all cores available and do
vectorized computations which are very fast.
Because Quivr tables are just Python classes, you can customize the behavior of your tables by adding or overriding methods. For example, if you want to add a method to compute the total mass of the asteroids in the table, you can do so like this:
class AsteroidOrbit(qv.Table):
designation = qv.StringColumn()
mass = qv.Float64Column(nullable=True)
radius = qv.Float64Column(nullable=True)
coords = Coordinates.as_column()
def total_mass(self):
return pc.sum(self.mass)
You can also use this to add "meta-columns" which are combinations of other columns. For example:
class CoordinateCovariance(qv.Table):
matrix_values = qv.ListColumn(pa.float64(), 36)
@property
def matrix(self):
# This is a numpy array of shape (n, 6, 6)
return self.matrix_values.to_numpy().reshape(-1, 6, 6)
class AsteroidOrbit(qv.Table):
designation = qv.StringColumn()
mass = qv.Float64Column(nullable=True)
radius = qv.Float64Column(nullable=True)
coords = Coordinates.as_column()
covariance = CoordinateCovariance.as_column()
orbits = load_orbits() # Analogous to the example above
# Compute the determinant of the covariance matrix for each asteroid
determinants = np.linalg.det(orbits.covariance.matrix)
You can validate that the data inside a Table matches constraints you define. Only a small number of validators are currently implemented, mostly for numeric checks, but as use cases emerge, more will be added.
To add data validation, use the validator=
keyword inside
columns. For example:
import quivr as qv
from quivr.validators import gt, ge, le, and_, is_in
class Observation(qv.Table):
id = qv.Int64Column(validator=gt(0))
ra = qv.Float64Column(validator=and_(ge(0), le(360))
dataset_id = qv.StringColumn(validator=is_in(["ztf", "nsc", "skymapper"])))
unvalidated = qv.Int64Column()
This Observation
table has validators that
id
column's values are greater than 0ra
column's values are between 0 and 360, inclusivedataset_id
column only has strings in the set {"ztf", "nsc", "skymapper"}
When an Observation
instance is created using the from_kwargs
method, these validation checks will be run, by default. This can be
disabled by calling Observation.from_kwargs(..., validate=False)
.
In addition, an instance can be explicitly validated by calling the
.validate()
method, which will raise a quivr.ValidationError
if
there are any failures.
Also, tables have a .is_valid()
method which returns a boolean to
indicate whether they pass validation.
You can also filter by expressions on the data. See Arrow documentation for more details. You can use this to construct a quivr Table using an appropriately-schemaed Arrow Table:
big_orbits = AsteroidOrbit(orbits.table.filter(orbits.table["mass"] > 1e21))
If you're plucking out rows that match a single value, you can use the "select" method on the Table:
# Get the orbit of Ceres
ceres_orbit = orbits.select("designation", "Ceres")
Feather is a fast, zero-copy serialization format for Arrow tables. It can be used for interprocess communication, or for working with data on disk via memory mapping.
orbits.to_feather("orbits.feather")
orbits_roundtripped = AsteroidOrbit.from_feather("orbits.feather")
# use memory mapping to work with a large file without copying it into memory
orbits_mmap = AsteroidOrbit.from_feather("orbits.feather", memory_map=True)
You can serialize your tables to Parquet files, and read them back:
orbits.to_parquet("orbits.parquet")
orbits_roundtripped = AsteroidOrbit.from_parquet("orbits.parquet")
See the Arrow documentation for more details on the Parquet format used.
FAQs
Container library for working with tabular Arrow data
We found that quivr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.