Project to persist Pandas data structures in a MongoDB database.
pip install antarctic
This project (unless the popular arctic project which I admire)
is based on top of MongoEngine.
MongoEngine is an ORM for MongoDB. MongoDB stores documents.
We introduce a new field and extend the Document class
to make Antarctic a convenient choice for storing Pandas (time series) data.
We introduce first a new field --- the PandasField.
>>> import mongomock
>>> import pandas as pd
>>> from mongoengine import Document, connect
>>> from antarctic.pandas_field import PandasField
>>> client = connect('mongoenginetest', host='mongodb://localhost', mongo_client_class=mongomock.MongoClient)
>>> class Portfolio(Document):
... nav = PandasField()
... weights = PandasField()
... prices = PandasField()
The portfolio objects works exactly the way you think it works
>>> data = pd.read_csv("src/tests/resources/price.csv", index_col=0, parse_dates=True)
>>> p = Portfolio()
>>> p.nav = data["A"].to_frame(name="nav")
>>> p.prices = data[["B","C","D"]]
>>> portfolio = p.save()
>>> nav = p.nav["nav"]
>>> prices = p.prices
Behind the scenes we convert the Frame objects
into parquet bytestreams and
store them in a MongoDB database.
The format should also be readable by R.
In most cases we have copies of very similar documents,
e.g. we store Portfolios and Symbols rather than just a Portfolio or a Symbol.
For this purpose we have developed the abstract XDocument
relying on the Document class of MongoEngine.
It provides some convenient tools to simplify looping
over all or a subset of Documents of the same type, e.g.
>>> from antarctic.document import XDocument
>>> from antarctic.pandas_field import PandasField
>>> class Symbol(XDocument):
... price = PandasField()
We define a bunch of symbols and assign a price for each (or some of it):
>>> s1 = Symbol(name="A", price=data["A"].to_frame(name="price")).save()
>>> s2 = Symbol(name="B", price=data["B"].to_frame(name="price")).save()
for symbol in Symbol.subset(names=["B"]):
>>> symbols = Symbol.to_dict(objects=[s1, s2])
>>> s1.reference["MyProp1"] = "ABC"
>>> s2.reference["MyProp2"] = "BCD"
print(Symbol.reference_frame(objects=[s1, s2]))
print(Symbol.frame(series="price", key="price"))
print(Symbol.apply(func=lambda x: x.price["price"].mean(), default=np.nan))
The XDocument class is exposing DataFrames both for reference and time series data.
There is an apply
method for using a function on (subset) of documents.
Database vs. Datastore
Storing json or bytestream representations of Pandas objects
is not exactly a database. Appending is rather expensive as one would have
to extract the original Pandas object, append to it and convert
the new object back into a json or bytestream representation.
Clever sharding can mitigate such effects but at the end of the day
you shouldn't update such objects too often. Often practitioners
use a small database for recording (e.g. over the last 24h) and
update the MongoDB database once a day. It's extremely fast
to read the Pandas objects out of such a construction.
Often such concepts are called DataStores.
Starting with
make install
will install uv and create
the virtual environment defined in
pyproject.toml and locked in uv.lock.
We install marimo on the fly within the aforementioned
virtual environment. Executing
make marimo
will install and start marimo.