🚀 Socket Launch Week Day 5:Introducing Repository Access Permissions and Custom Roles.Learn more
Sign In

pandas-maxminddb

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pandas-maxminddb

Fast geolocation library for Pandas Dataframes, built on Numpy C-FFI

pipPyPI
Version
0.2.1
Weekly downloads
68
65.85%
Maintainers
1
Weekly downloads
 

Pandas Maxmind

Provides fast and convenient geolocation bindings for Pandas Dataframes. Uses numpy ndarray's internally to speed it up compared to naively applying function per column. Based on the maxminddb-rust.

Features

  • Supports both MMAP and in-memory implementations
  • Supports parallelism (useful for very big datasets)
  • Comes with pre-built wheels, no need to install and maintain external C-library to get (better than) C-performance

Installation

  • Minimal supported Python is 3.8
  • pip install pandas_maxminddb
  • The preferred way is to use precompiled binary wheel, as this requires no toolchain and is fastest.
  • If you want to build from source any platform Rust has target for is supported.

Pre-built wheels

The wheels are built against following numpy and pandas distributions:

  • If you're on Windows / macOS / Linux there is no need to do anything extra.
  • If you use ARMv7 (RaspberryPi and such) use PiWheels --extra-index-url=https://www.piwheels.org/simple, install libatlas-base-dev for numpy.
  • If you use musl-based distro like Alpine use Alpine-wheels --extra-index-url https://alpine-wheels.github.io/index , install libstdc++ for pandas.

Refer to the build workflow for details.

Pywin x86win x64macOS x86_64macOS AArch64linux x86_64linux i686linux AArch64linux ARMv7musl linux x86_64
3.8🚫
3.9🚫
3.10🚫🚫🚫

Usage

By importing pandas_maxminddb you add Pandas geo extension which allows you to add columns in-place. This example uses context manager for reader lifetime:

import pandas as pd
from pandas_maxminddb import open_database

ips = pd.DataFrame(data={
    'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
with open_database('./GeoLite.mmdb/GeoLite2-City.mmdb') as reader:
    ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'])
ips
ipcitypostcodestatecountry
075.63.106.74Houston77070TXUS
1132.206.246.203MontrealH3AQCCA
294.226.237.31Kapellen2950VLGBE
3128.119.189.49Northampton01060MAUS
42.30.253.245LondonSW15ENGGB

Without context manager

You can also instantiate reader yourself, eg:

import pandas as pd
from pandas_maxminddb import ReaderMem, ReaderMmap

reader = ReaderMem('./GeoLite.mmdb/GeoLite2-City.mmdb')
ips = pd.DataFrame(data={
    'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'])
ips

Parallelism

If dataset is big enough, and you have extra cores you might benefit from using them. Currently only ReaderMem is supported:

import pandas as pd
from pandas_maxminddb import ReaderMem

reader = ReaderMem('./GeoLite.mmdb/GeoLite2-City.mmdb')
ips = pd.DataFrame(data={
    'ip': ["75.63.106.74", "132.206.246.203", "94.226.237.31", "128.119.189.49", "2.30.253.245"]})
ips.geo.geolocate('ip', reader, ['country', 'city', 'state', 'postcode'], parallel=True)
ips

Benchmarks

  • Tested on M1 Max with 1024 chunk size on 100k dataset, refer to benchmark
Name (time in ms)MinMaxMeanStdDevMedianIQROutliersOPSRoundsIterations
test_benchmark_pandas_parallel_mem_maxminddb52.7588 (1.0)57.4206 (1.0)54.0573 (1.0)1.1782 (1.15)53.8497 (1.0)1.4194 (1.09)4;118.4989 (1.0)201
test_benchmark_pandas_mmap_maxminddb240.0050 (4.55)244.3257 (4.26)242.2177 (4.48)1.9017 (1.85)243.1021 (4.51)3.2122 (2.46)2;04.1285 (0.22)51
test_benchmark_pandas_mem_maxminddb241.4630 (4.58)244.2553 (4.25)242.8391 (4.49)1.0288 (1.0)242.7672 (4.51)1.3064 (1.0)2;04.1180 (0.22)51
test_benchmark_c_maxminddb1,010.6569 (19.16)1,055.1080 (18.38)1,021.3691 (18.89)18.9273 (18.40)1,013.3819 (18.82)12.9544 (9.92)1;10.9791 (0.05)51
test_benchmark_python_maxminddb9,021.2686 (170.99)9,188.7629 (160.03)9,071.0055 (167.80)70.0512 (68.09)9,039.7811 (167.87)84.7766 (64.89)1;00.1102 (0.01)51

Extending

Due to Dataframe columns being flat arrays and geolocation data coming in a hierarchical format you might need to provide more mappings to serve your particular use-case. In order to do that follow Development section to setup your environment and then:

Development

Setting up environment

  • git clone --recurse-submodules git@github.com:andrusha/pandas-maxminddb.git
  • PYTHON_CONFIGURE_OPTS="--enable-shared" asdf install
  • PYTHON_CONFIGURE_OPTS="--enable-shared" python -m venv .venv
  • source .venv/bin/activate
  • pip install nox
  • nox -s test
  • PYTHONPATH=.venv/lib/python3.8/site-packages cargo test --no-default-features

libmaxminddb

In order to run nox -s bench properly you would need libmaxminddb installed as per maxminddb instructions prior to installing Python package, so that C-extension could be benchmarked properly.

On macOS this would require following:

  • brew instal libmaxminddb
  • PATH="/opt/homebrew/Cellar/libmaxminddb/1.7.1/bin:$PATH" LDFLAGS="-L/opt/homebrew/Cellar/libmaxminddb/1.7.1/lib" CPPFLAGS="-I/opt/homebrew/Cellar/libmaxminddb/1.7.1/include" pip install maxminddb --force-reinstall --verbose --no-cache-dir

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts