🚨 Active Supply Chain Attack:node-ipc Package Compromised.Learn More →

Book a Demo Sign in

starlings

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

starlings

A Python package for comparing entity resolutions from different processes

PyPI

Version: 0.2.3

Maintainers: 1

Starlings

High-performance entity resolution evaluation for Python

Starlings treats entities as what they truly are: collections of records across multiple datasets. Compare entity resolution methods efficiently using set operations rather than pairwise record comparisons.

Quickstart

import starlings as sl

# Your entity resolution results from two different methods
splink_results = [
    {"customers": ["cust_001", "cust_002"], "transactions": ["txn_100"]},
    {"customers": ["cust_003"], "transactions": ["txn_101", "txn_102"]},
]

dedupe_results = [
    {"customers": ["cust_001"], "transactions": ["txn_100"]},
    {"customers": ["cust_002", "cust_003"], "transactions": ["txn_101", "txn_102"]},
]

# Create frame and declare datasets upfront for efficiency
frame = sl.EntityFrame()
frame.declare_dataset("customers")
frame.declare_dataset("transactions")

# Add both collections
frame.add_method("splink", splink_results)
frame.add_method("dedupe", dedupe_results)

# Calculate hashes for integrity and comparison
frame.splink.add_hash("blake3")
frame.dedupe.add_hash("blake3")

# Compare how well the methods agree
comparisons = frame.compare_collections("splink", "dedupe")
avg_similarity = sum(c['jaccard'] for c in comparisons) / len(comparisons)
print(f"Average agreement: {avg_similarity:.2f}")

What is starlings?

Traditional entity resolution evaluation compares pairs of records. starlings compares entities as complete objects - collections of records that span multiple datasets.

Instead of asking "do these two records match?", starlings asks "how well do these two methods agree on what constitutes this entity?"

Core concepts

Entity: A collection of record IDs across multiple datasets

# Entity representing "Michael Smith"
{
    "customers": ["cust_001", "cust_045"],
    "transactions": ["txn_555", "txn_777"], 
    "addresses": ["addr_123"]
}

EntityCollection: Entities from one method (like pandas Series)
EntityFrame: Multiple collections for comparison (like pandas DataFrame)

Installation

# From PyPI (when available)
pip install entityframe

# From source
git clone https://github.com/yourusername/entityframe
cd entityframe
just install && just build

Working with hashes and metadata

import starlings as sl

# Method 1: Conservative clustering
method1 = [
    {"customers": ["john_1"], "emails": ["john@email.com"]},
    {"customers": ["john_2"], "emails": ["j.smith@work.com"]},
]

# Method 2: Aggressive clustering  
method2 = [
    {"customers": ["john_1", "john_2"], "emails": ["john@email.com", "j.smith@work.com"]},
]

# Create frame and declare datasets upfront for efficiency
frame = sl.EntityFrame()
frame.declare_dataset("customers")
frame.declare_dataset("emails")

# Add the collections
frame.add_method("conservative", method1)
frame.add_method("aggressive", method2)

# Calculate hashes for entity integrity (batch processed for performance)
frame.conservative.add_hash("blake3")
frame.aggressive.add_hash("blake3")

# Access entities with metadata
entity = frame.conservative[0]
print(f"Entity: {entity}")
# {'metadata': {'hash': b'\x7a\x2f\x8c\x91...'}, 'customers': ['john_1'], 'emails': ['john@email.com']}

# Verify hash integrity later
if frame.conservative.verify_hashes("blake3"):
    print("All hashes verified successfully")

# See how methods differ
results = frame.compare_collections("conservative", "aggressive")
for result in results:
    print(f"Entity {result['entity_index']}: {result['jaccard']:.2f} similarity")

Working with pre-existing metadata

import starlings as sl

# Entities with existing metadata (from previous processing)
existing_entities = [
    {
        "customers": ["alice_1", "alice_2"], 
        "emails": ["alice@personal.com", "a.johnson@company.com"],
        "metadata": {
            "id": 1,
            "hash": b"\x3b\x4c\x9d\xa2\x5f\x7e\x8b\x1c",
            "created_at": "2024-01-15T10:30:00Z",
            "source": "import_batch_001"
        }
    },
    {
        "customers": ["bob_smith"],
        "emails": ["bob.smith@email.org"],
        "metadata": {
            "id": 2,
            "hash": b"\x9e\x2a\x7f\x3c\x8d\x1b\x6e\x4a",
            "created_at": "2024-01-15T10:31:00Z", 
            "source": "import_batch_001"
        }
    }
]

# Create frame and add method with existing entities
frame = sl.EntityFrame()
frame.declare_dataset("customers")
frame.declare_dataset("emails")
frame.add_method("imported", existing_entities)

# Access entity with preserved metadata
entity = frame.imported[0]
print(f"ID: {entity['metadata']['id']}")
print(f"Hash: {entity['metadata']['hash'].hex()}")
print(f"Source: {entity['metadata']['source']}")

# Verify existing hashes match the data
if frame.imported.verify_hashes("blake3"):
    print("All imported hashes verified successfully")

Working with individual entities

# Create an entity manually
entity = ef.Entity()
entity.add_records("customers", [1, 2, 3])
entity.add_records("transactions", [100, 101])

# Query the entity
print(f"Total records: {entity.total_records()}")
print(f"Customer records: {entity.get_records('customers')}")
print(f"Has transactions: {entity.has_dataset('transactions')}")

# Compare two entities
other_entity = ef.Entity()
other_entity.add_records("customers", [2, 3, 4])
similarity = entity.jaccard_similarity(other_entity)
print(f"Jaccard similarity: {similarity:.2f}")

Working with multiple methods

starlings makes it simple to compare different entity resolution approaches:

frame = sl.EntityFrame()

# Add results from different methods
frame.add_method("splink", splink_results)
frame.add_method("dedupe", dedupe_results) 
frame.add_method("custom_rules", rules_results)

# Compare any two methods
comparisons = frame.compare_collections("splink", "dedupe")
print(f"Methods agree on {len([c for c in comparisons if c['jaccard'] > 0.8])} entities")

# See what methods you have
print(f"Methods: {frame.get_collection_names()}")
print(f"Total entities per method: {frame.total_entities() // len(frame.get_collection_names())}")

API overview

`EntityFrame`

frame = sl.EntityFrame()
frame = sl.EntityFrame.with_datasets(["ds1", "ds2"])  # Pre-declare datasets (optional)
frame.add_method(name, entity_data)              # Add method results 
frame.add_collection(name, collection)           # Add pre-built collection  
frame.compare_collections(name1, name2)          # Compare two methods
frame.get_collection(name)                       # Get specific collection
frame.get_collection_names()                     # List all methods
frame.total_entities()                           # Count total entities
frame.entity_has_dataset(method, index, name)    # Check if entity has dataset

`EntityCollection`

collection = ef.EntityCollection("method_name")
collection.get_entity(index)                     # Get entity by index
collection.get_entities()                        # Get all entities
collection.len()                                 # Number of entities
collection.compare_with(other_collection)        # Compare with another collection
collection.total_records()                       # Count all records
collection.is_empty()                            # Check if empty

`Entity`

entity = ef.Entity()
entity.add_record(dataset, record_id)           # Add single record
entity.add_records(dataset, record_ids)         # Add multiple records
entity.get_records(dataset)                     # Get records for dataset
entity.jaccard_similarity(other_entity)         # Compare entities
entity.total_records()                          # Count all records

Data format

starlings expects your entity resolution results as a list of dictionaries:

entity_data = [
    {
        "dataset1": ["record_1", "record_2"],    # Records from dataset1
        "dataset2": ["record_10"],               # Records from dataset2
    },
    {
        "dataset1": ["record_3"],
        "dataset3": ["record_20", "record_21"],  # Different datasets per entity
    },
    # ... more entities
]

Performance

starlings is built for scale:

Memory efficient: Handles millions of entities without breaking a sweat
Fast comparisons: Optimised set operations for entity comparison
Rust powered: High-performance core with Python convenience

Handle millions of entities efficiently.

Development

just install    # Install dependencies
just build      # Build Rust extension  
just test       # Run all tests
just format     # Format code

License

MIT License

Keywords

entity-resolution

data-processing

rust

FAQs

What is starlings?

Is starlings well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

starlings

Starlings

Quickstart

What is starlings?

Core concepts

Installation

Working with hashes and metadata

Working with pre-existing metadata

Working with individual entities

Working with multiple methods

API overview

EntityFrame

EntityCollection

Entity

Data format

Performance

Development

License

Keywords

Related posts

TeamPCP and BreachForums Launch $1,000 Contest for Supply Chain Attacks

Packagist Urges Immediate Composer Update After GitHub Actions Token Leak

`EntityFrame`

`EntityCollection`

`Entity`