You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

canonmap

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

canonmap

CanonMap - A Python library for entity canonicalization and mapping with enhanced configuration and response models

0.2.43

PyPI

Maintainers: 1

CanonMap

CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.

Features

Flexible Input Support: Process data from:
- CSV/JSON files
- Directories of data files
- Pandas DataFrames
- Python dictionaries
Artifact Generation:
- Generate canonical entity lists
- Create database schemas (supports multiple database types)
- Generate semantic embeddings for entities
- Clean and standardize field names
- Process metadata fields
Database Support:
- DuckDB (default)
- SQLite
- BigQuery
- MariaDB
- MySQL
- PostgreSQL
Enhanced Configuration:
- Separate configuration for artifacts and embeddings
- Optional GCP integration with bucket management
- Flexible sync strategies for cloud storage
- Comprehensive error handling and logging
- Local-only mode for development and testing

Installation

Lightweight Installation (Core Features Only)

pip install canonmap

Full Installation (Including Embedding Support)

pip install canonmap[embedding]

Note: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with [embedding] extras.

Quick Start

Local-Only Mode (Recommended for Development)

from canonmap import (
    CanonMap,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField
)

# Simple local-only configuration
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=None  # No GCS integration
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    gcs_config=None  # No GCS integration
)

# Initialize CanonMap
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True
)

# Configure artifact generation
request = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=[
        SemanticField(table_name="my_table", field_name="description"),
        SemanticField(table_name="my_table", field_name="notes")
    ],
    generate_schemas=True,
    generate_embeddings=True,
    generate_semantic_texts=True
)

# Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")

With GCP Integration

from canonmap import (
    CanonMap,
    CanonMapGCPConfig,
    CanonMapCustomGCSConfig,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField
)

# 1. Set up base GCP configuration
base_gcp = CanonMapGCPConfig(
    gcp_service_account_json_path="path/to/service_account.json",
    troubleshooting=False
)

# 2. Configure GCS for artifacts and embeddings
artifacts_gcs = CanonMapCustomGCSConfig(
    gcp_config=base_gcp,
    bucket_name="your-artifacts-bucket",
    bucket_prefix="artifacts/",
    auto_create_bucket=True,
    sync_strategy="refresh"
)

embedding_gcs = CanonMapCustomGCSConfig(
    gcp_config=base_gcp,
    bucket_name="your-models-bucket",
    bucket_prefix="models/",
    auto_create_bucket=True,
    sync_strategy="refresh"
)

# 3. Create application-specific configs
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=artifacts_gcs
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    gcs_config=embedding_gcs
)

# 4. Initialize CanonMap
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True,
    api_mode=False
)

# 5. Configure artifact generation
request = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=[
        SemanticField(table_name="my_table", field_name="description"),
        SemanticField(table_name="my_table", field_name="notes")
    ],
    generate_schemas=True,
    generate_embeddings=True,
    generate_semantic_texts=True
)

# 6. Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")

Artifact Generation Example

from canonmap import (
    CanonMap,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField,
    ArtifactGenerationResponse
)

# Set up configurations (local-only for this example)
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=None  # Local-only mode
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models",
    gcs_config=None  # Local-only mode
)

# Initialize CanonMap
cm = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True
)

# Create generation request
gen_req = ArtifactGenerationRequest(
    input_path="input",
    source_name="football_data",
    entity_fields=[
        EntityField(table_name="passing", field_name="player"),
        EntityField(table_name="rushing", field_name="rusher_name"),
    ],
    semantic_fields=[
        SemanticField(table_name="passing", field_name="description"),
        SemanticField(table_name="rushing", field_name="notes"),
    ],
    generate_schemas=True,
    save_processed_data=True,
    generate_semantic_texts=True
)

# Generate artifacts
resp: ArtifactGenerationResponse = cm.generate_artifacts(gen_req)

# Access response details
print(f"Status: {resp.status}")
print(f"Generated {len(resp.generated_artifacts)} artifacts")
print(f"Processing time: {resp.processing_stats.processing_time_seconds:.2f} seconds")

Entity Mapping Example

from canonmap import (
    CanonMap,
    EntityMappingRequest,
    TableFieldFilter,
    EntityMappingResponse
)

# Initialize CanonMap (reusing configs from above)
cm = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config
)

# Create mapping request
mapping_request = EntityMappingRequest(
    entities=["tim brady", "jake alan"],
    filters=[
        TableFieldFilter(table_name="passing", table_fields=["player"])
    ],
    num_results=3,
)

# Map entities
resp: EntityMappingResponse = cm.map_entities(mapping_request)

# Access mapping results
print(f"Processed {resp.total_entities_processed} entities")
print(f"Found {resp.total_matches_found} matches")

for mapping in resp.mappings:
    print(f"\nEntity: {mapping.entity}")
    for match in mapping.matches:
        print(f"  Match: {match.matched_entity} (Score: {match.score:.3f})")

Configuration Options

CanonMapGCPConfig

Base GCP configuration with service account and troubleshooting settings:

gcp_service_account_json_path: Path to GCP service account JSON file
troubleshooting: Enable detailed logging and validation

CanonMapCustomGCSConfig

Bucket-specific configuration extending the base GCP config:

gcp_config: Base GCP configuration
bucket_name: GCS bucket name
bucket_prefix: Optional prefix for bucket operations
auto_create_bucket: Automatically create bucket if it doesn't exist
auto_create_bucket_prefix: Automatically create prefix directory
sync_strategy: Sync strategy ("none", "missing", "overwrite", "refresh")

CanonMapArtifactsConfig

Configuration for artifact storage and management:

artifacts_local_path: Local directory for artifacts
gcs_config: Optional GCS configuration for artifact storage
troubleshooting: Enable troubleshooting mode

CanonMapEmbeddingConfig

Configuration for embedding model management:

embedding_model_hf_name: HuggingFace model name
embedding_model_local_path: Local path for model storage
gcs_config: Optional GCS configuration for model storage
troubleshooting: Enable troubleshooting mode

ArtifactGenerationRequest

Comprehensive configuration for artifact generation:

Input/Output:
- input_path: Path to data file/directory or DataFrame/dict
- source_name: Logical source name
- table_name: Logical table name
Directory Processing:
- recursive: Process subdirectories
- file_pattern: File matching pattern (e.g., "*.csv")
- table_name_from_file: Use filename as table name
Entity Processing:
- entity_fields: List of fields to treat as entities
- semantic_fields: List of fields to extract as individual semantic text files
- use_other_fields_as_metadata: Include non-entity fields as metadata
Generation Options:
- generate_canonical_entities: Generate entity list
- generate_schemas: Generate database schema
- generate_embeddings: Generate semantic embeddings
- generate_semantic_texts: Generate semantic text files from semantic_fields
- save_processed_data: Save cleaned data
- database_type: Target database type
- normalize_field_names: Standardize field names

Response Models

ArtifactGenerationResponse

Comprehensive response containing:

status: Success/failure status
message: Human-readable message
generated_artifacts: List of generated artifacts with metadata
processing_stats: Detailed processing statistics
errors: List of errors encountered
warnings: List of warnings
gcp_upload_info: GCP upload details
Convenience paths for common artifacts

EntityMappingResponse

Detailed mapping results including:

status: Success/failure status
mappings: List of entity mappings with matches
total_entities_processed: Number of entities processed
total_matches_found: Total number of matches found
processing_stats: Performance metrics
configuration_summary: Request configuration summary
errors: List of errors encountered
warnings: List of warnings

API Mode

For API deployments, initialize CanonMap with api_mode=True:

canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True,
    api_mode=True  # Enables API-specific optimizations
)

Output

The generate_artifacts() method returns an ArtifactGenerationResponse containing:

Generated artifacts with metadata
Processing statistics and timing information
Error and warning information
GCP upload details (if applicable)
Convenience paths to common artifacts

Semantic Text Files

When semantic_fields is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:

Single table: {source}_{table}_semantic_texts.zip
Multiple tables: {source}_semantic_texts.zip (combined)
File naming: {table_name}_row_{row_index}_{field_name}.txt
Content: Raw text content from the specified semantic fields

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

FAQs

What is canonmap?

Is canonmap well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

canonmap

CanonMap

Features

Installation

Lightweight Installation (Core Features Only)

Full Installation (Including Embedding Support)

Quick Start

Local-Only Mode (Recommended for Development)

With GCP Integration

Artifact Generation Example

Entity Mapping Example

Configuration Options

CanonMapGCPConfig

CanonMapCustomGCSConfig

CanonMapArtifactsConfig

CanonMapEmbeddingConfig

ArtifactGenerationRequest

Response Models

ArtifactGenerationResponse

EntityMappingResponse

API Mode

Output

Semantic Text Files

Contributing

License

Related posts

npm Phishing Email Targets Developers with Typosquatted Domain

Knip Hits 500 Releases with v5.62.0, Improving TypeScript Config Detection and Plugin Integrations