CanonMap
CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.
Features
Installation
Lightweight Installation (Core Features Only)
pip install canonmap
Full Installation (Including Embedding Support)
pip install canonmap[embedding]
Note: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with [embedding]
extras.
Quick Start
Local-Only Mode (Recommended for Development)
from canonmap import (
CanonMap,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField
)
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=None
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
gcs_config=None
)
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True
)
request = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=[
SemanticField(table_name="my_table", field_name="description"),
SemanticField(table_name="my_table", field_name="notes")
],
generate_schemas=True,
generate_embeddings=True,
generate_semantic_texts=True
)
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
With GCP Integration
from canonmap import (
CanonMap,
CanonMapGCPConfig,
CanonMapCustomGCSConfig,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField
)
base_gcp = CanonMapGCPConfig(
gcp_service_account_json_path="path/to/service_account.json",
troubleshooting=False
)
artifacts_gcs = CanonMapCustomGCSConfig(
gcp_config=base_gcp,
bucket_name="your-artifacts-bucket",
bucket_prefix="artifacts/",
auto_create_bucket=True,
sync_strategy="refresh"
)
embedding_gcs = CanonMapCustomGCSConfig(
gcp_config=base_gcp,
bucket_name="your-models-bucket",
bucket_prefix="models/",
auto_create_bucket=True,
sync_strategy="refresh"
)
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=artifacts_gcs
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
gcs_config=embedding_gcs
)
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True,
api_mode=False
)
request = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=[
SemanticField(table_name="my_table", field_name="description"),
SemanticField(table_name="my_table", field_name="notes")
],
generate_schemas=True,
generate_embeddings=True,
generate_semantic_texts=True
)
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
Artifact Generation Example
from canonmap import (
CanonMap,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField,
ArtifactGenerationResponse
)
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=None
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models",
gcs_config=None
)
cm = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True
)
gen_req = ArtifactGenerationRequest(
input_path="input",
source_name="football_data",
entity_fields=[
EntityField(table_name="passing", field_name="player"),
EntityField(table_name="rushing", field_name="rusher_name"),
],
semantic_fields=[
SemanticField(table_name="passing", field_name="description"),
SemanticField(table_name="rushing", field_name="notes"),
],
generate_schemas=True,
save_processed_data=True,
generate_semantic_texts=True
)
resp: ArtifactGenerationResponse = cm.generate_artifacts(gen_req)
print(f"Status: {resp.status}")
print(f"Generated {len(resp.generated_artifacts)} artifacts")
print(f"Processing time: {resp.processing_stats.processing_time_seconds:.2f} seconds")
Entity Mapping Example
from canonmap import (
CanonMap,
EntityMappingRequest,
TableFieldFilter,
EntityMappingResponse
)
cm = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config
)
mapping_request = EntityMappingRequest(
entities=["tim brady", "jake alan"],
filters=[
TableFieldFilter(table_name="passing", table_fields=["player"])
],
num_results=3,
)
resp: EntityMappingResponse = cm.map_entities(mapping_request)
print(f"Processed {resp.total_entities_processed} entities")
print(f"Found {resp.total_matches_found} matches")
for mapping in resp.mappings:
print(f"\nEntity: {mapping.entity}")
for match in mapping.matches:
print(f" Match: {match.matched_entity} (Score: {match.score:.3f})")
Configuration Options
CanonMapGCPConfig
Base GCP configuration with service account and troubleshooting settings:
gcp_service_account_json_path
: Path to GCP service account JSON file
troubleshooting
: Enable detailed logging and validation
CanonMapCustomGCSConfig
Bucket-specific configuration extending the base GCP config:
gcp_config
: Base GCP configuration
bucket_name
: GCS bucket name
bucket_prefix
: Optional prefix for bucket operations
auto_create_bucket
: Automatically create bucket if it doesn't exist
auto_create_bucket_prefix
: Automatically create prefix directory
sync_strategy
: Sync strategy ("none", "missing", "overwrite", "refresh")
CanonMapArtifactsConfig
Configuration for artifact storage and management:
artifacts_local_path
: Local directory for artifacts
gcs_config
: Optional GCS configuration for artifact storage
troubleshooting
: Enable troubleshooting mode
CanonMapEmbeddingConfig
Configuration for embedding model management:
embedding_model_hf_name
: HuggingFace model name
embedding_model_local_path
: Local path for model storage
gcs_config
: Optional GCS configuration for model storage
troubleshooting
: Enable troubleshooting mode
ArtifactGenerationRequest
Comprehensive configuration for artifact generation:
-
Input/Output:
input_path
: Path to data file/directory or DataFrame/dict
source_name
: Logical source name
table_name
: Logical table name
-
Directory Processing:
recursive
: Process subdirectories
file_pattern
: File matching pattern (e.g., "*.csv")
table_name_from_file
: Use filename as table name
-
Entity Processing:
entity_fields
: List of fields to treat as entities
semantic_fields
: List of fields to extract as individual semantic text files
use_other_fields_as_metadata
: Include non-entity fields as metadata
-
Generation Options:
generate_canonical_entities
: Generate entity list
generate_schemas
: Generate database schema
generate_embeddings
: Generate semantic embeddings
generate_semantic_texts
: Generate semantic text files from semantic_fields
save_processed_data
: Save cleaned data
database_type
: Target database type
normalize_field_names
: Standardize field names
Response Models
ArtifactGenerationResponse
Comprehensive response containing:
status
: Success/failure status
message
: Human-readable message
generated_artifacts
: List of generated artifacts with metadata
processing_stats
: Detailed processing statistics
errors
: List of errors encountered
warnings
: List of warnings
gcp_upload_info
: GCP upload details
- Convenience paths for common artifacts
EntityMappingResponse
Detailed mapping results including:
status
: Success/failure status
mappings
: List of entity mappings with matches
total_entities_processed
: Number of entities processed
total_matches_found
: Total number of matches found
processing_stats
: Performance metrics
configuration_summary
: Request configuration summary
errors
: List of errors encountered
warnings
: List of warnings
API Mode
For API deployments, initialize CanonMap with api_mode=True
:
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True,
api_mode=True
)
Output
The generate_artifacts()
method returns an ArtifactGenerationResponse
containing:
- Generated artifacts with metadata
- Processing statistics and timing information
- Error and warning information
- GCP upload details (if applicable)
- Convenience paths to common artifacts
Semantic Text Files
When semantic_fields
is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:
- Single table:
{source}_{table}_semantic_texts.zip
- Multiple tables:
{source}_semantic_texts.zip
(combined)
- File naming:
{table_name}_row_{row_index}_{field_name}.txt
- Content: Raw text content from the specified semantic fields
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.