
Security News
Opengrep Adds Apex Support and New Rule Controls in Latest Updates
The latest Opengrep releases add Apex scanning, precision rule tuning, and performance gains for open source static code analysis.
A Python package providing comprehensive tools for data type handling, validation, text processing, and streaming data analysis.
splurge-tools is a collection of Python utilities focused on:
pip install splurge-tools
type_helper.py
: Comprehensive type validation, conversion, and inference utilities with support for strings, numbers, dates, times, booleans, and collectionsdsv_helper.py
: Delimited separated value parsing with streaming support, column profiling, and data analysistabular_data_model.py
: In-memory data model for tabular datasets with multi-row header supporttyped_tabular_data_model.py
: Type-safe data model with schema validation and type enforcementstreaming_tabular_data_model.py
: Memory-efficient streaming data model for large datasets (>100MB)text_file_helper.py
: Text file processing with streaming support, header/footer skipping, and memory-efficient operationsstring_tokenizer.py
: String parsing and tokenization utilities with delimited value supportcase_helper.py
: Text case transformation utilities (camelCase, snake_case, kebab-case, etc.)text_normalizer.py
: Text normalization and cleaning utilitiesdata_validator.py
: Data validation framework with custom validation rulesdata_transformer.py
: Data transformation utilities for converting between formatsrandom_helper.py
: Random data generation for testing, including realistic test data and secure Base58-like string generation with guaranteed character diversityfrom splurge_tools.dsv_helper import DsvHelper
from splurge_tools.streaming_tabular_data_model import StreamingTabularDataModel
# Process a large CSV file without loading it into memory
stream = DsvHelper.parse_stream("large_dataset.csv", delimiter=",")
model = StreamingTabularDataModel(stream, header_rows=1, chunk_size=1000)
# Iterate through data efficiently
for row in model:
# Process each row
print(row)
# Or get rows as dictionaries
for row_dict in model.iter_rows():
print(row_dict["column_name"])
from splurge_tools.type_helper import String, DataType
# Infer data types
data_type = String.infer_type("2023-12-25") # DataType.DATE
data_type = String.infer_type("123.45") # DataType.FLOAT
data_type = String.infer_type("true") # DataType.BOOLEAN
# Convert values with validation
date_val = String.to_date("2023-12-25")
float_val = String.to_float("123.45", default=0.0)
bool_val = String.to_bool("true")
from splurge_tools.dsv_helper import DsvHelper
# Parse and profile columns
data = DsvHelper.parse("data.csv", delimiter=",")
profile = DsvHelper.profile_columns(data)
# Get column information
for col_name, col_info in profile.items():
print(f"{col_name}: {col_info['datatype']} ({col_info['count']} values)")
from splurge_tools.random_helper import RandomHelper
# Generate Base58-like strings with guaranteed character diversity
api_key = RandomHelper.as_base58_like(32) # Contains alpha, digit, and symbol
print(api_key) # Example: "A3!bC7@dE9#fG2$hJ4%kL6&mN8*pQ5"
# Generate without symbols (alpha + digits only)
token = RandomHelper.as_base58_like(16, symbols="")
print(token) # Example: "A3bC7dE9fG2hJ4kL"
# Generate with custom symbols and secure mode
secure_id = RandomHelper.as_base58_like(20, symbols="!@#$", secure=True)
print(secure_id) # Example: "A3!bC7@dE9#fG2$hJ4"
git clone https://github.com/jim-schilling/splurge-tools.git
cd splurge-tools
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"
Run tests using pytest:
python -m pytest tests/
The project uses several tools to maintain code quality:
Run all quality checks:
black .
isort .
flake8 splurge_tools/ tests/ --max-line-length=120
mypy splurge_tools/
python -m pytest tests/ --cov=splurge_tools
Build distribution:
python -m build --sdist
DataModelFactory
, ComponentFactory
, and create_data_model()
.splurge_tools/factory.py
:
create_in_memory_model(data, *, header_rows=1, skip_empty_rows=True)
create_streaming_model(stream, *, header_rows=1, skip_empty_rows=True, chunk_size=1000)
TypedTabularDataModel
. Typed access is now provided via a lightweight view:
TabularDataModel.to_typed(type_configs: dict[DataType, Any] | None = None)
StreamingTabularDataProtocol
documentation to a minimal, unified interface.ResourceManagerProtocol
usage in favor of direct context managers.splurge_tools/tabular_utils.py
: Shared utilities for tabular processing
process_headers()
— multi-row header merging and normalizationnormalize_rows()
— row padding and empty-row filteringshould_skip_row()
and auto_column_names()
helpersTabularDataModel.to_typed()
typed view:
__iter__
, iter_rows
, iter_rows_as_tuples
)row
, row_as_list
, row_as_tuple
)column_values
, cell_value
, column_type
)tabular_utils
.to_typed()
; removed factory usage.DataTransformer
import cleanup (removed dependency on deleted TypedTabularDataModel
).splurge_tools/typed_tabular_data_model.py
and all references.type_configs
(per DataType
).create_data_model(data)
→ create_in_memory_model(data)
or create_streaming_model(stream)
ComponentFactory.create_validator()
/create_transformer()
→ instantiate classes directlyTypedTabularDataModel(...)
with TabularDataModel(...).to_typed(...)
.ComponentFactory.create_resource_manager(...)
with safe_file_operation(...)
or FileResourceManager
directly.Secure Float Range Generation: Enhanced RandomHelper.as_float_range()
method with new secure
parameter for cryptographically secure random float generation:
secrets
module when secure=True
for cryptographically secure randomnessRandomHelper
classComprehensive Examples Suite: Added complete set of working examples demonstrating all major library features:
01_type_inference_and_validation.py
: Type inference, conversion, and validation utilities02_dsv_parsing_and_profiling.py
: DSV parsing, streaming, and column profiling03_tabular_data_models.py
: In-memory, streaming, and typed tabular data models04_text_processing.py
: Text normalization, case conversion, and tokenization05_validation_and_transformation.py
: Data validation, transformation, and factory patterns06_random_data_generation.py
: Random data generation including secure methods07_comprehensive_workflows.py
: End-to-end ETL and streaming data processing workflowsexamples/README.md
: Comprehensive documentation for all examplesexamples/run_all_examples.py
: Test runner with performance metrics and feature coverageDataTransformer.pivot()
: Corrected parameter names (index_cols
, columns_col
, values_col
)DataTransformer.group_by()
: Fixed aggregation parameter structure (group_cols
, agg_dict
)DataTransformer.transform_column()
: Updated parameter names (column
, transform_func
)TextNormalizer
methods: Corrected method names (remove_special_chars
, remove_control_chars
)Validator
utility methods: Fixed parameter signatures for validation utilitiesTypedTabularDataModel
with TabularDataModel.to_typed()
for typed accessCommon Utilities Module: Added new common_utils.py
module containing reusable utility functions to reduce code duplication across the package:
deprecated_method()
: Decorator for marking methods as deprecated with customizable warning messagessafe_file_operation()
: Safe file path validation and operation handling with comprehensive error handlingensure_minimum_columns()
: Utility for ensuring data rows have minimum required columns with paddingsafe_index_access()
: Safe list/tuple index access with bounds checking and helpful error messagessafe_dict_access()
: Safe dictionary key access with default values and error contextvalidate_data_structure()
: Generic data structure validation with type checking and empty data handlingcreate_parameter_validator()
: Factory function for creating parameter validation functions from validator dictionariesbatch_validate_rows()
: Iterator for validating and filtering tabular data rows with column count constraintscreate_error_context()
: Utility for creating detailed error context information for debuggingValidation Utilities Module: Added new validation_utils.py
module providing centralized Validator
class with consistent error handling:
Validator.is_non_empty_string()
: String validation with whitespace handling optionsValidator.is_positive_integer()
: Integer validation with range constraints and bounds checkingValidator.is_valid_range()
: Numeric range validation with inclusive/exclusive boundsValidator.is_valid_path()
: Path validation with existence checking and permission validationValidator.is_valid_encoding()
: Text encoding validation with fallback optionsValidator.is_iterable_of_type()
: Generic iterable validation with element type checkingis_*
naming convention and return validated values or raise specific exceptions|
) instead of Optional
and Union
imports:
data_transformer.py
, data_validator.py
, dsv_helper.py
, random_helper.py
, string_tokenizer.py
, tabular_data_model.py
Optional
and Union
importstests/test_common_utils.py
: Complete test coverage for common utility functions (96% coverage)tests/test_validation_utils.py
: Comprehensive validation utility testing (94% coverage)StreamingTabularDataProtocol
specifically designed for streaming data models with methods optimized for memory-efficient processing:
column_names
, column_count
, column_index()
for metadata access__iter__()
, iter_rows_as_dicts()
, iter_rows_as_tuples()
for data iterationreset_stream()
for stream position managementDataValidatorProtocol
with required methods validate()
, get_errors()
, and clear_errors()
DataTransformerProtocol
with required methods transform()
and can_transform()
TypeInferenceProtocol
with required methods can_infer()
, infer_type()
, and convert_value()
as_base58_like()
method for generating Base58-like strings with guaranteed character diversity:
SYMBOLS
constant for securityBASE58_ALPHA
, BASE58_DIGITS
, and SYMBOLS
constants to RandomHelper
:
BASE58_ALPHA
: 49 characters (excludes O, I, l from standard alphabet)BASE58_DIGITS
: 9 characters (excludes 0, uses 1-9 only)SYMBOLS
: 26 special characters for secure string generationTypeInference
class implementing TypeInferenceProtocol
for type inference operationsResourceManager
base class implementing ResourceManagerProtocol
with abstract methods _create_resource()
and _cleanup_resource()
tests/test_factory_protocols.py
- Factory protocol testingtests/test_type_inference.py
- TypeInference class and protocol testingtests/test_data_validator_comprehensive.py
- Comprehensive DataValidator testing (98% coverage)tests/test_factory_comprehensive.py
- Comprehensive Factory testing (87% coverage)tests/test_resource_manager_comprehensive.py
- Comprehensive ResourceManager testing (84% coverage)test_random_helper.py
with comprehensive as_base58_like()
testing (97% coverage)StreamingTabularDataModel
to implement StreamingTabularDataProtocol
instead of TabularDataProtocol
:
row_count
, column_type
, column_values
, cell_value
, row
, row_as_list
, row_as_tuple
Union[TabularDataProtocol, StreamingTabularDataProtocol]
based on model typeDataValidator
class to explicitly implement DataValidatorProtocol
:
validate()
method to return bool
instead of Dict[str, List[str]]
get_errors()
method returning list of error messagesclear_errors()
method to reset error state_errors
list to track validation errorsvalidate_detailed()
method for backward compatibilityDataTransformer
class to explicitly implement DataTransformerProtocol
:
transform()
method providing general transformation capabilitycan_transform()
method to check transformabilityTabularDataProtocol
for broader compatibilityComponentFactory
methods to return proper protocol types instead of Any
:
case_helper.py
and text_normalizer.py
type_helper.py
by restructuring type checksNone
attribute access by adding proper isinstance()
checksIterator[Any]
, list[Any]
, dict[str, DataType]
)PathLike
type annotations to PathLike[str]
None
values properly:
string_tokenizer.py
, base58.py
parameter types to str | None
or Any
random_helper.py
for start
parameterutility_helper.py
module with base-58 encoding/decoding utilitiesencode_base58()
function for converting binary data to base-58 stringsdecode_base58()
function for converting base-58 strings to binary datais_valid_base58()
function for validating base-58 string formatValidationError
exception class for utility validation errorstest_utility_helper.py
profile_values()
function in type_helper.py
that uses weighted incremental checks at 25%, 50%, and 75% of data processing to short-circuit early when a definitive type can be determined. This provides significant performance improvements for large datasets (>10,000 items) while maintaining accuracy.MIXED
type when both numeric/temporal types and string types are detected, avoiding unnecessary processing.use_incremental_typecheck
parameter (default: True
) to control whether incremental checking is used, allowing users to disable optimization if needed.examples/profile_values_performance_benchmark.py
) demonstrating 2-3x performance improvements for large datasets.type_helper.py
to accurately reflect the simplified implementation.unittest.TestCase
for improved test organization and consistency.os
import from type_helper.py
to improve code cleanliness.test_dsv_helper.py
: Kept core parsing tests; moved file I/O and streaming tests to test_dsv_helper_file_stream.py
test_streaming_tabular_data_model.py
: Kept core streaming model tests; moved complex scenarios and edge cases to test_streaming_tabular_data_model_complex.py
test_text_file_helper.py
: Kept core text file operations; moved streaming tests to test_text_file_helper_streaming.py
DataType
import from test_dsv_helper.py
Iterator
imports from streaming tabular data model test filestype_helper.py
String class for improved performance and maintainability:
_DATE_PATTERNS
, _TIME_PATTERNS
, _DATETIME_PATTERNS
)_FLOAT_REGEX
, _INTEGER_REGEX
, _DATE_YYYY_MM_DD_REGEX
, etc.)profile_columns
method keys (datatype
instead of type
and no count
key) and adjusted error message regex in streaming tabular data model tests.type_helper.py
String class for datetime parsing. Updated _DATETIME_YYYY_MM_DD_REGEX
and _DATETIME_MM_DD_YYYY_REGEX
patterns to properly handle microseconds with [.]?\d+
instead of the incorrect [.]?\d{5}
pattern.profile_values
function where collections of all-digit strings that could be interpreted as different types (DATE, TIME, DATETIME, INTEGER) were being classified as MIXED instead of INTEGER. The function now prioritizes INTEGER type when all values are all-digit strings (with optional +/- signs) and there's a mix of DATE, TIME, DATETIME, and INTEGER interpretations.profile_values
function would fail when given a non-reusable iterator (e.g., generator). The function now uses a 2-pass approach that always uses a list for the special case logic is needed, ensuring both correctness with generators.multi_row_headers
parameter from TabularDataModel
, StreamingTabularDataModel
, and DsvHelper.profile_columns
. Multi-row header merging is now controlled solely by the header_rows
parameter.StreamingTabularDataModel
API to focus on streaming functionality by removing random access methods (row()
, row_as_list()
, row_as_tuple()
, cell_value()
) and column analysis methods (column_values()
, column_type()
). This creates a cleaner, more consistent streaming paradigm.header_rows
parameter for multi-row header merging. Any usage of multi_row_headers
has been removed.test_string_tokenizer.py
for improved maintainability and clarity. Test coverage and edge case handling remain comprehensive.DsvHelper.parse_stream
to process data without loading the entire dataset into memory. Features include:
StreamingTabularDataModel
with 26 test methods covering:
StreamingTabularDataModel
usage, including memory usage comparison with traditional loading methods.StreamingTabularDataModel
, TabularDataModel
) to properly handle empty headers by filling them with column_<index>
names. Headers like "Name,,City"
now correctly become ["Name", "column_1", "City"]
.StringTokenizer.parse
to preserve empty fields instead of filtering them out. This ensures that "Name,,City"
is parsed as ["Name", "", "City"]
instead of ["Name", "City"]
, maintaining data integrity.StreamingTabularDataModel
to properly handle uneven rows and dynamically expand columns during iteration.StreamingTabularDataModel
provides significant memory savings for large datasets by processing data in configurable chunks rather than loading entire files into memory.StreamingTabularDataModel
with comprehensive edge case testing.DsvHelper.parse_stream
and various data formats.StringTokenizer
tests to reflect the new behavior of preserving empty fields.skip_header_rows
parameter to the preview()
method, allowing users to skip header rows when previewing file contents.collections.deque
in load_as_stream()
method, improving performance from O(n) to O(1) for footer row operations.header_rows=0
. Column names are now properly generated as ["column_0", "column_1", "column_2"]
when no headers are provided.IndexError
in the row()
method when accessing uneven data rows. Added proper padding logic to ensure row data has enough columns before access.DsvHelper.profile_columns
, a new method that generates a simple data profile from parsed DSV data, inferring column names and datatypes.DsvHelper.profile_columns
and improved validation of DSV parsing logic, including edge cases for all supported datatypes.def myfunc(value: str, *, trim: bool = True)
). This enforces keyword-only arguments for all default values, improving clarity and consistency. This is a breaking change and may require updates to any code that calls these methods positionally for defaulted parameters.This project is licensed under the MIT License - see the LICENSE file for details.
This project is configured to build source distributions only (no wheels). To build a source distribution:
# Using the build script (recommended)
python build_sdist.py
# Or using build directly
python -m build --sdist
The source distribution will be created in the dist/
directory as a .tar.gz
file.
Run the test suite:
pytest
Run with coverage:
pytest --cov=splurge_tools --cov-report=html
Jim Schilling
FAQs
Python tools for data type handling and validation
We found that splurge-tools demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The latest Opengrep releases add Apex scanning, precision rule tuning, and performance gains for open source static code analysis.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.