Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
HashStash is a versatile caching library for Python that supports multiple storage engines, serializers, and encoding options. It provides a simple dictionary-like interface for caching data with various backend options. HashStash is designed to be easy to use, flexible, and efficient.
Dictionary-like interface, except absolutely anything can be either a key or value (even unhashable entities like sets or unpicklable entities like lambdas, local functions, etc)
Multiprocessing support: connection pooling and locking parallelize operations as much as the specific engine allows
Functions like stash.run
and decorators like @stashed_result
cache the results of function calls
Functions like stash.map
and @stash_mapped
parallelize function calls across many objects, with stashed results
Easy dataframe assembly from cached contents
File-based
Server-based
In-memory
Transportable between Python versions
Not transportable between Python versions
External compressors (with depedencies):
Built-in compressors (no dependencies):
HashStash requires no dependencies by default, but you can install optional dependencies to get the best performance.
Default installation (no dependencies): pip install hashstash
Installation with only the recommended/optimal settings (lmdb engine, lz4 compression, and pyarrow dataframe serialization): pip install hashstash[rec]
Full installation with all optional dependencies: pip install hashstash[all]
Development installation: pip install hashstash[dev]
For all options see pyproject.toml under [project.optional-dependencies].
!pip install -qU hashstash[rec]
Here's a quick example of how to use HashStash.
from hashstash import HashStash
# Create a stash instance
stash = HashStash()
# or customize:
stash = HashStash(
# naming
root_dir="project_stash", # root directory of the stash (default: default_stash)
# if not an absolute path, will be ~/.cache/hashstash/[root_dir]
dbname="sub_stash", # name of "database" or subfolder (default: main)
# engines
engine="pairtree", # or lmdb, sqlite, diskcache, redis, mongo, or memory
serializer="hashstash", # or jsonpickle or pickle
compress='lz4', # or blosc, bz2, gzip, zlib, or raw
b64=True, # base64 encode keys and values
# storage options
append_mode=False, # store all versions of a key/value pair
clear=True # clear on init
)
# show stash type and path
print(stash)
# show stash config
stash.to_dict()
↓
PairtreeHashStash(~/.cache/hashstash/project_stash/sub_stash/pairtree.hashstash.lz4+b64/data.db)
{'root_dir': '/Users/ryan/.cache/hashstash/project_stash',
'dbname': 'sub_stash',
'engine': 'pairtree',
'serializer': 'hashstash',
'compress': 'lz4',
'b64': True,
'append_mode': False,
'is_function_stash': False,
'is_tmp': False,
'filename': 'data.db'}
Literally anything can be a key or value, including lambdas, local functions, sets, dataframes, dictionaries, etc:
# traditional dictionary keys,,,
stash["bad"] = "cat" # string key
stash[("bad","good")] = "cat" # tuple key
# ...unhashable keys...
stash[{"goodness":"bad"}] = "cat" # dict key
stash[["bad","good"]] = "cat" # list key
stash[{"bad","good"}] = "cat" # set key
# ...func keys...
def func_key(x): pass
stash[func_key] = "cat" # function key
lambda_key = lambda x: x
stash[lambda_key] = "cat" # lambda key
# ...very unhashable keys...
import pandas as pd
df_key = pd.DataFrame(
{"name":["cat"],
"goodness":["bad"]}
)
stash[df_key] = "cat" # dataframe key
# all should equal "cat":
(
stash["bad"],
stash[("bad","good")],
stash[{"goodness":"bad"}],
stash[["bad","good"]],
stash[{"bad","good"}],
stash[func_key],
stash[lambda_key],
stash[df_key]
)
↓
('cat', 'cat', 'cat', 'cat', 'cat', 'cat', 'cat', 'cat')
HashStash fully implements the dictionary's MutableMapping
interface, providing all its methods, including:
# get()
assert stash.get(df_key) == "cat"
assert stash.get('fake_key') == None
# __contains__
assert df_key in stash
# __len__
assert len(stash) == 8 # from earlier
# keys()
from hashstash import *
for i,key in enumerate(stash.keys()):
pass
# values()
for value in stash.values():
assert value == "cat"
# items()
for i, (key, value) in enumerate(stash.items()):
print(f'Item #{i+1}:\n{key} >>> {value}\n')
↓
Item #1:
{'good', 'bad'} >>> cat
Item #2:
{'goodness': 'bad'} >>> cat
Item #3:
bad >>> cat
Item #4:
name goodness
0 cat bad >>> cat
Item #5:
('bad', 'good') >>> cat
Item #6:
['bad', 'good'] >>> cat
Item #7:
<function func_key at 0x12846c160> >>> cat
Item #8:
<function <lambda> at 0x1291c0160> >>> cat
Other dictionary functions:
# pop()
assert stash.pop(df_key) == "cat"
assert df_key not in stash
# setdefault()
assert stash.setdefault(df_key, "new_cat_default") == "new_cat_default"
assert stash.get(df_key) == "new_cat_default"
# update()
another_dict = {'new_key_of_badness': 'cat'}
stash.update(another_dict)
assert stash['new_key_of_badness'] == "cat"
# update() with another stash
another_stash = HashStash(engine='memory').clear()
another_stash[[1,2,3]] = "cat"
stash.update(another_stash)
assert stash[[1,2,3]] == "cat"
You can also iterate the keys and values as actually exist in the data store, i.e. serialized encoded:
_keys()
: Return an iterator over the encoded keys
_values()
: Return an iterator over the encoded values
_items()
: Return an iterator over the encoded key-value pai
These methods are used internally and not necessary to use.
print('\nIterating over ._items():')
for encoded_key,encoded_value in stash._items():
print(encoded_key, 'is the serialized, compressed, and encoded key for', encoded_value)
decoded_key = stash.decode_key(encoded_key)
decoded_value = stash.decode_value(encoded_value)
print(decoded_key, 'is the decoded, uncompressed, and deserialized key for', decoded_value)
break
↓
Iterating over ._items():
b'NwAAAPETeyJfX3B5X18iOiAiYnVpbHRpbnMuc2V0IiwgIl9fZGF0YRwA8AFbImdvb2QiLCAiYmFkIl19' is the serialized, compressed, and encoded key for b'BQAAAFAiY2F0Ig=='
{'good', 'bad'} is the decoded, uncompressed, and deserialized key for cat
HashStash provides two ways of stashing results.
First, here's an expensive function:
# Here's an expensive function
num_times_computed = 0
def expensive_computation(names,goodnesses=['good']):
import random
global num_times_computed
num_times_computed += 1
print(f'Executing expensive_computation time #{num_times_computed}')
ld=[]
for n in range(1_000_000):
d={}
d['name']=random.choice(names)
d['goodness']=random.choice(goodnesses)
d['random']=random.random()
ld.append(d)
return random.sample(ld,k=10)
names = ['cat', 'dog']
goodnesses=['good','bad']
# execute 2 times -- different results
unstashed_result1 = expensive_computation(names, goodnesses=goodnesses)
unstashed_result2 = expensive_computation(names, goodnesses=goodnesses)
↓
Executing expensive_computation time #1
Executing expensive_computation time #2
stash.run()
## set up a stash to run the function in
functions_stash = HashStash('functions_stash', clear=True)
# execute time #3
stashed_result1 = functions_stash.run(expensive_computation, names, goodnesses=goodnesses)
# calls #4-5 will not execute but return stashed result
stashed_result2 = functions_stash.run(expensive_computation, names, goodnesses=goodnesses)
stashed_result3 = functions_stash.run(expensive_computation, names, goodnesses=goodnesses)
assert stashed_result1 == stashed_result2 == stashed_result3
↓
Executing expensive_computation time #3
@stash.stashed_result
from hashstash import stashed_result
@functions_stash.stashed_result # or @stashed_result("functions_stash") [same HashStash call args/kwargs]
def expensive_computation2(names, goodnesses=['good']):
return expensive_computation(names, goodnesses=goodnesses)
# will run once
stashed_result4 = expensive_computation2(names, goodnesses=goodnesses)
# then cached even when calling it normally
stashed_result5 = expensive_computation2(names, goodnesses=goodnesses)
stashed_result6 = expensive_computation2(names, goodnesses=goodnesses)
assert stashed_result4 == stashed_result5 == stashed_result6
↓
Executing expensive_computation time #4
Once a function is stashed via either the methods above you can access its stash as an attribute of the function:
# function now has .stash attribute, from either method
func_stash = expensive_computation.stash
func_stash2 = expensive_computation2.stash
assert len(func_stash) == len(func_stash2)
print(f'Function results cached in {func_stash}\n')
# can iterate over its results normally. Keys are: (args as tuple, kwargs as dict)
func_stash = func_stash2
for key, value in func_stash.items():
args, kwargs = key
print(f'Stashed key = {key}')
print(f'Called args: {args}')
print(f'Called kwargs: {kwargs}')
print(f'\nStashed value = {value}')
# you can get result via normal get
stashed_result7 = func_stash.get(((names,), {'goodnesses':goodnesses}))
# or via special get_func function which accepts function call syntax
stashed_result8 = func_stash.get_func(names, goodnesses=goodnesses)
assert stashed_result7 == stashed_result8 == stashed_result5 == stashed_result6
↓
Function results cached in LMDBHashStash(~/.cache/hashstash/functions_stash/lmdb.hashstash.lz4/stashed_result/__main__.expensive_computation/lmdb.hashstash.lz4/data.db)
Stashed key = ((['cat', 'dog'],), {'goodnesses': ['good', 'bad']})
Called args: (['cat', 'dog'],)
Called kwargs: {'goodnesses': ['good', 'bad']}
Stashed value = [{'name': 'dog', 'goodness': 'bad', 'random': 0.5057600020943653}, {'name': 'dog', 'goodness': 'bad', 'random': 0.44942716869985244}, {'name': 'dog', 'goodness': 'bad', 'random': 0.04412090932878976}, {'name': 'dog', 'goodness': 'good', 'random': 0.26390218890484296}, {'name': 'dog', 'goodness': 'good', 'random': 0.8861568169357764}, {'name': 'dog', 'goodness': 'bad', 'random': 0.8113840172104607}, {'name': 'dog', 'goodness': 'bad', 'random': 0.29450288091375965}, {'name': 'cat', 'goodness': 'good', 'random': 0.10650085474589033}, {'name': 'dog', 'goodness': 'bad', 'random': 0.10346094332240874}, {'name': 'cat', 'goodness': 'bad', 'random': 0.29552371113906584}]
You can also map functions across many objects, with stashed results, with stash.map
. By default it uses {num_proc}-2 processors to start computing results in background. In the meantime it returns a StashMap
object.
def expensive_computation3(name, goodnesses=['good']):
time.sleep(random.randint(1,5))
return {'name':name, 'goodness':random.choice(goodnesses)}
# this returns a custom StashMap object instantly, computing results in background (if num_proc>1)
stash_map = functions_stash.map(expensive_computation3, ['cat','dog','aardvark','zebra'], goodnesses=['good', 'bad'], num_proc=2)
stash_map
↓
Mapping __main__.expensive_computation3 across 4 objects [2x]: 0%| | 0/4 [00:00<?, ?it/s]
StashMap([StashMapRun(__main__.expensive_computation3('cat', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('dog', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('aardvark', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('zebra', goodnesses=['good', 'bad']) >>> ?)])
# iterate over results as they come in:
timestart=time.time()
for result in stash_map.results_iter():
print(f'[+{time.time()-timestart:.1f}] {result}')
↓
Mapping __main__.expensive_computation3 across 4 objects [2x]: 50%|█████ | 2/4 [00:05<00:04, 2.42s/it]
[+5.0] {'name': 'cat', 'goodness': 'good'}
[+5.0] {'name': 'dog', 'goodness': 'good'}
[+5.0] {'name': 'aardvark', 'goodness': 'good'}
Mapping __main__.expensive_computation3 across 4 objects [2x]: 100%|██████████| 4/4 [00:09<00:00, 2.16s/it]
[+9.0] {'name': 'zebra', 'goodness': 'bad'}
# or wait for as a list
stash_map.results
↓
[{'name': 'cat', 'goodness': 'good'},
{'name': 'dog', 'goodness': 'good'},
{'name': 'aardvark', 'goodness': 'good'},
{'name': 'zebra', 'goodness': 'bad'}]
# or by .items() or .keys() or .values()
for (args,kwargs),result in stash_map.items():
print(f'{args} {kwargs} >>> {result}')
↓
('cat',) {'goodnesses': ['good', 'bad']} >>> {'name': 'cat', 'goodness': 'good'}
('dog',) {'goodnesses': ['good', 'bad']} >>> {'name': 'dog', 'goodness': 'good'}
('aardvark',) {'goodnesses': ['good', 'bad']} >>> {'name': 'aardvark', 'goodness': 'good'}
('zebra',) {'goodnesses': ['good', 'bad']} >>> {'name': 'zebra', 'goodness': 'bad'}
# the next time, it will return stashed results, and compute only new values
stash_map2 = functions_stash.map(expensive_computation3, ['cat','dog','elephant','donkey'], goodnesses=['good', 'bad'], num_proc=2)
stash_map2
↓
Mapping __main__.expensive_computation3 across 4 objects [2x]: 0%| | 0/4 [00:00<?, ?it/s]
StashMap([StashMapRun(__main__.expensive_computation3('cat', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('dog', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('elephant', goodnesses=['good', 'bad']) >>> ?),
StashMapRun(__main__.expensive_computation3('donkey', goodnesses=['good', 'bad']) >>> ?)])
# heavily customizable
stash_map3 = functions_stash.map(
expensive_computation3,
objects=['cat','parrot'], # (2 new animals
options=[{'goodnesses':['bad']}, {}], # list of dictionaries for specific keyword arguments
goodnesses=['good', 'bad'], # keyword arguments common to all function calls
num_proc=4, # number of processes to use
preload=True, # start loading stashed results on init
precompute=True, # start computing stashed results
progress=True, # show progress bar
desc="Mapping expensive_computation3", # description for progress bar
ordered=True, # maintain order of input
stash_runs=True, # store individual function runs
stash_map=True, # store the entire map result
_force=False, # don't force recomputation if results exist
)
↓
# Can also use as a decorator
@stash_mapped('function_stash', num_proc=1)
def expensive_computation4(name, goodnesses=['good']):
time.sleep(random.randint(1,5))
return {'name':name, 'goodness':random.choice(goodnesses)}
expensive_computation4(['mole','lizard','turkey'])
↓
StashMap([StashMapRun(__main__.expensive_computation4('mole', root_dir='function_stash') >>> {'name': 'mole', 'goodness': 'good'}),
StashMapRun(__main__.expensive_computation4('lizard', root_dir='function_stash') >>> {'name': 'lizard', 'goodness': 'good'}),
StashMapRun(__main__.expensive_computation4('turkey', root_dir='function_stash') >>> {'name': 'turkey', 'goodness': 'good'})])
HashStash can assemble DataFrames from cached contents, even nested ones. First, examples from earlier:
# assemble list of flattened dictionaries from cached contents
func_stash.ld # or stash.assemble_ld()
# assemble dataframe from flattened dictionaries of cached contents
print(func_stash.df) # or stash.assemble_df()
↓
name goodness random
0 dog bad 0.505760
1 dog bad 0.449427
2 dog bad 0.044121
3 dog good 0.263902
4 dog good 0.886157
5 dog bad 0.811384
6 dog bad 0.294503
7 cat good 0.106501
8 dog bad 0.103461
9 cat bad 0.295524
Nested data flattening:
# can also work with nested data
nested_data_stash = HashStash(engine='memory', dbname='assembling_dfs')
# populate stash with random animals
import random
for n in range(100):
nested_data_stash[f'Animal {n+1}'] = {
'name': (cat_or_dog := random.choice(['cat', 'dog'])),
'goodness': (goodness := random.choice(['good', 'bad'])),
'etc': {
'age': random.randint(1, 10),
'goes_to':{
'heaven':True if cat_or_dog=='dog' or goodness=='good' else False,
}
}
}
# assemble dataframe from flattened dictionaries of cached contents
print(nested_data_stash.df) # or stash.assemble_df()
↓
name goodness etc.age etc.goes_to.heaven
_key
Animal 1 cat good 9 True
Animal 2 cat bad 8 False
Animal 3 cat good 6 True
Animal 4 dog bad 7 True
Animal 5 dog bad 10 True
... ... ... ... ...
Animal 96 dog bad 2 True
Animal 97 dog bad 8 True
Animal 98 cat bad 9 False
Animal 99 cat good 5 True
Animal 100 cat good 9 True
[100 rows x 4 columns]
Keep track of all versions of a key/value pair. All engines can track version number; "pairtree" tracks timestamp as well.
append_stash = HashStash("readme_append_mode", engine='pairtree', append_mode=True, clear=True)
key = {"name":"cat"}
append_stash[key] = {"goodness": "good"}
append_stash[key] = {"goodness": "bad"}
print(f'Latest value: {append_stash.get(key)}')
print(f'All values: {append_stash.get_all(key)}')
print(f'All values with metadata: {append_stash.get_all(key, with_metadata=True)}')
↓
Latest value: {'goodness': 'bad'}
All values: [{'goodness': 'good'}, {'goodness': 'bad'}]
All values with metadata: [{'_version': 1, '_timestamp': 1725652978.878733, '_value': {'goodness': 'good'}}, {'_version': 2, '_timestamp': 1725652978.878886, '_value': {'goodness': 'bad'}}]
Can also get metadata on dataframe:
print(append_stash.assemble_df(with_metadata=True))
↓
name goodness
_version _timestamp
1 1.725653e+09 cat good
2 1.725653e+09 cat bad
HashStash provides a tmp
method for creating temporary caches that are automatically cleaned up. The temporary cache is automatically cleared and removed after the with block:
with stash.tmp() as tmp_stash:
tmp_stash["key"] = "value"
print("key" in tmp_stash)
print("key" in tmp_stash)
↓
True
False
HashStash supports multiple serialization methods:
serialize
: Serializes Python objectsdeserialize
: Deserializes data back into Python objectsfrom hashstash import serialize, deserialize
data = pd.DataFrame({"name": ["cat", "dog"], "goodness": ["good", "bad"]})
serialized_data = serialize(data, serializer="hashstash") # or jsonpickle or pickle
deserialized_data = deserialize(serialized_data, serializer="hashstash")
data.equals(deserialized_data)
↓
True
HashStash provides functions for encoding and compressing data:
encode
: Encodes and optionally compresses datadecode
: Decodes and decompresses dataThese functions are used internally by HashStash but can also be used directly:
from hashstash import encode, decode
data = b"Hello, World!"
encoded_data = encode(data, compress='lz4', b64=True)
decoded_data = decode(encoded_data, compress='lz4', b64=True)
data == decoded_data
↓
True
Mapping __main__.expensive_computation3 across 4 objects [2x]: 6it [00:04, 1.45it/s]
LMDB is the fastest engine, followed by the custom "pairtree" implementation.
Pickle is by far the fastest serializer, but it is not transportable between Python versions. HashStash is generally faster than jsonpickle, and can serialize more data types (including lambdas and functions within functions), but it produces larger file sizes.
LZ4 is the fastest compressor, but it requires an external dependency. BZ2 is the slowest, but it provides the best compression ratio.
LMDB engine, with pickle serializer, with no compression (raw) or LZ4 or blosc compression is the fastest combination of parameters; followed by pairtree with the same.
To run the tests, clone this repository and run pytest
in the root project directory.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GNU License.
FAQs
A simple file-based caching system using hash-based file names
We found that hashstash demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.