smaz-py3
Small string compression using smaz compression
algorithm.
This library wraps the original C code, so it should be quite fast. It also has a
testsuite that uses hypothesis based
property testing - a fancy way of saying that the tests are run with randomly
generated strings using most of unicode, to better guard against edge cases.
Why do I need this?
You are working with tons of short strings (text messages, urls,...) and want to save
space.
According to the original code and notes, it achieves the best compression with english
strings (up to 50%) that do not contain a ton of numbers. However, any other language
might just work as well (allegedly still up to 30%).
Note that in certain cases, it is possible that the compression increases the size.
Keep that in mind and maybe first run some tests. Measuring size is explained in the
example below as well.
How do I use this?
Let's install:
$ pip install smaz-py3
Note: the -py3
is important. There is an original release, kudos to Benjamin
Sergeant, but it does not work with Python 3+.
Now, a usage example.
import smaz
compressed = smaz.compress("The quick brown fox jumps over the lazy dog.")
decompressed = smaz.decompress(b'H\x00\xfeq&\x83\xfek^sA)\xdc\xfa\x00\xfej&-<\x95\xe7\r\x0b\x89\xdbG\x18\x06;n')
assert decompressed == "The quick brown fox jumps over the lazy dog."
How much did we compress?
original_size = len("The quick brown fox jumps over the lazy dog.".encode("utf-8"))
compressed_size = len(compressed)
compression_ratio = 1 - compressed_size / original_size
So we saved about 30% (0.295 * 100 and some rounding 😉).
If the compression ratio would be below 0, we would have actually increased the
string. Yes, this can happen. Again, smaz works best on small strings.
A small note about NULL bytes
Currently, smaz-py3
does not support strings with NULL bytes (\x00
) in compression:
>>> import smaz
>>> smaz.compress("The quick brown fox\x00 jumps over the lazy dog.")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: embedded null character
My reasoning behind this is that in most scenarios you want to clean that away
beforehand anyways. If you think this is wrong, please open up an
issue on github. I am happy for further input!
Migrating from Python 2 smaz
If you have been using the Python 2 smaz
library,
this Python 3 version exposes the same API, so it is a drop-in replacement.
Important: While developing this extension, I think I found a bug in the original
library. Using Python 2.7.16:
>>> import smaz
>>> smaz.compress("The quick brown fox jumps over the lazy dog.")
'H'
>>> small = smaz.compress("The quick brown fox jumps over the lazy dog.")
>>> smaz.decompress(small)
'The'
So, if you are actually upgrading from this, please make sure that you are not
affected by this. smaz-py3
is not prone to this bug.
Behind the scenes, smaz uses NULL bytes in compression. However, when converting from
C back to a Python string object, NULL is used to mark the end of the string. The
above sentence, compressed, has the NULL byte right after the H
(H\x00\xfeq…
).
That's why it stops right then and there. Again, smaz-py3
is not affected by this,
mostly because I got lucky in choosing this example sentence.
Credits
Credit where credit is due. First to antirez's SMAZ compression
and to the original python 2 wrapper by Benjamin
Sergeant.