Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Official docs: https://hojichar.github.io/HojiChar/hojichar.html
Text preprocessing is far from a one-size-fits-all process. Depending on the data source and the specific task at hand, various steps including normalization, noise removal, and filtering may be necessary. Not all texts require the same level of preprocessing. For instance, relatively clean texts may only need minimal filtering, while "dirtier" sources like Common Crawl data often require more thorough processing. As a result, the preprocessing profile has to be tailored to each specific domain.
Many preprocessing operations can be viewed as filters, taking string as input, applying a transformation, and outputting the processed string. Even though these operations might seem straightforward individually, managing them in a multi-layered, efficient manner can be challenging.
Inspired by torchvision.transforms
and iver56/audiomentations, HojiChar addresses these challenges. It enables users to define each text processing step as a class inheriting from hojichar.Filter
and use hojichar.Compose
to chain them together into a single filter. By writing out the Compose
recipe as a profile, the preprocessing process for a specific domain's text can be made portable. Moreover, Compose
automatically logs various metrics for each filter, such as byte changes, processing time, and number of rejected texts. This allows users to assess the validity of each operation and consider trade-offs between computation time and performance.
While there are other text normalization tools available, most are designed to perform a specific set of operations. Text preprocessing, despite its importance in the LLM era, is often considered a mundane task compared to machine learning or artificial intelligence tasks. As a result, many existing solutions can be ad hoc, poorly maintained, or inadequately tested. Recognizing these issues, we developed HojiChar as a robust tool for configuring text preprocessing.
pip install hojichar
If you want to use the additional filters, install the package with the following command:
pip install hojichar[all]
The Compose
class in HojiChar allows you to create a sequence of text processing filters.
from hojichar import Compose, document_filters
cleaner = Compose([
document_filters.JSONLoader(key="text"),
document_filters.AcceptJapanese(),
document_filters.DocumentLengthFilter(min_doc_len=0,max_doc_len=1000),
document_filters.ExampleHojiChar(),
document_filters.JSONDumper()
])
When a Compose
object is called, it accepts a string and returns the processed string.
>>> cleaner('{"text": "こんにちは、"}')
{"text": "こんにちは、<hojichar>"}
The filter pipeline above accomplishes the following steps:
'text'
key in the JSON object.<hojichar>
to the string.The filters used in the pipeline are predefined filters found in hojichar.filters
.
While HojiChar provides some fundamental text processing filters and plans to add more in the future, users can also define their custom filters.
A filter composing a Compose
object is a class that inherits the Filter
class and implements the text processing within the apply
function.
from hojichar.core.filter_interface import Filter
class YourFilter(Filter):
def apply(self, document):
text = document.text
"""
Write your text transformation...
"""
document.text = text
return document
The apply
method accepts a hojichar.Document
type as an argument and returns it after the transformations. The Document
is a class that encapsulates a string.
The Document class can have additional metadata via the extras attribute. This allows you to associate values with the document that can be utilized in subsequent filters. Reject documents
hojichar.Document
has an is_rejected
attribute. If a filter sets this flag to True
, Compose
will discard the document during processing.Definition of __init__
for custom filter
When creating a user-defined class and applying a custom constructor, make sure to initialize the parent class.
class YourFilter(Filter):
def __init__(self, your_param, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.your_param = your_param
def apply(self, document):
text = document.text
text = process(text, self.your_param)
document.text = text
return document
This is because The Filter
class implicitly has several arguments, one of which is p
.
cleaner = Compose([
document_filters.JSONLoader(key="text"),
document_filters.AcceptJapanese(p=0.5),
document_filters.JSONDumper()
])
The p
argument passed to the document_filters.AcceptJapanese
constructor determines the probability of applying the filter; with a probability of 1-p
, it acts as an identity function. This behavior is defined in the parent class hojichar.Filter
.
Even though the behavior of a Compose
object when called is a text-in, text-out function, Compose
itself also inherits from the Filter
class. Therefore, applying the apply
method to a Compose
object results in hojihcar.Document
class being used as input and output.
Compose
class behaves like a Filter. If you add a Compose object as one of the filters in the constructor of Compose, the filter will be unfolded recursively.
You can access various statistics regarding the processing performed by Compose
through Compose.statistics
or Compose.statistics_obj
.
Compose.statistics
is a dictionary like above.
{
"total_info": {
"processed_num": 10928,
"discard_num": 5513,
"input_MB": 104.514584,
"output_MB": 25.33024,
"cumulative_time": 114.071047143,
"total_token_num": 0
},
"layers_info": [
{
"name": "0-JSONLoader",
"discard_num": 0,
"diff_MB": -1.9647932052612305,
"cumulative_time": 0.420034328,
"params": {
"name": "JSONLoader",
"p": 1,
"skip_rejected": true,
"key": "text",
"ignore": true
}
},
{
"name": "1-DocumentNormalizer",
"discard_num": 0,
"diff_MB": -1.5221118927001953,
"cumulative_time": 8.286988707,
"params": {
"name": "DocumentNormalizer",
"p": 1,
"skip_rejected": true
}
},
{
"name": "2-DocumentLengthFilter",
"discard_num": 344,
"diff_MB": -0.05566596984863281,
"cumulative_time": 0.093768306,
"params": {
"name": "DocumentLengthFilter",
"p": 1,
"skip_rejected": true,
"min_doc_len": 100,
"max_doc_len": null
}
},
]
}
Compose.statistics_obj
is a hojichar.StatsContainer
class. The hojichar.StatsContainer
class stores the raw values of the statistics dictionary, and addition operations are defined to easily calculate the total statistics processed with the same filter. You can get the statistics dictionary by calling Compose.statistics_obj.get_human_readable_values()
.
Compose
The hojichar.Parallel
class allows for the application of Compose
to an iterable of Document
concurrently. This class empowers users to process vast collections of documents by harnessing the power of multiple CPU cores.
Example usage of Parallel
class to proces a very large JSON Lines file concurrently.
import hojichar
input_file = "your_text.jsonl"
input_doc_iter = (hojichar.Document(line) for line in open(input_file))
cleaner = hojichar.Compose([
hojichar.document_filters.JSONLoader(),
hojichar.document_filters.DocumentNormalizer(),
# Insert your filters
hojichar.document_filters.JSONDumper(),
])
with hojichar.Parallel(cleaner, num_jobs=10) as pfilter:
out_doc_iter = pfilter.imap_apply(input_doc_iter)
with open("your_processed_text.jsonl", "w") as fp:
for doc in out_doc_iter:
fp.write(doc.text + "\n")
Parallel
class within a with
statement.Parallel.imap_apply(doc_iter)
processes an iterator of Document
and returns an iterator of the processed documents.Parallel
class, please refer to the official documentation.HojiChar provides CLI tools for text preprocess pipeline.
User defines a series of preprocessing into a python file as profile.
Example:
cat <your_text.jsonl> | hojichar -p your_preprocessing_profile.py -o your_text_preprocessed.jsonl
hojichar --help
usage: hojichar [-h] --profile <profile.py> [--args ARGS [ARGS ...]] [--output OUTPUT] [--input INPUT] [--dump-stats <path to stats.json>] [--exit-on-error] [--all] [--jobs JOBS]
options:
-h, --help show this help message and exit
--profile <profile.py>, -p <profile.py>
Path to a Python file that implements your custom filter.
--args ARGS [ARGS ...]
Pass additional arguments to the profile. Use it like `--args arg1 arg2` etc. The arguments should be space-separated.
--output OUTPUT, -o OUTPUT
Specifies the path for the output file. Defaults to standard output.
--input INPUT, -i INPUT
Specifies the path for the input file. Defaults to standard input. If set this path, the progress bar is enabled.
--dump-stats <path to stats.json>
Dump statistics to file. If the file exists, it will be appended.
--exit-on-error Exit if an exception occurs during filtering. Useful for debugging custom filters.
--all A flag that specifies whether to include discarded samples. This is useful when inspecting discarded samples.
--jobs JOBS, -j JOBS The number ob parallel jobs. By default, the nuber of the CPU core.
FILTER
profilehojichar.Compose
must be defined as FILTER
variable.
Example.
import json
from hojichar import Compose, Filter
from hojichar.filters.document_filters import ExampleHojiChar, JSONLoader
class JSONDumper(Filter):
def apply(self, document):
text = document.text
document.text = json.dumps({"text": text}, ensure_ascii=False)
return document
# FILTER must define Compose object.
FILTER = Compose(
[
JSONLoader(),
ExampleHojiChar(),
JSONDumper(),
]
)
Pass the texts to the filter you have defined using a pipe as follows.
cat <your_file> | hojichar -p example_profile.py
hojichar.utils.load_compose.load_filter_from_file()
loads this type of profile.
FACTORY
profileA callable function that returns hojichar.Compose
must be defined as FACTORY
variable.
The callable can receive arguments. In this way, parameters can be passed to the profile.
FACTORY
provides a mechanism to pass those values as arguments to the preprocessing.Example.
import json
from hojichar import Compose, Filter
from hojichar.filters.document_filters import JSONLoader
class AddSomething(Filter): # Concat some value after every document.
def __init__(self, something: str, *args, **kwargs) -> None:
self.something = something
def apply(self, document):
text = document.text + self.something
document.text = text
return document
class JSONDumper(Filter):
def apply(self, document):
text = document.text
document.text = json.dumps({"text": text}, ensure_ascii=False)
return document
def callback(something):
return Compose(
[
JSONLoader(),
AddSomething(something),
JSONDumper(),
]
)
# FACTORY must be callable which returns Compose object.
FACTORY = callback
Using FACTORY
profile with arguments in CLI.
cat <your_file> | hojichar -p example_profile.py --args arg1 arg2
hojichar.utils.load_compose.load_parametrized_filter_from_file()
or load_factory_from_file
loads this type of profile.
To install the package, execute the following commands:
git clone https://github.com/HojiChar/HojiChar.git
cd HojiChar
poetry install
To install packages related to development, use:
poetry install --extras "dev lint test doc"
Some filters incorporate doctests. You can run these tests with the command:
pytest --doctest-modules .
This command should be executed from the root of the project.
pyproject.toml
. You can perform linting and formatting from the root of the project using the following commands:Linting
poetry run task lint
Formatting
poetry run task format
We use Pdoc for building the documentation. You can build the documentation using the following command:
pdoc -o docs hojichar
Run this command from the project root.
In practice, the process of building the documentation is automated by CI. When a Pull Request is merged into the main branch, the documentation is built in the docs/
directory of the docs
branch. This directory is then deployed to the official documentation site by GitHub Pages.
To create a source tarball, for instance, for packaging or distribution, run the following command:
poetry build
The tarball will be created in the dist directory. This command will compile the source code, and the resulting tarball can be installed with no additional dependencies other than the Python standard library.
This command is primarily used by the project manager to create a release and upload it to PyPI.
Versions uploaded to PyPI are identified by git tags. The __version__
variable in __init__.py
or the version
entry in pyproject.toml
are ignored. The poetry-dynamic-versioning
Poetry plugin is used to implement this process.
To add the plugin, use:
poetry self add "poetry-dynamic-versioning[plugin]"
The steps to push to PyPI are as follows, although in actuality, the process is automated by CI when a GitHub release is created from the tag.
git checkout v0.1.2
poetry config pypi-token.pypi <API TOKEN>
poetry build
poetry publish
The actual task for the manager is to apply the appropriate tag to the commit to be released and to create the release from GitHub:
git tag -a v0.1.2 -m "Version 0.1.2"
git push origin v0.1.2
FAQs
Text preprocessing management system.
We found that hojichar demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.