Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Official Implementation of "COLLIE: Systematic Construction of Constrained Text Generation Tasks"
We propose the COLLIE framework for easy constraint structure specification, example extraction, instruction rendering, and model evaluation.
We recommand using Python 3.9 (3.10 as of now might have incompatabilty of certain dependencies).
To install in development mode (in cloned project directory):
pip install -e .
After installation you can access the functionalities through import collie
.
We will add COLLIE to PyPI soon.
There are two main ways to use COLLIE:
The dataset used in the paper is at data/all_data.dill
and can be loaded by
with open("data/all_data.dill", "rb") as f:
all_data = dill.load(f)
all_data
will be a dictionary with keys as the data source and constraint type, and values as a list of constraints. For example, all_data['wiki_c07'][0]
is
{
'example': 'Black market prices for weapons and ammunition in the Palestinian Authority-controlled areas have been rising, necessitating outside funding for the operation.',
'targets': ['have', 'rising', 'the'],
'constraint': ...,
'prompt': "Please generate a sentence containing the word 'have', 'rising', 'the'.",
...
}
Reproducing the results reported in the paper:
logs/
folderscripts/analysis.ipynb
python scripts/run_api_models.py
and python scripts/run_gpu_models.py
The framework follows a 4-step process:
To specify a constraint, you need the following concepts defined as classes in collie/constraints.py
:
Level
: deriving classes InputLevel
(the basic unit of the input) and TargetLevel
(the level for comparing to the target value); levels include 'character'
, 'word'
, 'sentence'
, etcTransformation
: defines how the input text is modified into values comparable against the provided target value; it derives classes like Count
, Position
, ForEach
, etcLogic
: And
, Or
, All
that can be used to combine constraintsRelation
: relation such as '=='
or 'in'
for compariing against the target valueReduction
: when the target has multiple values, you need to specify how the transformed values from the input is reduced such as 'all'
, 'any'
, 'at least'
Constraint
: the main class for combining all the above specificationsTo specify a constraint, you need to provide at least the TargetLevel
, Transformation
, and Relation
.
They are going to be wrapped in the c = Constraint(...)
initialization. Once the constraint is specified, you can use c.check(input_text, target_value)
to verify any given text and target tuple.
Below is an example of specifying a "counting the number of word constraint".
>>> from collie.constraints import Constraint, TargetLevel, Count, Relation
# A very simple "number of word" constraint.
>>> c = Constraint(
>>> target_level=TargetLevel('word'),
>>> transformation=Count(),
>>> relation=Relation('=='),
>>> )
>>> print(c)
Constraint(
InputLevel(None),
TargetLevel(word),
Transformation('Count()'),
Relation(==),
Reduction(None)
)
Check out the guide to explore more examples.
Once the constraints are defined, you can now extract examples from the datasources (e.g., Gutenberg, Wikipedia) that satisfy the specified constraints.
To download necessary data files including the Gutenberg, dammit
corpus to the data
folder, run from the root project dir:
bash download.sh
Run extraction:
python -m collie.examples.extract
This will sweep over all constraints and data sources defined in collie/examples/
. To add additional examples, you can add them to the appropriate python files.
Extracted examples can be found in the folder sample_data
. The files are named as: {source}_{level}.dill
. The data/all_data.dill
file is simply a concatenation of all these source-level dill files.
To render a constraint, simply run:
>>> from collie.constraint_renderer import ConstraintRenderer
>>> renderer = ConstraintRenderer(
>>> constraint=c, # Defined in step one
>>> constraint_value=5
>>> )
>>> print(renderer.prompt)
Please generate a sentence with exactly 5 words.
To check constraint satisfication, simply run:
>>> text = 'This is a good sentence.'
>>> print(c.check(text, 5))
True
>>> print(c.check(text, 4))
False
lease cite our paper if you use SimCSE in your work:
@inproceedings{yao2023collie,
title = {COLLIE: Systematic Construction of Constrained Text Generation Tasks},
author = {Yao, Shunyu and Chen, Howard and Wang, Austin and Yang, Runzhe and Narasimhan, Karthik},
booktitle = {ArXiv},
year = {2023},
html = {}
}
MIT. Note that this is the license for our code, but each data source retains their own respective licenses.
FAQs
Official Implementation of "COLLIE: Systematic Construction of Constrained Text Generation Tasks"
We found that collie-bench demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.