Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
CCSD (Combinatorial Complex Score-based Diffusion) is a sophisticated score-based diffusion model designed to generate Combinatorial Complexes using Stochastic Differential Equations. This cutting-edge approach enables the generation of complex objects with higher-order structures and relations, thereby enhancing our ability to learn underlying distributions and produce more realistic objects.
CCSD is a sophisticated score-based diffusion model designed to generate Combinatorial Complexes using Stochastic Differential Equations. This cutting-edge approach enables the generation of complex objects with higher-order structures and relations, thereby enhancing our ability to learn underlying distributions and produce more realistic objects.
Complex object generation is a challenging problem with application in various fields such as drug discovery. The CCSD model offers a novel approach to tackle this problem by leveraging Diffusion Models and Stochastic Differential Equations to generate Combinatorial Complexes (CC). This topological structure generalizes the different mathematical stuctures used in Topological/Geometric Deep Learning to represent complex objects with higher-order structures and relations. The integration of the higher-order domain during the generation enhances the learning of the underlying distribution of the data and thus, allows for better data generation.
If you find this project interesting, we would appreciate your support by leaving a star ⭐ on this GitHub repository.
Code still in Alpha version!
CCSD stands out from traditional complex object generation models due to the following key advantages:
Combinatorial Complexes: The model generates Combinatorial Complexes, enabling the synthesis of complex objects with rich structures and relationships.
Score-Based Diffusion: CCSD utilizes score-based diffusion techniques, allowing for efficient, high-quality and state-of-the-art complex object generation.
Enhanced "Realism": By incorporating higher-order structures, the generated objects are more representative of the underlying data distribution.
Also, this repository is highly documented and commented, which makes it easy to use, understand, deploy, and which offers endless possibilities for improvements.
The research has been conducted by Adrien Carrel as part of his requirements for the MSc degree in Advanced Computing of Imperial College London, United Kingdom, and his requirements for the MEng in Applied Mathematics (Diplôme d'Ingénieur) at CentraleSupélec, France.
This project has been supervised by Dr. Tolga Birdal, Assistant Professor (Lecturer) in the Department of Computing of Imperial College London.
We welcome new contributors with various background and programming levels who would like to contribute to the fields of diffusion models and topological deep learning. Feel free to suggest new ideas, submit pull requests, etc.
Feel free to check our Code of Conduct if you wish to contribute.
If you encounter an error during the installation, please refer to the section Commons errors below. If you are creating an Ubuntu instance on a Public Cloud service to train/sample from the model, you may want to use the post_installation_script.sh
script provided to automate the process (just modify the Git configurations section inside the script with your details).
To get started with CCSD, you can install the package using pip by typing the command:
pip install ccsd
If you encounter, if you want to use the latest version, or if you prefer the command line interface, you can use it locally by cloning or forking this repository to your local machine.
git clone https://github.com/AdrienC21/CCSD.git
For Windows users, the recommended version of kaleido is 0.1.0.post1. You can install it by typing:
pip install kaleido==0.1.0post1
For Linux users, the latest version of kaleido seems fine. For those who want to use orca
, you can install it via the npm command of Node.js. For the installation, type:
sudo NEEDRESTART_MODE=a apt install -y nodejs
sudo NEEDRESTART_MODE=a apt install -y npm
npm install -g electron@6.1.4 orca
Install the dependencies (see the section Dependencies below).
When installing PyTorch and its componenents, or when install TopoModelX along with TopoNetX, run the commands below:
pip install torch==2.0.1 --extra-index-url https://download.pytorch.org/whl/${CUDA}
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.1+${CUDA}.html
where ${CUDA} could be cu117
, cu118
, or cpu
if you want to use the CPU. For GPU, we recommend cu118
. Also, TopoModelX should be installed after TopoNetX to avoid versioning issues.
If you are using a Linux system, you may need to install libxrender1 by typing:
sudo apt-get install libxrender1
To test your installation, refer to the section Testing below.
If you encounter an error, please refer to the section Commons errors below.
CCSD requires a recent version of Python, probably 3.7, but preferably 3.10 or higher.
It also requires the following dependencies:
Please make sure you have the required dependencies installed before using CCSD.
You can install all of them by running the command:
pip install -r requirements.txt
To ensure the correctness and robustness of CCSD and to allow researchers to build upon this tool, we have provided an extensive test suite. To run the tests, clone the repository, change your directory to the root folder of this project and execute the following command:
pytest tests/ -W ignore::DeprecationWarning
If you encounter an error during the testing, please refer to the section Commons errors below.
The output should look like this:
==================================================== test session starts ====================================================
...
tests\utils\test_mol_utils.py .................. [ 98%]
tests\utils\test_time_utils.py .. [100%]
============================================== 128 passed in 70.03s (0:01:10) ===============================================
For more information about the script below and their arguments, you can type:
python <script_path> --help
If you want to use molecular datasets such as QM9 or ZINC250k, you first need to run the two following commands:
python ccsd/data/preprocess.py --dataset <dataset_name> --folder <folder_name>
python ccsd/data/preprocess_for_nspdk.py --dataset <dataset_name> --folder <folder_name>
<dataset_name>
has to be chosen from this list: ["QM9", "ZINC250k"]
. <folder_name>
is the location of the data
folder that contains the datasets (default to ./
).
If you want to generate generic graphs datasets, you can run the following command:
python ccsd/data/data_generators.py --dataset <dataset_name> --folder <folder_name>
<dataset_name>
has to be chosen from this list: ["community_small", "grid", "ego_small", "ENZYMES"]
. <folder_name>
is the location of the data
folder that will contain the generated dataset (default to ./
).
For a more complete documentation on all the classes and functions, please refer to the Documentation page (link in the section below) and here.
To use CCSD, follow the steps outlined below:
Edit your general configurations:
Edit the file config\general_config.py
to provide your wandb (Weights & Biases) information (if you want to use wandb), your timezone, and some other general parameters.
Edit the engine
value with the plotly engine you want to use. For Windows users, we recommend kaleido
(see installation instructions below). For Linux users, we recommend orca
(see installation instructions below).
Execute the code:
You can either use the command line or directly the CCSD class (more information in the subsections below).
Combinatorial Complexes:
To generate combinatorial complexes and not graphs, just put the parameter is_cc
to True in your configuration files, specify the d_min
and d_max
parameters of your dataset (see the thesis for more information on that), and pick a ScoreNetwork model for the rank2 score predictions (see example of configurations).
Figures and output:
The figures, the graphs, combinatorial complexes, molecules, etc, will all be saved into a samples
folder, and in the logs folders logs_sample
and logs_train
.
The output in the command line should look like below, where a logo is printed, the current experiment information, and then the training/sampling information:
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP5JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJYPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP? 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPP5YY5PPPPPPPPPP5YY5PPPPPPPPPPPPPPPPPP? .^!7?????7!^. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPP5^. ~JYYY55557^....^?PPPPPPPPPPPPPPPP? :!J5PPPPPPPPPPP5J!. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPP? :^ ~^ ~PPPPPPPPPPPPPPP? .?5PPPPPPPPPPPPPPPPP5! 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPP?^:^?J??777?.. .:5PPPPPPPPPPPPPP? ^5PPPPPP5?~^^~!?5PPPY!: 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPG5^:^. ^: :75PPPPPPPPPPPP? :5PPPPPP7. .7!: 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPP5!. !?^^~JJ~ .^!^^^~75PPPPP? 7PPPPPPJ 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPY!. ~YPPPPPPPPY^ .?PPPP? JPPPPPP! 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPP5557. ~YPPPPPPPPPPJ :PPPP? !PPPPPPJ 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPJ~....:~YPPPPPPPPPPPP5: 7PPPP? .YPPPPPP?. :?7^. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPJ :5PPPPPPPPPPPPP57^...:~JPPPPP? .JPPPPPP5?!~~~!?5PPP5J!. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPJ .5PPPPPPPPPPPPPPPP555PPPPPPPP? !5PPPPPPPPPPPPPPPPP57 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPJ^...:~YPPPPPPPPPPPPPPPPPPPPPPPPPPPP? .~?5PPPPPPPPPPP5J~. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPP5555PPPPPPPPPPPPPPPPPPPPPPPPPPPPPP? .:~!7??77!~:. 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP? 7PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP5YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY5PPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP _____ _____ _____ _____ PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP / ____/ ____|/ ____| __ \ PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP | | | | | (___ | | | | PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP | | | | \___ \| | | | PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP | |___| |____ ____) | |__| | PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP \_____\_____|_____/|_____/ PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
----------------------------------------------------------------------------------------------------
Current experiment:
type: train
config: qm9
comment:
seed: 42
...
Help:
python main.py --help
Train a model:
Define your own config or use one from the config folder.
Use the command:
python main.py --type train --config <config_name>
where <config_name>
should be the name of your configuration file.
Sample from a model:
Define your own sample config or use one from the config folder.
Use the command:
python main.py --type sample --config <config_name>
where <config_name>
should be the name of your sampling configuration file.
Other:
--comment COMMENT
: A single line comment for the experiment
--seed SEED
: Random seed for reproducibility
--folder FOLDER
: Directory to save the results, load checkpoints, load config, etc
CUDA_VISIBLE_DEVICES=0,1 python main.py --type <type> --config <config_name>
if you want to use the GPUs 0 and 1, or to only use the GPU 0:
CUDA_VISIBLE_DEVICES=0 python main.py --type <type> --config <config_name>
from ccsd.diffusion import CCSD
params = {
"type": "train",
"config": "qm9_CC",
"folder": "./", # optional
"comment": "test experiment", # optional
"seed": 42 # optional
}
diffusion_model = CCSD(**params) # define the object
diffusion_model.run() # run the experiment
from ccsd import CCSD
help(CCSD)
Here is the link to the documentation of this library: https://ccsd.readthedocs.io/en/latest/. It contains more information regarding all the classes and functions of this package.
If you encounter an error during the installation of MOSES, please follow the steps below:
First, if you are on Windows, make sure you have installed the C++ Dev kit via Visual Studio 2022 community.
Then, for all users, install rdkit, Cython, and pomegranate using the commands:
pip install rdkit
pip install Cython
pip install pomegranate
Finally, either install MOSES directly using:
pip install molsets
Or, first install lfs by running
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git-lfs install
And then use one of the two options below to install MOSES:
1 - via git
git lfs install
pip install git+https://github.com/molecularsets/moses.git
2 - install it manually (if the previous steps doesn't work):
git lfs install
git clone https://github.com/molecularsets/moses.git
cd moses
python setup.py install
The errors related to packages (MOSES, TopoNetX, etc) can be fixed by running the script apply_fixes.py
located in .github/workflows/
or by following the steps for each individual packages below:
If you get an error related to a ._append
method that no longer exists in Pandas and that is still used in the MOSES package, please replace in the MOSES package the file utils.py
with the one provided in this repository: fixes\utils.py. The MOSES utils.py
file should be located somewhere like:
C:\Users\<username>\miniconda3\lib\site-packages\moses\metrics\utils.py
or miniconda\lib\python3.11\site-packages\molsets-1.0-py3.11.egg\moses\metrics\utils.py
More precisely the error should look like this:
_mcf.append(_pains, sort=True)['smarts'].values]
...
AttributeError: 'DataFrame' object has no attribute 'append'
The trick is to replace this line (24) by:
pd.concat([_mcf, _pains], sort=True)['smarts'].values]
Remark: The entire operation can be done through the command line by using sed:
sed -i "24s/.*/\t\t\tpd.concat([_mcf, _pains], sort=True)[\o047smarts\o047].values]/" <path_to_moses_utils.py>
Install an old version of TopoNetX
, by running for example the command:
pip install git+https://github.com/pyt-team/TopoNetX.git@a389bd8bb11c731bb98d79da8392e3396ea9db8c
Then, replace the file combinatorial_complex.py
of TopoNetX by the updated one provided in this repository: fixes\combinatorial_complex.py
The file in TopoNetX should be located somewhere like:
C:\Users\<username>\miniconda3\Lib\site-packages\toponetx\classes\combinatorial_complex.py
or miniconda\lib\python3.11\site-packages\toponetx\classes\combinatorial_complex.py
If you use CCSD in your research or work, please consider citing it using the following BibTeX entry:
Carrel, A. (2023). CCSD - Combinatorial Complex Score-based Diffusion model using stochastic differential equations. (Version 1.0.0) [Computer software]. https://github.com/AdrienC21/CCSD
The Laboratory for Computational Physiology (LCP) at the Massachusetts Institute of Technology (MIT) for partially hosting me during the redaction of my thesis.
Dr. Tolga Birdal for his supervision and for the valuable advice and ressources that he provided.
Dr. Mustafa Hajij and the members of the pyt-team for the package TopoNetX and their work on Combinatorial Complexes and Topological Deep Learning: Topological Deep Learning: Going Beyond Graph Data.
Jo, J. & al. for their paper Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations that served as a baseline for both the theoretical and empirical part of this work.
All my friends and my family for the support.
Logo created by me using the icon: "topology" icon by VectorsLab from Noun Project CC BY 3.0.
See the change log for a history of all changes to CCSD.
CCSD is licensed under the MIT License. Feel free to use and modify the code as per the terms of the license.
Debug folder structure when calling some functions
Add example script and notebook
Debugging, additional plots
More configuration options with more datasets
New scripts (Tanimoto similarity, plot datasets)
FAQs
CCSD (Combinatorial Complex Score-based Diffusion) is a sophisticated score-based diffusion model designed to generate Combinatorial Complexes using Stochastic Differential Equations. This cutting-edge approach enables the generation of complex objects with higher-order structures and relations, thereby enhancing our ability to learn underlying distributions and produce more realistic objects.
We found that ccsd demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.