🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

cheminftools

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

cheminftools

A collection of tools for daily cheminformatics tasks.

0.1.4

PyPI

Maintainers: 1

cheminftools

Installation

From GitHub repo: pip install git+https://github.com/marcossantanaioc/cheminftools.git

From PyPi: pip install cheminftools

How to use

Chemtools offer a collection of cheminformatics scripts for daily tasks. Currently supported tasks include:

1 - Standardization of chemical structures

2 - Calculation of molecular descriptors

3 - Filtering datasets your own or predefined alerts (e.g. PAINS, Dundee, Glaxo, etc.)

Standardization

A dataset of molecules can be standardized in just 1 line of code!

import pandas as pd
import numpy as np
from cheminftools.tools.sanitizer import MolCleaner
from cheminftools.tools.featurizer import MolFeaturizer
from cheminftools.tools.filtering import MolFilter
from rdkit import Chem
import json

data = pd.read_csv('../data/example_data.csv')

Sanitizing

The MolCleaner class performs sanitization tasks, including:

    1. Standardize unknown stereochemistry (Handled by the RDKit Mol file parser)
        i) Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
        ii) Fix wiggly bonds on double bonds – set double bond to crossed bond
    2. Clears S Group data from the mol file
    3. Kekulize the structure
    4. Remove H atoms (See the page on explicit Hs for more details)
    5. Normalization:
        Fix hypervalent nitro groups
        Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
        Correct amides with N=COH
        Standardise sulphoxides to charge separated form
        Standardize diazonium N (atom :2 here: [*:1]-[N;X2:2]#[N;X1:3]>>[*:1]) to N+
        Ensure quaternary N is charged
        Ensure trivalent O ([*:1]=[O;X2;v3;+0:2]-[#6:3]) is charged
        Ensure trivalent S ([O:1]=[S;D2;+0:2]-[#6:3]) is charged
        Ensure halogen with no neighbors ([F,Cl,Br,I;X0;+0:1]) is charged
    6. The molecule is neutralized, if possible. See the page on neutralization rules for more details.
    7. Remove stereo from tartrate to simplify salt matching
    8. Normalise (straighten) triple bonds and allenes
    
    
    
    The curation steps in ChEMBL structure pipeline were augmented with additional steps to identify duplicated entries
    9. Find stereo centers
    10. Generate inchi keys
    11. Find duplicated SMILES. If the same SMILES is present multiple times, two outcomes are possible.
        i. The same compound (e.g. same ID and same SMILES)
        ii. Isomers with different SMILES, IDs and/or activities
        
        In case i), the compounds are merged by taking the median values of all numeric columns in the dataframe. 
        For case ii), the compounds are further classified as 'to merge' or 'to keep' depending on the activity values.
            a) Compounds are considered for mergining (to merge) if the difference in acvitities is less than 1log unit.
            b) Compounds are considered for keeping as individual entries (to keep) if the difference in activities is larger than 1log unit. In this case, the user can
            select which compound to keep - the one with highest or lowest activity.

processed_data = MolCleaner.from_df(data, smiles_col='smiles', act_col='pIC50', id_col='molecule_chembl_id')

+-------------------------------------------------------------+-------------------------------------------------------------+
|                      processed_smiles                       |                           smiles                            |
+=============================================================+=============================================================+
|       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |       N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12       |
+-------------------------------------------------------------+-------------------------------------------------------------+
|       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |       COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1       |
+-------------------------------------------------------------+-------------------------------------------------------------+
| Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 |
+-------------------------------------------------------------+-------------------------------------------------------------+
|               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |               C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1                |
+-------------------------------------------------------------+-------------------------------------------------------------+
| Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1  |
+-------------------------------------------------------------+-------------------------------------------------------------+

Filtering

The MolFilter class is responsible for removing compounds that match defined substructural alerts. The class AlertMatcher can be used to generate your own catalog of alerts based on a dictionary. You can also use catalogs from RDKIT, such as PAINS catalog.

The example below shows how to create an alerts catalog starting from a json of the Glaxos alerts.

Load json and prepare dictionary

alerts_df = pd.read_csv('../data/libraries/alert_collection.csv')
alerts_df = alerts_df[alerts_df['rule_set_name']=='Glaxo']
alerts_df.rename(columns={'smarts':'SMARTS'},inplace=True)
alerts_df_reindex = alerts_df[['description','SMARTS','rule_set_name','priority','max_matches']].set_index('description')
alerts_dict = alerts_df_reindex.to_dict(orient='index')

Create matcher object from dict

matcher = AlertMatcher(alerts_dict)
catalog = matcher.create_matcher()

Run filtering

alerts_data = MolFilter.from_df(df=processed_data, smiles_column='processed_smiles', catalog=catalog)

+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
|                                  smiles                                    |         SMARTS        |          alert_name       |   rule_set_name  |
+============================================================================+=======================+===========================+==================+
|        Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1         |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
|               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                | [Br,Cl,I][CX4;CH,CH2] | R1 Reactive alkyl halides |      Glaxo       |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
|               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |   [N;R0][N;R0]C(=O)   |     R17 acylhydrazide     |      Glaxo       |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
|               O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1                |      [N&D2](=O)       |        R21 Nitroso        |      Glaxo       |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+
| CS(=O)(=O)O[C@H]1CN[C@H](C#Cc2cc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3s2)C1 |   COS(=O)(=O)[C,c]    |      R5 Sulphonates       |      Glaxo       |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+

Featurization

The MolFeaturizer class converts SMILES into molecular descriptors. The current version supports Morgan fingerprints, Atom Pairs, Torsion Fingerprints, RDKit fingerprints and 200 constitutional descriptors, and MACCS keys.

fingerprinter = MolFeaturizer('rdkit2d')

X = fingerprinter.transform(processed_data['processed_smiles'])

X[:5, :5]
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=uint8)

Collecting data from ChEMBL

The current version of cheminftools support queries to ChEMBL based on UNIPROT accession codes. It should be straightforward to get activity data for multiple targets using the ChemblFetcher class. Users can find the latest version (and also older ones) of ChEMBL on the official page. Installation instructions come together with each ChEMBL release.

ChbemlFetcher expects a configuration file for the database. This file includes information such as the host, user, password and port to connect to the database. An example is shown below:

[postgresql]
host = localhost
database = customer
user = postgres
password = admindb
port = 5432

from cheminftools.data.data_gather import ChemblFetcher

target_uniprot = ['P00742', 'P50613']
chembl = ChemblFetcher(database_config_filename='database.ini',  # Path to configuration file. You can find \an example in the cheminftools.examples folder
                       database_name='chembl',  # Name of database
                       version='32')  # ChEMBL version to use
df = chembl.query_target_uniprot(target_uniprot=target_uniprot)

The output is a pandas DataFrame with the desired activity types (e.g. IC50, Kd, Ki) for each target in target_uniprot.

Keywords

cheminformatics computational chemistry rdkit

FAQs

What is cheminftools?

Is cheminftools well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

cheminftools

cheminftools

Installation

How to use

Standardization

Sanitizing

Filtering

Load json and prepare dictionary

Create matcher object from dict

Run filtering

Featurization

Collecting data from ChEMBL

Keywords

Related posts

NPM targeted by malware campaign mimicking familiar library names

wget to Wipeout: Malicious Go Modules Fetch Destructive Payload

Using Trusted Protocols Against You: Gmail as a C2 Mechanism