Latest Threat Research:SANDWORM_MODE: Shai-Hulud-Style npm Worm Hijacks CI Workflows and Poisons AI Toolchains.Details
Socket
Book a DemoInstallSign in
Socket

selfies

Package Overview
Dependencies
Maintainers
1
Versions
16
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

selfies - npm Package Compare versions

Comparing version
2.1.1
to
2.1.2
+201
LICENSE
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2019 Mario Krenn
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
import faulthandler
import pathlib
import random
import pandas as pd
import pytest
from rdkit import Chem
import selfies as sf
faulthandler.enable()
TEST_SET_DIR = pathlib.Path(__file__).parent / "test_sets"
ERROR_LOG_DIR = pathlib.Path(__file__).parent / "error_logs"
ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True)
datasets = list(TEST_SET_DIR.glob("**/*.csv"))
@pytest.mark.parametrize("test_path", datasets)
def test_roundtrip_translation(test_path, dataset_samples):
"""Tests SMILES -> SELFIES -> SMILES translation on various datasets.
"""
# very relaxed constraints
constraints = sf.get_preset_constraints("hypervalent")
constraints.update({"P": 7, "P-1": 8, "P+1": 6, "?": 12})
sf.set_semantic_constraints(constraints)
error_path = ERROR_LOG_DIR / "{}.csv".format(test_path.stem)
with open(error_path, "w+") as error_log:
error_log.write("In, Out\n")
error_data = []
error_found = False
n_lines = sum(1 for _ in open(test_path)) - 1
n_keep = dataset_samples if (0 < dataset_samples <= n_lines) else n_lines
skip = random.sample(range(1, n_lines + 1), n_lines - n_keep)
reader = pd.read_csv(test_path, chunksize=10000, header=0, skiprows=skip)
for chunk in reader:
for in_smiles in chunk["smiles"]:
in_smiles = in_smiles.strip()
mol = Chem.MolFromSmiles(in_smiles, sanitize=True)
if (mol is None) or ("*" in in_smiles):
continue
try:
selfies = sf.encoder(in_smiles, strict=True)
out_smiles = sf.decoder(selfies)
except (sf.EncoderError, sf.DecoderError):
error_data.append((in_smiles, ""))
continue
if not is_same_mol(in_smiles, out_smiles):
error_data.append((in_smiles, out_smiles))
with open(error_path, "a") as error_log:
for entry in error_data:
error_log.write(",".join(entry) + "\n")
error_found = error_found or error_data
error_data = []
sf.set_semantic_constraints() # restore constraints
assert not error_found
def is_same_mol(smiles1, smiles2):
try:
can_smiles1 = Chem.CanonSmiles(smiles1)
can_smiles2 = Chem.CanonSmiles(smiles2)
return can_smiles1 == can_smiles2
except Exception:
return False
import pytest
import selfies as sf
class Entry:
def __init__(self, selfies, symbols, label, one_hot):
self.selfies = selfies
self.symbols = symbols
self.label = label
self.one_hot = one_hot
@pytest.fixture()
def dataset():
stoi = {"[nop]": 0, "[O]": 1, ".": 2, "[C]": 3, "[F]": 4}
itos = {i: c for c, i in stoi.items()}
pad_to_len = 4
entries = [
Entry(selfies="",
symbols=[],
label=[0, 0, 0, 0],
one_hot=[[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]]),
Entry(selfies="[C][C][C]",
symbols=["[C]", "[C]", "[C]"],
label=[3, 3, 3, 0],
one_hot=[[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]]),
Entry(selfies="[C].[C]",
symbols=["[C]", ".", "[C]"],
label=[3, 2, 3, 0],
one_hot=[[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]]),
Entry(selfies="[C][O][C][F]",
symbols=["[C]", "[O]", "[C]", "[F]"],
label=[3, 1, 3, 4],
one_hot=[[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]),
Entry(selfies="[C][O][C]",
symbols=["[C]", "[O]", "[C]"],
label=[3, 1, 3, 0],
one_hot=[[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]])
]
return entries, (stoi, itos, pad_to_len)
@pytest.fixture()
def dataset_flat_hots(dataset):
flat_hots = []
for entry in dataset[0]:
hot = [elm for vec in entry.one_hot for elm in vec]
flat_hots.append(hot)
return flat_hots
def test_len_selfies(dataset):
for entry in dataset[0]:
assert sf.len_selfies(entry.selfies) == len(entry.symbols)
def test_split_selfies(dataset):
for entry in dataset[0]:
assert list(sf.split_selfies(entry.selfies)) == entry.symbols
def test_get_alphabet_from_selfies(dataset):
entries, (vocab_stoi, _, _) = dataset
selfies = [entry.selfies for entry in entries]
alphabet = sf.get_alphabet_from_selfies(selfies)
alphabet.add("[nop]")
alphabet.add(".")
assert alphabet == set(vocab_stoi.keys())
def test_selfies_to_encoding(dataset):
entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset
for entry in entries:
label, one_hot = sf.selfies_to_encoding(
entry.selfies, vocab_stoi, pad_to_len, "both"
)
assert label == entry.label
assert one_hot == entry.one_hot
# recover original selfies
selfies = sf.encoding_to_selfies(label, vocab_itos, "label")
selfies = selfies.replace("[nop]", "")
assert selfies == entry.selfies
selfies = sf.encoding_to_selfies(one_hot, vocab_itos, "one_hot")
selfies = selfies.replace("[nop]", "")
assert selfies == entry.selfies
def test_selfies_to_flat_hot(dataset, dataset_flat_hots):
entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset
batch = [entry.selfies for entry in entries]
flat_hots = sf.batch_selfies_to_flat_hot(batch, vocab_stoi, pad_to_len)
assert flat_hots == dataset_flat_hots
# recover original selfies
recovered = sf.batch_flat_hot_to_selfies(flat_hots, vocab_itos)
assert batch == [s.replace("[nop]", "") for s in recovered]
import faulthandler
import random
import pytest
from rdkit.Chem import MolFromSmiles
import selfies as sf
faulthandler.enable()
@pytest.fixture()
def max_selfies_len():
return 1000
@pytest.fixture()
def large_alphabet():
alphabet = sf.get_semantic_robust_alphabet()
alphabet.update([
"[#Br]", "[#Branch1]", "[#Branch2]", "[#Branch3]", "[#C@@H1]",
"[#C@@]", "[#C@H1]", "[#C@]", "[#C]", "[#Cl]", "[#F]", "[#H]", "[#I]",
"[#NH1]", "[#N]", "[#O]", "[#P]", "[#Ring1]", "[#Ring2]", "[#Ring3]",
"[#S]", "[/Br]", "[/C@@H1]", "[/C@@]", "[/C@H1]", "[/C@]", "[/C]",
"[/Cl]", "[/F]", "[/H]", "[/I]", "[/NH1]", "[/N]", "[/O]", "[/P]",
"[/S]", "[=Br]", "[=Branch1]", "[=Branch2]", "[=Branch3]", "[=C@@H1]",
"[=C@@]", "[=C@H1]", "[=C@]", "[=C]", "[=Cl]", "[=F]", "[=H]", "[=I]",
"[=NH1]", "[=N]", "[=O]", "[=P]", "[=Ring1]", "[=Ring2]", "[=Ring3]",
"[=S]", "[Br]", "[Branch1]", "[Branch2]", "[Branch3]", "[C@@H1]",
"[C@@]", "[C@H1]", "[C@]", "[C]", "[Cl]", "[F]", "[H]", "[I]", "[NH1]",
"[N]", "[O]", "[P]", "[Ring1]", "[Ring2]", "[Ring3]", "[S]", "[\\Br]",
"[\\C@@H1]", "[\\C@@]", "[\\C@H1]", "[\\C@]", "[\\C]", "[\\Cl]",
"[\\F]", "[\\H]", "[\\I]", "[\\NH1]", "[\\N]", "[\\O]", "[\\P]",
"[\\S]", "[nop]"
])
return list(alphabet)
def test_random_selfies_decoder(trials, max_selfies_len, large_alphabet):
"""Tests that SELFIES that are generated by randomly stringing together
symbols from the SELFIES alphabet are decoded into valid SMILES.
"""
alphabet = tuple(large_alphabet)
for _ in range(trials):
# create random SELFIES and decode
rand_len = random.randint(1, max_selfies_len)
rand_selfies = "".join(random_choices(alphabet, k=rand_len))
smiles = sf.decoder(rand_selfies)
# check if SMILES is valid
try:
is_valid = MolFromSmiles(smiles, sanitize=True) is not None
except Exception:
is_valid = False
err_msg = "SMILES: {}\n\t SELFIES: {}".format(smiles, rand_selfies)
assert is_valid, err_msg
def test_nop_symbol_decoder(max_selfies_len, large_alphabet):
"""Tests that the '[nop]' symbol is always skipped over.
"""
alphabet = list(large_alphabet)
alphabet.remove("[nop]")
for _ in range(100):
# create random SELFIES with and without [nop]
rand_len = random.randint(1, max_selfies_len)
rand_mol = random_choices(alphabet, k=rand_len)
rand_mol.extend(["[nop]"] * (max_selfies_len - rand_len))
random.shuffle(rand_mol)
with_nops = "".join(rand_mol)
without_nops = with_nops.replace("[nop]", "")
assert sf.decoder(with_nops) == sf.decoder(without_nops)
def test_get_semantic_constraints():
constraints = sf.get_semantic_constraints()
assert constraints is not sf.get_semantic_constraints() # not alias
assert "?" in constraints
def test_change_constraints_cache_clear():
alphabet = sf.get_semantic_robust_alphabet()
assert alphabet == sf.get_semantic_robust_alphabet()
assert sf.decoder("[C][#C]") == "C#C"
new_constraints = sf.get_semantic_constraints()
new_constraints["C"] = 1
sf.set_semantic_constraints(new_constraints)
new_alphabet = sf.get_semantic_robust_alphabet()
assert new_alphabet != alphabet
assert sf.decoder("[C][#C]") == "CC"
sf.set_semantic_constraints() # re-set alphabet
def test_invalid_or_unsupported_smiles_encoder():
malformed_smiles = [
"",
"(",
"C(Cl)(Cl)CC[13C",
"C(CCCOC",
"C=(CCOC",
"CCCC)",
"C1CCCCC",
"C(F)(F)(F)(F)(F)F", # violates bond constraints
"C=C1=CCCCCC1", # violates bond constraints
"CC*CC", # uses wildcard
"C$C", # uses $ bond
"S[As@TB1](F)(Cl)(Br)N", # unrecognized chirality,
"SOMETHINGWRONGHERE",
"1243124124",
]
for smiles in malformed_smiles:
with pytest.raises(sf.EncoderError):
sf.encoder(smiles)
def test_malformed_selfies_decoder():
with pytest.raises(sf.DecoderError):
sf.decoder("[O][=C][O][C][C][C][C][O][N][Branch2_3")
def random_choices(population, k): # random.choices was new in Python v3.6
return [random.choice(population) for _ in range(k)]
def test_decoder_attribution():
sm, am = sf.decoder(
"[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]", attribute=True)
# check that P lined up
for ta in am:
if ta.token == 'P':
for a in ta.attribution:
if a.token == '[P]':
return
raise ValueError('Failed to find P in attribution map')
def test_encoder_attribution():
smiles = "C1([O-])C=CC=C1Cl"
indices = [0, 3, 3, 3, 5, 7, 8, 10, None, None, 12]
s, am = sf.encoder(smiles, attribute=True)
for i, ta in enumerate(am):
if ta.attribution:
assert indices[i] == ta.attribution[0].index, \
f'found {ta[1]}; should be {indices[i]}'
if ta.token == '[Cl]':
assert 'Cl' in [
a.token for a in ta.attribution],\
'Failed to find Cl in attribution map'
import pytest
import selfies as sf
def decode_eq(selfies, smiles):
s = sf.decoder(selfies)
return s == smiles
def test_branch_and_ring_at_state_X0():
"""Tests SELFIES with branches and rings at state X0 (i.e. at the
very beginning of a SELFIES). These symbols should be skipped.
"""
assert decode_eq("[Branch3][C][S][C][O]", "CSCO")
assert decode_eq("[Ring3][C][S][C][O]", "CSCO")
assert decode_eq("[Branch1][Ring1][Ring3][C][S][C][O]", "CSCO")
def test_branch_at_state_X1():
"""Test SELFIES with branches at state X1 (i.e. at an atom that
can only make one bond. In this case, the branch symbol should be skipped.
"""
assert decode_eq("[C][C][O][Branch1][C][I]", "CCOCI")
assert decode_eq("[C][C][C][O][#Branch3][C][I]", "CCCOCI")
def test_branch_and_ring_decrement_state():
"""Tests that the branch and ring symbols properly decrement the
derivation state.
"""
assert decode_eq("[C][C][C][Ring1][Ring1][#C]", "C1CC1=C")
assert decode_eq("[C][=C][C][C][#Ring1][Ring1][#C]", "C=C1CC1")
assert decode_eq("[C][O][C][C][=Ring1][Ring1][#C]", "COCCC")
assert decode_eq("[C][=C][Branch1][C][=C][#C]", "C=C(C)C")
def test_branch_at_end_of_selfies():
"""Test SELFIES that have a branch symbol as its very last symbol.
"""
assert decode_eq("[C][C][C][C][Branch1]", "CCCC")
assert decode_eq("[C][C][C][C][#Branch3]", "CCCC")
def test_ring_at_end_of_selfies():
"""Test SELFIES that have a ring symbol as its very last symbol.
"""
assert decode_eq("[C][C][C][C][C][Ring1]", "CCCC=C")
assert decode_eq("[C][C][C][C][C][Ring3]", "CCCC=C")
def test_branch_with_no_atoms():
"""Test SELFIES that have a branch, but the branch has no atoms in it.
Such branches should not be made in the outputted SMILES.
"""
s = "[C][Branch1][Ring2][Branch1][Branch1][Branch1][F]"
assert decode_eq(s, "CF")
s = "[C][Branch1][Ring2][Ring1][Ring1][Branch1][F]"
assert decode_eq(s, "CF")
s = "[C][=Branch1][Ring2][Branch1][C][Cl][F]"
assert decode_eq(s, "C(Cl)F")
# special case: #Branch3 takes Q_1, Q_2 = [O] and Q_3 = ''. However,
# there are no more symbols in the branch.
assert decode_eq("[C][C][C][C][#Branch3][O][O]", "CCCC")
def test_oversized_branch():
"""Test SELFIES that have a branch, with Q larger than the length
of the SELFIES
"""
assert decode_eq("[C][Branch2][O][O][C][C][S][F][C]", "CCCSF")
assert decode_eq("[C][#Branch2][O][O][#C][C][S][F]", "C#CCSF")
def test_oversized_ring():
"""Test SELFIES that have a ring, with Q so large that the (Q + 1)-th
previously derived atom does not exist.
"""
assert decode_eq("[C][C][C][C][Ring1][O]", "C1CCC1")
assert decode_eq("[C][C][C][C][Ring2][O][C]", "C1CCC1")
# special case: Ring2 takes Q_1 = [O] and Q_2 = '', leading to
# Q = 9 * 16 + 0 (i.e. an oversized ring)
assert decode_eq("[C][C][C][C][Ring2][O]", "C1CCC1")
# special case: ring between 1st atom and 1st atom should not be formed
assert decode_eq("[C][Ring1][O]", "C")
def test_branch_at_beginning_of_branch():
"""Test SELFIES that have a branch immediately at the start of a branch.
"""
# [C@]((Br)Cl)F
s = "[C@][=Branch1][Branch1][Branch1][C][Br][Cl][F]"
assert decode_eq(s, "[C@](Br)(Cl)F")
# [C@](((Br)Cl)I)F
s = "[C@][#Branch1][Branch2][=Branch1][Branch1][Branch1][C][Br][Cl][I][F]"
assert decode_eq(s, "[C@](Br)(Cl)(I)F")
# [C@]((Br)(Cl)I)F
s = "[C@][#Branch1][Branch2][Branch1][C][Br][Branch1][C][Cl][I][F]"
assert decode_eq(s, "[C@](Br)(Cl)(I)F")
def test_ring_at_beginning_of_branch():
"""Test SELFIES that have a ring immediately at the start of a branch.
"""
# CC1CCC(1CCl)F
s = "[C][C][C][C][C][=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
assert decode_eq(s, "CC1CCC1(CCl)F")
# CC1CCS(Br)(1CCl)F
s = "[C][C][C][C][S][Branch1][C][Br]" \
"[=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
assert decode_eq(s, "CC1CCS1(Br)(CCl)F")
def test_branch_and_ring_at_beginning_of_branch():
"""Test SELFIES that have a branch and ring immediately at the start
of a branch.
"""
# CC1CCCS((Br)1Cl)F
s = "[C][C][C][C][C][S][#Branch1][#Branch1][Branch1][C][Br]" \
"[Ring1][Branch1][Cl][F]"
assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")
# CC1CCCS(1(Br)Cl)F
s = "[C][C][C][C][C][S][#Branch1][#Branch1][Ring1][Branch1]" \
"[Branch1][C][Br][Cl][F]"
assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")
def test_ring_immediately_following_branch():
"""Test SELFIES that have a ring immediately following after a branch.
"""
# CCC1CCCC(OCO)1
s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][Ring1][Branch1]"
assert decode_eq(s, "CCC1CCCC1OCO")
# CCC1CCCC(OCO)(F)1
s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
"[Branch1][C][F][Ring1][Branch1]"
assert decode_eq(s, "CCC1CCCC1(OCO)F")
def test_ring_after_branch():
"""Tests SELFIES that have a ring following a branch, but not
immediately after a branch.
"""
# CCCCCCC1(OCO)1
s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][C][Ring1][Branch1]"
assert decode_eq(s, "CCCCCCC(OCO)=C")
s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
"[Branch1][C][F][C][C][Ring1][=Branch2]"
assert decode_eq(s, "CCCCC1CC(OCO)(F)CC1")
def test_ring_on_top_of_existing_bond():
"""Tests SELFIES with rings between two atoms that are already bonded
in the main scaffold.
"""
# C1C1, C1C=1, C1C#1, ...
assert decode_eq("[C][C][Ring1][C]", "C=C")
assert decode_eq("[C][/C][Ring1][C]", "C=C")
assert decode_eq("[C][C][=Ring1][C]", "C#C")
assert decode_eq("[C][C][#Ring1][C]", "C#C")
def test_consecutive_rings():
"""Test SELFIES which have multiple consecutive rings.
"""
s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2]"
assert decode_eq(s, "C=1CCC=1") # 1 + 1
s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2][Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 1 + 1 + 1
s = "[C][C][C][C][=Ring1][Ring2][Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 2 + 1
s = "[C][C][C][C][Ring1][Ring2][=Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 1 + 2
# consecutive rings that exceed bond constraints
s = "[C][C][C][C][#Ring1][Ring2][=Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 3 + 2
s = "[C][C][C][C][=Ring1][Ring2][#Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 2 + 3
s = "[C][C][C][C][=Ring1][Ring2][=Ring1][Ring2]"
assert decode_eq(s, "C#1CCC#1") # 2 + 2
# consecutive rings with stereochemical single bond
s = "[C][C][C][C][\\/Ring1][Ring2]"
assert decode_eq(s, "C\\1CCC/1")
s = "[C][C][C][C][\\/Ring1][Ring2][Ring1][Ring2]"
assert decode_eq(s, "C=1CCC=1")
def test_unconstrained_symbols():
"""Tests SELFIES with symbols that are not semantically constrained.
"""
f_branch = "[Branch1][C][F]"
s = "[Xe-2]" + (f_branch * 8)
assert decode_eq(s, "[Xe-2](F)(F)(F)(F)(F)(F)(F)CF")
# change default semantic constraints
constraints = sf.get_semantic_constraints()
constraints["?"] = 2
sf.set_semantic_constraints(constraints)
assert decode_eq(s, "[Xe-2](F)CF")
sf.set_semantic_constraints()
def test_isotope_symbols():
"""Tests that SELFIES symbols with isotope specifications are
constrained properly.
"""
s = "[13C][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
assert decode_eq(s, "[13C](Cl)(F)(Br)CI")
assert decode_eq("[C][36Cl][C]", "C[36Cl]")
def test_chiral_symbols():
"""Tests that SELFIES symbols with chirality specifications are
constrained properly.
"""
s = "[C@@][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
assert decode_eq(s, "[C@@](Cl)(F)(Br)CI")
s = "[C@H1][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br]"
assert decode_eq(s, "[C@H1](Cl)(F)CBr")
def test_explicit_hydrogen_symbols():
"""Tests that SELFIES symbols with explicit hydrogen specifications
are constrained properly.
"""
assert decode_eq("[CH1][Branch1][C][Cl][#C]", "[CH1](Cl)=C")
assert decode_eq("[CH3][=C]", "[CH3]C")
assert decode_eq("[CH4][C][C]", "[CH4]")
assert decode_eq("[C][C][C][CH4]", "CCC")
assert decode_eq("[C][Branch1][Ring2][C][=CH4][C][=C]", "C(C)=C")
with pytest.raises(sf.DecoderError):
sf.decoder("[C][C][CH5]")
with pytest.raises(sf.DecoderError):
sf.decoder("[C][C][C][OH9]")
def test_charged_symbols():
"""Tests that SELFIES symbols with charges are constrained properly.
"""
constraints = sf.get_semantic_constraints()
constraints["Sn+4"] = 1
constraints["O-2"] = 2
sf.set_semantic_constraints(constraints)
# the following molecules don't make sense, but we use them to test
# selfies. Hence, we can't verify them with RDKit
assert decode_eq("[Sn+4][=C]", "[Sn+4]C")
assert decode_eq("[O-2][#C]", "[O-2]=C")
# mixing many symbol types
assert decode_eq("[17O@@H1-2][#C]", "[17O@@H1-2]C")
sf.set_semantic_constraints()
def test_standardized_alphabet():
"""Tests that equivalent SMILES atom symbols are translated into the
same SELFIES atom symbol.
"""
assert sf.encoder("[C][O][N][P][F]") == "[CH0][OH0][NH0][PH0][FH0]"
assert sf.encoder("[Fe][Si]") == "[Fe][Si]"
assert sf.encoder("[Fe++][Fe+2]") == "[Fe+2][Fe+2]"
assert sf.encoder("[CH][CH1]") == "[CH1][CH1]"
def test_old_symbols():
"""Tests backward compatibility of SELFIES with old (<v2) symbols.
"""
s = "[C@@Hexpl][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]"
assert sf.decoder(s, compatible=True) == "[C@@H1](C)(Cl)F"
s = "[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]"
assert sf.decoder(s, compatible=True) == "C#1CCC#1"
long_s = "[C@@Hexpl][=C][C@@Hexpl][N+expl][=C][C+expl][N+expl][O+expl]" \
"[Fe++expl][C@@Hexpl][C][N+expl][Branch1_2][Fe++expl][S+expl]" \
"[=C][Expl=Ring1][Fe++expl][S+expl][Expl=Ring1][O+expl]" \
"[C@@Hexpl][Expl=Ring1][C@@Hexpl][C@@Hexpl][N+expl][Expl=Ring1]" \
"[Expl=Ring1][S+expl][=C]"
try:
sf.decoder(long_s, compatible=True)
except Exception:
assert False
def test_large_selfies_decoding():
"""Test that we can decode extremely large SELFIES strings (used to cause a RecursionError)
"""
large_selfies = "[C]" * 1024
expected_smiles = "C" * 1024
assert decode_eq(large_selfies, expected_smiles)
+245
-244
Metadata-Version: 2.1
Name: selfies
Version: 2.1.1
Version: 2.1.2
Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs.

@@ -8,245 +8,2 @@ Home-page: https://github.com/aspuru-guzik-group/selfies

Author-email: mario.krenn@utoronto.ca, alan@aspuru.com
License: UNKNOWN
Description: # SELFIES
[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)
**Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\
_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
[*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)
---
A main objective is to use SELFIES as direct input into machine learning models,
in particular in generative models, for the generation of molecular graphs
which are syntactically and semantically valid.
<p align="center">
<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
</p>
## Installation
Use pip to install ``selfies``.
```bash
pip install selfies
```
To check if the correct version of ``selfies`` is installed, use
the following pip command.
```bash
pip show selfies
```
To upgrade to the latest release of ``selfies`` if you are using an
older version, use the following pip command. Please see the
[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
to review the changes between versions of `selfies`, before upgrading:
```bash
pip install selfies --upgrade
```
## Usage
### Overview
Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
which contains a thorough tutorial for getting started with ``selfies``
and detailed descriptions of the functions
that ``selfies`` provides. We summarize some key functions below.
| Function | Description |
| ------------------------------------- | ----------------------------------------------------------------- |
| ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. |
| ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. |
| ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. |
| ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. |
| ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. |
| ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. |
| ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. |
| ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. |
### Examples
#### Translation between SELFIES and SMILES representations:
```python
import selfies as sf
benzene = "c1ccccc1"
# SMILES -> SELFIES -> SMILES translation
try:
benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
except sf.EncoderError:
pass # sf.encoder error!
except sf.DecoderError:
pass # sf.decoder error!
len_benzene = sf.len_selfies(benzene_sf) # 8
symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
```
#### Very simple creation of random valid molecules:
A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):
```python
import selfies as sf
import random
alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)
```
These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.
#### Integer and one-hot encoding SELFIES:
In this example, we first build an alphabet from a dataset of SELFIES strings,
and then convert a SELFIES string into its padded encoding. Note that we use the
``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
symbol to pad our SELFIES, which is a special SELFIES symbol that is always
ignored and skipped over by ``selfies.decoder``, making it a useful
padding character.
```python
import selfies as sf
dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]") # [nop] is a special padding symbol
alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']
pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}
dimethyl_ether = dataset[0] # [C][O][C]
label, one_hot = sf.selfies_to_encoding(
selfies=dimethyl_ether,
vocab_stoi=symbol_to_idx,
pad_to_len=pad_to_len,
enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
```
#### Customizing SELFIES:
In this example, we relax the semantic constraints of ``selfies`` to allow
for hypervalences (caution: hypervalence rules are much less understood
than octet rules. Some molecules containing hypervalences are important,
but generally, it is not known which molecules are stable and reasonable).
```python
import selfies as sf
hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)
sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
```
#### Explaining Translation:
You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.
```python
selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
print(smiles_token)
# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
```
``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.
### More Usages and Examples
* More examples can be found in the ``examples/`` directory, including a
[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
genetic algorithm to achieve state-of-the-art performance for inverse design,
with the [code here](https://github.com/aspuru-guzik-group/GA).
* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.
## Tests
`selfies` uses `pytest` with `tox` as its testing framework.
All tests can be found in the `tests/` directory. To run the test suite for
SELFIES, install ``tox`` and run:
```bash
tox -- --trials=10000 --dataset_samples=10000
```
By default, `selfies` is tested against a random subset
(of size ``dataset_samples=10000``) on various datasets:
* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
Due to its large size, this dataset is not included on the repository. To run tests
on it, please download the dataset into the ``tests/test_sets`` directory
and run the ``tests/run_on_large_dataset.py`` script.
## Version History
See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).
## Credits
We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
and Robert Pollice for chemistry advices.
## License
[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3

@@ -262,1 +19,245 @@ Classifier: Programming Language :: Python :: 3.7

Description-Content-Type: text/markdown
License-File: LICENSE
# SELFIES
[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)
**Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\
_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
[*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
**[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)**\
[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)
---
A main objective is to use SELFIES as direct input into machine learning models,
in particular in generative models, for the generation of molecular graphs
which are syntactically and semantically valid.
<p align="center">
<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
</p>
## Installation
Use pip to install ``selfies``.
```bash
pip install selfies
```
To check if the correct version of ``selfies`` is installed, use
the following pip command.
```bash
pip show selfies
```
To upgrade to the latest release of ``selfies`` if you are using an
older version, use the following pip command. Please see the
[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
to review the changes between versions of `selfies`, before upgrading:
```bash
pip install selfies --upgrade
```
## Usage
### Overview
Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
which contains a thorough tutorial for getting started with ``selfies``
and detailed descriptions of the functions
that ``selfies`` provides. We summarize some key functions below.
| Function | Description |
| ------------------------------------- | ----------------------------------------------------------------- |
| ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. |
| ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. |
| ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. |
| ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. |
| ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. |
| ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. |
| ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. |
| ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. |
### Examples
#### Translation between SELFIES and SMILES representations:
```python
import selfies as sf
benzene = "c1ccccc1"
# SMILES -> SELFIES -> SMILES translation
try:
benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
except sf.EncoderError:
pass # sf.encoder error!
except sf.DecoderError:
pass # sf.decoder error!
len_benzene = sf.len_selfies(benzene_sf) # 8
symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
```
#### Very simple creation of random valid molecules:
A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):
```python
import selfies as sf
import random
alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)
```
These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.
#### Integer and one-hot encoding SELFIES:
In this example, we first build an alphabet from a dataset of SELFIES strings,
and then convert a SELFIES string into its padded encoding. Note that we use the
``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
symbol to pad our SELFIES, which is a special SELFIES symbol that is always
ignored and skipped over by ``selfies.decoder``, making it a useful
padding character.
```python
import selfies as sf
dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]") # [nop] is a special padding symbol
alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']
pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}
dimethyl_ether = dataset[0] # [C][O][C]
label, one_hot = sf.selfies_to_encoding(
selfies=dimethyl_ether,
vocab_stoi=symbol_to_idx,
pad_to_len=pad_to_len,
enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
```
#### Customizing SELFIES:
In this example, we relax the semantic constraints of ``selfies`` to allow
for hypervalences (caution: hypervalence rules are much less understood
than octet rules. Some molecules containing hypervalences are important,
but generally, it is not known which molecules are stable and reasonable).
```python
import selfies as sf
hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)
sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
```
#### Explaining Translation:
You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.
```python
selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
print(smiles_token)
# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
```
``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.
### More Usages and Examples
* More examples can be found in the ``examples/`` directory, including a
[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
genetic algorithm to achieve state-of-the-art performance for inverse design,
with the [code here](https://github.com/aspuru-guzik-group/GA).
* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.
## Tests
`selfies` uses `pytest` with `tox` as its testing framework.
All tests can be found in the `tests/` directory. To run the test suite for
SELFIES, install ``tox`` and run:
```bash
tox -- --trials=10000 --dataset_samples=10000
```
By default, `selfies` is tested against a random subset
(of size ``dataset_samples=10000``) on various datasets:
* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
Due to its large size, this dataset is not included on the repository. To run tests
on it, please download the dataset into the ``tests/test_sets`` directory
and run the ``tests/run_on_large_dataset.py`` script.
## Version History
See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).
## Credits
We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
and Robert Pollice for chemistry advices.
## License
[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

@@ -18,2 +18,4 @@ # SELFIES

[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
**[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)**\
[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\

@@ -20,0 +22,0 @@ Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\

Metadata-Version: 2.1
Name: selfies
Version: 2.1.1
Version: 2.1.2
Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs.

@@ -8,245 +8,2 @@ Home-page: https://github.com/aspuru-guzik-group/selfies

Author-email: mario.krenn@utoronto.ca, alan@aspuru.com
License: UNKNOWN
Description: # SELFIES
[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)
**Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\
_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
[*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)
---
A main objective is to use SELFIES as direct input into machine learning models,
in particular in generative models, for the generation of molecular graphs
which are syntactically and semantically valid.
<p align="center">
<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
</p>
## Installation
Use pip to install ``selfies``.
```bash
pip install selfies
```
To check if the correct version of ``selfies`` is installed, use
the following pip command.
```bash
pip show selfies
```
To upgrade to the latest release of ``selfies`` if you are using an
older version, use the following pip command. Please see the
[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
to review the changes between versions of `selfies`, before upgrading:
```bash
pip install selfies --upgrade
```
## Usage
### Overview
Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
which contains a thorough tutorial for getting started with ``selfies``
and detailed descriptions of the functions
that ``selfies`` provides. We summarize some key functions below.
| Function | Description |
| ------------------------------------- | ----------------------------------------------------------------- |
| ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. |
| ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. |
| ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. |
| ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. |
| ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. |
| ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. |
| ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. |
| ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. |
### Examples
#### Translation between SELFIES and SMILES representations:
```python
import selfies as sf
benzene = "c1ccccc1"
# SMILES -> SELFIES -> SMILES translation
try:
benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
except sf.EncoderError:
pass # sf.encoder error!
except sf.DecoderError:
pass # sf.decoder error!
len_benzene = sf.len_selfies(benzene_sf) # 8
symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
```
#### Very simple creation of random valid molecules:
A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):
```python
import selfies as sf
import random
alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)
```
These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.
#### Integer and one-hot encoding SELFIES:
In this example, we first build an alphabet from a dataset of SELFIES strings,
and then convert a SELFIES string into its padded encoding. Note that we use the
``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
symbol to pad our SELFIES, which is a special SELFIES symbol that is always
ignored and skipped over by ``selfies.decoder``, making it a useful
padding character.
```python
import selfies as sf
dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]") # [nop] is a special padding symbol
alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']
pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}
dimethyl_ether = dataset[0] # [C][O][C]
label, one_hot = sf.selfies_to_encoding(
selfies=dimethyl_ether,
vocab_stoi=symbol_to_idx,
pad_to_len=pad_to_len,
enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
```
#### Customizing SELFIES:
In this example, we relax the semantic constraints of ``selfies`` to allow
for hypervalences (caution: hypervalence rules are much less understood
than octet rules. Some molecules containing hypervalences are important,
but generally, it is not known which molecules are stable and reasonable).
```python
import selfies as sf
hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)
sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
```
#### Explaining Translation:
You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.
```python
selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
print(smiles_token)
# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
```
``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.
### More Usages and Examples
* More examples can be found in the ``examples/`` directory, including a
[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
genetic algorithm to achieve state-of-the-art performance for inverse design,
with the [code here](https://github.com/aspuru-guzik-group/GA).
* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.
## Tests
`selfies` uses `pytest` with `tox` as its testing framework.
All tests can be found in the `tests/` directory. To run the test suite for
SELFIES, install ``tox`` and run:
```bash
tox -- --trials=10000 --dataset_samples=10000
```
By default, `selfies` is tested against a random subset
(of size ``dataset_samples=10000``) on various datasets:
* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
Due to its large size, this dataset is not included on the repository. To run tests
on it, please download the dataset into the ``tests/test_sets`` directory
and run the ``tests/run_on_large_dataset.py`` script.
## Version History
See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).
## Credits
We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
and Robert Pollice for chemistry advices.
## License
[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3

@@ -262,1 +19,245 @@ Classifier: Programming Language :: Python :: 3.7

Description-Content-Type: text/markdown
License-File: LICENSE
# SELFIES
[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)
**Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation**\
_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
[*Machine Learning: Science and Technology* **1**, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
**[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)**\
[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)
---
A main objective is to use SELFIES as direct input into machine learning models,
in particular in generative models, for the generation of molecular graphs
which are syntactically and semantically valid.
<p align="center">
<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
</p>
## Installation
Use pip to install ``selfies``.
```bash
pip install selfies
```
To check if the correct version of ``selfies`` is installed, use
the following pip command.
```bash
pip show selfies
```
To upgrade to the latest release of ``selfies`` if you are using an
older version, use the following pip command. Please see the
[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
to review the changes between versions of `selfies`, before upgrading:
```bash
pip install selfies --upgrade
```
## Usage
### Overview
Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
which contains a thorough tutorial for getting started with ``selfies``
and detailed descriptions of the functions
that ``selfies`` provides. We summarize some key functions below.
| Function | Description |
| ------------------------------------- | ----------------------------------------------------------------- |
| ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. |
| ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. |
| ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. |
| ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. |
| ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. |
| ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. |
| ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. |
| ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. |
### Examples
#### Translation between SELFIES and SMILES representations:
```python
import selfies as sf
benzene = "c1ccccc1"
# SMILES -> SELFIES -> SMILES translation
try:
benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
except sf.EncoderError:
pass # sf.encoder error!
except sf.DecoderError:
pass # sf.decoder error!
len_benzene = sf.len_selfies(benzene_sf) # 8
symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
```
#### Very simple creation of random valid molecules:
A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):
```python
import selfies as sf
import random
alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)
```
These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.
#### Integer and one-hot encoding SELFIES:
In this example, we first build an alphabet from a dataset of SELFIES strings,
and then convert a SELFIES string into its padded encoding. Note that we use the
``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
symbol to pad our SELFIES, which is a special SELFIES symbol that is always
ignored and skipped over by ``selfies.decoder``, making it a useful
padding character.
```python
import selfies as sf
dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]") # [nop] is a special padding symbol
alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']
pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}
dimethyl_ether = dataset[0] # [C][O][C]
label, one_hot = sf.selfies_to_encoding(
selfies=dimethyl_ether,
vocab_stoi=symbol_to_idx,
pad_to_len=pad_to_len,
enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
```
#### Customizing SELFIES:
In this example, we relax the semantic constraints of ``selfies`` to allow
for hypervalences (caution: hypervalence rules are much less understood
than octet rules. Some molecules containing hypervalences are important,
but generally, it is not known which molecules are stable and reasonable).
```python
import selfies as sf
hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)
sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
```
#### Explaining Translation:
You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.
```python
selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
print(smiles_token)
# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
```
``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.
### More Usages and Examples
* More examples can be found in the ``examples/`` directory, including a
[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
genetic algorithm to achieve state-of-the-art performance for inverse design,
with the [code here](https://github.com/aspuru-guzik-group/GA).
* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.
## Tests
`selfies` uses `pytest` with `tox` as its testing framework.
All tests can be found in the `tests/` directory. To run the test suite for
SELFIES, install ``tox`` and run:
```bash
tox -- --trials=10000 --dataset_samples=10000
```
By default, `selfies` is tested against a random subset
(of size ``dataset_samples=10000``) on various datasets:
* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
Due to its large size, this dataset is not included on the repository. To run tests
on it, please download the dataset into the ``tests/test_sets`` directory
and run the ``tests/run_on_large_dataset.py`` script.
## Version History
See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).
## Credits
We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
and Robert Pollice for chemistry advices.
## License
[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

@@ -0,1 +1,2 @@

LICENSE
README.md

@@ -20,2 +21,6 @@ setup.py

selfies/utils/selfies_utils.py
selfies/utils/smiles_utils.py
selfies/utils/smiles_utils.py
tests/test_on_datasets.py
tests/test_selfies.py
tests/test_selfies_utils.py
tests/test_specific_cases.py

@@ -28,3 +28,3 @@ #!/usr/bin/env python

__version__ = "2.1.0"
__version__ = "2.1.1"

@@ -31,0 +31,0 @@ __all__ = [

@@ -50,5 +50,14 @@ from typing import Dict, List, Tuple, Union

# integer encode
char_list = split_selfies(selfies)
integer_encoded = [vocab_stoi[char] for char in char_list]
integer_encoded = []
for char in split_selfies(selfies):
if (char == ".") and ("." not in vocab_stoi):
raise KeyError(
"The SELFIES string contains two unconnected molecules "
"(given by the '.' character), but vocab_stoi does not "
"contain the '.' key. Please add it to the vocabulary "
"or separate the molecules."
)
integer_encoded.append(vocab_stoi[char])
if enc_type == "label":

@@ -55,0 +64,0 @@ return integer_encoded

@@ -445,37 +445,47 @@ import enum

attribution_maps, attribution_index=0):
curr_atom, curr = mol.get_atom(root), root
token = atom_to_smiles(curr_atom)
derived.append(token)
attribution_maps.append(AttributionMap(
_strlen(derived) - 1 + attribution_index,
token, mol.get_attribution(curr_atom)))
stack = [(root, 0, len(mol.get_out_dirbonds(root)), False)]
out_bonds = mol.get_out_dirbonds(curr)
for i, bond in enumerate(out_bonds):
if bond.ring_bond:
token = bond_to_smiles(bond)
while stack:
curr, bond_index, total_bonds, needs_closing = stack[-1]
curr_atom = mol.get_atom(curr)
if bond_index == 0:
token = atom_to_smiles(curr_atom)
derived.append(token)
attribution_maps.append(AttributionMap(
_strlen(derived) - 1 + attribution_index,
token, mol.get_attribution(bond)))
ends = (min(bond.src, bond.dst), max(bond.src, bond.dst))
rnum = ring_log.setdefault(ends, len(ring_log) + 1)
if rnum >= 10:
derived.append("%")
derived.append(str(rnum))
token, mol.get_attribution(curr_atom)))
out_bonds = mol.get_out_dirbonds(curr)
if bond_index < total_bonds:
bond = out_bonds[bond_index]
bond_attribution = mol.get_attribution(bond)
stack[-1] = (curr, bond_index + 1, total_bonds, needs_closing)
if bond.ring_bond:
token = bond_to_smiles(bond)
derived.append(token)
attribution_maps.append(AttributionMap(
_strlen(derived) - 1 + attribution_index,
token, bond_attribution))
ends = (min(bond.src, bond.dst), max(bond.src, bond.dst))
rnum = ring_log.setdefault(ends, len(ring_log) + 1)
if rnum >= 10:
derived.append("%")
derived.append(str(rnum))
else:
if bond_index < total_bonds - 1:
derived.append("(")
token = bond_to_smiles(bond)
derived.append(token)
attribution_maps.append(AttributionMap(
_strlen(derived) - 1 + attribution_index,
token, bond_attribution))
stack.append((bond.dst, 0, len(mol.get_out_dirbonds(bond.dst)), bond_index < total_bonds - 1))
else:
if i < len(out_bonds) - 1:
derived.append("(")
token = bond_to_smiles(bond)
derived.append(token)
attribution_maps.append(AttributionMap(
_strlen(derived) - 1 + attribution_index,
token, mol.get_attribution(bond)))
_derive_smiles_from_fragment(
derived, mol, bond.dst, ring_log,
attribution_maps, attribution_index)
if i < len(out_bonds) - 1:
stack.pop()
if needs_closing:
derived.append(")")
return attribution_maps

@@ -10,3 +10,3 @@ #!/usr/bin/env python

name="selfies",
version="2.1.1",
version="2.1.2",
author="Mario Krenn, Alston Lo, and many other contributors",

@@ -13,0 +13,0 @@ author_email="mario.krenn@utoronto.ca, alan@aspuru.com",