Apache License
		Version 2.0, January 2004
		http://www.apache.org/licenses/

		TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

		1. Definitions.

		"License" shall mean the terms and conditions for use, reproduction,
		and distribution as defined by Sections 1 through 9 of this document.

		"Licensor" shall mean the copyright owner or entity authorized by
		the copyright owner that is granting the License.

		"Legal Entity" shall mean the union of the acting entity and all
		other entities that control, are controlled by, or are under common
		control with that entity. For the purposes of this definition,
		"control" means (i) the power, direct or indirect, to cause the
		direction or management of such entity, whether by contract or
		otherwise, or (ii) ownership of fifty percent (50%) or more of the
		outstanding shares, or (iii) beneficial ownership of such entity.

		"You" (or "Your") shall mean an individual or Legal Entity
		exercising permissions granted by this License.

		"Source" form shall mean the preferred form for making modifications,
		including but not limited to software source code, documentation
		source, and configuration files.

		"Object" form shall mean any form resulting from mechanical
		transformation or translation of a Source form, including but
		not limited to compiled object code, generated documentation,
		and conversions to other media types.

		"Work" shall mean the work of authorship, whether in Source or
		Object form, made available under the License, as indicated by a
		copyright notice that is included in or attached to the work
		(an example is provided in the Appendix below).

		"Derivative Works" shall mean any work, whether in Source or Object
		form, that is based on (or derived from) the Work and for which the
		editorial revisions, annotations, elaborations, or other modifications
		represent, as a whole, an original work of authorship. For the purposes
		of this License, Derivative Works shall not include works that remain
		separable from, or merely link (or bind by name) to the interfaces of,
		the Work and Derivative Works thereof.

		"Contribution" shall mean any work of authorship, including
		the original version of the Work and any modifications or additions
		to that Work or Derivative Works thereof, that is intentionally
		submitted to Licensor for inclusion in the Work by the copyright owner
		or by an individual or Legal Entity authorized to submit on behalf of
		the copyright owner. For the purposes of this definition, "submitted"
		means any form of electronic, verbal, or written communication sent
		to the Licensor or its representatives, including but not limited to
		communication on electronic mailing lists, source code control systems,
		and issue tracking systems that are managed by, or on behalf of, the
		Licensor for the purpose of discussing and improving the Work, but
		excluding communication that is conspicuously marked or otherwise
		designated in writing by the copyright owner as "Not a Contribution."

		"Contributor" shall mean Licensor and any individual or Legal Entity
		on behalf of whom a Contribution has been received by Licensor and
		subsequently incorporated within the Work.

		2. Grant of Copyright License. Subject to the terms and conditions of
		this License, each Contributor hereby grants to You a perpetual,
		worldwide, non-exclusive, no-charge, royalty-free, irrevocable
		copyright license to reproduce, prepare Derivative Works of,
		publicly display, publicly perform, sublicense, and distribute the
		Work and such Derivative Works in Source or Object form.

		3. Grant of Patent License. Subject to the terms and conditions of
		this License, each Contributor hereby grants to You a perpetual,
		worldwide, non-exclusive, no-charge, royalty-free, irrevocable
		(except as stated in this section) patent license to make, have made,
		use, offer to sell, sell, import, and otherwise transfer the Work,
		where such license applies only to those patent claims licensable
		by such Contributor that are necessarily infringed by their
		Contribution(s) alone or by combination of their Contribution(s)
		with the Work to which such Contribution(s) was submitted. If You
		institute patent litigation against any entity (including a
		cross-claim or counterclaim in a lawsuit) alleging that the Work
		or a Contribution incorporated within the Work constitutes direct
		or contributory patent infringement, then any patent licenses
		granted to You under this License for that Work shall terminate
		as of the date such litigation is filed.

		4. Redistribution. You may reproduce and distribute copies of the
		Work or Derivative Works thereof in any medium, with or without
		modifications, and in Source or Object form, provided that You
		meet the following conditions:

		(a) You must give any other recipients of the Work or
		Derivative Works a copy of this License; and

		(b) You must cause any modified files to carry prominent notices
		stating that You changed the files; and

		(c) You must retain, in the Source form of any Derivative Works
		that You distribute, all copyright, patent, trademark, and
		attribution notices from the Source form of the Work,
		excluding those notices that do not pertain to any part of
		the Derivative Works; and

		(d) If the Work includes a "NOTICE" text file as part of its
		distribution, then any Derivative Works that You distribute must
		include a readable copy of the attribution notices contained
		within such NOTICE file, excluding those notices that do not
		pertain to any part of the Derivative Works, in at least one
		of the following places: within a NOTICE text file distributed
		as part of the Derivative Works; within the Source form or
		documentation, if provided along with the Derivative Works; or,
		within a display generated by the Derivative Works, if and
		wherever such third-party notices normally appear. The contents
		of the NOTICE file are for informational purposes only and
		do not modify the License. You may add Your own attribution
		notices within Derivative Works that You distribute, alongside
		or as an addendum to the NOTICE text from the Work, provided
		that such additional attribution notices cannot be construed
		as modifying the License.

		You may add Your own copyright statement to Your modifications and
		may provide additional or different license terms and conditions
		for use, reproduction, or distribution of Your modifications, or
		for any such Derivative Works as a whole, provided Your use,
		reproduction, and distribution of the Work otherwise complies with
		the conditions stated in this License.

		5. Submission of Contributions. Unless You explicitly state otherwise,
		any Contribution intentionally submitted for inclusion in the Work
		by You to the Licensor shall be under the terms and conditions of
		this License, without any additional terms or conditions.
		Notwithstanding the above, nothing herein shall supersede or modify
		the terms of any separate license agreement you may have executed
		with Licensor regarding such Contributions.

		6. Trademarks. This License does not grant permission to use the trade
		names, trademarks, service marks, or product names of the Licensor,
		except as required for reasonable and customary use in describing the
		origin of the Work and reproducing the content of the NOTICE file.

		7. Disclaimer of Warranty. Unless required by applicable law or
		agreed to in writing, Licensor provides the Work (and each
		Contributor provides its Contributions) on an "AS IS" BASIS,
		WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
		implied, including, without limitation, any warranties or conditions
		of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
		PARTICULAR PURPOSE. You are solely responsible for determining the
		appropriateness of using or redistributing the Work and assume any
		risks associated with Your exercise of permissions under this License.

		8. Limitation of Liability. In no event and under no legal theory,
		whether in tort (including negligence), contract, or otherwise,
		unless required by applicable law (such as deliberate and grossly
		negligent acts) or agreed to in writing, shall any Contributor be
		liable to You for damages, including any direct, indirect, special,
		incidental, or consequential damages of any character arising as a
		result of this License or out of the use or inability to use the
		Work (including but not limited to damages for loss of goodwill,
		work stoppage, computer failure or malfunction, or any and all
		other commercial damages or losses), even if such Contributor
		has been advised of the possibility of such damages.

		9. Accepting Warranty or Additional Liability. While redistributing
		the Work or Derivative Works thereof, You may choose to offer,
		and charge a fee for, acceptance of support, warranty, indemnity,
		or other liability obligations and/or rights consistent with this
		License. However, in accepting such obligations, You may act only
		on Your own behalf and on Your sole responsibility, not on behalf
		of any other Contributor, and only if You agree to indemnify,
		defend, and hold each Contributor harmless for any liability
		incurred by, or claims asserted against, such Contributor by reason
		of your accepting any such warranty or additional liability.

		END OF TERMS AND CONDITIONS

		APPENDIX: How to apply the Apache License to your work.

		To apply the Apache License to your work, attach the following
		boilerplate notice, with the fields enclosed by brackets "[]"
		replaced with your own identifying information. (Don't include
		the brackets!) The text should be enclosed in the appropriate
		comment syntax for the file format. We also recommend that a
		file or class name and description of purpose be included on the
		same "printed page" as the copyright notice for easier
		identification within third-party archives.

		Copyright 2019 Mario Krenn

		Licensed under the Apache License, Version 2.0 (the "License");
		you may not use this file except in compliance with the License.
		You may obtain a copy of the License at

		http://www.apache.org/licenses/LICENSE-2.0

		Unless required by applicable law or agreed to in writing, software
		distributed under the License is distributed on an "AS IS" BASIS,
		WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
		See the License for the specific language governing permissions and
		limitations under the License.

+79

tests/test_on_datasets.py

		import faulthandler
		import pathlib
		import random

		import pandas as pd
		import pytest
		from rdkit import Chem

		import selfies as sf

		faulthandler.enable()

		TEST_SET_DIR = pathlib.Path(__file__).parent / "test_sets"
		ERROR_LOG_DIR = pathlib.Path(__file__).parent / "error_logs"
		ERROR_LOG_DIR.mkdir(exist_ok=True, parents=True)

		datasets = list(TEST_SET_DIR.glob("*/.csv"))


		@pytest.mark.parametrize("test_path", datasets)
		def test_roundtrip_translation(test_path, dataset_samples):
		"""Tests SMILES -> SELFIES -> SMILES translation on various datasets.
		"""

		# very relaxed constraints
		constraints = sf.get_preset_constraints("hypervalent")
		constraints.update({"P": 7, "P-1": 8, "P+1": 6, "?": 12})
		sf.set_semantic_constraints(constraints)

		error_path = ERROR_LOG_DIR / "{}.csv".format(test_path.stem)
		with open(error_path, "w+") as error_log:
		error_log.write("In, Out\n")

		error_data = []
		error_found = False

		n_lines = sum(1 for _ in open(test_path)) - 1
		n_keep = dataset_samples if (0 < dataset_samples <= n_lines) else n_lines
		skip = random.sample(range(1, n_lines + 1), n_lines - n_keep)
		reader = pd.read_csv(test_path, chunksize=10000, header=0, skiprows=skip)

		for chunk in reader:

		for in_smiles in chunk["smiles"]:
		in_smiles = in_smiles.strip()

		mol = Chem.MolFromSmiles(in_smiles, sanitize=True)
		if (mol is None) or ("*" in in_smiles):
		continue

		try:
		selfies = sf.encoder(in_smiles, strict=True)
		out_smiles = sf.decoder(selfies)
		except (sf.EncoderError, sf.DecoderError):
		error_data.append((in_smiles, ""))
		continue

		if not is_same_mol(in_smiles, out_smiles):
		error_data.append((in_smiles, out_smiles))

		with open(error_path, "a") as error_log:
		for entry in error_data:
		error_log.write(",".join(entry) + "\n")

		error_found = error_found or error_data
		error_data = []

		sf.set_semantic_constraints() # restore constraints

		assert not error_found


		def is_same_mol(smiles1, smiles2):
		try:
		can_smiles1 = Chem.CanonSmiles(smiles1)
		can_smiles2 = Chem.CanonSmiles(smiles2)
		return can_smiles1 == can_smiles2
		except Exception:
		return False

+123

tests/test_selfies_utils.py

		import pytest

		import selfies as sf


		class Entry:

		def __init__(self, selfies, symbols, label, one_hot):
		self.selfies = selfies
		self.symbols = symbols
		self.label = label
		self.one_hot = one_hot


		@pytest.fixture()
		def dataset():
		stoi = {"[nop]": 0, "[O]": 1, ".": 2, "[C]": 3, "[F]": 4}
		itos = {i: c for c, i in stoi.items()}
		pad_to_len = 4

		entries = [
		Entry(selfies="",
		symbols=[],
		label=[0, 0, 0, 0],
		one_hot=[[1, 0, 0, 0, 0],
		[1, 0, 0, 0, 0],
		[1, 0, 0, 0, 0],
		[1, 0, 0, 0, 0]]),
		Entry(selfies="[C][C][C]",
		symbols=["[C]", "[C]", "[C]"],
		label=[3, 3, 3, 0],
		one_hot=[[0, 0, 0, 1, 0],
		[0, 0, 0, 1, 0],
		[0, 0, 0, 1, 0],
		[1, 0, 0, 0, 0]]),
		Entry(selfies="[C].[C]",
		symbols=["[C]", ".", "[C]"],
		label=[3, 2, 3, 0],
		one_hot=[[0, 0, 0, 1, 0],
		[0, 0, 1, 0, 0],
		[0, 0, 0, 1, 0],
		[1, 0, 0, 0, 0]]),
		Entry(selfies="[C][O][C][F]",
		symbols=["[C]", "[O]", "[C]", "[F]"],
		label=[3, 1, 3, 4],
		one_hot=[[0, 0, 0, 1, 0],
		[0, 1, 0, 0, 0],
		[0, 0, 0, 1, 0],
		[0, 0, 0, 0, 1]]),
		Entry(selfies="[C][O][C]",
		symbols=["[C]", "[O]", "[C]"],
		label=[3, 1, 3, 0],
		one_hot=[[0, 0, 0, 1, 0],
		[0, 1, 0, 0, 0],
		[0, 0, 0, 1, 0],
		[1, 0, 0, 0, 0]])
		]

		return entries, (stoi, itos, pad_to_len)


		@pytest.fixture()
		def dataset_flat_hots(dataset):
		flat_hots = []
		for entry in dataset[0]:
		hot = [elm for vec in entry.one_hot for elm in vec]
		flat_hots.append(hot)
		return flat_hots


		def test_len_selfies(dataset):
		for entry in dataset[0]:
		assert sf.len_selfies(entry.selfies) == len(entry.symbols)


		def test_split_selfies(dataset):
		for entry in dataset[0]:
		assert list(sf.split_selfies(entry.selfies)) == entry.symbols


		def test_get_alphabet_from_selfies(dataset):
		entries, (vocab_stoi, _, _) = dataset

		selfies = [entry.selfies for entry in entries]
		alphabet = sf.get_alphabet_from_selfies(selfies)
		alphabet.add("[nop]")
		alphabet.add(".")

		assert alphabet == set(vocab_stoi.keys())


		def test_selfies_to_encoding(dataset):
		entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset

		for entry in entries:
		label, one_hot = sf.selfies_to_encoding(
		entry.selfies, vocab_stoi, pad_to_len, "both"
		)

		assert label == entry.label
		assert one_hot == entry.one_hot

		# recover original selfies
		selfies = sf.encoding_to_selfies(label, vocab_itos, "label")
		selfies = selfies.replace("[nop]", "")
		assert selfies == entry.selfies

		selfies = sf.encoding_to_selfies(one_hot, vocab_itos, "one_hot")
		selfies = selfies.replace("[nop]", "")
		assert selfies == entry.selfies


		def test_selfies_to_flat_hot(dataset, dataset_flat_hots):
		entries, (vocab_stoi, vocab_itos, pad_to_len) = dataset

		batch = [entry.selfies for entry in entries]
		flat_hots = sf.batch_selfies_to_flat_hot(batch, vocab_stoi, pad_to_len)

		assert flat_hots == dataset_flat_hots

		# recover original selfies
		recovered = sf.batch_flat_hot_to_selfies(flat_hots, vocab_itos)
		assert batch == [s.replace("[nop]", "") for s in recovered]

+161

tests/test_selfies.py

		import faulthandler
		import random

		import pytest
		from rdkit.Chem import MolFromSmiles

		import selfies as sf

		faulthandler.enable()


		@pytest.fixture()
		def max_selfies_len():
		return 1000


		@pytest.fixture()
		def large_alphabet():
		alphabet = sf.get_semantic_robust_alphabet()
		alphabet.update([
		"[#Br]", "[#Branch1]", "[#Branch2]", "[#Branch3]", "[#C@@H1]",
		"[#C@@]", "[#C@H1]", "[#C@]", "[#C]", "[#Cl]", "[#F]", "[#H]", "[#I]",
		"[#NH1]", "[#N]", "[#O]", "[#P]", "[#Ring1]", "[#Ring2]", "[#Ring3]",
		"[#S]", "[/Br]", "[/C@@H1]", "[/C@@]", "[/C@H1]", "[/C@]", "[/C]",
		"[/Cl]", "[/F]", "[/H]", "[/I]", "[/NH1]", "[/N]", "[/O]", "[/P]",
		"[/S]", "[=Br]", "[=Branch1]", "[=Branch2]", "[=Branch3]", "[=C@@H1]",
		"[=C@@]", "[=C@H1]", "[=C@]", "[=C]", "[=Cl]", "[=F]", "[=H]", "[=I]",
		"[=NH1]", "[=N]", "[=O]", "[=P]", "[=Ring1]", "[=Ring2]", "[=Ring3]",
		"[=S]", "[Br]", "[Branch1]", "[Branch2]", "[Branch3]", "[C@@H1]",
		"[C@@]", "[C@H1]", "[C@]", "[C]", "[Cl]", "[F]", "[H]", "[I]", "[NH1]",
		"[N]", "[O]", "[P]", "[Ring1]", "[Ring2]", "[Ring3]", "[S]", "[\\Br]",
		"[\\C@@H1]", "[\\C@@]", "[\\C@H1]", "[\\C@]", "[\\C]", "[\\Cl]",
		"[\\F]", "[\\H]", "[\\I]", "[\\NH1]", "[\\N]", "[\\O]", "[\\P]",
		"[\\S]", "[nop]"
		])
		return list(alphabet)


		def test_random_selfies_decoder(trials, max_selfies_len, large_alphabet):
		"""Tests that SELFIES that are generated by randomly stringing together
		symbols from the SELFIES alphabet are decoded into valid SMILES.
		"""

		alphabet = tuple(large_alphabet)

		for _ in range(trials):

		# create random SELFIES and decode
		rand_len = random.randint(1, max_selfies_len)
		rand_selfies = "".join(random_choices(alphabet, k=rand_len))
		smiles = sf.decoder(rand_selfies)

		# check if SMILES is valid
		try:
		is_valid = MolFromSmiles(smiles, sanitize=True) is not None
		except Exception:
		is_valid = False

		err_msg = "SMILES: {}\n\t SELFIES: {}".format(smiles, rand_selfies)
		assert is_valid, err_msg


		def test_nop_symbol_decoder(max_selfies_len, large_alphabet):
		"""Tests that the '[nop]' symbol is always skipped over.
		"""

		alphabet = list(large_alphabet)
		alphabet.remove("[nop]")

		for _ in range(100):

		# create random SELFIES with and without [nop]
		rand_len = random.randint(1, max_selfies_len)
		rand_mol = random_choices(alphabet, k=rand_len)
		rand_mol.extend(["[nop]"] * (max_selfies_len - rand_len))
		random.shuffle(rand_mol)

		with_nops = "".join(rand_mol)
		without_nops = with_nops.replace("[nop]", "")

		assert sf.decoder(with_nops) == sf.decoder(without_nops)


		def test_get_semantic_constraints():
		constraints = sf.get_semantic_constraints()
		assert constraints is not sf.get_semantic_constraints() # not alias
		assert "?" in constraints


		def test_change_constraints_cache_clear():
		alphabet = sf.get_semantic_robust_alphabet()
		assert alphabet == sf.get_semantic_robust_alphabet()
		assert sf.decoder("[C][#C]") == "C#C"

		new_constraints = sf.get_semantic_constraints()
		new_constraints["C"] = 1
		sf.set_semantic_constraints(new_constraints)

		new_alphabet = sf.get_semantic_robust_alphabet()
		assert new_alphabet != alphabet
		assert sf.decoder("[C][#C]") == "CC"

		sf.set_semantic_constraints() # re-set alphabet


		def test_invalid_or_unsupported_smiles_encoder():
		malformed_smiles = [
		"",
		"(",
		"C(Cl)(Cl)CC[13C",
		"C(CCCOC",
		"C=(CCOC",
		"CCCC)",
		"C1CCCCC",
		"C(F)(F)(F)(F)(F)F", # violates bond constraints
		"C=C1=CCCCCC1", # violates bond constraints
		"CC*CC", # uses wildcard
		"C$C", # uses $ bond
		"S[As@TB1](F)(Cl)(Br)N", # unrecognized chirality,
		"SOMETHINGWRONGHERE",
		"1243124124",
		]

		for smiles in malformed_smiles:
		with pytest.raises(sf.EncoderError):
		sf.encoder(smiles)


		def test_malformed_selfies_decoder():
		with pytest.raises(sf.DecoderError):
		sf.decoder("[O][=C][O][C][C][C][C][O][N][Branch2_3")


		def random_choices(population, k): # random.choices was new in Python v3.6
		return [random.choice(population) for _ in range(k)]


		def test_decoder_attribution():
		sm, am = sf.decoder(
		"[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]", attribute=True)
		# check that P lined up
		for ta in am:
		if ta.token == 'P':
		for a in ta.attribution:
		if a.token == '[P]':
		return
		raise ValueError('Failed to find P in attribution map')


		def test_encoder_attribution():
		smiles = "C1([O-])C=CC=C1Cl"
		indices = [0, 3, 3, 3, 5, 7, 8, 10, None, None, 12]
		s, am = sf.encoder(smiles, attribute=True)
		for i, ta in enumerate(am):
		if ta.attribution:
		assert indices[i] == ta.attribution[0].index, \
		f'found {ta[1]}; should be {indices[i]}'
		if ta.token == '[Cl]':
		assert 'Cl' in [
		a.token for a in ta.attribution],\
		'Failed to find Cl in attribution map'

+340

tests/test_specific_cases.py

		import pytest

		import selfies as sf


		def decode_eq(selfies, smiles):
		s = sf.decoder(selfies)
		return s == smiles


		def test_branch_and_ring_at_state_X0():
		"""Tests SELFIES with branches and rings at state X0 (i.e. at the
		very beginning of a SELFIES). These symbols should be skipped.
		"""

		assert decode_eq("[Branch3][C][S][C][O]", "CSCO")
		assert decode_eq("[Ring3][C][S][C][O]", "CSCO")
		assert decode_eq("[Branch1][Ring1][Ring3][C][S][C][O]", "CSCO")


		def test_branch_at_state_X1():
		"""Test SELFIES with branches at state X1 (i.e. at an atom that
		can only make one bond. In this case, the branch symbol should be skipped.
		"""

		assert decode_eq("[C][C][O][Branch1][C][I]", "CCOCI")
		assert decode_eq("[C][C][C][O][#Branch3][C][I]", "CCCOCI")


		def test_branch_and_ring_decrement_state():
		"""Tests that the branch and ring symbols properly decrement the
		derivation state.
		"""

		assert decode_eq("[C][C][C][Ring1][Ring1][#C]", "C1CC1=C")
		assert decode_eq("[C][=C][C][C][#Ring1][Ring1][#C]", "C=C1CC1")
		assert decode_eq("[C][O][C][C][=Ring1][Ring1][#C]", "COCCC")

		assert decode_eq("[C][=C][Branch1][C][=C][#C]", "C=C(C)C")


		def test_branch_at_end_of_selfies():
		"""Test SELFIES that have a branch symbol as its very last symbol.
		"""

		assert decode_eq("[C][C][C][C][Branch1]", "CCCC")
		assert decode_eq("[C][C][C][C][#Branch3]", "CCCC")


		def test_ring_at_end_of_selfies():
		"""Test SELFIES that have a ring symbol as its very last symbol.
		"""

		assert decode_eq("[C][C][C][C][C][Ring1]", "CCCC=C")
		assert decode_eq("[C][C][C][C][C][Ring3]", "CCCC=C")


		def test_branch_with_no_atoms():
		"""Test SELFIES that have a branch, but the branch has no atoms in it.
		Such branches should not be made in the outputted SMILES.
		"""

		s = "[C][Branch1][Ring2][Branch1][Branch1][Branch1][F]"
		assert decode_eq(s, "CF")

		s = "[C][Branch1][Ring2][Ring1][Ring1][Branch1][F]"
		assert decode_eq(s, "CF")

		s = "[C][=Branch1][Ring2][Branch1][C][Cl][F]"
		assert decode_eq(s, "C(Cl)F")

		# special case: #Branch3 takes Q_1, Q_2 = [O] and Q_3 = ''. However,
		# there are no more symbols in the branch.
		assert decode_eq("[C][C][C][C][#Branch3][O][O]", "CCCC")


		def test_oversized_branch():
		"""Test SELFIES that have a branch, with Q larger than the length
		of the SELFIES
		"""

		assert decode_eq("[C][Branch2][O][O][C][C][S][F][C]", "CCCSF")
		assert decode_eq("[C][#Branch2][O][O][#C][C][S][F]", "C#CCSF")


		def test_oversized_ring():
		"""Test SELFIES that have a ring, with Q so large that the (Q + 1)-th
		previously derived atom does not exist.
		"""

		assert decode_eq("[C][C][C][C][Ring1][O]", "C1CCC1")
		assert decode_eq("[C][C][C][C][Ring2][O][C]", "C1CCC1")

		# special case: Ring2 takes Q_1 = [O] and Q_2 = '', leading to
		# Q = 9 * 16 + 0 (i.e. an oversized ring)
		assert decode_eq("[C][C][C][C][Ring2][O]", "C1CCC1")

		# special case: ring between 1st atom and 1st atom should not be formed
		assert decode_eq("[C][Ring1][O]", "C")


		def test_branch_at_beginning_of_branch():
		"""Test SELFIES that have a branch immediately at the start of a branch.
		"""

		# [C@]((Br)Cl)F
		s = "[C@][=Branch1][Branch1][Branch1][C][Br][Cl][F]"
		assert decode_eq(s, "[C@](Br)(Cl)F")

		# [C@](((Br)Cl)I)F
		s = "[C@][#Branch1][Branch2][=Branch1][Branch1][Branch1][C][Br][Cl][I][F]"
		assert decode_eq(s, "[C@](Br)(Cl)(I)F")

		# [C@]((Br)(Cl)I)F
		s = "[C@][#Branch1][Branch2][Branch1][C][Br][Branch1][C][Cl][I][F]"
		assert decode_eq(s, "[C@](Br)(Cl)(I)F")


		def test_ring_at_beginning_of_branch():
		"""Test SELFIES that have a ring immediately at the start of a branch.
		"""

		# CC1CCC(1CCl)F
		s = "[C][C][C][C][C][=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
		assert decode_eq(s, "CC1CCC1(CCl)F")

		# CC1CCS(Br)(1CCl)F
		s = "[C][C][C][C][S][Branch1][C][Br]" \
		"[=Branch1][Branch1][Ring1][Ring2][C][Cl][F]"
		assert decode_eq(s, "CC1CCS1(Br)(CCl)F")


		def test_branch_and_ring_at_beginning_of_branch():
		"""Test SELFIES that have a branch and ring immediately at the start
		of a branch.
		"""

		# CC1CCCS((Br)1Cl)F
		s = "[C][C][C][C][C][S][#Branch1][#Branch1][Branch1][C][Br]" \
		"[Ring1][Branch1][Cl][F]"
		assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")

		# CC1CCCS(1(Br)Cl)F
		s = "[C][C][C][C][C][S][#Branch1][#Branch1][Ring1][Branch1]" \
		"[Branch1][C][Br][Cl][F]"
		assert decode_eq(s, "CC1CCCS1(Br)(Cl)F")


		def test_ring_immediately_following_branch():
		"""Test SELFIES that have a ring immediately following after a branch.
		"""

		# CCC1CCCC(OCO)1
		s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][Ring1][Branch1]"
		assert decode_eq(s, "CCC1CCCC1OCO")

		# CCC1CCCC(OCO)(F)1
		s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
		"[Branch1][C][F][Ring1][Branch1]"
		assert decode_eq(s, "CCC1CCCC1(OCO)F")


		def test_ring_after_branch():
		"""Tests SELFIES that have a ring following a branch, but not
		immediately after a branch.
		"""

		# CCCCCCC1(OCO)1
		s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O][C][Ring1][Branch1]"
		assert decode_eq(s, "CCCCCCC(OCO)=C")

		s = "[C][C][C][C][C][C][C][Branch1][Ring2][O][C][O]" \
		"[Branch1][C][F][C][C][Ring1][=Branch2]"
		assert decode_eq(s, "CCCCC1CC(OCO)(F)CC1")


		def test_ring_on_top_of_existing_bond():
		"""Tests SELFIES with rings between two atoms that are already bonded
		in the main scaffold.
		"""

		# C1C1, C1C=1, C1C#1, ...
		assert decode_eq("[C][C][Ring1][C]", "C=C")
		assert decode_eq("[C][/C][Ring1][C]", "C=C")
		assert decode_eq("[C][C][=Ring1][C]", "C#C")
		assert decode_eq("[C][C][#Ring1][C]", "C#C")


		def test_consecutive_rings():
		"""Test SELFIES which have multiple consecutive rings.
		"""

		s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2]"
		assert decode_eq(s, "C=1CCC=1") # 1 + 1

		s = "[C][C][C][C][Ring1][Ring2][Ring1][Ring2][Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 1 + 1 + 1

		s = "[C][C][C][C][=Ring1][Ring2][Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 2 + 1

		s = "[C][C][C][C][Ring1][Ring2][=Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 1 + 2

		# consecutive rings that exceed bond constraints
		s = "[C][C][C][C][#Ring1][Ring2][=Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 3 + 2

		s = "[C][C][C][C][=Ring1][Ring2][#Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 2 + 3

		s = "[C][C][C][C][=Ring1][Ring2][=Ring1][Ring2]"
		assert decode_eq(s, "C#1CCC#1") # 2 + 2

		# consecutive rings with stereochemical single bond
		s = "[C][C][C][C][\\/Ring1][Ring2]"
		assert decode_eq(s, "C\\1CCC/1")

		s = "[C][C][C][C][\\/Ring1][Ring2][Ring1][Ring2]"
		assert decode_eq(s, "C=1CCC=1")


		def test_unconstrained_symbols():
		"""Tests SELFIES with symbols that are not semantically constrained.
		"""

		f_branch = "[Branch1][C][F]"
		s = "[Xe-2]" + (f_branch * 8)
		assert decode_eq(s, "[Xe-2](F)(F)(F)(F)(F)(F)(F)CF")

		# change default semantic constraints
		constraints = sf.get_semantic_constraints()
		constraints["?"] = 2
		sf.set_semantic_constraints(constraints)

		assert decode_eq(s, "[Xe-2](F)CF")

		sf.set_semantic_constraints()


		def test_isotope_symbols():
		"""Tests that SELFIES symbols with isotope specifications are
		constrained properly.
		"""

		s = "[13C][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
		assert decode_eq(s, "[13C](Cl)(F)(Br)CI")

		assert decode_eq("[C][36Cl][C]", "C[36Cl]")


		def test_chiral_symbols():
		"""Tests that SELFIES symbols with chirality specifications are
		constrained properly.
		"""

		s = "[C@@][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br][Branch1][C][I]"
		assert decode_eq(s, "[C@@](Cl)(F)(Br)CI")

		s = "[C@H1][Branch1][C][Cl][Branch1][C][F][Branch1][C][Br]"
		assert decode_eq(s, "[C@H1](Cl)(F)CBr")


		def test_explicit_hydrogen_symbols():
		"""Tests that SELFIES symbols with explicit hydrogen specifications
		are constrained properly.
		"""

		assert decode_eq("[CH1][Branch1][C][Cl][#C]", "[CH1](Cl)=C")
		assert decode_eq("[CH3][=C]", "[CH3]C")

		assert decode_eq("[CH4][C][C]", "[CH4]")
		assert decode_eq("[C][C][C][CH4]", "CCC")
		assert decode_eq("[C][Branch1][Ring2][C][=CH4][C][=C]", "C(C)=C")

		with pytest.raises(sf.DecoderError):
		sf.decoder("[C][C][CH5]")
		with pytest.raises(sf.DecoderError):
		sf.decoder("[C][C][C][OH9]")


		def test_charged_symbols():
		"""Tests that SELFIES symbols with charges are constrained properly.
		"""

		constraints = sf.get_semantic_constraints()
		constraints["Sn+4"] = 1
		constraints["O-2"] = 2
		sf.set_semantic_constraints(constraints)

		# the following molecules don't make sense, but we use them to test
		# selfies. Hence, we can't verify them with RDKit
		assert decode_eq("[Sn+4][=C]", "[Sn+4]C")
		assert decode_eq("[O-2][#C]", "[O-2]=C")

		# mixing many symbol types
		assert decode_eq("[17O@@H1-2][#C]", "[17O@@H1-2]C")

		sf.set_semantic_constraints()


		def test_standardized_alphabet():
		"""Tests that equivalent SMILES atom symbols are translated into the
		same SELFIES atom symbol.
		"""

		assert sf.encoder("[C][O][N][P][F]") == "[CH0][OH0][NH0][PH0][FH0]"
		assert sf.encoder("[Fe][Si]") == "[Fe][Si]"
		assert sf.encoder("[Fe++][Fe+2]") == "[Fe+2][Fe+2]"
		assert sf.encoder("[CH][CH1]") == "[CH1][CH1]"


		def test_old_symbols():
		"""Tests backward compatibility of SELFIES with old (<v2) symbols.
		"""

		s = "[C@@Hexpl][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]"
		assert sf.decoder(s, compatible=True) == "[C@@H1](C)(Cl)F"

		s = "[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]"
		assert sf.decoder(s, compatible=True) == "C#1CCC#1"

		long_s = "[C@@Hexpl][=C][C@@Hexpl][N+expl][=C][C+expl][N+expl][O+expl]" \
		"[Fe++expl][C@@Hexpl][C][N+expl][Branch1_2][Fe++expl][S+expl]" \
		"[=C][Expl=Ring1][Fe++expl][S+expl][Expl=Ring1][O+expl]" \
		"[C@@Hexpl][Expl=Ring1][C@@Hexpl][C@@Hexpl][N+expl][Expl=Ring1]" \
		"[Expl=Ring1][S+expl][=C]"
		try:
		sf.decoder(long_s, compatible=True)
		except Exception:
		assert False

		def test_large_selfies_decoding():
		"""Test that we can decode extremely large SELFIES strings (used to cause a RecursionError)
		"""

		large_selfies = "[C]" * 1024
		expected_smiles = "C" * 1024

		assert decode_eq(large_selfies, expected_smiles)

+245

-244

PKG-INFO

		Metadata-Version: 2.1
		Name: selfies
		Version: 2.1.1
		Version: 2.1.2
		Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs.
		@@ -8,245 +8,2 @@ Home-page: https://github.com/aspuru-guzik-group/selfies
		Author-email: mario.krenn@utoronto.ca, alan@aspuru.com
		License: UNKNOWN
		Description: # SELFIES

		[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
		![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
		[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
		[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
		[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
		[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
		[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)


		Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation\
		_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
		[Machine Learning: Science and Technology 1, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
		[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
		[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
		[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
		Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
		Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
		Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)

		---

		A main objective is to use SELFIES as direct input into machine learning models,
		in particular in generative models, for the generation of molecular graphs
		which are syntactically and semantically valid.

		<p align="center">
		<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
		</p>

		## Installation
		Use pip to install ``selfies``.

		```bash
		pip install selfies
		```

		To check if the correct version of ``selfies`` is installed, use
		the following pip command.

		```bash
		pip show selfies
		```

		To upgrade to the latest release of ``selfies`` if you are using an
		older version, use the following pip command. Please see the
		[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
		to review the changes between versions of `selfies`, before upgrading:

		```bash
		pip install selfies --upgrade
		```


		## Usage

		### Overview

		Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
		which contains a thorough tutorial for getting started with ``selfies``
		and detailed descriptions of the functions
		that ``selfies`` provides. We summarize some key functions below.

		\| Function \| Description \|
		\| ------------------------------------- \| ----------------------------------------------------------------- \|
		\| ``selfies.encoder`` \| Translates a SMILES string into its corresponding SELFIES string. \|
		\| ``selfies.decoder`` \| Translates a SELFIES string into its corresponding SMILES string. \|
		\| ``selfies.set_semantic_constraints`` \| Configures the semantic constraints that ``selfies`` operates on. \|
		\| ``selfies.len_selfies`` \| Returns the number of symbols in a SELFIES string. \|
		\| ``selfies.split_selfies`` \| Tokenizes a SELFIES string into its individual symbols. \|
		\| ``selfies.get_alphabet_from_selfies`` \| Constructs an alphabet from an iterable of SELFIES strings. \|
		\| ``selfies.selfies_to_encoding`` \| Converts a SELFIES string into its label and/or one-hot encoding. \|
		\| ``selfies.encoding_to_selfies`` \| Converts a label or one-hot encoding into a SELFIES string. \|


		### Examples

		#### Translation between SELFIES and SMILES representations:

		```python
		import selfies as sf

		benzene = "c1ccccc1"

		# SMILES -> SELFIES -> SMILES translation
		try:
		benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
		benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
		except sf.EncoderError:
		pass # sf.encoder error!
		except sf.DecoderError:
		pass # sf.decoder error!

		len_benzene = sf.len_selfies(benzene_sf) # 8

		symbols_benzene = list(sf.split_selfies(benzene_sf))
		# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
		```

		#### Very simple creation of random valid molecules:
		A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):

		```python
		import selfies as sf
		import random

		alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
		rnd_selfies=''.join(random.sample(list(alphabet), 9))
		rnd_smiles=sf.decoder(rnd_selfies)
		print(rnd_smiles)
		```
		These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

		#### Integer and one-hot encoding SELFIES:

		In this example, we first build an alphabet from a dataset of SELFIES strings,
		and then convert a SELFIES string into its padded encoding. Note that we use the
		``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
		symbol to pad our SELFIES, which is a special SELFIES symbol that is always
		ignored and skipped over by ``selfies.decoder``, making it a useful
		padding character.

		```python
		import selfies as sf

		dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
		alphabet = sf.get_alphabet_from_selfies(dataset)
		alphabet.add("[nop]") # [nop] is a special padding symbol
		alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

		pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
		symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

		dimethyl_ether = dataset[0] # [C][O][C]

		label, one_hot = sf.selfies_to_encoding(
		selfies=dimethyl_ether,
		vocab_stoi=symbol_to_idx,
		pad_to_len=pad_to_len,
		enc_type="both"
		)
		# label = [1, 3, 1, 4, 4]
		# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
		```

		#### Customizing SELFIES:

		In this example, we relax the semantic constraints of ``selfies`` to allow
		for hypervalences (caution: hypervalence rules are much less understood
		than octet rules. Some molecules containing hypervalences are important,
		but generally, it is not known which molecules are stable and reasonable).

		```python
		import selfies as sf

		hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
		standard_derived_smi = sf.decoder(hypervalent_sf)
		# OI (the default constraints for I allows for only 1 bond)

		sf.set_semantic_constraints("hypervalent")
		relaxed_derived_smi = sf.decoder(hypervalent_sf)
		# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
		```

		#### Explaining Translation:

		You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.

		```python
		selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
		smiles, attr = sf.decoder(
		selfies, attribute=True)
		print('SELFIES', selfies)
		print('SMILES', smiles)
		print('Attribution:')
		for smiles_token in attr:
		print(smiles_token)

		# output
		SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
		SMILES C1NC(P)CC1
		Attribution:
		AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
		AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
		AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
		AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
		AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
		AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
		```

		``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.

		### More Usages and Examples

		* More examples can be found in the ``examples/`` directory, including a
		[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
		* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
		genetic algorithm to achieve state-of-the-art performance for inverse design,
		with the [code here](https://github.com/aspuru-guzik-group/GA).
		* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
		* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
		* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
		* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
		* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.

		## Tests
		`selfies` uses `pytest` with `tox` as its testing framework.
		All tests can be found in the `tests/` directory. To run the test suite for
		SELFIES, install ``tox`` and run:

		```bash
		tox -- --trials=10000 --dataset_samples=10000
		```

		By default, `selfies` is tested against a random subset
		(of size ``dataset_samples=10000``) on various datasets:

		* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
		* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
		* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
		* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
		* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
		Due to its large size, this dataset is not included on the repository. To run tests
		on it, please download the dataset into the ``tests/test_sets`` directory
		and run the ``tests/run_on_large_dataset.py`` script.

		## Version History
		See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).

		## Credits

		We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
		HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
		Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
		and Robert Pollice for chemistry advices.

		## License

		[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

		Platform: UNKNOWN
		Classifier: Programming Language :: Python :: 3
		@@ -262,1 +19,245 @@ Classifier: Programming Language :: Python :: 3.7
		Description-Content-Type: text/markdown
		License-File: LICENSE

		# SELFIES

		[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
		![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
		[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
		[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
		[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
		[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
		[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)


		Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation\
		_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
		[Machine Learning: Science and Technology 1, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
		[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
		[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
		[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
		[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)\
		[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
		Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
		Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
		Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)

		---

		A main objective is to use SELFIES as direct input into machine learning models,
		in particular in generative models, for the generation of molecular graphs
		which are syntactically and semantically valid.

		<p align="center">
		<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
		</p>

		## Installation
		Use pip to install ``selfies``.

		```bash
		pip install selfies
		```

		To check if the correct version of ``selfies`` is installed, use
		the following pip command.

		```bash
		pip show selfies
		```

		To upgrade to the latest release of ``selfies`` if you are using an
		older version, use the following pip command. Please see the
		[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
		to review the changes between versions of `selfies`, before upgrading:

		```bash
		pip install selfies --upgrade
		```


		## Usage

		### Overview

		Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
		which contains a thorough tutorial for getting started with ``selfies``
		and detailed descriptions of the functions
		that ``selfies`` provides. We summarize some key functions below.

		\| Function \| Description \|
		\| ------------------------------------- \| ----------------------------------------------------------------- \|
		\| ``selfies.encoder`` \| Translates a SMILES string into its corresponding SELFIES string. \|
		\| ``selfies.decoder`` \| Translates a SELFIES string into its corresponding SMILES string. \|
		\| ``selfies.set_semantic_constraints`` \| Configures the semantic constraints that ``selfies`` operates on. \|
		\| ``selfies.len_selfies`` \| Returns the number of symbols in a SELFIES string. \|
		\| ``selfies.split_selfies`` \| Tokenizes a SELFIES string into its individual symbols. \|
		\| ``selfies.get_alphabet_from_selfies`` \| Constructs an alphabet from an iterable of SELFIES strings. \|
		\| ``selfies.selfies_to_encoding`` \| Converts a SELFIES string into its label and/or one-hot encoding. \|
		\| ``selfies.encoding_to_selfies`` \| Converts a label or one-hot encoding into a SELFIES string. \|


		### Examples

		#### Translation between SELFIES and SMILES representations:

		```python
		import selfies as sf

		benzene = "c1ccccc1"

		# SMILES -> SELFIES -> SMILES translation
		try:
		benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
		benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
		except sf.EncoderError:
		pass # sf.encoder error!
		except sf.DecoderError:
		pass # sf.decoder error!

		len_benzene = sf.len_selfies(benzene_sf) # 8

		symbols_benzene = list(sf.split_selfies(benzene_sf))
		# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
		```

		#### Very simple creation of random valid molecules:
		A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):

		```python
		import selfies as sf
		import random

		alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
		rnd_selfies=''.join(random.sample(list(alphabet), 9))
		rnd_smiles=sf.decoder(rnd_selfies)
		print(rnd_smiles)
		```
		These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

		#### Integer and one-hot encoding SELFIES:

		In this example, we first build an alphabet from a dataset of SELFIES strings,
		and then convert a SELFIES string into its padded encoding. Note that we use the
		``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
		symbol to pad our SELFIES, which is a special SELFIES symbol that is always
		ignored and skipped over by ``selfies.decoder``, making it a useful
		padding character.

		```python
		import selfies as sf

		dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
		alphabet = sf.get_alphabet_from_selfies(dataset)
		alphabet.add("[nop]") # [nop] is a special padding symbol
		alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

		pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
		symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

		dimethyl_ether = dataset[0] # [C][O][C]

		label, one_hot = sf.selfies_to_encoding(
		selfies=dimethyl_ether,
		vocab_stoi=symbol_to_idx,
		pad_to_len=pad_to_len,
		enc_type="both"
		)
		# label = [1, 3, 1, 4, 4]
		# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
		```

		#### Customizing SELFIES:

		In this example, we relax the semantic constraints of ``selfies`` to allow
		for hypervalences (caution: hypervalence rules are much less understood
		than octet rules. Some molecules containing hypervalences are important,
		but generally, it is not known which molecules are stable and reasonable).

		```python
		import selfies as sf

		hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
		standard_derived_smi = sf.decoder(hypervalent_sf)
		# OI (the default constraints for I allows for only 1 bond)

		sf.set_semantic_constraints("hypervalent")
		relaxed_derived_smi = sf.decoder(hypervalent_sf)
		# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
		```

		#### Explaining Translation:

		You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.

		```python
		selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
		smiles, attr = sf.decoder(
		selfies, attribute=True)
		print('SELFIES', selfies)
		print('SMILES', smiles)
		print('Attribution:')
		for smiles_token in attr:
		print(smiles_token)

		# output
		SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
		SMILES C1NC(P)CC1
		Attribution:
		AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
		AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
		AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
		AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
		AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
		AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
		```

		``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.

		### More Usages and Examples

		* More examples can be found in the ``examples/`` directory, including a
		[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
		* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
		genetic algorithm to achieve state-of-the-art performance for inverse design,
		with the [code here](https://github.com/aspuru-guzik-group/GA).
		* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
		* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
		* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
		* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
		* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.

		## Tests
		`selfies` uses `pytest` with `tox` as its testing framework.
		All tests can be found in the `tests/` directory. To run the test suite for
		SELFIES, install ``tox`` and run:

		```bash
		tox -- --trials=10000 --dataset_samples=10000
		```

		By default, `selfies` is tested against a random subset
		(of size ``dataset_samples=10000``) on various datasets:

		* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
		* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
		* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
		* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
		* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
		Due to its large size, this dataset is not included on the repository. To run tests
		on it, please download the dataset into the ``tests/test_sets`` directory
		and run the ``tests/run_on_large_dataset.py`` script.

		## Version History
		See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).

		## Credits

		We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
		HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
		Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
		and Robert Pollice for chemistry advices.

		## License

		[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

+2

-0

README.md

		@@ -18,2 +18,4 @@ # SELFIES
		[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
		[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)\
		[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
		Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
		@@ -20,0 +22,0 @@ Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\

+245

-244

selfies.egg-info/PKG-INFO

		Metadata-Version: 2.1
		Name: selfies
		Version: 2.1.1
		Version: 2.1.2
		Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs.
		@@ -8,245 +8,2 @@ Home-page: https://github.com/aspuru-guzik-group/selfies
		Author-email: mario.krenn@utoronto.ca, alan@aspuru.com
		License: UNKNOWN
		Description: # SELFIES

		[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
		![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
		[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
		[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
		[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
		[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
		[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)


		Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation\
		_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
		[Machine Learning: Science and Technology 1, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
		[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
		[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
		[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
		Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
		Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
		Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)

		---

		A main objective is to use SELFIES as direct input into machine learning models,
		in particular in generative models, for the generation of molecular graphs
		which are syntactically and semantically valid.

		<p align="center">
		<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
		</p>

		## Installation
		Use pip to install ``selfies``.

		```bash
		pip install selfies
		```

		To check if the correct version of ``selfies`` is installed, use
		the following pip command.

		```bash
		pip show selfies
		```

		To upgrade to the latest release of ``selfies`` if you are using an
		older version, use the following pip command. Please see the
		[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
		to review the changes between versions of `selfies`, before upgrading:

		```bash
		pip install selfies --upgrade
		```


		## Usage

		### Overview

		Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
		which contains a thorough tutorial for getting started with ``selfies``
		and detailed descriptions of the functions
		that ``selfies`` provides. We summarize some key functions below.

		\| Function \| Description \|
		\| ------------------------------------- \| ----------------------------------------------------------------- \|
		\| ``selfies.encoder`` \| Translates a SMILES string into its corresponding SELFIES string. \|
		\| ``selfies.decoder`` \| Translates a SELFIES string into its corresponding SMILES string. \|
		\| ``selfies.set_semantic_constraints`` \| Configures the semantic constraints that ``selfies`` operates on. \|
		\| ``selfies.len_selfies`` \| Returns the number of symbols in a SELFIES string. \|
		\| ``selfies.split_selfies`` \| Tokenizes a SELFIES string into its individual symbols. \|
		\| ``selfies.get_alphabet_from_selfies`` \| Constructs an alphabet from an iterable of SELFIES strings. \|
		\| ``selfies.selfies_to_encoding`` \| Converts a SELFIES string into its label and/or one-hot encoding. \|
		\| ``selfies.encoding_to_selfies`` \| Converts a label or one-hot encoding into a SELFIES string. \|


		### Examples

		#### Translation between SELFIES and SMILES representations:

		```python
		import selfies as sf

		benzene = "c1ccccc1"

		# SMILES -> SELFIES -> SMILES translation
		try:
		benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
		benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
		except sf.EncoderError:
		pass # sf.encoder error!
		except sf.DecoderError:
		pass # sf.decoder error!

		len_benzene = sf.len_selfies(benzene_sf) # 8

		symbols_benzene = list(sf.split_selfies(benzene_sf))
		# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
		```

		#### Very simple creation of random valid molecules:
		A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):

		```python
		import selfies as sf
		import random

		alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
		rnd_selfies=''.join(random.sample(list(alphabet), 9))
		rnd_smiles=sf.decoder(rnd_selfies)
		print(rnd_smiles)
		```
		These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

		#### Integer and one-hot encoding SELFIES:

		In this example, we first build an alphabet from a dataset of SELFIES strings,
		and then convert a SELFIES string into its padded encoding. Note that we use the
		``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
		symbol to pad our SELFIES, which is a special SELFIES symbol that is always
		ignored and skipped over by ``selfies.decoder``, making it a useful
		padding character.

		```python
		import selfies as sf

		dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
		alphabet = sf.get_alphabet_from_selfies(dataset)
		alphabet.add("[nop]") # [nop] is a special padding symbol
		alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

		pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
		symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

		dimethyl_ether = dataset[0] # [C][O][C]

		label, one_hot = sf.selfies_to_encoding(
		selfies=dimethyl_ether,
		vocab_stoi=symbol_to_idx,
		pad_to_len=pad_to_len,
		enc_type="both"
		)
		# label = [1, 3, 1, 4, 4]
		# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
		```

		#### Customizing SELFIES:

		In this example, we relax the semantic constraints of ``selfies`` to allow
		for hypervalences (caution: hypervalence rules are much less understood
		than octet rules. Some molecules containing hypervalences are important,
		but generally, it is not known which molecules are stable and reasonable).

		```python
		import selfies as sf

		hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
		standard_derived_smi = sf.decoder(hypervalent_sf)
		# OI (the default constraints for I allows for only 1 bond)

		sf.set_semantic_constraints("hypervalent")
		relaxed_derived_smi = sf.decoder(hypervalent_sf)
		# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
		```

		#### Explaining Translation:

		You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.

		```python
		selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
		smiles, attr = sf.decoder(
		selfies, attribute=True)
		print('SELFIES', selfies)
		print('SMILES', smiles)
		print('Attribution:')
		for smiles_token in attr:
		print(smiles_token)

		# output
		SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
		SMILES C1NC(P)CC1
		Attribution:
		AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
		AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
		AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
		AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
		AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
		AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
		```

		``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.

		### More Usages and Examples

		* More examples can be found in the ``examples/`` directory, including a
		[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
		* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
		genetic algorithm to achieve state-of-the-art performance for inverse design,
		with the [code here](https://github.com/aspuru-guzik-group/GA).
		* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
		* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
		* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
		* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
		* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.

		## Tests
		`selfies` uses `pytest` with `tox` as its testing framework.
		All tests can be found in the `tests/` directory. To run the test suite for
		SELFIES, install ``tox`` and run:

		```bash
		tox -- --trials=10000 --dataset_samples=10000
		```

		By default, `selfies` is tested against a random subset
		(of size ``dataset_samples=10000``) on various datasets:

		* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
		* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
		* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
		* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
		* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
		Due to its large size, this dataset is not included on the repository. To run tests
		on it, please download the dataset into the ``tests/test_sets`` directory
		and run the ``tests/run_on_large_dataset.py`` script.

		## Version History
		See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).

		## Credits

		We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
		HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
		Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
		and Robert Pollice for chemistry advices.

		## License

		[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

		Platform: UNKNOWN
		Classifier: Programming Language :: Python :: 3
		@@ -262,1 +19,245 @@ Classifier: Programming Language :: Python :: 3.7
		Description-Content-Type: text/markdown
		License-File: LICENSE

		# SELFIES

		[![GitHub release](https://img.shields.io/github/release/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/releases/)
		![versions](https://img.shields.io/pypi/pyversions/selfies.svg)
		[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
		[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/commit-activity)
		[![GitHub issues](https://img.shields.io/github/issues/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/issues/)
		[![Documentation Status](https://readthedocs.org/projects/selfiesv2/badge/?version=latest)](http://selfiesv2.readthedocs.io/?badge=latest)
		[![GitHub contributors](https://img.shields.io/github/contributors/aspuru-guzik-group/selfies.svg)](https://GitHub.com/aspuru-guzik-group/selfies/graphs/contributors/)


		Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation\
		_Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik_\
		[Machine Learning: Science and Technology 1, 045024 (2020)](https://iopscience.iop.org/article/10.1088/2632-2153/aba947), [extensive blog post January 2021](https://aspuru.substack.com/p/molecular-graph-representations-and).\
		[Talk on youtube about SELFIES](https://www.youtube.com/watch?v=CaIyUmfGXDk).\
		[A community paper with 31 authors on SELFIES and the future of molecular string representations](https://arxiv.org/abs/2204.00056).\
		[Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\
		[Code-Paper in February 2023](https://pubs.rsc.org/en/content/articlelanding/2023/DD/D3DD00044C)\
		[SELFIES in Wolfram Mathematica](https://resources.wolframcloud.com/PacletRepository/resources/WolframChemistry/Selfies/) (since Dec 2023)\
		Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\
		Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\
		Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ)

		---

		A main objective is to use SELFIES as direct input into machine learning models,
		in particular in generative models, for the generation of molecular graphs
		which are syntactically and semantically valid.

		<p align="center">
		<img src="https://github.com/aspuru-guzik-group/selfies/blob/master/examples/VAE_LS_Validity.png" alt="SELFIES validity in a VAE latent space" width="666px">
		</p>

		## Installation
		Use pip to install ``selfies``.

		```bash
		pip install selfies
		```

		To check if the correct version of ``selfies`` is installed, use
		the following pip command.

		```bash
		pip show selfies
		```

		To upgrade to the latest release of ``selfies`` if you are using an
		older version, use the following pip command. Please see the
		[CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md)
		to review the changes between versions of `selfies`, before upgrading:

		```bash
		pip install selfies --upgrade
		```


		## Usage

		### Overview

		Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/),
		which contains a thorough tutorial for getting started with ``selfies``
		and detailed descriptions of the functions
		that ``selfies`` provides. We summarize some key functions below.

		\| Function \| Description \|
		\| ------------------------------------- \| ----------------------------------------------------------------- \|
		\| ``selfies.encoder`` \| Translates a SMILES string into its corresponding SELFIES string. \|
		\| ``selfies.decoder`` \| Translates a SELFIES string into its corresponding SMILES string. \|
		\| ``selfies.set_semantic_constraints`` \| Configures the semantic constraints that ``selfies`` operates on. \|
		\| ``selfies.len_selfies`` \| Returns the number of symbols in a SELFIES string. \|
		\| ``selfies.split_selfies`` \| Tokenizes a SELFIES string into its individual symbols. \|
		\| ``selfies.get_alphabet_from_selfies`` \| Constructs an alphabet from an iterable of SELFIES strings. \|
		\| ``selfies.selfies_to_encoding`` \| Converts a SELFIES string into its label and/or one-hot encoding. \|
		\| ``selfies.encoding_to_selfies`` \| Converts a label or one-hot encoding into a SELFIES string. \|


		### Examples

		#### Translation between SELFIES and SMILES representations:

		```python
		import selfies as sf

		benzene = "c1ccccc1"

		# SMILES -> SELFIES -> SMILES translation
		try:
		benzene_sf = sf.encoder(benzene) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
		benzene_smi = sf.decoder(benzene_sf) # C1=CC=CC=C1
		except sf.EncoderError:
		pass # sf.encoder error!
		except sf.DecoderError:
		pass # sf.decoder error!

		len_benzene = sf.len_selfies(benzene_sf) # 8

		symbols_benzene = list(sf.split_selfies(benzene_sf))
		# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']
		```

		#### Very simple creation of random valid molecules:
		A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by [Rajarshi Guha](https://twitter.com/rguha/status/1543601839983284224):

		```python
		import selfies as sf
		import random

		alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
		rnd_selfies=''.join(random.sample(list(alphabet), 9))
		rnd_smiles=sf.decoder(rnd_selfies)
		print(rnd_smiles)
		```
		These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

		#### Integer and one-hot encoding SELFIES:

		In this example, we first build an alphabet from a dataset of SELFIES strings,
		and then convert a SELFIES string into its padded encoding. Note that we use the
		``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) ))
		symbol to pad our SELFIES, which is a special SELFIES symbol that is always
		ignored and skipped over by ``selfies.decoder``, making it a useful
		padding character.

		```python
		import selfies as sf

		dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
		alphabet = sf.get_alphabet_from_selfies(dataset)
		alphabet.add("[nop]") # [nop] is a special padding symbol
		alphabet = list(sorted(alphabet)) # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

		pad_to_len = max(sf.len_selfies(s) for s in dataset) # 5
		symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

		dimethyl_ether = dataset[0] # [C][O][C]

		label, one_hot = sf.selfies_to_encoding(
		selfies=dimethyl_ether,
		vocab_stoi=symbol_to_idx,
		pad_to_len=pad_to_len,
		enc_type="both"
		)
		# label = [1, 3, 1, 4, 4]
		# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
		```

		#### Customizing SELFIES:

		In this example, we relax the semantic constraints of ``selfies`` to allow
		for hypervalences (caution: hypervalence rules are much less understood
		than octet rules. Some molecules containing hypervalences are important,
		but generally, it is not known which molecules are stable and reasonable).

		```python
		import selfies as sf

		hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid
		standard_derived_smi = sf.decoder(hypervalent_sf)
		# OI (the default constraints for I allows for only 1 bond)

		sf.set_semantic_constraints("hypervalent")
		relaxed_derived_smi = sf.decoder(hypervalent_sf)
		# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)
		```

		#### Explaining Translation:

		You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens.

		```python
		selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
		smiles, attr = sf.decoder(
		selfies, attribute=True)
		print('SELFIES', selfies)
		print('SMILES', smiles)
		print('Attribution:')
		for smiles_token in attr:
		print(smiles_token)

		# output
		SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
		SMILES C1NC(P)CC1
		Attribution:
		AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
		AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
		AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
		AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
		AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
		AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
		```

		``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``.

		### More Usages and Examples

		* More examples can be found in the ``examples/`` directory, including a
		[variational autoencoder that runs on the SELFIES](https://github.com/aspuru-guzik-group/selfies/tree/master/examples/vae_example) language.
		* This [ICLR2020 paper](https://arxiv.org/abs/1909.11655) used SELFIES in a
		genetic algorithm to achieve state-of-the-art performance for inverse design,
		with the [code here](https://github.com/aspuru-guzik-group/GA).
		* SELFIES allows for [highly efficient exploration and interpolation of the chemical space](https://chemrxiv.org/articles/preprint/Beyond_Generative_Models_Superfast_Traversal_Optimization_Novelty_Exploration_and_Discovery_STONED_Algorithm_for_Molecules_using_SELFIES/13383266), with a [deterministic algorithms, see code](https://github.com/aspuru-guzik-group/stoned-selfies).
		* We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea).
		* Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator).
		* Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling.
		* An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling.

		## Tests
		`selfies` uses `pytest` with `tox` as its testing framework.
		All tests can be found in the `tests/` directory. To run the test suite for
		SELFIES, install ``tox`` and run:

		```bash
		tox -- --trials=10000 --dataset_samples=10000
		```

		By default, `selfies` is tested against a random subset
		(of size ``dataset_samples=10000``) on various datasets:

		* 130K molecules from [QM9](https://www.nature.com/articles/sdata201422)
		* 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database)
		* 50K molecules from a dataset of [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307)
		* 160K+ molecules from various [MoleculeNet](http://moleculenet.ai/datasets-1) datasets
		* 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html).
		Due to its large size, this dataset is not included on the repository. To run tests
		on it, please download the dataset into the ``tests/test_sets`` directory
		and run the ``tests/run_on_large_dataset.py`` script.

		## Version History
		See [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md).

		## Credits

		We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin,
		HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan),
		Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports,
		and Robert Pollice for chemistry advices.

		## License

		[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

+6

-1

selfies.egg-info/SOURCES.txt

		@@ -0,1 +1,2 @@
		LICENSE
		README.md
		@@ -20,2 +21,6 @@ setup.py
		selfies/utils/selfies_utils.py
		selfies/utils/smiles_utils.py
		selfies/utils/smiles_utils.py
		tests/test_on_datasets.py
		tests/test_selfies.py
		tests/test_selfies_utils.py
		tests/test_specific_cases.py

+1

-1

selfies/__init__.py

		@@ -28,3 +28,3 @@ #!/usr/bin/env python

		__version__ = "2.1.0"
		__version__ = "2.1.1"

		@@ -31,0 +31,0 @@ __all__ = [

+11

-2

selfies/utils/encoding_utils.py

		@@ -50,5 +50,14 @@ from typing import Dict, List, Tuple, Union
		# integer encode
		char_list = split_selfies(selfies)
		integer_encoded = [vocab_stoi[char] for char in char_list]
		integer_encoded = []
		for char in split_selfies(selfies):
		if (char == ".") and ("." not in vocab_stoi):
		raise KeyError(
		"The SELFIES string contains two unconnected molecules "
		"(given by the '.' character), but vocab_stoi does not "
		"contain the '.' key. Please add it to the vocabulary "
		"or separate the molecules."
		)

		integer_encoded.append(vocab_stoi[char])

		if enc_type == "label":
		@@ -55,0 +64,0 @@ return integer_encoded

+38

-28

selfies/utils/smiles_utils.py

		@@ -445,37 +445,47 @@ import enum
		attribution_maps, attribution_index=0):
		curr_atom, curr = mol.get_atom(root), root
		token = atom_to_smiles(curr_atom)
		derived.append(token)
		attribution_maps.append(AttributionMap(
		_strlen(derived) - 1 + attribution_index,
		token, mol.get_attribution(curr_atom)))
		stack = [(root, 0, len(mol.get_out_dirbonds(root)), False)]

		out_bonds = mol.get_out_dirbonds(curr)
		for i, bond in enumerate(out_bonds):
		if bond.ring_bond:
		token = bond_to_smiles(bond)
		while stack:
		curr, bond_index, total_bonds, needs_closing = stack[-1]
		curr_atom = mol.get_atom(curr)

		if bond_index == 0:
		token = atom_to_smiles(curr_atom)
		derived.append(token)
		attribution_maps.append(AttributionMap(
		_strlen(derived) - 1 + attribution_index,
		token, mol.get_attribution(bond)))
		ends = (min(bond.src, bond.dst), max(bond.src, bond.dst))
		rnum = ring_log.setdefault(ends, len(ring_log) + 1)
		if rnum >= 10:
		derived.append("%")
		derived.append(str(rnum))
		token, mol.get_attribution(curr_atom)))

		out_bonds = mol.get_out_dirbonds(curr)

		if bond_index < total_bonds:
		bond = out_bonds[bond_index]
		bond_attribution = mol.get_attribution(bond)
		stack[-1] = (curr, bond_index + 1, total_bonds, needs_closing)

		if bond.ring_bond:
		token = bond_to_smiles(bond)
		derived.append(token)
		attribution_maps.append(AttributionMap(
		_strlen(derived) - 1 + attribution_index,
		token, bond_attribution))
		ends = (min(bond.src, bond.dst), max(bond.src, bond.dst))
		rnum = ring_log.setdefault(ends, len(ring_log) + 1)
		if rnum >= 10:
		derived.append("%")
		derived.append(str(rnum))
		else:
		if bond_index < total_bonds - 1:
		derived.append("(")

		token = bond_to_smiles(bond)
		derived.append(token)
		attribution_maps.append(AttributionMap(
		_strlen(derived) - 1 + attribution_index,
		token, bond_attribution))
		stack.append((bond.dst, 0, len(mol.get_out_dirbonds(bond.dst)), bond_index < total_bonds - 1))
		else:
		if i < len(out_bonds) - 1:
		derived.append("(")

		token = bond_to_smiles(bond)
		derived.append(token)
		attribution_maps.append(AttributionMap(
		_strlen(derived) - 1 + attribution_index,
		token, mol.get_attribution(bond)))
		_derive_smiles_from_fragment(
		derived, mol, bond.dst, ring_log,
		attribution_maps, attribution_index)
		if i < len(out_bonds) - 1:
		stack.pop()
		if needs_closing:
		derived.append(")")
		return attribution_maps

+1

-1

setup.py

		@@ -10,3 +10,3 @@ #!/usr/bin/env python
		name="selfies",
		version="2.1.1",
		version="2.1.2",
		author="Mario Krenn, Alston Lo, and many other contributors",
		@@ -13,0 +13,0 @@ author_email="mario.krenn@utoronto.ca, alan@aspuru.com",

selfies - npm Package Compare versions

Improved metrics