selfies
Advanced tools
+81
-47
| Metadata-Version: 2.1 | ||
| Name: selfies | ||
| Version: 2.0.0 | ||
| Version: 2.1.0 | ||
| Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs. | ||
@@ -25,3 +25,4 @@ Home-page: https://github.com/aspuru-guzik-group/selfies | ||
| [Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\ | ||
| Major contributors since v1.0.0: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\ | ||
| Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ) | ||
@@ -47,3 +48,3 @@ | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
| the following pip command. | ||
@@ -54,9 +55,9 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| ```bash | ||
| pip install selfies --upgrade | ||
| pip install selfies --upgrade | ||
| ``` | ||
@@ -70,16 +71,16 @@ | ||
| Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/), | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| and detailed descriptions of the functions | ||
| that ``selfies`` provides. We summarize some key functions below. | ||
| | Function | Description | | ||
| | -------- | ----------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
| | Function | Description | | ||
| | ------------------------------------- | ----------------------------------------------------------------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
@@ -111,24 +112,6 @@ | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Integer and one-hot encoding SELFIES: | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| and then convert a SELFIES string into its padded encoding. Note that we use the | ||
@@ -163,2 +146,52 @@ ``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Explaining Translation: | ||
| You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens. | ||
| ```python | ||
| selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]" | ||
| smiles, attr = sf.decoder( | ||
| selfies, attribute=True) | ||
| print('SELFIES', selfies) | ||
| print('SMILES', smiles) | ||
| print('Attribution:') | ||
| for smiles_token, a in attr: | ||
| print(smiles_token) | ||
| if a: | ||
| for j, selfies_token in a: | ||
| print(f'\t{j}:{selfies_token}') | ||
| # output | ||
| SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] | ||
| SMILES C1NC(P)CC1 | ||
| Attribution: | ||
| AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')]) | ||
| AttributionMap(index=4, token='N', attribution=[Attribution(index=1, token='[N]')]) | ||
| AttributionMap(index=6, token='C', attribution=[Attribution(index=2, token='[C]')]) | ||
| AttributionMap(index=9, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')]) | ||
| AttributionMap(index=12, token='C', attribution=[Attribution(index=6, token='[C]')]) | ||
| AttributionMap(index=14, token='C', attribution=[Attribution(index=7, token='[C]')]) | ||
| ``` | ||
| ``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``. | ||
| ### More Usages and Examples | ||
@@ -173,4 +206,5 @@ | ||
| * We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea). | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling. | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
@@ -180,3 +214,3 @@ ## Tests | ||
| All tests can be found in the `tests/` directory. To run the test suite for | ||
| SELFIES, install ``tox`` and run: | ||
| SELFIES, install ``tox`` and run: | ||
@@ -195,5 +229,5 @@ ```bash | ||
| * 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html). | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
@@ -216,10 +250,10 @@ ## Version History | ||
| Classifier: Programming Language :: Python :: 3 | ||
| Classifier: Programming Language :: Python :: 3.5 | ||
| Classifier: Programming Language :: Python :: 3.6 | ||
| Classifier: Programming Language :: Python :: 3.7 | ||
| Classifier: Programming Language :: Python :: 3.8 | ||
| Classifier: Programming Language :: Python :: 3.9 | ||
| Classifier: Programming Language :: Python :: 3.10 | ||
| Classifier: Programming Language :: Python :: 3 :: Only | ||
| Classifier: License :: OSI Approved :: Apache Software License | ||
| Classifier: Operating System :: OS Independent | ||
| Requires-Python: >=3.5 | ||
| Requires-Python: >=3.7 | ||
| Description-Content-Type: text/markdown |
+77
-43
@@ -17,3 +17,4 @@ # SELFIES | ||
| [Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\ | ||
| Major contributors since v1.0.0: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\ | ||
| Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ) | ||
@@ -39,3 +40,3 @@ | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
| the following pip command. | ||
@@ -46,9 +47,9 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| ```bash | ||
| pip install selfies --upgrade | ||
| pip install selfies --upgrade | ||
| ``` | ||
@@ -62,16 +63,16 @@ | ||
| Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/), | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| and detailed descriptions of the functions | ||
| that ``selfies`` provides. We summarize some key functions below. | ||
| | Function | Description | | ||
| | -------- | ----------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
| | Function | Description | | ||
| | ------------------------------------- | ----------------------------------------------------------------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
@@ -103,24 +104,6 @@ | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Integer and one-hot encoding SELFIES: | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| and then convert a SELFIES string into its padded encoding. Note that we use the | ||
@@ -155,2 +138,52 @@ ``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Explaining Translation: | ||
| You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens. | ||
| ```python | ||
| selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]" | ||
| smiles, attr = sf.decoder( | ||
| selfies, attribute=True) | ||
| print('SELFIES', selfies) | ||
| print('SMILES', smiles) | ||
| print('Attribution:') | ||
| for smiles_token, a in attr: | ||
| print(smiles_token) | ||
| if a: | ||
| for j, selfies_token in a: | ||
| print(f'\t{j}:{selfies_token}') | ||
| # output | ||
| SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] | ||
| SMILES C1NC(P)CC1 | ||
| Attribution: | ||
| AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')]) | ||
| AttributionMap(index=4, token='N', attribution=[Attribution(index=1, token='[N]')]) | ||
| AttributionMap(index=6, token='C', attribution=[Attribution(index=2, token='[C]')]) | ||
| AttributionMap(index=9, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')]) | ||
| AttributionMap(index=12, token='C', attribution=[Attribution(index=6, token='[C]')]) | ||
| AttributionMap(index=14, token='C', attribution=[Attribution(index=7, token='[C]')]) | ||
| ``` | ||
| ``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``. | ||
| ### More Usages and Examples | ||
@@ -165,4 +198,5 @@ | ||
| * We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea). | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling. | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
@@ -172,3 +206,3 @@ ## Tests | ||
| All tests can be found in the `tests/` directory. To run the test suite for | ||
| SELFIES, install ``tox`` and run: | ||
| SELFIES, install ``tox`` and run: | ||
@@ -187,5 +221,5 @@ ```bash | ||
| * 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html). | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
@@ -192,0 +226,0 @@ ## Version History |
| Metadata-Version: 2.1 | ||
| Name: selfies | ||
| Version: 2.0.0 | ||
| Version: 2.1.0 | ||
| Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs. | ||
@@ -25,3 +25,4 @@ Home-page: https://github.com/aspuru-guzik-group/selfies | ||
| [Blog explaining SELFIES in Japanese language](https://blacktanktop.hatenablog.com/entry/2021/08/12/115613)\ | ||
| Major contributors since v1.0.0: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Major contributors of v1.0.n: _[Alston Lo](https://github.com/alstonlo) and [Seyone Chithrananda](https://github.com/seyonechithrananda)_\ | ||
| Main developer of v2.0.0: _[Alston Lo](https://github.com/alstonlo)_\ | ||
| Chemistry Advisor: [Robert Pollice](https://scholar.google.at/citations?user=JR2N3JIAAAAJ) | ||
@@ -47,3 +48,3 @@ | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
| the following pip command. | ||
@@ -54,9 +55,9 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`, before upgrading: | ||
| ```bash | ||
| pip install selfies --upgrade | ||
| pip install selfies --upgrade | ||
| ``` | ||
@@ -70,16 +71,16 @@ | ||
| Please refer to the [documentation](https://selfiesv2.readthedocs.io/en/latest/), | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| which contains a thorough tutorial for getting started with ``selfies`` | ||
| and detailed descriptions of the functions | ||
| that ``selfies`` provides. We summarize some key functions below. | ||
| | Function | Description | | ||
| | -------- | ----------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
| | Function | Description | | ||
| | ------------------------------------- | ----------------------------------------------------------------- | | ||
| | ``selfies.encoder`` | Translates a SMILES string into its corresponding SELFIES string. | | ||
| | ``selfies.decoder`` | Translates a SELFIES string into its corresponding SMILES string. | | ||
| | ``selfies.set_semantic_constraints`` | Configures the semantic constraints that ``selfies`` operates on. | | ||
| | ``selfies.len_selfies`` | Returns the number of symbols in a SELFIES string. | | ||
| | ``selfies.split_selfies`` | Tokenizes a SELFIES string into its individual symbols. | | ||
| | ``selfies.get_alphabet_from_selfies`` | Constructs an alphabet from an iterable of SELFIES strings. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES string into its label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES string. | | ||
@@ -111,24 +112,6 @@ | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Integer and one-hot encoding SELFIES: | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| In this example, we first build an alphabet from a dataset of SELFIES strings, | ||
| and then convert a SELFIES string into its padded encoding. Note that we use the | ||
@@ -163,2 +146,52 @@ ``[nop]`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
| #### Customizing SELFIES: | ||
| In this example, we relax the semantic constraints of ``selfies`` to allow | ||
| for hypervalences (caution: hypervalence rules are much less understood | ||
| than octet rules. Some molecules containing hypervalences are important, | ||
| but generally, it is not known which molecules are stable and reasonable). | ||
| ```python | ||
| import selfies as sf | ||
| hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False) # orthoperiodic acid | ||
| standard_derived_smi = sf.decoder(hypervalent_sf) | ||
| # OI (the default constraints for I allows for only 1 bond) | ||
| sf.set_semantic_constraints("hypervalent") | ||
| relaxed_derived_smi = sf.decoder(hypervalent_sf) | ||
| # O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds) | ||
| ``` | ||
| #### Explaining Translation: | ||
| You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string ``[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]`` are responsible for the output SMILES tokens. | ||
| ```python | ||
| selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]" | ||
| smiles, attr = sf.decoder( | ||
| selfies, attribute=True) | ||
| print('SELFIES', selfies) | ||
| print('SMILES', smiles) | ||
| print('Attribution:') | ||
| for smiles_token, a in attr: | ||
| print(smiles_token) | ||
| if a: | ||
| for j, selfies_token in a: | ||
| print(f'\t{j}:{selfies_token}') | ||
| # output | ||
| SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] | ||
| SMILES C1NC(P)CC1 | ||
| Attribution: | ||
| AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')]) | ||
| AttributionMap(index=4, token='N', attribution=[Attribution(index=1, token='[N]')]) | ||
| AttributionMap(index=6, token='C', attribution=[Attribution(index=2, token='[C]')]) | ||
| AttributionMap(index=9, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')]) | ||
| AttributionMap(index=12, token='C', attribution=[Attribution(index=6, token='[C]')]) | ||
| AttributionMap(index=14, token='C', attribution=[Attribution(index=7, token='[C]')]) | ||
| ``` | ||
| ``attr`` is a list of `AttributionMap`s containing the output token, its index, and input tokens that led to it. For example, the ``P`` appearing in the output SMILES at that location is a result of both the ``[Branch1]`` token at position 3 and the ``[P]`` token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with ``get_attribution``. | ||
| ### More Usages and Examples | ||
@@ -173,4 +206,5 @@ | ||
| * We use SELFIES for [Deep Molecular dreaming](https://arxiv.org/abs/2012.09712), a new generative model inspired by interpretable neural networks in computational vision. See the [code of PASITHEA here](https://github.com/aspuru-guzik-group/Pasithea). | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
| * Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in [img2string](https://link.springer.com/article/10.1186/s13321-020-00469-w) and [string2string](https://chemrxiv.org/articles/preprint/STOUT_SMILES_to_IUPAC_Names_Using_Neural_Machine_Translation/13469202/1) translation tasks, see the codes of [DECIMER](https://github.com/Kohulan/DECIMER-Image-to-SMILES) and [STOUT](https://github.com/Kohulan/Smiles-TO-iUpac-Translator). | ||
| * Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the [FastFlows](https://arxiv.org/abs/2201.12419) framework for deep chemical generative modeling. | ||
| * An improvement to the old genetic algorithm, the authors have also released [JANUS](https://arxiv.org/abs/2106.04011), which allows for more efficient optimization in the chemical space. JANUS makes use of [STONED-SELFIES](https://pubs.rsc.org/en/content/articlepdf/2021/sc/d1sc00231g) and a neural network for efficient sampling. | ||
@@ -180,3 +214,3 @@ ## Tests | ||
| All tests can be found in the `tests/` directory. To run the test suite for | ||
| SELFIES, install ``tox`` and run: | ||
| SELFIES, install ``tox`` and run: | ||
@@ -195,5 +229,5 @@ ```bash | ||
| * 36M+ molecules from the [eMolecules Database](https://www.emolecules.com/info/products-data-downloads.html). | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
| Due to its large size, this dataset is not included on the repository. To run tests | ||
| on it, please download the dataset into the ``tests/test_sets`` directory | ||
| and run the ``tests/run_on_large_dataset.py`` script. | ||
@@ -216,10 +250,10 @@ ## Version History | ||
| Classifier: Programming Language :: Python :: 3 | ||
| Classifier: Programming Language :: Python :: 3.5 | ||
| Classifier: Programming Language :: Python :: 3.6 | ||
| Classifier: Programming Language :: Python :: 3.7 | ||
| Classifier: Programming Language :: Python :: 3.8 | ||
| Classifier: Programming Language :: Python :: 3.9 | ||
| Classifier: Programming Language :: Python :: 3.10 | ||
| Classifier: Programming Language :: Python :: 3 :: Only | ||
| Classifier: License :: OSI Approved :: Apache Software License | ||
| Classifier: Operating System :: OS Independent | ||
| Requires-Python: >=3.5 | ||
| Requires-Python: >=3.7 | ||
| Description-Content-Type: text/markdown |
@@ -18,5 +18,4 @@ README.md | ||
| selfies/utils/encoding_utils.py | ||
| selfies/utils/linked_list.py | ||
| selfies/utils/matching_utils.py | ||
| selfies/utils/selfies_utils.py | ||
| selfies/utils/smiles_utils.py |
+42
-14
| import warnings | ||
| from typing import List, Union, Tuple | ||
@@ -14,3 +15,3 @@ from selfies.compatibility import modernize_symbol | ||
| ) | ||
| from selfies.mol_graph import MolecularGraph | ||
| from selfies.mol_graph import MolecularGraph, Attribution | ||
| from selfies.utils.selfies_utils import split_selfies | ||
@@ -20,3 +21,7 @@ from selfies.utils.smiles_utils import mol_to_smiles | ||
| def decoder(selfies: str, compatible: bool = False) -> str: | ||
| def decoder( | ||
| selfies: str, | ||
| compatible: bool = False, | ||
| attribute: bool = False) ->\ | ||
| Union[str, Tuple[str, List[Tuple[str, List[Tuple[int, str]]]]]]: | ||
| """Translates a SELFIES string into its corresponding SMILES string. | ||
@@ -35,2 +40,4 @@ | ||
| Defaults to ``False``. | ||
| :param attribute: if ``True``, an attribution map connecting selfies | ||
| tokens to smiles tokens is output. | ||
| :return: a SMILES string derived from the input SELFIES string. | ||
@@ -51,8 +58,9 @@ :raises DecoderError: if the input SELFIES string is malformed. | ||
| mol = MolecularGraph() | ||
| mol = MolecularGraph(attributable=attribute) | ||
| rings = [] | ||
| attribution_index = 0 | ||
| for s in selfies.split("."): | ||
| _derive_mol_from_symbols( | ||
| symbol_iter=_tokenize_selfies(s, compatible), | ||
| n = _derive_mol_from_symbols( | ||
| symbol_iter=enumerate(_tokenize_selfies(s, compatible)), | ||
| mol=mol, | ||
@@ -63,6 +71,9 @@ selfies=selfies, | ||
| root_atom=None, | ||
| rings=rings | ||
| rings=rings, | ||
| attribute_stack=[] if attribute else None, | ||
| attribution_index=attribution_index | ||
| ) | ||
| attribution_index += n | ||
| _form_rings_bilocally(mol, rings) | ||
| return mol_to_smiles(mol) | ||
| return mol_to_smiles(mol, attribute) | ||
@@ -91,3 +102,3 @@ | ||
| symbol_iter, mol, selfies, max_derive, | ||
| init_state, root_atom, rings | ||
| init_state, root_atom, rings, attribute_stack, attribution_index | ||
| ): | ||
@@ -101,3 +112,3 @@ n_derived = 0 | ||
| try: # retrieve next symbol | ||
| symbol = next(symbol_iter) | ||
| index, symbol = next(symbol_iter) | ||
| n_derived += 1 | ||
@@ -123,3 +134,7 @@ except StopIteration: | ||
| symbol_iter, mol, selfies, (Q + 1), | ||
| init_state=binit_state, root_atom=prev_atom, rings=rings | ||
| init_state=binit_state, root_atom=prev_atom, rings=rings, | ||
| attribute_stack=attribute_stack + | ||
| [Attribution(index + attribution_index, symbol) | ||
| ] if attribute_stack is not None else None, | ||
| attribution_index=attribution_index | ||
| ) | ||
@@ -162,7 +177,20 @@ | ||
| if state == 0: | ||
| mol.add_atom(atom, True) | ||
| o = mol.add_atom(atom, True) | ||
| mol.add_attribution( | ||
| o, attribute_stack + | ||
| [Attribution(index + attribution_index, symbol)] | ||
| if attribute_stack is not None else None) | ||
| else: | ||
| mol.add_atom(atom) | ||
| o = mol.add_atom(atom) | ||
| mol.add_attribution( | ||
| o, attribute_stack + | ||
| [Attribution(index + attribution_index, symbol)] | ||
| if attribute_stack is not None else None) | ||
| src, dst = prev_atom.index, atom.index | ||
| mol.add_bond(src=src, dst=dst, order=bond_order, stereo=stereo) | ||
| o = mol.add_bond(src=src, dst=dst, | ||
| order=bond_order, stereo=stereo) | ||
| mol.add_attribution( | ||
| o, attribute_stack + | ||
| [Attribution(index + attribution_index, symbol)] | ||
| if attribute_stack is not None else None) | ||
| prev_atom = atom | ||
@@ -195,3 +223,3 @@ | ||
| try: | ||
| index_symbols.append(next(symbol_iter)) | ||
| index_symbols.append(next(symbol_iter)[-1]) | ||
| except StopIteration: | ||
@@ -198,0 +226,0 @@ index_symbols.append(None) |
+40
-10
| from selfies.exceptions import EncoderError, SMILESParserError | ||
| from selfies.grammar_rules import get_selfies_from_index | ||
| from selfies.utils.linked_list import SinglyLinkedList | ||
| from selfies.utils.smiles_utils import ( | ||
@@ -10,4 +9,6 @@ atom_to_smiles, | ||
| from selfies.mol_graph import AttributionMap | ||
| def encoder(smiles: str, strict: bool = True) -> str: | ||
| def encoder(smiles: str, strict: bool = True, attribute: bool = False) -> str: | ||
| """Translates a SMILES string into its corresponding SELFIES string. | ||
@@ -37,3 +38,6 @@ | ||
| Defaults to ``True``. | ||
| :return: a SELFIES string translated from the input SMILES string. | ||
| :param attribute: if an attribution should be returned | ||
| :return: a SELFIES string translated from the input SMILES string if | ||
| attribute is ``False``, otherwise a tuple is returned of | ||
| SELFIES string and attribution list. | ||
| :raises EncoderError: if the input SMILES string is invalid, | ||
@@ -63,3 +67,3 @@ cannot be kekulized, or violates the semantic constraints with | ||
| try: | ||
| mol = smiles_to_mol(smiles) | ||
| mol = smiles_to_mol(smiles, attributable=attribute) | ||
| except SMILESParserError as err: | ||
@@ -85,6 +89,13 @@ err_msg = "failed to parse input\n\tSMILES: {}".format(smiles) | ||
| fragments = [] | ||
| attribution_maps = [] | ||
| attribution_index = 0 | ||
| for root in mol.get_roots(): | ||
| derived = list(_fragment_to_selfies(mol, None, root)) | ||
| derived = list(_fragment_to_selfies( | ||
| mol, None, root, attribution_maps, attribution_index)) | ||
| attribution_index += len(derived) | ||
| fragments.append("".join(derived)) | ||
| return ".".join(fragments) | ||
| # trim attribution map of empty tokens | ||
| attribution_maps = [a for a in attribution_maps if a.token] | ||
| result = ".".join(fragments), attribution_maps | ||
| return result if attribute else result[0] | ||
@@ -137,4 +148,5 @@ | ||
| def _fragment_to_selfies(mol, bond_into_root, root): | ||
| derived = SinglyLinkedList() | ||
| def _fragment_to_selfies(mol, bond_into_root, root, | ||
| attribution_maps, attribution_index=0): | ||
| derived = [] | ||
@@ -144,4 +156,9 @@ bond_into_curr, curr = bond_into_root, root | ||
| curr_atom = mol.get_atom(curr) | ||
| derived.append(_atom_to_selfies(bond_into_curr, curr_atom)) | ||
| token = _atom_to_selfies(bond_into_curr, curr_atom) | ||
| derived.append(token) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| token, mol.get_attribution(curr_atom))) | ||
| out_bonds = mol.get_out_dirbonds(curr) | ||
@@ -163,4 +180,10 @@ for i, bond in enumerate(out_bonds): | ||
| derived.append(ring_symbol) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| ring_symbol, mol.get_attribution(bond))) | ||
| for symbol in Q_as_symbols: | ||
| derived.append(symbol) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| symbol, mol.get_attribution(bond))) | ||
@@ -171,3 +194,4 @@ elif i == len(out_bonds) - 1: | ||
| else: | ||
| branch = _fragment_to_selfies(mol, bond, bond.dst) | ||
| branch = _fragment_to_selfies( | ||
| mol, bond, bond.dst, attribution_maps, len(derived)) | ||
| Q_as_symbols = get_selfies_from_index(len(branch) - 1) | ||
@@ -180,4 +204,10 @@ branch_symbol = "[{}Branch{}]".format( | ||
| derived.append(branch_symbol) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| branch_symbol, mol.get_attribution(bond))) | ||
| for symbol in Q_as_symbols: | ||
| derived.append(symbol) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| symbol, mol.get_attribution(bond))) | ||
| derived.extend(branch) | ||
@@ -184,0 +214,0 @@ |
+50
-3
| import functools | ||
| import itertools | ||
| from typing import List, Optional, Union | ||
| from dataclasses import dataclass, field | ||
@@ -10,2 +11,25 @@ from selfies.bond_constraints import get_bonding_capacity | ||
| @dataclass | ||
| class Attribution: | ||
| """A dataclass that contains token string and its index. | ||
| """ | ||
| #: token index | ||
| index: int | ||
| #: token string | ||
| token: str | ||
| @dataclass | ||
| class AttributionMap: | ||
| """A mapping from input to single output token showing which | ||
| input tokens created the output token. | ||
| """ | ||
| #: Index of output token | ||
| index: int | ||
| #: Output token | ||
| token: str | ||
| #: List of input tokens that created the output token | ||
| attribution: List[Attribution] = field(default_factory=list) | ||
| class Atom: | ||
@@ -74,3 +98,3 @@ """An atom with associated specifications (e.g. charge, chirality). | ||
| def __init__(self): | ||
| def __init__(self, attributable=False): | ||
| self._roots = list() # stores root atoms, where traversal begins | ||
@@ -83,2 +107,4 @@ self._atoms = list() # stores atoms in this graph | ||
| self._delocal_subgraph = dict() # delocalization subgraph | ||
| self._attribution = dict() # attribution of each atom/bond | ||
| self._attributable = attributable | ||
@@ -96,2 +122,10 @@ def __len__(self): | ||
| def get_attribution( | ||
| self, | ||
| o: Union[DirectedBond, Atom] | ||
| ) -> List[Attribution]: | ||
| if self._attributable and o in self._attribution: | ||
| return self._attribution[o] | ||
| return None | ||
| def get_roots(self) -> List[int]: | ||
@@ -115,3 +149,3 @@ return self._roots | ||
| def add_atom(self, atom: Atom, mark_root: bool = False) -> None: | ||
| def add_atom(self, atom: Atom, mark_root: bool = False) -> Atom: | ||
| atom.index = len(self) | ||
@@ -127,7 +161,19 @@ | ||
| self._delocal_subgraph[atom.index] = list() | ||
| return atom | ||
| def add_attribution( | ||
| self, | ||
| o: Union[DirectedBond, Atom], | ||
| attr: List[Attribution] | ||
| ) -> None: | ||
| if self._attributable: | ||
| if o in self._attribution: | ||
| self._attribution[o].extend(attr) | ||
| else: | ||
| self._attribution[o] = attr | ||
| def add_bond( | ||
| self, src: int, dst: int, | ||
| order: Union[int, float], stereo: str | ||
| ) -> None: | ||
| ) -> DirectedBond: | ||
| assert src < dst | ||
@@ -143,2 +189,3 @@ | ||
| self._delocal_subgraph.setdefault(dst, []).append(src) | ||
| return bond | ||
@@ -145,0 +192,0 @@ def add_placeholder_bond(self, src: int) -> int: |
| import enum | ||
| import re | ||
| from collections import deque | ||
| from typing import Iterator, Optional, Tuple, Union | ||
| from typing import Iterator, Optional, Tuple, Union, List | ||
| from selfies.constants import AROMATIC_SUBSET, ELEMENTS, ORGANIC_SUBSET | ||
| from selfies.exceptions import SMILESParserError | ||
| from selfies.mol_graph import Atom, DirectedBond, MolecularGraph | ||
| from selfies.mol_graph import Atom, Attribution, \ | ||
| AttributionMap, DirectedBond, MolecularGraph | ||
@@ -40,3 +41,4 @@ SMILES_BRACKETED_ATOM_PATTERN = re.compile( | ||
| bond_idx: Optional[int], | ||
| start_idx: int, end_idx: int, token_type: SMILESTokenTypes | ||
| start_idx: int, end_idx: int, token_type: SMILESTokenTypes, | ||
| token: str | ||
| ): | ||
@@ -47,2 +49,3 @@ self.bond_idx = bond_idx | ||
| self.token_type = token_type | ||
| self.token = token | ||
@@ -55,3 +58,6 @@ def extract_bond_char(self, smiles): | ||
| def __str__(self): | ||
| return self.token | ||
| def tokenize_smiles(smiles: str) -> Iterator[SMILESToken]: | ||
@@ -68,3 +74,3 @@ """Splits a SMILES string into its tokens. | ||
| if smiles[i] == ".": | ||
| yield SMILESToken(None, i, i + 1, SMILESTokenTypes.DOT) | ||
| yield SMILESToken(None, i, i + 1, SMILESTokenTypes.DOT, smiles[i]) | ||
| i += 1 | ||
@@ -84,5 +90,7 @@ continue | ||
| if smiles[i: i + 2] in ("Br", "Cl"): # two-letter elements | ||
| token = SMILESToken(bond_idx, i, i + 2, SMILESTokenTypes.ATOM) | ||
| token = SMILESToken(bond_idx, i, i + 2, | ||
| SMILESTokenTypes.ATOM, smiles[i: i + 2]) | ||
| else: # one-letter elements (e.g. C, N, ...) | ||
| token = SMILESToken(bond_idx, i, i + 1, SMILESTokenTypes.ATOM) | ||
| token = SMILESToken(bond_idx, i, i + 1, | ||
| SMILESTokenTypes.ATOM, smiles[i:i + 1]) | ||
@@ -93,3 +101,4 @@ elif smiles[i] == "[": # atoms encased in brackets (e.g. [NH]) | ||
| raise SMILESParserError(smiles, "hanging bracket [", i) | ||
| token = SMILESToken(bond_idx, i, r_idx + 1, SMILESTokenTypes.ATOM) | ||
| token = SMILESToken(bond_idx, i, r_idx + 1, | ||
| SMILESTokenTypes.ATOM, smiles[i:r_idx + 1]) | ||
@@ -99,6 +108,8 @@ elif smiles[i] in ("(", ")"): # open and closed branch brackets | ||
| raise SMILESParserError(smiles, "hanging_bond", bond_idx) | ||
| token = SMILESToken(None, i, i + 1, SMILESTokenTypes.BRANCH) | ||
| token = SMILESToken( | ||
| None, i, i + 1, SMILESTokenTypes.BRANCH, smiles[i:i+1]) | ||
| elif smiles[i].isdigit(): # one-digit ring number | ||
| token = SMILESToken(bond_idx, i, i + 1, SMILESTokenTypes.RING) | ||
| token = SMILESToken(bond_idx, i, i + 1, | ||
| SMILESTokenTypes.RING, smiles[i:i+1]) | ||
@@ -110,3 +121,4 @@ elif smiles[i] == "%": # two-digit ring number (e.g. %12) | ||
| raise SMILESParserError(smiles, err_msg, i) | ||
| token = SMILESToken(bond_idx, i, i + 3, SMILESTokenTypes.RING) | ||
| token = SMILESToken(bond_idx, i, i + 3, | ||
| SMILESTokenTypes.RING, smiles[i:i+3]) | ||
@@ -197,6 +209,7 @@ else: | ||
| def smiles_to_mol(smiles: str) -> MolecularGraph: | ||
| def smiles_to_mol(smiles: str, attributable: bool) -> MolecularGraph: | ||
| """Reads a molecular graph from a SMILES string. | ||
| :param smiles: the input SMILES string. | ||
| :param attributable: if molecular graph needs to include attributions | ||
| :return: a molecular graph that the input SMILES string represents. | ||
@@ -209,10 +222,11 @@ :raises SMILESParserError: if the input SMILES is invalid. | ||
| mol = MolecularGraph() | ||
| mol = MolecularGraph(attributable=attributable) | ||
| tokens = deque(tokenize_smiles(smiles)) | ||
| i = 0 | ||
| while tokens: | ||
| _derive_mol_from_tokens(mol, smiles, tokens) | ||
| i = _derive_mol_from_tokens(mol, smiles, tokens, i) | ||
| return mol | ||
| def _derive_mol_from_tokens(mol, smiles, tokens): | ||
| def _derive_mol_from_tokens(mol, smiles, tokens, i): | ||
| tok = None | ||
@@ -240,3 +254,3 @@ prev_stack = deque() # keep track of previous atom on the current chain | ||
| curr = _attach_atom(mol, bond_char, curr, prev_atom) | ||
| curr, i = _attach_atom(mol, bond_char, curr, prev_atom, i, tok) | ||
| prev_stack.pop() | ||
@@ -277,2 +291,3 @@ prev_stack.append(curr) | ||
| raise Exception("invalid symbol type") | ||
| i += 1 | ||
@@ -291,8 +306,11 @@ if len(mol) == 0: | ||
| raise SMILESParserError(smiles, err_msg, tok.start_idx) | ||
| return i | ||
| def _attach_atom(mol, bond_char, atom, prev_atom): | ||
| def _attach_atom(mol, bond_char, atom, prev_atom, i, tok): | ||
| is_root = (prev_atom is None) | ||
| mol.add_atom(atom, mark_root=is_root) | ||
| if bond_char: | ||
| i += 1 | ||
| o = mol.add_atom(atom, mark_root=is_root) | ||
| mol.add_attribution(o, [Attribution(i, str(tok))]) | ||
| if not is_root: | ||
@@ -303,4 +321,5 @@ src, dst = prev_atom.index, atom.index | ||
| order = 1.5 # handle implicit aromatic bonds, e.g. cc | ||
| mol.add_bond(src=src, dst=dst, order=order, stereo=stereo) | ||
| return atom | ||
| o = mol.add_bond(src=src, dst=dst, order=order, stereo=stereo) | ||
| mol.add_attribution(o, [Attribution(i, str(tok))]) | ||
| return atom, i | ||
@@ -399,3 +418,6 @@ | ||
| def mol_to_smiles(mol: MolecularGraph) -> str: | ||
| def mol_to_smiles( | ||
| mol: MolecularGraph, | ||
| attribute: bool = False | ||
| ) -> Union[str, Tuple[str, List[Tuple[str, List[Tuple[int, str]]]]]]: | ||
| """Converts a molecular graph into its SMILES representation, maintaining | ||
@@ -405,3 +427,6 @@ the traversal order indicated by the input graph. | ||
| :param mol: the input molecule. | ||
| :return: a SMILES string representing the input molecule. | ||
| :param attribute: if an attribution should be returned | ||
| :return: a SMILES string representing the input molecule if | ||
| attribute is ``False``, otherwise a tuple is returned of | ||
| SMILES string and attribution list. | ||
| """ | ||
@@ -411,13 +436,29 @@ assert mol.is_kekulized() | ||
| fragments = [] | ||
| attribution_maps = [] | ||
| attribution_index = 0 | ||
| ring_log = dict() | ||
| for root in mol.get_roots(): | ||
| derived = [] | ||
| _derive_smiles_from_fragment(derived, mol, root, ring_log) | ||
| _derive_smiles_from_fragment( | ||
| derived, mol, root, ring_log, attribution_maps, attribution_index) | ||
| attribution_index += len(derived) | ||
| fragments.append("".join(derived)) | ||
| return ".".join(fragments) | ||
| # trim attribution map of empty tokens | ||
| attribution_maps = [a for a in attribution_maps if a.token] | ||
| result = ".".join(fragments), attribution_maps | ||
| return result if attribute else result[0] | ||
| def _derive_smiles_from_fragment(derived, mol, root, ring_log): | ||
| def _derive_smiles_from_fragment( | ||
| derived, | ||
| mol, | ||
| root, | ||
| ring_log, | ||
| attribution_maps, attribution_index=0): | ||
| curr_atom, curr = mol.get_atom(root), root | ||
| derived.append(atom_to_smiles(curr_atom)) | ||
| token = atom_to_smiles(curr_atom) | ||
| derived.append(token) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| token, mol.get_attribution(curr_atom))) | ||
@@ -427,3 +468,7 @@ out_bonds = mol.get_out_dirbonds(curr) | ||
| if bond.ring_bond: | ||
| derived.append(bond_to_smiles(bond)) | ||
| token = bond_to_smiles(bond) | ||
| derived.append(token) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| token, mol.get_attribution(bond))) | ||
| ends = (min(bond.src, bond.dst), max(bond.src, bond.dst)) | ||
@@ -439,6 +484,12 @@ rnum = ring_log.setdefault(ends, len(ring_log) + 1) | ||
| derived.append(bond_to_smiles(bond)) | ||
| _derive_smiles_from_fragment(derived, mol, bond.dst, ring_log) | ||
| token = bond_to_smiles(bond) | ||
| derived.append(token) | ||
| attribution_maps.append(AttributionMap( | ||
| len(derived) - 1 + attribution_index, | ||
| token, mol.get_attribution(bond))) | ||
| _derive_smiles_from_fragment( | ||
| derived, mol, bond.dst, ring_log, | ||
| attribution_maps, attribution_index) | ||
| if i < len(out_bonds) - 1: | ||
| derived.append(")") | ||
| return attribution_maps |
+4
-4
@@ -10,3 +10,3 @@ #!/usr/bin/env python | ||
| name="selfies", | ||
| version="2.0.0", | ||
| version="2.1.0", | ||
| author="Mario Krenn, Alston Lo, and many other contributors", | ||
@@ -23,6 +23,6 @@ author_email="mario.krenn@utoronto.ca, alan@aspuru.com", | ||
| "Programming Language :: Python :: 3", | ||
| "Programming Language :: Python :: 3.5", | ||
| "Programming Language :: Python :: 3.6", | ||
| "Programming Language :: Python :: 3.7", | ||
| "Programming Language :: Python :: 3.8", | ||
| "Programming Language :: Python :: 3.9", | ||
| "Programming Language :: Python :: 3.10", | ||
| "Programming Language :: Python :: 3 :: Only", | ||
@@ -32,3 +32,3 @@ "License :: OSI Approved :: Apache Software License", | ||
| ], | ||
| python_requires='>=3.5' | ||
| python_requires='>=3.7' | ||
| ) |
| from typing import Any | ||
| class SinglyLinkedList: | ||
| """A simple singly linked list that supports O(1) append and O(1) extend. | ||
| """ | ||
| def __init__(self): | ||
| self._head = None | ||
| self._tail = None | ||
| self._count = 0 | ||
| def __len__(self): | ||
| return self._count | ||
| def __iter__(self): | ||
| return SinglyLinkedListIterator(self) | ||
| @property | ||
| def head(self): | ||
| return self._head | ||
| def append(self, item: Any) -> None: | ||
| node = [item, None] | ||
| if self._head is None: | ||
| self._head = node | ||
| self._tail = node | ||
| else: | ||
| self._tail[1] = node | ||
| self._tail = node | ||
| self._count += 1 | ||
| def extend(self, other) -> None: | ||
| assert isinstance(other, SinglyLinkedList) | ||
| if other._head is None: | ||
| return | ||
| if self._head is None: | ||
| self._head = other._head | ||
| self._tail = other._tail | ||
| else: | ||
| self._tail[1] = other._head | ||
| self._tail = other._tail | ||
| self._count += len(other) | ||
| class SinglyLinkedListIterator: | ||
| def __init__(self, linked_list): | ||
| self._curr = linked_list.head | ||
| def __iter__(self): | ||
| return self | ||
| def __next__(self): | ||
| if self._curr is None: | ||
| raise StopIteration | ||
| else: | ||
| item = self._curr[0] | ||
| self._curr = self._curr[1] | ||
| return item |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
121516
11.46%1811
5.97%22
-4.35%