selfies
Advanced tools
+21
-22
| Metadata-Version: 2.1 | ||
| Name: selfies | ||
| Version: 1.0.1 | ||
| Version: 1.0.2 | ||
| Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs. | ||
@@ -39,5 +39,4 @@ Home-page: https://github.com/aspuru-guzik-group/selfies | ||
| To check if the correct version of ``selfies`` is installed | ||
| (see [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to verify the latest version), use the following pip command: | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
@@ -49,3 +48,5 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command: | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`: | ||
@@ -66,3 +67,3 @@ ```bash | ||
| The ``selfies`` library has six standard functions: | ||
| The ``selfies`` library has eight standard functions: | ||
@@ -77,2 +78,4 @@ | Function | Description | | ||
| | ``selfies.get_semantic_robust_alphabet`` | Returns a subset of all SELFIES symbols that are semantically constrained. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES into a label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES. | | ||
@@ -104,6 +107,6 @@ Please read the documentation for more detailed descriptions of these | ||
| #### Integer encoding SELFIES: | ||
| #### Label (Integer) encoding SELFIES: | ||
| In this example we first build an alphabet | ||
| from a dataset of SELFIES, and then convert a SELFIES into a | ||
| padded, integer-encoded representation. Note that we use the | ||
| padded, label-encoded representation. Note that we use the | ||
| ``'[nop]'`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
@@ -126,14 +129,10 @@ symbol to pad our SELFIES, which is a special SELFIES symbol that is always | ||
| # SELFIES to integer encode | ||
| # SELFIES to label encode | ||
| dimethyl_ether = dataset[0] # '[C][O][C]' | ||
| # pad the SELFIES | ||
| dimethyl_ether += '[nop]' * (pad_to_len - sf.len_selfies(dimethyl_ether)) | ||
| # integer encode the SELFIES | ||
| int_encoded = [] | ||
| for symbol in sf.split_selfies(dimethyl_ether): | ||
| int_encoded.append(symbol_to_idx[symbol]) | ||
| print(int_encoded) # [1, 3, 1, 4, 4] | ||
| # [1, 3, 1, 4, 4] | ||
| print(sf.selfies_to_encoding(dimethyl_ether, | ||
| vocab_stoi=symbol_to_idx, | ||
| pad_to_len=pad_to_len, | ||
| enc_type='label')) | ||
| ``` | ||
@@ -163,3 +162,3 @@ | ||
| * 130K molecules from [QM9](https://www.nature.com/articles/sdata201422) | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database), | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database) | ||
| * 50K molecules from [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307) | ||
@@ -189,5 +188,5 @@ * 8K molecules from [Tox21](http://moleculenet.ai/datasets-1) in MoleculeNet | ||
| We thank Kevin Ryan (LeanAndMean@github), Theophile Gaudin, Andrew Brereton, | ||
| Benjamin Sanchez-Lengeling, and Zhenpeng Yao for their suggestions and | ||
| bug reports. | ||
| We thank Jacques Boitreaud, Andrew Brereton, Matthew Carbone (x94carbone), Nathan Frey (ncfrey), Theophile Gaudin, | ||
| Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, | ||
| and Zhenpeng Yao for their suggestions and bug reports, and Robert Pollice for chemistry advices. | ||
@@ -194,0 +193,0 @@ ## License |
+20
-21
@@ -31,5 +31,4 @@ # SELFIES | ||
| To check if the correct version of ``selfies`` is installed | ||
| (see [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to verify the latest version), use the following pip command: | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
@@ -41,3 +40,5 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command: | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`: | ||
@@ -58,3 +59,3 @@ ```bash | ||
| The ``selfies`` library has six standard functions: | ||
| The ``selfies`` library has eight standard functions: | ||
@@ -69,2 +70,4 @@ | Function | Description | | ||
| | ``selfies.get_semantic_robust_alphabet`` | Returns a subset of all SELFIES symbols that are semantically constrained. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES into a label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES. | | ||
@@ -96,6 +99,6 @@ Please read the documentation for more detailed descriptions of these | ||
| #### Integer encoding SELFIES: | ||
| #### Label (Integer) encoding SELFIES: | ||
| In this example we first build an alphabet | ||
| from a dataset of SELFIES, and then convert a SELFIES into a | ||
| padded, integer-encoded representation. Note that we use the | ||
| padded, label-encoded representation. Note that we use the | ||
| ``'[nop]'`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
@@ -118,14 +121,10 @@ symbol to pad our SELFIES, which is a special SELFIES symbol that is always | ||
| # SELFIES to integer encode | ||
| # SELFIES to label encode | ||
| dimethyl_ether = dataset[0] # '[C][O][C]' | ||
| # pad the SELFIES | ||
| dimethyl_ether += '[nop]' * (pad_to_len - sf.len_selfies(dimethyl_ether)) | ||
| # integer encode the SELFIES | ||
| int_encoded = [] | ||
| for symbol in sf.split_selfies(dimethyl_ether): | ||
| int_encoded.append(symbol_to_idx[symbol]) | ||
| print(int_encoded) # [1, 3, 1, 4, 4] | ||
| # [1, 3, 1, 4, 4] | ||
| print(sf.selfies_to_encoding(dimethyl_ether, | ||
| vocab_stoi=symbol_to_idx, | ||
| pad_to_len=pad_to_len, | ||
| enc_type='label')) | ||
| ``` | ||
@@ -155,3 +154,3 @@ | ||
| * 130K molecules from [QM9](https://www.nature.com/articles/sdata201422) | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database), | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database) | ||
| * 50K molecules from [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307) | ||
@@ -181,5 +180,5 @@ * 8K molecules from [Tox21](http://moleculenet.ai/datasets-1) in MoleculeNet | ||
| We thank Kevin Ryan (LeanAndMean@github), Theophile Gaudin, Andrew Brereton, | ||
| Benjamin Sanchez-Lengeling, and Zhenpeng Yao for their suggestions and | ||
| bug reports. | ||
| We thank Jacques Boitreaud, Andrew Brereton, Matthew Carbone (x94carbone), Nathan Frey (ncfrey), Theophile Gaudin, | ||
| Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, | ||
| and Zhenpeng Yao for their suggestions and bug reports, and Robert Pollice for chemistry advices. | ||
@@ -186,0 +185,0 @@ ## License |
| Metadata-Version: 2.1 | ||
| Name: selfies | ||
| Version: 1.0.1 | ||
| Version: 1.0.2 | ||
| Summary: SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs. | ||
@@ -39,5 +39,4 @@ Home-page: https://github.com/aspuru-guzik-group/selfies | ||
| To check if the correct version of ``selfies`` is installed | ||
| (see [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to verify the latest version), use the following pip command: | ||
| To check if the correct version of ``selfies`` is installed, use | ||
| the following pip command. | ||
@@ -49,3 +48,5 @@ ```bash | ||
| To upgrade to the latest release of ``selfies`` if you are using an | ||
| older version, use the following pip command: | ||
| older version, use the following pip command. Please see the | ||
| [CHANGELOG](https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md) | ||
| to review the changes between versions of `selfies`: | ||
@@ -66,3 +67,3 @@ ```bash | ||
| The ``selfies`` library has six standard functions: | ||
| The ``selfies`` library has eight standard functions: | ||
@@ -77,2 +78,4 @@ | Function | Description | | ||
| | ``selfies.get_semantic_robust_alphabet`` | Returns a subset of all SELFIES symbols that are semantically constrained. | | ||
| | ``selfies.selfies_to_encoding`` | Converts a SELFIES into a label and/or one-hot encoding. | | ||
| | ``selfies.encoding_to_selfies`` | Converts a label or one-hot encoding into a SELFIES. | | ||
@@ -104,6 +107,6 @@ Please read the documentation for more detailed descriptions of these | ||
| #### Integer encoding SELFIES: | ||
| #### Label (Integer) encoding SELFIES: | ||
| In this example we first build an alphabet | ||
| from a dataset of SELFIES, and then convert a SELFIES into a | ||
| padded, integer-encoded representation. Note that we use the | ||
| padded, label-encoded representation. Note that we use the | ||
| ``'[nop]'`` ([no operation](https://en.wikipedia.org/wiki/NOP_(code) )) | ||
@@ -126,14 +129,10 @@ symbol to pad our SELFIES, which is a special SELFIES symbol that is always | ||
| # SELFIES to integer encode | ||
| # SELFIES to label encode | ||
| dimethyl_ether = dataset[0] # '[C][O][C]' | ||
| # pad the SELFIES | ||
| dimethyl_ether += '[nop]' * (pad_to_len - sf.len_selfies(dimethyl_ether)) | ||
| # integer encode the SELFIES | ||
| int_encoded = [] | ||
| for symbol in sf.split_selfies(dimethyl_ether): | ||
| int_encoded.append(symbol_to_idx[symbol]) | ||
| print(int_encoded) # [1, 3, 1, 4, 4] | ||
| # [1, 3, 1, 4, 4] | ||
| print(sf.selfies_to_encoding(dimethyl_ether, | ||
| vocab_stoi=symbol_to_idx, | ||
| pad_to_len=pad_to_len, | ||
| enc_type='label')) | ||
| ``` | ||
@@ -163,3 +162,3 @@ | ||
| * 130K molecules from [QM9](https://www.nature.com/articles/sdata201422) | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database), | ||
| * 250K molecules from [ZINC](https://en.wikipedia.org/wiki/ZINC_database) | ||
| * 50K molecules from [non-fullerene acceptors for organic solar cells](https://www.sciencedirect.com/science/article/pii/S2542435117301307) | ||
@@ -189,5 +188,5 @@ * 8K molecules from [Tox21](http://moleculenet.ai/datasets-1) in MoleculeNet | ||
| We thank Kevin Ryan (LeanAndMean@github), Theophile Gaudin, Andrew Brereton, | ||
| Benjamin Sanchez-Lengeling, and Zhenpeng Yao for their suggestions and | ||
| bug reports. | ||
| We thank Jacques Boitreaud, Andrew Brereton, Matthew Carbone (x94carbone), Nathan Frey (ncfrey), Theophile Gaudin, | ||
| Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, | ||
| and Zhenpeng Yao for their suggestions and bug reports, and Robert Pollice for chemistry advices. | ||
@@ -194,0 +193,0 @@ ## License |
+28
-7
@@ -30,11 +30,32 @@ #!/usr/bin/env python | ||
| __all__ = ['encoder', 'decoder', | ||
| 'get_semantic_robust_alphabet', 'get_semantic_constraints', | ||
| 'set_semantic_constraints', | ||
| 'len_selfies', 'split_selfies', 'get_alphabet_from_selfies'] | ||
| __all__ = [ | ||
| "encoder", | ||
| "decoder", | ||
| "get_semantic_robust_alphabet", | ||
| "get_semantic_constraints", | ||
| "set_semantic_constraints", | ||
| "len_selfies", | ||
| "split_selfies", | ||
| "get_alphabet_from_selfies", | ||
| "selfies_to_encoding", | ||
| "batch_selfies_to_flat_hot", | ||
| "encoding_to_selfies", | ||
| "batch_flat_hot_to_selfies", | ||
| ] | ||
| from .decoder import decoder | ||
| from .encoder import encoder | ||
| from .grammar_rules import get_semantic_robust_alphabet, \ | ||
| get_semantic_constraints, set_semantic_constraints | ||
| from .utils import get_alphabet_from_selfies, len_selfies, split_selfies | ||
| from .grammar_rules import ( | ||
| get_semantic_robust_alphabet, | ||
| get_semantic_constraints, | ||
| set_semantic_constraints, | ||
| ) | ||
| from .utils import ( | ||
| get_alphabet_from_selfies, | ||
| len_selfies, | ||
| split_selfies, | ||
| selfies_to_encoding, | ||
| batch_selfies_to_flat_hot, | ||
| encoding_to_selfies, | ||
| batch_flat_hot_to_selfies, | ||
| ) |
@@ -19,4 +19,4 @@ from collections import OrderedDict | ||
| :param selfies: The SELFIES to be translated. | ||
| :param print_error: If True, error messages will be printed to console. | ||
| :param selfies: the SELFIES to be translated. | ||
| :param print_error: if True, error messages will be printed to console. | ||
| Defaults to False. | ||
@@ -66,2 +66,7 @@ :return: the SMILES translation of ``selfies``. If an error occurs, | ||
| right_idx = selfies.find(']', left_idx + 1) | ||
| if (selfies[left_idx] != '[') or (right_idx == -1): | ||
| raise ValueError("malformed SELIFES, " | ||
| "misplaced or missing brackets") | ||
| next_symbol = selfies[left_idx: right_idx + 1] | ||
@@ -68,0 +73,0 @@ left_idx = right_idx + 1 |
@@ -35,4 +35,4 @@ from typing import Dict, Iterable, List, Optional, Tuple | ||
| :param smiles: The SMILES to be translated. | ||
| :param print_error: If True, error messages will be printed to console. | ||
| :param smiles: the SMILES to be translated. | ||
| :param print_error: if True, error messages will be printed to console. | ||
| Defaults to False. | ||
@@ -130,2 +130,5 @@ :return: the SELFIES translation of ``smiles``. If an error occurs, | ||
| if r_idx == -1: | ||
| raise ValueError("malformed SMILES, missing ']'") | ||
| # quick chirality specification check | ||
@@ -132,0 +135,0 @@ chiral_i = symbol.find('@') |
@@ -6,7 +6,7 @@ from itertools import product | ||
| 'H': 1, 'F': 1, 'Cl': 1, 'Br': 1, 'I': 1, | ||
| 'O': 2, | ||
| 'N': 3, | ||
| 'C': 4, | ||
| 'P': 5, | ||
| 'S': 6, | ||
| 'O': 2, 'O+1': 3, 'O-1': 1, | ||
| 'N': 3, 'N+1': 4, 'N-1': 2, | ||
| 'C': 4, 'C+1': 5, 'C-1': 3, | ||
| 'S': 6, 'S+1': 7, 'S-1': 5, | ||
| 'P': 7, 'P+1': 8, 'P-1': 6, | ||
| '?': 8, | ||
@@ -67,3 +67,3 @@ } | ||
| :return: The bond constraints :mod:`selfies` is currently operating on. | ||
| :return: the bond constraints :mod:`selfies` is currently operating on. | ||
| """ | ||
@@ -166,3 +166,3 @@ | ||
| if h_count >= max_bonds: | ||
| if (h_count > max_bonds) or (h_count == max_bonds and state > 0): | ||
| raise ValueError("too many Hs in symbol '{}'; consider " | ||
@@ -169,0 +169,0 @@ "adjusting bond constraints".format(symbol)) |
@@ -130,3 +130,3 @@ from typing import Dict, Iterable, List, Set, Tuple, Union | ||
| 'b': 3, 'c': 4, 'n': 5, 'p': 5, 'as': 5, | ||
| 'o': 6, 's': 6, 'se': 6 | ||
| 'o': 6, 's': 6, 'se': 6, 'te': 6 | ||
| } | ||
@@ -197,3 +197,7 @@ | ||
| # e.g. c1ccccc1 | ||
| if (atom == 'c') and (h_count == charge == 0) and (len(bonds) == 2): | ||
| # this also covers the neutral carbon radical case (e.g. C1=[C]NC=C1), | ||
| # which is treated equivalently to a 1-H carbon (e.g. C1=[CH]NC=C1) | ||
| if (atom == 'c') and (h_count == charge == 0) \ | ||
| and (len(bonds) == 2) and ('#' not in bonds): | ||
| h_count += 1 # implied bonded hydrogen | ||
@@ -200,0 +204,0 @@ |
+213
-13
@@ -1,2 +0,2 @@ | ||
| from typing import Iterable, Set | ||
| from typing import Dict, Iterable, List, Set, Tuple, Union | ||
@@ -10,4 +10,4 @@ | ||
| :param selfies: A SELFIES. | ||
| :return: The symbol length of ``selfies``. | ||
| :param selfies: a SELFIES. | ||
| :return: the symbol length of ``selfies``. | ||
@@ -23,3 +23,3 @@ :Example: | ||
| return selfies.count('[') + selfies.count('.') | ||
| return selfies.count("[") + selfies.count(".") | ||
@@ -35,4 +35,4 @@ | ||
| :param selfies: The SELFIES to be read. | ||
| :return: An iterable of the symbols of ``selfies`` in the same order | ||
| :param selfies: the SELFIES to be read. | ||
| :return: an iterable of the symbols of ``selfies`` in the same order | ||
| they appear in the string. | ||
@@ -49,6 +49,6 @@ | ||
| left_idx = selfies.find('[') | ||
| left_idx = selfies.find("[") | ||
| while 0 <= left_idx < len(selfies): | ||
| right_idx = selfies.find(']', left_idx + 1) | ||
| right_idx = selfies.find("]", left_idx + 1) | ||
| next_symbol = selfies[left_idx: right_idx + 1] | ||
@@ -58,4 +58,4 @@ yield next_symbol | ||
| left_idx = right_idx + 1 | ||
| if selfies[left_idx: left_idx + 1] == '.': | ||
| yield '.' | ||
| if selfies[left_idx: left_idx + 1] == ".": | ||
| yield "." | ||
| left_idx += 1 | ||
@@ -73,4 +73,4 @@ | ||
| :param selfies_iter: An iterable of SELFIES. | ||
| :return: The SElFIES alphabet built from the SELFIES in ``selfies_iter``. | ||
| :param selfies_iter: an iterable of SELFIES. | ||
| :return: the SElFIES alphabet built from the SELFIES in ``selfies_iter``. | ||
@@ -92,4 +92,204 @@ :Example: | ||
| alphabet.discard('.') | ||
| alphabet.discard(".") | ||
| return alphabet | ||
| def selfies_to_encoding( | ||
| selfies: str, | ||
| vocab_stoi: Dict[str, int], | ||
| pad_to_len: int = -1, | ||
| enc_type: str = 'both' | ||
| ) -> Union[List[int], List[List[int]], Tuple[List[int], List[List[int]]]]: | ||
| """Converts a SELFIES into its label (integer) and/or one-hot encoding. | ||
| A label encoded output will be a list of size ``(N,)`` and a | ||
| one-hot encoded output will be a list of size ``(N, len(vocab_stoi))``; | ||
| where ``N`` is the symbol length of the (potentially padded) SELFIES. | ||
| Note that SELFIES uses the special padding symbol ``[nop]``. | ||
| :param selfies: the SELFIES to be encoded. | ||
| :param vocab_stoi: a dictionary that maps SELFIES symbols (the keys) | ||
| to a non-negative index. The indices of the dictionary | ||
| must contiguous, starting from 0. | ||
| :param pad_to_len: the length the SELFIES is be padded to. | ||
| If ``pad_to_len`` is less than or equal to the symbol | ||
| length of the SELFIES, then no padding is added. Defaults to ``-1``. | ||
| :param enc_type: the type of encoding of the output: | ||
| ``label`` or ``one_hot`` or ``both``. | ||
| If the value is ``both``, then a tuple of the label and one-hot | ||
| encoding are returned (in that order). Defaults to ``both``. | ||
| :return: the label encoded and/or one-hot encoded SELFIES. | ||
| :Example: | ||
| >>> import selfies as sf | ||
| >>> sf.selfies_to_encoding('[C][F]', {'[C]': 0, '[F]': 1}) | ||
| ([0, 1], [[1, 0], [0, 1]]) | ||
| """ | ||
| # some error checking | ||
| if enc_type not in ('label', 'one_hot', 'both'): | ||
| raise ValueError("enc_type must be in ('label', 'one_hot', 'both')") | ||
| # pad with [nop] | ||
| if pad_to_len > len_selfies(selfies): | ||
| selfies += "[nop]" * (pad_to_len - len_selfies(selfies)) | ||
| # integer encode | ||
| char_list = split_selfies(selfies) | ||
| integer_encoded = [vocab_stoi[char] for char in char_list] | ||
| if enc_type == 'label': | ||
| return integer_encoded | ||
| # one-hot encode | ||
| onehot_encoded = list() | ||
| for index in integer_encoded: | ||
| letter = [0] * len(vocab_stoi) | ||
| letter[index] = 1 | ||
| onehot_encoded.append(letter) | ||
| if enc_type == 'one_hot': | ||
| return onehot_encoded | ||
| return integer_encoded, onehot_encoded | ||
| def encoding_to_selfies( | ||
| encoded: Union[List[int], List[List[int]]], | ||
| vocab_itos: Dict[int, str], | ||
| enc_type: str, | ||
| ) -> str: | ||
| """Converts a label (integer) or one-hot encoded list into | ||
| a SELFIES string. | ||
| If the input is label encoded, then a list of size ``(N,)`` is | ||
| expected; and if the input is one-hot encoded, then a 2D list of | ||
| size ``(N, len(vocab_itos))`` is expected. | ||
| :param encoded: a label or one-hot encoded list. | ||
| :param vocab_itos: a dictionary that maps non-negative indices (the keys) | ||
| to SELFIES symbols. The indices of the dictionary | ||
| must be contiguous, starting from 0. | ||
| :param enc_type: the type of encoding of the output: | ||
| ``label`` or ``one_hot``. | ||
| :return: the SELFIES string represented by the encoded input. | ||
| :Example: | ||
| >>> import selfies as sf | ||
| >>> one_hot = [[0, 1, 0], [0, 0, 1], [1, 0, 0]] | ||
| >>> vocab_itos = {0: '[nop]', 1: '[C]', 2: '[F]'} | ||
| >>> sf.encoding_to_selfies(one_hot, vocab_itos, enc_type='one_hot') | ||
| '[C][F][nop]' | ||
| """ | ||
| if enc_type not in ('label', 'one_hot'): | ||
| raise ValueError("enc_type must be in ('label', 'one_hot')") | ||
| if enc_type == 'one_hot': # Get integer encoding | ||
| integer_encoded = [] | ||
| for row in encoded: | ||
| integer_encoded.append(row.index(1)) | ||
| else: | ||
| integer_encoded = encoded | ||
| # Integer encoding -> SELFIES | ||
| char_list = [vocab_itos[i] for i in integer_encoded] | ||
| selfies = "".join(char_list) | ||
| return selfies | ||
| def batch_selfies_to_flat_hot( | ||
| selfies_batch: List[str], | ||
| vocab_stoi: Dict[str, int], | ||
| pad_to_len: int = -1, | ||
| ) -> List[List[int]]: | ||
| """Converts a list of SELFIES into a list of | ||
| flattened one-hot encodings. | ||
| Returned is a list of size ``(batch_size, N * len(vocab_stoi))``; | ||
| where ``N`` is the symbol length of the (potentially padded) SELFIES. | ||
| Note that SELFIES uses the special padding symbol ``[nop]``. | ||
| :param selfies_batch: a list of SELFIES to be converted. | ||
| :param vocab_stoi: a dictionary that maps SELFIES symbols (the keys) | ||
| to a non-negative index. The indices of the dictionary | ||
| must contiguous, starting from 0. | ||
| :param pad_to_len: the length that each SELFIES is be padded to. | ||
| If ``pad_to_len`` is less than or equal to the symbol | ||
| length of the SELFIES, then no padding is added. Defaults to ``-1``. | ||
| :return: the flattened one-hot encoded representations of the SELFIES | ||
| from the batch. This is a 2D list of size | ||
| ``(batch_size, N * len(vocab_stoi))``. | ||
| :Example: | ||
| >>> import selfies as sf | ||
| >>> batch = ["[C]", "[C][C]"] | ||
| >>> vocab_stoi = {'[nop]': 0, '[C]': 1} | ||
| >>> sf.batch_selfies_to_flat_hot(batch, vocab_stoi, 2) | ||
| [[0, 1, 1, 0], [0, 1, 0, 1]] | ||
| """ | ||
| hot_list = list() | ||
| for selfies in selfies_batch: | ||
| one_hot = selfies_to_encoding(selfies, vocab_stoi, pad_to_len, | ||
| enc_type='one_hot') | ||
| flattened = [elem for vec in one_hot for elem in vec] | ||
| hot_list.append(flattened) | ||
| return hot_list | ||
| def batch_flat_hot_to_selfies( | ||
| one_hot_batch: List[List[int]], | ||
| vocab_itos: Dict[int, str], | ||
| ) -> List[str]: | ||
| """Convert a batch of flattened one-hot encodings into | ||
| a list of SELFIES. | ||
| We expect ``one_hot_batch`` to be a list of size ``(batch_size, S)``, | ||
| where ``S`` is divisible by the length of the vocabulary. | ||
| :param one_hot_batch: a list of flattened one-hot encoded representations. | ||
| :param vocab_itos: a dictionary that maps non-negative indices (the keys) | ||
| to SELFIES symbols. We expect the indices of the dictionary | ||
| to be contiguous and starting from 0. | ||
| :return: a list of SELFIES strings. | ||
| :Example: | ||
| >>> import selfies as sf | ||
| >>> batch = [[0, 1, 1, 0], [0, 1, 0, 1]] | ||
| >>> vocab_itos = {0: '[nop]', 1: '[C]'} | ||
| >>> sf.batch_flat_hot_to_selfies(batch, vocab_itos) | ||
| ['[C][nop]', '[C][C]'] | ||
| """ | ||
| selfies_list = [] | ||
| for flat_one_hot in one_hot_batch: | ||
| # Reshape to an N x M array where each column represents an alphabet | ||
| # entry and each row is a position in the selfies | ||
| one_hot = [] | ||
| M = len(vocab_itos) | ||
| if len(flat_one_hot) % M != 0: | ||
| raise ValueError("size of vector in one_hot_batch not divisible " | ||
| "by the length of the vocabulary.") | ||
| N = len(flat_one_hot) // M | ||
| for i in range(N): | ||
| one_hot.append(flat_one_hot[M * i: M * (i + 1)]) | ||
| selfies = encoding_to_selfies(one_hot, vocab_itos, enc_type='one_hot') | ||
| selfies_list.append(selfies) | ||
| return selfies_list |
+0
-0
@@ -0,0 +0,0 @@ [egg_info] |
+1
-1
@@ -10,3 +10,3 @@ #!/usr/bin/env python | ||
| name="selfies", | ||
| version="1.0.1", | ||
| version="1.0.2", | ||
| author="Mario Krenn", | ||
@@ -13,0 +13,0 @@ author_email="mario.krenn@utoronto.ca, alan@aspuru.com", |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
93114
13.71%1395
15%