--- a +++ b/docs/source/tutorial.ipynb @@ -0,0 +1,670 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Tutorial" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tM3wFk1e_COd", + "tags": [] + }, + "source": [ + "## The Basics\n", + "We begin by importing `selfies`. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 89 + }, + "colab_type": "code", + "id": "GH0DQxBN_Fei", + "outputId": "56aa043e-df48-4081-f938-49711a166d33" + }, + "outputs": [], + "source": [ + "import selfies as sf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's try translating between SMILES and SELFIES - as an example, we will use benzaldehyde. To translate from SMILES to SELFIES, use the `selfies.encoder` function, and to translate from SMILES back to SELFIES, use the `selfies.decoder` function." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "original_smiles = \"O=Cc1ccccc1\" # benzaldehyde\n", + "\n", + "try:\n", + " \n", + " encoded_selfies = sf.encoder(original_smiles) # SMILES -> SELFIES\n", + " decoded_smiles = sf.decoder(encoded_selfies) # SELFIES -> SMILES\n", + " \n", + "except sf.EncoderError as err: \n", + " pass # sf.encoder error...\n", + "except sf.DecoderError as err: \n", + " pass # sf.decoder error..." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "encoded_selfies" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'O=CC1=CC=CC=C1'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "decoded_smiles" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Ng8PmMiB_RvJ" + }, + "source": [ + "Note that `original_smiles` and `decoded_smiles` are different strings, but they both represent benzaldehyde. Thus, when comparing the two SMILES strings, string equality should _not_ be used. Insead, use RDKit to check whether the SMILES strings represent the same molecule." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "colab_type": "code", + "id": "iAc5FVrP_XV6", + "outputId": "b503f896-a2a0-46a6-fc5b-9c474c01ba62" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from rdkit import Chem\n", + "\n", + "Chem.CanonSmiles(original_smiles) == Chem.CanonSmiles(decoded_smiles)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "IKfNr5m6_h4f", + "tags": [] + }, + "source": [ + "## Customizing SELFIES\n", + "The SELFIES grammar is derived dynamically from a set of semantic constraints, which assign bonding capacities to various atoms. Let's customize the semantic constraints that `selfies` operates on. By default, the following constraints are used:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 200 + }, + "colab_type": "code", + "id": "Xmce7wvV_t4Y", + "outputId": "8b10af2f-486e-4910-8a71-055b59a09746" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'H': 1,\n", + " 'F': 1,\n", + " 'Cl': 1,\n", + " 'Br': 1,\n", + " 'I': 1,\n", + " 'O': 2,\n", + " 'O+1': 3,\n", + " 'O-1': 1,\n", + " 'N': 3,\n", + " 'N+1': 4,\n", + " 'N-1': 2,\n", + " 'C': 4,\n", + " 'C+1': 5,\n", + " 'C-1': 3,\n", + " 'P': 5,\n", + " 'P+1': 6,\n", + " 'P-1': 4,\n", + " 'S': 6,\n", + " 'S+1': 7,\n", + " 'S-1': 5,\n", + " '?': 8}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sf.get_preset_constraints(\"default\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These constraints map atoms (they keys) to their bonding capacities (the values). The special `?` key maps to the bonding capacity for all atoms that are not explicitly listed in the constraints. For example, S and Li are constrained to a maximum of 6 and 8 bonds, respectively. Every SELFIES string can be decoded into a molecule that obeys the current constraints." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[Li]=CCS=CC#S'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "KevVGyIEAVlu" + }, + "source": [ + "But suppose that we instead wanted to constrain S and Li to a maximum of 2 and 1 bond(s), respectively. To do so, we create a new set of constraints, and tell `selfies` to operate on them using `selfies.set_semantic_constraints`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "y5EbmzkKATkD" + }, + "outputs": [], + "source": [ + "new_constraints = sf.get_preset_constraints(\"default\")\n", + "new_constraints['Li'] = 1\n", + "new_constraints['S'] = 2\n", + "\n", + "sf.set_semantic_constraints(new_constraints)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To check that the update was succesful, we can use `selfies.get_semantic_constraints`, which returns the semantic constraints that `selfies` is currently operating on." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'H': 1,\n", + " 'F': 1,\n", + " 'Cl': 1,\n", + " 'Br': 1,\n", + " 'I': 1,\n", + " 'O': 2,\n", + " 'O+1': 3,\n", + " 'O-1': 1,\n", + " 'N': 3,\n", + " 'N+1': 4,\n", + " 'N-1': 2,\n", + " 'C': 4,\n", + " 'C+1': 5,\n", + " 'C-1': 3,\n", + " 'P': 5,\n", + " 'P+1': 6,\n", + " 'P-1': 4,\n", + " 'S': 2,\n", + " 'S+1': 7,\n", + " 'S-1': 5,\n", + " '?': 8,\n", + " 'Li': 1}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sf.get_semantic_constraints()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "e_djGmr_AvM7" + }, + "source": [ + "Our previous SELFIES string is now decoded like so. Notice that the specified bonding capacities are met, with every S and Li making only 2 and 1 bonds, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TCzbjZMAAxpo" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'[Li]CCSCC=S'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Ng1Lr_e6A3cB" + }, + "source": [ + "Finally, to revert back to the default constraints, simply call: " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "zwC00Rx5A6eQ" + }, + "outputs": [], + "source": [ + "sf.set_semantic_constraints()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " Please refer to the API reference for more details and more preset constraints.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SELFIES in Practice \n", + "\n", + "Let's use a simple example to show how `selfies` can be used in practice, as well as highlight some convenient utility functions from the library. We start with a toy dataset of SMILES strings. As before, we can use `selfies.encoder` to convert the dataset into SELFIES form." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['[C][O][C]',\n", + " '[F][C][F]',\n", + " '[O][=O]',\n", + " '[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]']" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "smiles_dataset = [\"COC\", \"FCF\", \"O=O\", \"O=Cc1ccccc1\"]\n", + "selfies_dataset = list(map(sf.encoder, smiles_dataset))\n", + "\n", + "selfies_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function `selfies.len_selfies` computes the symbol length of a SELFIES string. We can use it to find the maximum symbol length of the SELFIES strings in the dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "max_len = max(sf.len_selfies(s) for s in selfies_dataset)\n", + "max_len" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To extract the SELFIES symbols that form the dataset, use `selfies.get_alphabet_from_selfies`. Here, we add `[nop]` to the alphabet, which is a special padding character that `selfies` recognizes." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['[=Branch1]', '[=C]', '[=O]', '[C]', '[F]', '[O]', '[Ring1]', '[nop]']" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "alphabet = sf.get_alphabet_from_selfies(selfies_dataset)\n", + "alphabet.add(\"[nop]\")\n", + "\n", + "alphabet = list(sorted(alphabet))\n", + "alphabet" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, create a mapping between the alphabet SELFIES symbols and indices." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'[=Branch1]': 0,\n", + " '[=C]': 1,\n", + " '[=O]': 2,\n", + " '[C]': 3,\n", + " '[F]': 4,\n", + " '[O]': 5,\n", + " '[Ring1]': 6,\n", + " '[nop]': 7}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vocab_stoi = {symbol: idx for idx, symbol in enumerate(alphabet)}\n", + "vocab_itos = {idx: symbol for symbol, idx in vocab_stoi.items()}\n", + "\n", + "vocab_stoi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "SELFIES provides some convenience methods to convert between SELFIES strings and label (integer) and one-hot encodings. Using the first entry of the dataset (dimethyl ether) as an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "dimethyl_ether = selfies_dataset[0]\n", + "label, one_hot = sf.selfies_to_encoding(dimethyl_ether, vocab_stoi, pad_to_len=max_len)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[3, 5, 3, 7, 7, 7, 7, 7, 7, 7]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "label" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[0, 0, 0, 1, 0, 0, 0, 0],\n", + " [0, 0, 0, 0, 0, 1, 0, 0],\n", + " [0, 0, 0, 1, 0, 0, 0, 0],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1],\n", + " [0, 0, 0, 0, 0, 0, 0, 1]]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "one_hot" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[C][O][C][nop][nop][nop][nop][nop][nop][nop]'" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dimethyl_ether = sf.encoding_to_selfies(one_hot, vocab_itos, enc_type=\"one_hot\")\n", + "dimethyl_ether" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'COC'" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sf.decoder(dimethyl_ether) # sf.decoder ignores [nop]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If different encoding strategies are desired, `selfies.split_selfies` can be used to tokenize a SELFIES string into its individual symbols." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['[C]', '[O]', '[C]']" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(sf.split_selfies(\"[C][O][C]\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please refer to the API reference for more details and utility functions." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "selfies_example.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}