Switch to unified view

a b/docs/source/derivation.rst
1
Derivation
2
============
3
4
This section is an informal tutorial on how molecules are derived
5
from a SELFIES. The SELFIES grammar has non-terminal symbols or states
6
7
.. math::
8
9
    X_0, \ldots, X_7, Q
10
11
Derivation starts with state :math:`X_0`. The SELFIES is read symbol-by-symbol,
12
with each symbol specifying a grammar rule. SELFIES derivation terminates
13
when no non-terminal symbols remain. In each subsection, we describe a type of
14
SELFIES symbol and the grammar rules associated with it.
15
16
Atomic Symbols
17
##############
18
19
Atomic symbols are of the general form ``[<B><A>]``, where
20
``<B> in {'', '/', '\\', '=', '#'}`` is a prefix representing a bond,
21
and ``<A>`` is a SMILES symbol representing an atom or ion.
22
If the SMILES symbol is enclosed by square brackets (e.g. ``[13C]``),
23
then the square brackets are dropped and ``expl`` (for "explicit brackets")
24
is appended to obtain ``<A>``. For example:
25
26
.. table::
27
    :align: center
28
29
    +---------+---------------+--------------+----------------+
30
    | ``<B>`` | SMILES symbol | ``<A>``      | SELFIES symbol |
31
    +=========+===============+==============+================+
32
    | ``'='`` | ``N``         | ``N``        | ``[=N]``       |
33
    +---------+---------------+--------------+----------------+
34
    | ``''``  | ``[C@@H]``    | ``C@@Hexpl`` | ``[C@@Hexpl]`` |
35
    +---------+---------------+--------------+----------------+
36
    | ``'/'`` | ``[O+]``      | ``O+expl``   | ``[/O+expl]``  |
37
    +---------+---------------+--------------+----------------+
38
39
Let atomic symbol ``[<B><A>]`` be given, where ``<B>`` is a prefix
40
representing a bond with multiplicity :math:`\beta` and ``<A>`` is an atom
41
that can make :math:`\alpha` bonds maximally. The atomic symbol maps:
42
43
.. math::
44
45
    X_i \to \begin{cases}
46
        \texttt{<B'><A>}  & \alpha - \mu = 0 \\
47
        \texttt{<B'><A>} X_{\alpha - \mu}  & \alpha - \mu \neq 0
48
    \end{cases}
49
50
where ``<B'>`` is a prefix representing a bond with multiplicity
51
:math:`\mu = \min(\beta, \alpha, i)`, or the empty string if :math:`\mu = 0`.
52
Note that non-terminal states :math:`X_i` effectively restrict the subsequent
53
bond to a multiplicity of at most :math:`i`. We provide an example of
54
the derivation of the SELFIES ``[F][=C][=C][#N]``:
55
56
.. math::
57
58
    X_0 \to \texttt{F}X_1 \to \texttt{FC}X_3 \to \texttt{FC=C}X_2 \to \texttt{FC=C=N}
59
60
61
**Discussion:** Intuitively, the formal grammar has the following behaviour.
62
An atomic symbol ``[<B><A>]`` connects atom ``<A>`` to the previously derived
63
atom through bond type ``<B>``. If creating this bond would violate the bond
64
constraints of the previous or current atom, the bond multiplicity is reduced
65
(minimally) such that all bond constraints are fulfilled.
66
67
**Examples:**
68
69
.. table::
70
    :align: center
71
72
    +---------+-----------------------------+-----------------+
73
    | Example | SELFIES                     | SMILES          |
74
    +=========+=============================+=================+
75
    | 1       | ``[C][=C][C][#C][13Cexpl]`` | ``C=CC#C[13C]`` |
76
    +---------+-----------------------------+-----------------+
77
    | 2       | ``[C][F][C][C][C][C]``      | ``CF``          |
78
    +---------+-----------------------------+-----------------+
79
    | 3       | ``[C][O][=C][#O][C][F]``    | ``COC=O``       |
80
    +---------+-----------------------------+-----------------+
81
82
Index Symbols
83
#############
84
85
The state :math:`Q` is used to derive the size of branches and
86
the location of ring bonds. After a ring or branch symbol, the subsequent
87
one or more SELFIES symbols are used to derive an integer from :math:`Q`.
88
Note that the specific branch and ring symbol itself will specify exactly
89
how many symbols are used in the derivation (e.g. ``[Ring3]`` indicates
90
that the subsequent three symbols are used).
91
92
First, each subsequent symbol :math:`s_i` is converted to an
93
index :math:`\text{idx}(s_i)`, according to the following assignment:
94
95
.. table::
96
    :align: center
97
98
    +-------+-----------------+-------+-----------------+
99
    | Index | Symbol          | Index | Symbol          |
100
    +=======+=================+=======+=================+
101
    | 0     | ``[C]``         | 8     | ``[Branch2_3]`` |
102
    +-------+-----------------+-------+-----------------+
103
    | 1     | ``[Ring1]``     | 9     | ``[O]``         |
104
    +-------+-----------------+-------+-----------------+
105
    | 2     | ``[Ring2]``     | 10    | ``[N]``         |
106
    +-------+-----------------+-------+-----------------+
107
    | 3     | ``[Branch1_1]`` | 11    | ``[=N]``        |
108
    +-------+-----------------+-------+-----------------+
109
    | 4     | ``[Branch1_2]`` | 12    | ``[=C]``        |
110
    +-------+-----------------+-------+-----------------+
111
    | 5     | ``[Branch1_3]`` | 13    | ``[#C]``        |
112
    +-------+-----------------+-------+-----------------+
113
    | 6     | ``[Branch2_1]`` | 14    | ``[S]``         |
114
    +-------+-----------------+-------+-----------------+
115
    | 7     | ``[Branch2_2]`` | 15    | ``[P]``         |
116
    +-------+-----------------+-------+-----------------+
117
    | All other symbols assigned index 0.               |
118
    +-------+-----------------+-------+-----------------+
119
120
Then :math:`Q` is mapped to the hexadecimal (base 16) integer specified
121
by the indices. For example, if three symbols :math:`s_1, s_2, s_3` are
122
used in the derivation, then :math:`Q` is mapped to:
123
124
.. math::
125
126
    Q \to (\text{idx}(s_1) \times 16^2) + (\text{idx}(s_2) \times 16) + \text{idx}(s_3)
127
128
For example, ``[Ring3][C][Branch1_1][O]`` will derive the number :math:`(039)_{16}=57`.
129
130
Branch Symbols
131
##############
132
133
Branch symbols are of the general form ``[Branch<L>_<M>]``, where
134
``<L>, <M> in {1, 2, 3}``. A branch symbol specifies a branch from the
135
main chain, analogous to the open and closed curved brackets in SMILES.
136
In SELFIES, a branch is derived by a recursive call to the SELFIES
137
derivation.
138
139
A Branch symbol ``[Branch<L>_<M>]`` maps:
140
141
.. math::
142
143
    X_i \to \begin{cases}
144
        X_i & i \leq 1 \\
145
        B(Q, X_{n})X_j & i > 1
146
    \end{cases}
147
148
where :math:`n = \min(i - 1, \texttt{<M>})` is the derivation state of a new branch,
149
and :math:`j = i - n` is the new derivation state of the main chain. In the :math:`i > 1`
150
case, the ``<L>`` subsequent symbols are used to derive an integer from the
151
state :math:`Q`. Then :math:`B(Q, X_{n})` takes the next :math:`Q + 1` symbols,
152
and recursively derives them with initial derivation state :math:`X_{n}`.
153
The resulting fragment is taken to be the derived branch, and derivation
154
proceeds with the next derivation state :math:`X_j`.
155
156
**Discussion:**  Intuitively, branch symbols are skipped for states
157
:math:`X_{0-1}` because the previous atom can make at most one bond
158
(branches require at least two bonds to be free). It is possible
159
that a branch is nested at the start of another branch; in SELFIES, both
160
branches will be connected to the same main chain atom (see Example 5 below).
161
162
**Examples:**
163
164
+---------+-------------------------------------------------------+---------------+-------------------------+
165
| Example | SELFIES                                               | :math:`Q + 1` | SMILES                  |
166
+=========+=======================================================+===============+=========================+
167
| 1       | ``[C][Branch1_1][C][F][Cl]``                          | 1             | ``C(F)Cl``              |
168
+---------+-------------------------------------------------------+---------------+-------------------------+
169
| 2       | ``[C][Branch1_2][Ring2][=C][C][C][Cl]``               | 3             | ``C(=CCC)Cl``           |
170
+---------+-------------------------------------------------------+---------------+-------------------------+
171
| 3       | ``[S][Branch1_2][C][=O][Branch1_2][C]``               | 1, 1, 1       | ``S(=O)(=O)([O-])[O-]`` |
172
|         |                                                       |               |                         |
173
|         | ``[=O][Branch1_1][C][O-expl][O-expl]``                |               |                         |
174
+---------+-------------------------------------------------------+---------------+-------------------------+
175
| 4       | ``[C][Branch2_1][Ring1][Branch1_2][C]``               | 21            | ``C(CC...CC)F``         |
176
|         |                                                       |               |                         |
177
|         | ``[C][C][C][C][C][C][C][C][C][C][C][C]``              |               |                         |
178
|         |                                                       |               |                         |
179
|         | ``[C][C][C][C][C][C][C][C][F]``                       |               |                         |
180
|         +-------------------------------------------------------+---------------+-------------------------+
181
|         | Example 4 has a single branch of 21 carbon atoms.                                               |
182
+---------+-------------------------------------------------------+---------------+-------------------------+
183
| 5       | ``[C][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]`` | 4, 1          | ``C(C)(Cl)F``           |
184
+---------+-------------------------------------------------------+---------------+-------------------------+
185
186
187
Ring Symbols
188
############
189
190
Ring symbols are of the general form ``[Ring<L>]`` or ``[Expl<B>Ring<L>]``,
191
where ``<L> in {1, 2, 3}`` and ``<B> in {'/', '\\', '=', '#'}`` is a
192
prefix representing a bond. A ring symbol specifies a ring bond between two
193
atoms, analogous to the ring numbering digits in SMILES.
194
195
A Ring symbol ``[Ring<L>]`` maps:
196
197
.. math::
198
199
    X_i \to \begin{cases}
200
        X_i & i = 0 \\
201
        R(Q)X_i & i \neq 0
202
    \end{cases}
203
204
In the :math:`i \neq 0` case, the ``<L>`` subsequent symbols are used to
205
derive an integer from the state :math:`Q`. Then :math:`R(Q)` connects the
206
*current* atom to the :math:`(Q + 1)`-th preceding atom through a
207
single bond. More specifically, the *current* atom is the most recently
208
derived atom within the current derivation instance (see Example 5 below).
209
If the *current* atom is the :math:`m`-th derived atom, then
210
a bond is made between the :math:`m`-th derived atom and the :math:`n`-th
211
derived atom, where :math:`n = \max(1, m - (Q + 1))`.
212
213
The Ring symbol ``[Expl<B>Ring<L>]`` has an equivalent function to
214
``[Ring<L>]``, except that it connects the current and :math:`(Q + 1)`-th
215
preceding atom through a bond of type ``<B>``.
216
217
**Discussion**: In practice, ring bonds are created during a second pass,
218
after all atoms and branches have been derived. The candidate ring
219
bonds are temporarily stored in a queue, and then made in
220
the order that they appear in the SELFIES. A ring bond will be made if
221
its connected atoms can make the ring bond without violating any
222
bond constraints. This is the only non-local rule in SELFIES, but is
223
efficiently implemented as this number can be determined only by looking
224
at one location.
225
226
It is also possible that the current atom is already bonded to the
227
:math:`(Q + 1)`-th preceding atom, e.g. if :math:`Q = 0`. In this case,
228
the multiplicity of the existing bond is increased by the multiplicity of
229
the ring bond candidate. Then the multiplicity of the resulting bond is reduced
230
(minimally) such that no bond constraints are violated, and the multiplicity
231
is at most 3 (see Example 6 below).
232
233
**Examples:**
234
235
+---------+------------------------------------------------------------+---------------+------------------+
236
| Example | SELFIES                                                    | :math:`Q + 1` | SMILES           |
237
+=========+============================================================+===============+==================+
238
| 1       | ``[C][=C][C][=C][C][=C][Ring1][Branch1_2]``                | 5             | ``C1=CC=CC=C1``  |
239
+---------+------------------------------------------------------------+---------------+------------------+
240
| 2       | ``[C][C][=C][C][=C][C][Expl=Ring1][Branch1_2]``            | 5             | ``C=1C=CC=CC=1`` |
241
+---------+------------------------------------------------------------+---------------+------------------+
242
| 3       | ``[C][C][Expl=Ring1][C]``                                  | 1             | ``C#C``          |
243
+---------+------------------------------------------------------------+---------------+------------------+
244
| 4       | ``[C][C][C][C][C][C][C][C][C][C][C]``                      | 21            | ``C1CC...CC1``   |
245
|         |                                                            |               |                  |
246
|         | ``[C][C][C][C][C][C][C][C][C][C][C]``                      |               |                  |
247
|         |                                                            |               |                  |
248
|         | ``[Ring2][Ring1][Branch1_2]``                              |               |                  |
249
|         +------------------------------------------------------------+---------------+------------------+
250
|         | Example 4 is a single carbon ring of 22 carbon atoms.                                         |
251
+---------+------------------------------------------------------------+---------------+------------------+
252
| 5       | ``[C][C][C][C][Branch1_1][C][C][Ring1][Ring2][C][C]``      | 3             | ``C1CCC1(C)CC``  |
253
|         +------------------------------------------------------------+---------------+------------------+
254
|         | Note that the SMILES ``CC1CC(C1)CC`` is not outputted.                                        |
255
+---------+------------------------------------------------------------+---------------+------------------+
256
| 6       | ``[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]``     | 3, 3          | ``C#1CCC#1``     |
257
+---------+------------------------------------------------------------+---------------+------------------+
258
259
260
261
Special Symbols
262
###############
263
264
The following are symbols that have a special meaning for SELFIES:
265
266
.. _no operation: https://en.wikipedia.org/wiki/NOP_(code)
267
268
+---------------+-------------------------------------------------------------------------------------------------+
269
| Character     | Description                                                                                     |
270
+===============+=================================================================================================+
271
| ``[epsilon]`` | The ``[epsilon]`` symbol maps :math:`X_0 \to X_0` and :math:`X_i \to \epsilon` (the empty       |
272
|               | string) for all :math:`i \geq 1`.                                                               |
273
+---------------+-------------------------------------------------------------------------------------------------+
274
| ``[nop]``     | The nop (`no operation`_) symbol is always ignored and skipped over by :func:`selfies.decoder`. |
275
|               |                                                                                                 |
276
|               | Thus, it can be used as a padding symbol for SELFIES.                                           |
277
+---------------+-------------------------------------------------------------------------------------------------+
278
| ``.``         | The dot symbol is used to indicate disconnected or ionic compounds, similar to how it is        |
279
|               |                                                                                                 |
280
|               | used in SMILES.                                                                                 |
281
+---------------+-------------------------------------------------------------------------------------------------+