|
a |
|
b/docs/source/derivation.rst |
|
|
1 |
Derivation |
|
|
2 |
============ |
|
|
3 |
|
|
|
4 |
This section is an informal tutorial on how molecules are derived |
|
|
5 |
from a SELFIES. The SELFIES grammar has non-terminal symbols or states |
|
|
6 |
|
|
|
7 |
.. math:: |
|
|
8 |
|
|
|
9 |
X_0, \ldots, X_7, Q |
|
|
10 |
|
|
|
11 |
Derivation starts with state :math:`X_0`. The SELFIES is read symbol-by-symbol, |
|
|
12 |
with each symbol specifying a grammar rule. SELFIES derivation terminates |
|
|
13 |
when no non-terminal symbols remain. In each subsection, we describe a type of |
|
|
14 |
SELFIES symbol and the grammar rules associated with it. |
|
|
15 |
|
|
|
16 |
Atomic Symbols |
|
|
17 |
############## |
|
|
18 |
|
|
|
19 |
Atomic symbols are of the general form ``[<B><A>]``, where |
|
|
20 |
``<B> in {'', '/', '\\', '=', '#'}`` is a prefix representing a bond, |
|
|
21 |
and ``<A>`` is a SMILES symbol representing an atom or ion. |
|
|
22 |
If the SMILES symbol is enclosed by square brackets (e.g. ``[13C]``), |
|
|
23 |
then the square brackets are dropped and ``expl`` (for "explicit brackets") |
|
|
24 |
is appended to obtain ``<A>``. For example: |
|
|
25 |
|
|
|
26 |
.. table:: |
|
|
27 |
:align: center |
|
|
28 |
|
|
|
29 |
+---------+---------------+--------------+----------------+ |
|
|
30 |
| ``<B>`` | SMILES symbol | ``<A>`` | SELFIES symbol | |
|
|
31 |
+=========+===============+==============+================+ |
|
|
32 |
| ``'='`` | ``N`` | ``N`` | ``[=N]`` | |
|
|
33 |
+---------+---------------+--------------+----------------+ |
|
|
34 |
| ``''`` | ``[C@@H]`` | ``C@@Hexpl`` | ``[C@@Hexpl]`` | |
|
|
35 |
+---------+---------------+--------------+----------------+ |
|
|
36 |
| ``'/'`` | ``[O+]`` | ``O+expl`` | ``[/O+expl]`` | |
|
|
37 |
+---------+---------------+--------------+----------------+ |
|
|
38 |
|
|
|
39 |
Let atomic symbol ``[<B><A>]`` be given, where ``<B>`` is a prefix |
|
|
40 |
representing a bond with multiplicity :math:`\beta` and ``<A>`` is an atom |
|
|
41 |
that can make :math:`\alpha` bonds maximally. The atomic symbol maps: |
|
|
42 |
|
|
|
43 |
.. math:: |
|
|
44 |
|
|
|
45 |
X_i \to \begin{cases} |
|
|
46 |
\texttt{<B'><A>} & \alpha - \mu = 0 \\ |
|
|
47 |
\texttt{<B'><A>} X_{\alpha - \mu} & \alpha - \mu \neq 0 |
|
|
48 |
\end{cases} |
|
|
49 |
|
|
|
50 |
where ``<B'>`` is a prefix representing a bond with multiplicity |
|
|
51 |
:math:`\mu = \min(\beta, \alpha, i)`, or the empty string if :math:`\mu = 0`. |
|
|
52 |
Note that non-terminal states :math:`X_i` effectively restrict the subsequent |
|
|
53 |
bond to a multiplicity of at most :math:`i`. We provide an example of |
|
|
54 |
the derivation of the SELFIES ``[F][=C][=C][#N]``: |
|
|
55 |
|
|
|
56 |
.. math:: |
|
|
57 |
|
|
|
58 |
X_0 \to \texttt{F}X_1 \to \texttt{FC}X_3 \to \texttt{FC=C}X_2 \to \texttt{FC=C=N} |
|
|
59 |
|
|
|
60 |
|
|
|
61 |
**Discussion:** Intuitively, the formal grammar has the following behaviour. |
|
|
62 |
An atomic symbol ``[<B><A>]`` connects atom ``<A>`` to the previously derived |
|
|
63 |
atom through bond type ``<B>``. If creating this bond would violate the bond |
|
|
64 |
constraints of the previous or current atom, the bond multiplicity is reduced |
|
|
65 |
(minimally) such that all bond constraints are fulfilled. |
|
|
66 |
|
|
|
67 |
**Examples:** |
|
|
68 |
|
|
|
69 |
.. table:: |
|
|
70 |
:align: center |
|
|
71 |
|
|
|
72 |
+---------+-----------------------------+-----------------+ |
|
|
73 |
| Example | SELFIES | SMILES | |
|
|
74 |
+=========+=============================+=================+ |
|
|
75 |
| 1 | ``[C][=C][C][#C][13Cexpl]`` | ``C=CC#C[13C]`` | |
|
|
76 |
+---------+-----------------------------+-----------------+ |
|
|
77 |
| 2 | ``[C][F][C][C][C][C]`` | ``CF`` | |
|
|
78 |
+---------+-----------------------------+-----------------+ |
|
|
79 |
| 3 | ``[C][O][=C][#O][C][F]`` | ``COC=O`` | |
|
|
80 |
+---------+-----------------------------+-----------------+ |
|
|
81 |
|
|
|
82 |
Index Symbols |
|
|
83 |
############# |
|
|
84 |
|
|
|
85 |
The state :math:`Q` is used to derive the size of branches and |
|
|
86 |
the location of ring bonds. After a ring or branch symbol, the subsequent |
|
|
87 |
one or more SELFIES symbols are used to derive an integer from :math:`Q`. |
|
|
88 |
Note that the specific branch and ring symbol itself will specify exactly |
|
|
89 |
how many symbols are used in the derivation (e.g. ``[Ring3]`` indicates |
|
|
90 |
that the subsequent three symbols are used). |
|
|
91 |
|
|
|
92 |
First, each subsequent symbol :math:`s_i` is converted to an |
|
|
93 |
index :math:`\text{idx}(s_i)`, according to the following assignment: |
|
|
94 |
|
|
|
95 |
.. table:: |
|
|
96 |
:align: center |
|
|
97 |
|
|
|
98 |
+-------+-----------------+-------+-----------------+ |
|
|
99 |
| Index | Symbol | Index | Symbol | |
|
|
100 |
+=======+=================+=======+=================+ |
|
|
101 |
| 0 | ``[C]`` | 8 | ``[Branch2_3]`` | |
|
|
102 |
+-------+-----------------+-------+-----------------+ |
|
|
103 |
| 1 | ``[Ring1]`` | 9 | ``[O]`` | |
|
|
104 |
+-------+-----------------+-------+-----------------+ |
|
|
105 |
| 2 | ``[Ring2]`` | 10 | ``[N]`` | |
|
|
106 |
+-------+-----------------+-------+-----------------+ |
|
|
107 |
| 3 | ``[Branch1_1]`` | 11 | ``[=N]`` | |
|
|
108 |
+-------+-----------------+-------+-----------------+ |
|
|
109 |
| 4 | ``[Branch1_2]`` | 12 | ``[=C]`` | |
|
|
110 |
+-------+-----------------+-------+-----------------+ |
|
|
111 |
| 5 | ``[Branch1_3]`` | 13 | ``[#C]`` | |
|
|
112 |
+-------+-----------------+-------+-----------------+ |
|
|
113 |
| 6 | ``[Branch2_1]`` | 14 | ``[S]`` | |
|
|
114 |
+-------+-----------------+-------+-----------------+ |
|
|
115 |
| 7 | ``[Branch2_2]`` | 15 | ``[P]`` | |
|
|
116 |
+-------+-----------------+-------+-----------------+ |
|
|
117 |
| All other symbols assigned index 0. | |
|
|
118 |
+-------+-----------------+-------+-----------------+ |
|
|
119 |
|
|
|
120 |
Then :math:`Q` is mapped to the hexadecimal (base 16) integer specified |
|
|
121 |
by the indices. For example, if three symbols :math:`s_1, s_2, s_3` are |
|
|
122 |
used in the derivation, then :math:`Q` is mapped to: |
|
|
123 |
|
|
|
124 |
.. math:: |
|
|
125 |
|
|
|
126 |
Q \to (\text{idx}(s_1) \times 16^2) + (\text{idx}(s_2) \times 16) + \text{idx}(s_3) |
|
|
127 |
|
|
|
128 |
For example, ``[Ring3][C][Branch1_1][O]`` will derive the number :math:`(039)_{16}=57`. |
|
|
129 |
|
|
|
130 |
Branch Symbols |
|
|
131 |
############## |
|
|
132 |
|
|
|
133 |
Branch symbols are of the general form ``[Branch<L>_<M>]``, where |
|
|
134 |
``<L>, <M> in {1, 2, 3}``. A branch symbol specifies a branch from the |
|
|
135 |
main chain, analogous to the open and closed curved brackets in SMILES. |
|
|
136 |
In SELFIES, a branch is derived by a recursive call to the SELFIES |
|
|
137 |
derivation. |
|
|
138 |
|
|
|
139 |
A Branch symbol ``[Branch<L>_<M>]`` maps: |
|
|
140 |
|
|
|
141 |
.. math:: |
|
|
142 |
|
|
|
143 |
X_i \to \begin{cases} |
|
|
144 |
X_i & i \leq 1 \\ |
|
|
145 |
B(Q, X_{n})X_j & i > 1 |
|
|
146 |
\end{cases} |
|
|
147 |
|
|
|
148 |
where :math:`n = \min(i - 1, \texttt{<M>})` is the derivation state of a new branch, |
|
|
149 |
and :math:`j = i - n` is the new derivation state of the main chain. In the :math:`i > 1` |
|
|
150 |
case, the ``<L>`` subsequent symbols are used to derive an integer from the |
|
|
151 |
state :math:`Q`. Then :math:`B(Q, X_{n})` takes the next :math:`Q + 1` symbols, |
|
|
152 |
and recursively derives them with initial derivation state :math:`X_{n}`. |
|
|
153 |
The resulting fragment is taken to be the derived branch, and derivation |
|
|
154 |
proceeds with the next derivation state :math:`X_j`. |
|
|
155 |
|
|
|
156 |
**Discussion:** Intuitively, branch symbols are skipped for states |
|
|
157 |
:math:`X_{0-1}` because the previous atom can make at most one bond |
|
|
158 |
(branches require at least two bonds to be free). It is possible |
|
|
159 |
that a branch is nested at the start of another branch; in SELFIES, both |
|
|
160 |
branches will be connected to the same main chain atom (see Example 5 below). |
|
|
161 |
|
|
|
162 |
**Examples:** |
|
|
163 |
|
|
|
164 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
165 |
| Example | SELFIES | :math:`Q + 1` | SMILES | |
|
|
166 |
+=========+=======================================================+===============+=========================+ |
|
|
167 |
| 1 | ``[C][Branch1_1][C][F][Cl]`` | 1 | ``C(F)Cl`` | |
|
|
168 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
169 |
| 2 | ``[C][Branch1_2][Ring2][=C][C][C][Cl]`` | 3 | ``C(=CCC)Cl`` | |
|
|
170 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
171 |
| 3 | ``[S][Branch1_2][C][=O][Branch1_2][C]`` | 1, 1, 1 | ``S(=O)(=O)([O-])[O-]`` | |
|
|
172 |
| | | | | |
|
|
173 |
| | ``[=O][Branch1_1][C][O-expl][O-expl]`` | | | |
|
|
174 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
175 |
| 4 | ``[C][Branch2_1][Ring1][Branch1_2][C]`` | 21 | ``C(CC...CC)F`` | |
|
|
176 |
| | | | | |
|
|
177 |
| | ``[C][C][C][C][C][C][C][C][C][C][C][C]`` | | | |
|
|
178 |
| | | | | |
|
|
179 |
| | ``[C][C][C][C][C][C][C][C][F]`` | | | |
|
|
180 |
| +-------------------------------------------------------+---------------+-------------------------+ |
|
|
181 |
| | Example 4 has a single branch of 21 carbon atoms. | |
|
|
182 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
183 |
| 5 | ``[C][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]`` | 4, 1 | ``C(C)(Cl)F`` | |
|
|
184 |
+---------+-------------------------------------------------------+---------------+-------------------------+ |
|
|
185 |
|
|
|
186 |
|
|
|
187 |
Ring Symbols |
|
|
188 |
############ |
|
|
189 |
|
|
|
190 |
Ring symbols are of the general form ``[Ring<L>]`` or ``[Expl<B>Ring<L>]``, |
|
|
191 |
where ``<L> in {1, 2, 3}`` and ``<B> in {'/', '\\', '=', '#'}`` is a |
|
|
192 |
prefix representing a bond. A ring symbol specifies a ring bond between two |
|
|
193 |
atoms, analogous to the ring numbering digits in SMILES. |
|
|
194 |
|
|
|
195 |
A Ring symbol ``[Ring<L>]`` maps: |
|
|
196 |
|
|
|
197 |
.. math:: |
|
|
198 |
|
|
|
199 |
X_i \to \begin{cases} |
|
|
200 |
X_i & i = 0 \\ |
|
|
201 |
R(Q)X_i & i \neq 0 |
|
|
202 |
\end{cases} |
|
|
203 |
|
|
|
204 |
In the :math:`i \neq 0` case, the ``<L>`` subsequent symbols are used to |
|
|
205 |
derive an integer from the state :math:`Q`. Then :math:`R(Q)` connects the |
|
|
206 |
*current* atom to the :math:`(Q + 1)`-th preceding atom through a |
|
|
207 |
single bond. More specifically, the *current* atom is the most recently |
|
|
208 |
derived atom within the current derivation instance (see Example 5 below). |
|
|
209 |
If the *current* atom is the :math:`m`-th derived atom, then |
|
|
210 |
a bond is made between the :math:`m`-th derived atom and the :math:`n`-th |
|
|
211 |
derived atom, where :math:`n = \max(1, m - (Q + 1))`. |
|
|
212 |
|
|
|
213 |
The Ring symbol ``[Expl<B>Ring<L>]`` has an equivalent function to |
|
|
214 |
``[Ring<L>]``, except that it connects the current and :math:`(Q + 1)`-th |
|
|
215 |
preceding atom through a bond of type ``<B>``. |
|
|
216 |
|
|
|
217 |
**Discussion**: In practice, ring bonds are created during a second pass, |
|
|
218 |
after all atoms and branches have been derived. The candidate ring |
|
|
219 |
bonds are temporarily stored in a queue, and then made in |
|
|
220 |
the order that they appear in the SELFIES. A ring bond will be made if |
|
|
221 |
its connected atoms can make the ring bond without violating any |
|
|
222 |
bond constraints. This is the only non-local rule in SELFIES, but is |
|
|
223 |
efficiently implemented as this number can be determined only by looking |
|
|
224 |
at one location. |
|
|
225 |
|
|
|
226 |
It is also possible that the current atom is already bonded to the |
|
|
227 |
:math:`(Q + 1)`-th preceding atom, e.g. if :math:`Q = 0`. In this case, |
|
|
228 |
the multiplicity of the existing bond is increased by the multiplicity of |
|
|
229 |
the ring bond candidate. Then the multiplicity of the resulting bond is reduced |
|
|
230 |
(minimally) such that no bond constraints are violated, and the multiplicity |
|
|
231 |
is at most 3 (see Example 6 below). |
|
|
232 |
|
|
|
233 |
**Examples:** |
|
|
234 |
|
|
|
235 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
236 |
| Example | SELFIES | :math:`Q + 1` | SMILES | |
|
|
237 |
+=========+============================================================+===============+==================+ |
|
|
238 |
| 1 | ``[C][=C][C][=C][C][=C][Ring1][Branch1_2]`` | 5 | ``C1=CC=CC=C1`` | |
|
|
239 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
240 |
| 2 | ``[C][C][=C][C][=C][C][Expl=Ring1][Branch1_2]`` | 5 | ``C=1C=CC=CC=1`` | |
|
|
241 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
242 |
| 3 | ``[C][C][Expl=Ring1][C]`` | 1 | ``C#C`` | |
|
|
243 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
244 |
| 4 | ``[C][C][C][C][C][C][C][C][C][C][C]`` | 21 | ``C1CC...CC1`` | |
|
|
245 |
| | | | | |
|
|
246 |
| | ``[C][C][C][C][C][C][C][C][C][C][C]`` | | | |
|
|
247 |
| | | | | |
|
|
248 |
| | ``[Ring2][Ring1][Branch1_2]`` | | | |
|
|
249 |
| +------------------------------------------------------------+---------------+------------------+ |
|
|
250 |
| | Example 4 is a single carbon ring of 22 carbon atoms. | |
|
|
251 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
252 |
| 5 | ``[C][C][C][C][Branch1_1][C][C][Ring1][Ring2][C][C]`` | 3 | ``C1CCC1(C)CC`` | |
|
|
253 |
| +------------------------------------------------------------+---------------+------------------+ |
|
|
254 |
| | Note that the SMILES ``CC1CC(C1)CC`` is not outputted. | |
|
|
255 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
256 |
| 6 | ``[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]`` | 3, 3 | ``C#1CCC#1`` | |
|
|
257 |
+---------+------------------------------------------------------------+---------------+------------------+ |
|
|
258 |
|
|
|
259 |
|
|
|
260 |
|
|
|
261 |
Special Symbols |
|
|
262 |
############### |
|
|
263 |
|
|
|
264 |
The following are symbols that have a special meaning for SELFIES: |
|
|
265 |
|
|
|
266 |
.. _no operation: https://en.wikipedia.org/wiki/NOP_(code) |
|
|
267 |
|
|
|
268 |
+---------------+-------------------------------------------------------------------------------------------------+ |
|
|
269 |
| Character | Description | |
|
|
270 |
+===============+=================================================================================================+ |
|
|
271 |
| ``[epsilon]`` | The ``[epsilon]`` symbol maps :math:`X_0 \to X_0` and :math:`X_i \to \epsilon` (the empty | |
|
|
272 |
| | string) for all :math:`i \geq 1`. | |
|
|
273 |
+---------------+-------------------------------------------------------------------------------------------------+ |
|
|
274 |
| ``[nop]`` | The nop (`no operation`_) symbol is always ignored and skipped over by :func:`selfies.decoder`. | |
|
|
275 |
| | | |
|
|
276 |
| | Thus, it can be used as a padding symbol for SELFIES. | |
|
|
277 |
+---------------+-------------------------------------------------------------------------------------------------+ |
|
|
278 |
| ``.`` | The dot symbol is used to indicate disconnected or ionic compounds, similar to how it is | |
|
|
279 |
| | | |
|
|
280 |
| | used in SMILES. | |
|
|
281 |
+---------------+-------------------------------------------------------------------------------------------------+ |