Switch to side-by-side view

--- a
+++ b/docs/source/derivation.rst
@@ -0,0 +1,281 @@
+Derivation
+============
+
+This section is an informal tutorial on how molecules are derived
+from a SELFIES. The SELFIES grammar has non-terminal symbols or states
+
+.. math::
+
+    X_0, \ldots, X_7, Q
+
+Derivation starts with state :math:`X_0`. The SELFIES is read symbol-by-symbol,
+with each symbol specifying a grammar rule. SELFIES derivation terminates
+when no non-terminal symbols remain. In each subsection, we describe a type of
+SELFIES symbol and the grammar rules associated with it.
+
+Atomic Symbols
+##############
+
+Atomic symbols are of the general form ``[<B><A>]``, where
+``<B> in {'', '/', '\\', '=', '#'}`` is a prefix representing a bond,
+and ``<A>`` is a SMILES symbol representing an atom or ion.
+If the SMILES symbol is enclosed by square brackets (e.g. ``[13C]``),
+then the square brackets are dropped and ``expl`` (for "explicit brackets")
+is appended to obtain ``<A>``. For example:
+
+.. table::
+    :align: center
+
+    +---------+---------------+--------------+----------------+
+    | ``<B>`` | SMILES symbol | ``<A>``      | SELFIES symbol |
+    +=========+===============+==============+================+
+    | ``'='`` | ``N``         | ``N``        | ``[=N]``       |
+    +---------+---------------+--------------+----------------+
+    | ``''``  | ``[C@@H]``    | ``C@@Hexpl`` | ``[C@@Hexpl]`` |
+    +---------+---------------+--------------+----------------+
+    | ``'/'`` | ``[O+]``      | ``O+expl``   | ``[/O+expl]``  |
+    +---------+---------------+--------------+----------------+
+
+Let atomic symbol ``[<B><A>]`` be given, where ``<B>`` is a prefix
+representing a bond with multiplicity :math:`\beta` and ``<A>`` is an atom
+that can make :math:`\alpha` bonds maximally. The atomic symbol maps:
+
+.. math::
+
+    X_i \to \begin{cases}
+        \texttt{<B'><A>}  & \alpha - \mu = 0 \\
+        \texttt{<B'><A>} X_{\alpha - \mu}  & \alpha - \mu \neq 0
+    \end{cases}
+
+where ``<B'>`` is a prefix representing a bond with multiplicity
+:math:`\mu = \min(\beta, \alpha, i)`, or the empty string if :math:`\mu = 0`.
+Note that non-terminal states :math:`X_i` effectively restrict the subsequent
+bond to a multiplicity of at most :math:`i`. We provide an example of
+the derivation of the SELFIES ``[F][=C][=C][#N]``:
+
+.. math::
+
+    X_0 \to \texttt{F}X_1 \to \texttt{FC}X_3 \to \texttt{FC=C}X_2 \to \texttt{FC=C=N}
+
+
+**Discussion:** Intuitively, the formal grammar has the following behaviour.
+An atomic symbol ``[<B><A>]`` connects atom ``<A>`` to the previously derived
+atom through bond type ``<B>``. If creating this bond would violate the bond
+constraints of the previous or current atom, the bond multiplicity is reduced
+(minimally) such that all bond constraints are fulfilled.
+
+**Examples:**
+
+.. table::
+    :align: center
+
+    +---------+-----------------------------+-----------------+
+    | Example | SELFIES                     | SMILES          |
+    +=========+=============================+=================+
+    | 1       | ``[C][=C][C][#C][13Cexpl]`` | ``C=CC#C[13C]`` |
+    +---------+-----------------------------+-----------------+
+    | 2       | ``[C][F][C][C][C][C]``      | ``CF``          |
+    +---------+-----------------------------+-----------------+
+    | 3       | ``[C][O][=C][#O][C][F]``    | ``COC=O``       |
+    +---------+-----------------------------+-----------------+
+
+Index Symbols
+#############
+
+The state :math:`Q` is used to derive the size of branches and
+the location of ring bonds. After a ring or branch symbol, the subsequent
+one or more SELFIES symbols are used to derive an integer from :math:`Q`.
+Note that the specific branch and ring symbol itself will specify exactly
+how many symbols are used in the derivation (e.g. ``[Ring3]`` indicates
+that the subsequent three symbols are used).
+
+First, each subsequent symbol :math:`s_i` is converted to an
+index :math:`\text{idx}(s_i)`, according to the following assignment:
+
+.. table::
+    :align: center
+
+    +-------+-----------------+-------+-----------------+
+    | Index | Symbol          | Index | Symbol          |
+    +=======+=================+=======+=================+
+    | 0     | ``[C]``         | 8     | ``[Branch2_3]`` |
+    +-------+-----------------+-------+-----------------+
+    | 1     | ``[Ring1]``     | 9     | ``[O]``         |
+    +-------+-----------------+-------+-----------------+
+    | 2     | ``[Ring2]``     | 10    | ``[N]``         |
+    +-------+-----------------+-------+-----------------+
+    | 3     | ``[Branch1_1]`` | 11    | ``[=N]``        |
+    +-------+-----------------+-------+-----------------+
+    | 4     | ``[Branch1_2]`` | 12    | ``[=C]``        |
+    +-------+-----------------+-------+-----------------+
+    | 5     | ``[Branch1_3]`` | 13    | ``[#C]``        |
+    +-------+-----------------+-------+-----------------+
+    | 6     | ``[Branch2_1]`` | 14    | ``[S]``         |
+    +-------+-----------------+-------+-----------------+
+    | 7     | ``[Branch2_2]`` | 15    | ``[P]``         |
+    +-------+-----------------+-------+-----------------+
+    | All other symbols assigned index 0.               |
+    +-------+-----------------+-------+-----------------+
+
+Then :math:`Q` is mapped to the hexadecimal (base 16) integer specified
+by the indices. For example, if three symbols :math:`s_1, s_2, s_3` are
+used in the derivation, then :math:`Q` is mapped to:
+
+.. math::
+
+    Q \to (\text{idx}(s_1) \times 16^2) + (\text{idx}(s_2) \times 16) + \text{idx}(s_3)
+
+For example, ``[Ring3][C][Branch1_1][O]`` will derive the number :math:`(039)_{16}=57`.
+
+Branch Symbols
+##############
+
+Branch symbols are of the general form ``[Branch<L>_<M>]``, where
+``<L>, <M> in {1, 2, 3}``. A branch symbol specifies a branch from the
+main chain, analogous to the open and closed curved brackets in SMILES.
+In SELFIES, a branch is derived by a recursive call to the SELFIES
+derivation.
+
+A Branch symbol ``[Branch<L>_<M>]`` maps:
+
+.. math::
+
+    X_i \to \begin{cases}
+        X_i & i \leq 1 \\
+        B(Q, X_{n})X_j & i > 1
+    \end{cases}
+
+where :math:`n = \min(i - 1, \texttt{<M>})` is the derivation state of a new branch,
+and :math:`j = i - n` is the new derivation state of the main chain. In the :math:`i > 1`
+case, the ``<L>`` subsequent symbols are used to derive an integer from the
+state :math:`Q`. Then :math:`B(Q, X_{n})` takes the next :math:`Q + 1` symbols,
+and recursively derives them with initial derivation state :math:`X_{n}`.
+The resulting fragment is taken to be the derived branch, and derivation
+proceeds with the next derivation state :math:`X_j`.
+
+**Discussion:**  Intuitively, branch symbols are skipped for states
+:math:`X_{0-1}` because the previous atom can make at most one bond
+(branches require at least two bonds to be free). It is possible
+that a branch is nested at the start of another branch; in SELFIES, both
+branches will be connected to the same main chain atom (see Example 5 below).
+
+**Examples:**
+
++---------+-------------------------------------------------------+---------------+-------------------------+
+| Example | SELFIES                                               | :math:`Q + 1` | SMILES                  |
++=========+=======================================================+===============+=========================+
+| 1       | ``[C][Branch1_1][C][F][Cl]``                          | 1             | ``C(F)Cl``              |
++---------+-------------------------------------------------------+---------------+-------------------------+
+| 2       | ``[C][Branch1_2][Ring2][=C][C][C][Cl]``               | 3             | ``C(=CCC)Cl``           |
++---------+-------------------------------------------------------+---------------+-------------------------+
+| 3       | ``[S][Branch1_2][C][=O][Branch1_2][C]``               | 1, 1, 1       | ``S(=O)(=O)([O-])[O-]`` |
+|         |                                                       |               |                         |
+|         | ``[=O][Branch1_1][C][O-expl][O-expl]``                |               |                         |
++---------+-------------------------------------------------------+---------------+-------------------------+
+| 4       | ``[C][Branch2_1][Ring1][Branch1_2][C]``               | 21            | ``C(CC...CC)F``         |
+|         |                                                       |               |                         |
+|         | ``[C][C][C][C][C][C][C][C][C][C][C][C]``              |               |                         |
+|         |                                                       |               |                         |
+|         | ``[C][C][C][C][C][C][C][C][F]``                       |               |                         |
+|         +-------------------------------------------------------+---------------+-------------------------+
+|         | Example 4 has a single branch of 21 carbon atoms.                                               |
++---------+-------------------------------------------------------+---------------+-------------------------+
+| 5       | ``[C][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]`` | 4, 1          | ``C(C)(Cl)F``           |
++---------+-------------------------------------------------------+---------------+-------------------------+
+
+
+Ring Symbols
+############
+
+Ring symbols are of the general form ``[Ring<L>]`` or ``[Expl<B>Ring<L>]``,
+where ``<L> in {1, 2, 3}`` and ``<B> in {'/', '\\', '=', '#'}`` is a
+prefix representing a bond. A ring symbol specifies a ring bond between two
+atoms, analogous to the ring numbering digits in SMILES.
+
+A Ring symbol ``[Ring<L>]`` maps:
+
+.. math::
+
+    X_i \to \begin{cases}
+        X_i & i = 0 \\
+        R(Q)X_i & i \neq 0
+    \end{cases}
+
+In the :math:`i \neq 0` case, the ``<L>`` subsequent symbols are used to
+derive an integer from the state :math:`Q`. Then :math:`R(Q)` connects the
+*current* atom to the :math:`(Q + 1)`-th preceding atom through a
+single bond. More specifically, the *current* atom is the most recently
+derived atom within the current derivation instance (see Example 5 below).
+If the *current* atom is the :math:`m`-th derived atom, then
+a bond is made between the :math:`m`-th derived atom and the :math:`n`-th
+derived atom, where :math:`n = \max(1, m - (Q + 1))`.
+
+The Ring symbol ``[Expl<B>Ring<L>]`` has an equivalent function to
+``[Ring<L>]``, except that it connects the current and :math:`(Q + 1)`-th
+preceding atom through a bond of type ``<B>``.
+
+**Discussion**: In practice, ring bonds are created during a second pass,
+after all atoms and branches have been derived. The candidate ring
+bonds are temporarily stored in a queue, and then made in
+the order that they appear in the SELFIES. A ring bond will be made if
+its connected atoms can make the ring bond without violating any
+bond constraints. This is the only non-local rule in SELFIES, but is
+efficiently implemented as this number can be determined only by looking
+at one location.
+
+It is also possible that the current atom is already bonded to the
+:math:`(Q + 1)`-th preceding atom, e.g. if :math:`Q = 0`. In this case,
+the multiplicity of the existing bond is increased by the multiplicity of
+the ring bond candidate. Then the multiplicity of the resulting bond is reduced
+(minimally) such that no bond constraints are violated, and the multiplicity
+is at most 3 (see Example 6 below).
+
+**Examples:**
+
++---------+------------------------------------------------------------+---------------+------------------+
+| Example | SELFIES                                                    | :math:`Q + 1` | SMILES           |
++=========+============================================================+===============+==================+
+| 1       | ``[C][=C][C][=C][C][=C][Ring1][Branch1_2]``                | 5             | ``C1=CC=CC=C1``  |
++---------+------------------------------------------------------------+---------------+------------------+
+| 2       | ``[C][C][=C][C][=C][C][Expl=Ring1][Branch1_2]``            | 5             | ``C=1C=CC=CC=1`` |
++---------+------------------------------------------------------------+---------------+------------------+
+| 3       | ``[C][C][Expl=Ring1][C]``                                  | 1             | ``C#C``          |
++---------+------------------------------------------------------------+---------------+------------------+
+| 4       | ``[C][C][C][C][C][C][C][C][C][C][C]``                      | 21            | ``C1CC...CC1``   |
+|         |                                                            |               |                  |
+|         | ``[C][C][C][C][C][C][C][C][C][C][C]``                      |               |                  |
+|         |                                                            |               |                  |
+|         | ``[Ring2][Ring1][Branch1_2]``                              |               |                  |
+|         +------------------------------------------------------------+---------------+------------------+
+|         | Example 4 is a single carbon ring of 22 carbon atoms.                                         |
++---------+------------------------------------------------------------+---------------+------------------+
+| 5       | ``[C][C][C][C][Branch1_1][C][C][Ring1][Ring2][C][C]``      | 3             | ``C1CCC1(C)CC``  |
+|         +------------------------------------------------------------+---------------+------------------+
+|         | Note that the SMILES ``CC1CC(C1)CC`` is not outputted.                                        |
++---------+------------------------------------------------------------+---------------+------------------+
+| 6       | ``[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]``     | 3, 3          | ``C#1CCC#1``     |
++---------+------------------------------------------------------------+---------------+------------------+
+
+
+
+Special Symbols
+###############
+
+The following are symbols that have a special meaning for SELFIES:
+
+.. _no operation: https://en.wikipedia.org/wiki/NOP_(code)
+
++---------------+-------------------------------------------------------------------------------------------------+
+| Character     | Description                                                                                     |
++===============+=================================================================================================+
+| ``[epsilon]`` | The ``[epsilon]`` symbol maps :math:`X_0 \to X_0` and :math:`X_i \to \epsilon` (the empty       |
+|               | string) for all :math:`i \geq 1`.                                                               |
++---------------+-------------------------------------------------------------------------------------------------+
+| ``[nop]``     | The nop (`no operation`_) symbol is always ignored and skipped over by :func:`selfies.decoder`. |
+|               |                                                                                                 |
+|               | Thus, it can be used as a padding symbol for SELFIES.                                           |
++---------------+-------------------------------------------------------------------------------------------------+
+| ``.``         | The dot symbol is used to indicate disconnected or ionic compounds, similar to how it is        |
+|               |                                                                                                 |
+|               | used in SMILES.                                                                                 |
++---------------+-------------------------------------------------------------------------------------------------+