bionemo-framework / Git / [b9e282] /sub-packages/bionemo-scdl/examples/example

Models:
Amanda-D/
bionemo-framework
Downloads: 1
[b9e282]: / sub-packages / bionemo-scdl / examples / example_notebook.ipynb
History
Download this file
244 lines (243 with data), 7.4 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![ Click here to deploy.](https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdeploynavy.svg)](https://console.brev.dev/launchable/deploy?launchableID=env-2pDvxgSkyyfh2QPeHZ4aGvtuqBY)\n",
    "\n",
    "<div class=\"alert alert-block alert-info\"> <b>NOTE</b> It takes about 10 minutes to deploy this notebook as a Launchable. As of this writing, we are working on a free tier so a credit card may be required. You can reach out to your NVIDIA rep for credits.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tempfile\n",
    "\n",
    "import pooch\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "from bionemo.core import BIONEMO_CACHE_DIR\n",
    "from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset\n",
    "from bionemo.scdl.util.torch_dataloader_utils import collate_sparse_matrix_batch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, copy the input data. This can be done by copying https://datasets.cellxgene.cziscience.com/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad to a directory named `hdf5s`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data = pooch.retrieve(\n",
    "    \"https://datasets.cellxgene.cziscience.com/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad\",\n",
    "    path=BIONEMO_CACHE_DIR / \"hdf5s\",\n",
    "    known_hash=\"a0728e13a421bbcd6b2718e1d32f88d0d5c7cb92289331e3f14a59b7c513b3bc\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a SingleCellMemMapDataset\n",
    "dataset_temp_dir = tempfile.TemporaryDirectory()\n",
    "dataset_dir = os.path.join(dataset_temp_dir.name, \"97e_scmm\")\n",
    "\n",
    "data = SingleCellMemMapDataset(dataset_dir, input_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Save the dataset to the disk.\n",
    "data.save()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reload the data\n",
    "reloaded_data = SingleCellMemMapDataset(dataset_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various numbers of columns per observation. However, for a batch size of 1\n",
    "the data does not need to be collated. It will then be outputted in a torch tensor of shape\n",
    "(1, 2, num_obs) The first row of lengh num_obs contains the column pointers, and the second row contains the corresponding values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/pbinder/bionemo-framework/sub-packages/bionemo-scdl/src/bionemo/scdl/util/torch_dataloader_utils.py:39: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)\n",
      "  batch_sparse_tensor = torch.sparse_csr_tensor(batch_rows, batch_cols, batch_values, size=(len(batch), max_pointer))\n"
     ]
    }
   ],
   "source": [
    "model = lambda x: x  # noqa: E731\n",
    "\n",
    "dataloader = DataLoader(data, batch_size=1, shuffle=True, collate_fn=collate_sparse_matrix_batch)\n",
    "n_epochs = 1\n",
    "for e in range(n_epochs):\n",
    "    for batch in dataloader:\n",
    "        model(batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data can be collated with a batch size of 1 and must be collated with larger batch sizes. \n",
    "This will collate several sparse matrices into the CSR (Compressed Sparse Row) torch tensor\n",
    "format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = lambda x: x  # noqa: E731\n",
    "\n",
    "dataloader = DataLoader(data, batch_size=8, shuffle=True, collate_fn=collate_sparse_matrix_batch)\n",
    "n_epochs = 1\n",
    "for e in range(n_epochs):\n",
    "    for batch in dataloader:\n",
    "        model(batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some applications, we might want to also use the features. These can be specified with get_row(index, return_features = True). By default, all features are returned, but the features can be specified with the feature_vars argument in get_row, which corresponds to a list of the feature names to return. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "for index in range(len(data)):\n",
    "    model(data.get_row(index, return_features=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, if there are multiple AnnData files, they can be converted into a single SingleCellMemMapDataset. If the \n",
    "hdf5 directory has one or more AnnData files, the SingleCellCollection class crawls the filesystem to recursively find \n",
    "AnnData files (with the h5ad extension). The code below is in scripts/convert_h5ad_to_scdl.py. It will create a new\n",
    "dataset at example_dataset. This can also be called with the convert_h5ad_to_scdl command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# path to dir holding hdf5s data\n",
    "hdf5s = BIONEMO_CACHE_DIR / \"hdf5s\"\n",
    "\n",
    "# path to output dir where SCDataset will be stored\n",
    "output_dir = os.path.join(\"scdataset_output\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bionemo.scdl.io.single_cell_collection import SingleCellCollection\n",
    "\n",
    "\n",
    "with tempfile.TemporaryDirectory() as temp_dir:\n",
    "    coll = SingleCellCollection(temp_dir)\n",
    "    coll.load_h5ad_multi(hdf5s, max_workers=4, use_processes=True)\n",
    "    coll.flatten(output_dir, destroy_on_copy=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_temp_dir.cleanup()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}