574 lines (573 with data), 17.0 kB
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import h5py\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some operation\n",
"\n",
"`h5py.File` work like a dictionary\n",
"\n",
"`Group` work like a dictionary\n",
"\n",
"`dataset` work like a numpy array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create file and store data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"k = 10\n",
"n = 100\n",
"some_data = np.random.rand(k, n, n)\n",
"with h5py.File('myh5.hdf5', 'w') as f:\n",
" dset = f.create_dataset('my_dataset', (k, n, n))\n",
" dset[:] = some_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load file and read data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(10, 100, 100)\n",
"(5, 100, 100)\n",
"0.27462065\n"
]
}
],
"source": [
"with h5py.File('myh5.hdf5', 'r') as f:\n",
" print(f['my_dataset'][:].shape) # Read all the data\n",
" print(f['my_dataset'][:5].shape) # Read first 5 of (n,n)\n",
" print(f['my_dataset'][0, 1, 2]) # Read specific number"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Group organization\n",
"\n",
"Create groups in the following archtecture\n",
"\n",
"```\n",
"f -\n",
" |-Group1\n",
" |-dataset1\n",
" |-Group2\n",
" |-dataset2\n",
" |-Subgroup1\n",
" |-dataset3\n",
" |-Subgroup2\n",
" |-\n",
" |-Group3\n",
" |-dataset4\n",
"```\n",
"“HDF” stands for “Hierarchical Data Format”. Every object in an HDF5 file has a name, and they’re arranged in a POSIX-style hierarchy with `/`-separators\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"f = h5py.File('myh5.hdf5', 'w')\n",
"# Create group\n",
"grp1 = f.create_group('Group1')\n",
"# Create group and sub group\n",
"grp2 = f.create_group('Group2')\n",
"grp21 = grp2.create_group('Subgroup1')\n",
"# Create subgroup directly using POSIX style hierarchy\n",
"grp22 = f.create_group('/Group2/Subgroup2')\n",
"# Create datasdet use group object\n",
"dset1 = grp1.create_dataset(\"dataset1\", (50,), dtype='f')\n",
"# Create use POSIX style\n",
"dset2 = f.create_dataset(\"/Group2/dataset2\", (20,), dtype='i')\n",
"dset3 = grp2.create_dataset(\"./Subgroup1/dataset3\", (10,), dtype='f')\n",
"# Create a dataset along with all parent group\n",
"dset4 = f.create_dataset('/Group3/dataset4', (10,), dtype='f')\n",
"f.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Access group"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"f.name = '/'\n",
"[key for key in f.keys()] = ['Group1', 'Group2', 'Group3']\n",
"f[\"/Group1/dataset1\"] = <HDF5 dataset \"dataset1\": shape (50,), type \"<f4\">\n",
"grp2[\"dataset2\"] = <HDF5 dataset \"dataset2\": shape (20,), type \"<i4\">\n",
"grp2[\"dataset2\"].name = '/Group2/dataset2'\n",
"[item.name for key, item in grp2.items()] = ['/Group2/Subgroup1', '/Group2/Subgroup2', '/Group2/dataset2']\n",
"grp2.get(\"Subgroup2\") = <HDF5 group \"/Group2/Subgroup2\" (0 members)>\n"
]
}
],
"source": [
"f = h5py.File('myh5.hdf5', 'r')\n",
"# The name path of the object\n",
"print(f'{f.name = }')\n",
"# Find all the keys\n",
"print(f'{[key for key in f.keys()] = }')\n",
"# Access dierclty from POSIX like structure\n",
"print(f'{f[\"/Group1/dataset1\"] = }')\n",
"# Access through dictionary key\n",
"grp2 = f['Group2']\n",
"print(f'{grp2[\"dataset2\"] = }')\n",
"print(f'{grp2[\"dataset2\"].name = }')\n",
"# Access all object in the group\n",
"print(f'{[item.name for key, item in grp2.items()] = }')\n",
"# Use get method\n",
"print(f'{grp2.get(\"Subgroup2\") = }')\n",
"f.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Iterate through groups, and do something to it"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"===visit===\n",
"Group1\n",
"Group1/dataset1\n",
"Group2\n",
"Group2/Subgroup1\n",
"Group2/Subgroup1/dataset3\n",
"Group2/Subgroup2\n",
"Group2/dataset2\n",
"Group3\n",
"Group3/dataset4\n",
"===visit_item===\n",
"Subgroup1: <HDF5 group \"/Group2/Subgroup1\" (1 members)>\n",
"Subgroup1/dataset3: <HDF5 dataset \"dataset3\": shape (10,), type \"<f4\">\n",
"Subgroup2: <HDF5 group \"/Group2/Subgroup2\" (0 members)>\n",
"dataset2: <HDF5 dataset \"dataset2\": shape (20,), type \"<i4\">\n"
]
}
],
"source": [
"func1 = lambda x: print(x)\n",
"func2 = lambda x, y: print(f'{x}: {y}')\n",
"f = h5py.File('myh5.hdf5', 'r')\n",
"print('===visit===')\n",
"f.visit(func1)\n",
"print('===visit_item===')\n",
"grp2 = f['Group2']\n",
"grp2.visititems(func2)\n",
"f.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attribute\n",
"\n",
"Attribute is useful for store metadata alongside with your dataset!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dset.attrs[\"resolution\"] = array([100, 100])\n",
"grp1.attrs[\"location\"] = 'Stanford'\n"
]
}
],
"source": [
"k = 10\n",
"n = 100\n",
"with h5py.File('myh5.hdf5', 'w') as f:\n",
" dset = f.create_dataset('my_dataset', (k, n, n))\n",
" dset.attrs['resolution'] = (n,n)\n",
" print(f'{dset.attrs[\"resolution\"] = }')\n",
" \n",
" grp1 = f.create_group('group1')\n",
" grp1.attrs['location'] = 'Stanford'\n",
" print(f'{grp1.attrs[\"location\"] = }')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chunked storage\n",
"\n",
"By default, HDF5 dataset is store contiguous on disk in C order. You can enable chunk storage which will store in specific chunk on disk, this will also enable to change size of dataset on the fly. \n",
"\n",
"Note, when reading a chunked dataset, if any element in a chunk is accessed, the entire is read from disk. Therefore, it is idea to set chunk to one unit data (one image, one time step etc...)\n",
"\n",
"**Chunking has performance implication.**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"with h5py.File('myh5.hdf5', 'w') as f:\n",
" # Specific chunk\n",
" dset = f.create_dataset('chunked', (5, 100, 100), chunks = (1, 100, 100))\n",
" # Auto chunking\n",
" dset = f.create_dataset('autochunk', (5, 100, 100), chunks = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Resizable dataset\n",
"\n",
"When create a dataset, you can pass `maxshape` to make the dataset resiable"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before resize: max20.shape = (1, 100, 100)\n",
"After resize: max20.shape = (15, 100, 100)\n"
]
}
],
"source": [
"some_data = np.random.rand(100, 100)\n",
"with h5py.File('myh5.hdf5', 'w') as f:\n",
" # Specific upper limit\n",
" max20 = f.create_dataset('max20', (1, 100, 100), maxshape = (20, 100, 100))\n",
" # No upper limit\n",
" unlimited = f.create_dataset('unlimited', (1, 100, 100), maxshape = (None, 100, 100))\n",
" # resize operation\n",
" max20[0] = some_data\n",
" print(f'Before resize: {max20.shape = }')\n",
" max20.resize(15, axis=0)\n",
" print(f'After resize: {max20.shape = }')\n",
" max20[14] = some_data\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optimize storage\n",
"\n",
"You can futher optimize storage using compression. Note using compression may bring quite significant performance impact. \n",
"\n",
"The compression handle purely through h5py, the data will compress when store into h5py and decompress when read. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WRITE with gzip=4: \t\t\t0.45760011672973633\n",
"WRITE with gzip=9: \t\t\t0.4559488296508789\n",
"WRITE with lzf: \t\t\t\t0.10457086563110352\n",
"WRITE with no compression: \t\t0.019588947296142578\n"
]
}
],
"source": [
"import time\n",
"\n",
"n = 2000\n",
"some_data = np.random.rand(n, n)\n",
"with h5py.File('myh5.hdf5', 'w') as f:\n",
" # GZIP compression: lossless, GOOD compression, MODERATE speed\n",
" # compression option: 0-9\n",
" # default compression level 4:\n",
" gzip = f.create_dataset('gzip', (n, n), compression='gzip')\n",
" start = time.time()\n",
" gzip[:] = some_data\n",
" print(f'WRITE with gzip=4: \\t\\t\\t{time.time()-start}')\n",
"\n",
" # max compression level:\n",
" gzip_max = f.create_dataset('gzip_max', (n, n), compression='gzip', compression_opts=9)\n",
" start = time.time()\n",
" gzip_max[:] = some_data\n",
" print(f'WRITE with gzip=9: \\t\\t\\t{time.time()-start}')\n",
"\n",
" # LZF compression: lossless, LOW to MODERATE compressinom, VERY FAST speed\n",
" # no compression option\n",
" lzf = f.create_dataset('lzf', (n, n), compression='lzf')\n",
" start = time.time()\n",
" lzf[:] = some_data\n",
" print(f'WRITE with lzf: \\t\\t\\t\\t{time.time()-start}')\n",
"\n",
" # write uncompress:\n",
" uncomp = f.create_dataset('uncomp', (n, n))\n",
" start = time.time()\n",
" uncomp[:] = some_data\n",
" print(f'WRITE with no compression: \\t\\t{time.time()-start}')\n",
"\n",
" # You may be able to further improve the compression ratio by shuffling data within the chunk, little speed penalty losless\n",
" gzip_shuffle = f.create_dataset('gzip_shuffle', (n, n), compression='gzip', shuffle=True)\n",
" lzf_shuffle = f.create_dataset('lzf_shuffle', (n, n), compression='lzf', shuffle=True)\n",
"\n",
" # Fletcher 32: add a chuecksum to each chunk to detect data corruption, it will fail with an error when read corrupted chunks\n",
" gzip_fl32 = f.create_dataset('gzip_fl32', (n, n), compression='gzip', fletcher32=True)\n",
" lzf_fl32 = f.create_dataset('lzf_fl32', (n, n), compression='lzf', fletcher32=True)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"READ with gzip=4: \t\t\t\t0.11252164840698242\n",
"READ with gzip=9: \t\t\t\t0.11008644104003906\n",
"READ with lzf: \t\t\t\t\t0.014460086822509766\n",
"READ with no compression: \t\t0.006185054779052734\n"
]
}
],
"source": [
"with h5py.File('myh5.hdf5', 'r') as f:\n",
" start = time.time()\n",
" _ = f['gzip'][:]\n",
" print(f'READ with gzip=4: \\t\\t\\t\\t{time.time()-start}')\n",
"\n",
" start = time.time()\n",
" _ = f['gzip_max'][:]\n",
" print(f'READ with gzip=9: \\t\\t\\t\\t{time.time()-start}')\n",
"\n",
" start = time.time()\n",
" _ = f['lzf'][:]\n",
" print(f'READ with lzf: \\t\\t\\t\\t\\t{time.time()-start}')\n",
"\n",
" start = time.time()\n",
" _ = f['uncomp'][:]\n",
" print(f'READ with no compression: \\t\\t{time.time()-start}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some example\n",
"\n",
"The following code are not throughly tested, use as a reference for how you could convert your dataset into hdf5 file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### x only\n",
"\n",
"For unsupervised training, when you only have one type of data. It use dynamic loading to increase size as data is added. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ROOT = SOME/PATH/TO/DIR\n",
"\n",
"def loader (path):\n",
" # Loader should return a numpy array that fit with chunks size\n",
" return np.load(str(path))\n",
"\n",
"h5_path = Path(ROOT)/'test.h5'\n",
"data_dir = Path(ROOT)/'data'\n",
"\n",
"ext = 'npy'\n",
"init_shape = (1,12,128,128)\n",
"maxshape = (None,12,128,128) # None means no upper limit\n",
"chunks = (1,12,128,128) # Usually set the chunk to how much data you will load at a time (like 1 image)\n",
"dataset_name = 'image'\n",
"\n",
"with h5py.File(str(h5_path), 'w') as f:\n",
" dset = f.create_dataset(dataset_name, init_shape, maxshape=maxshape, chunks=chunks)\n",
" pbar = tqdm(list(data_dir.glob(f'*.{ext}')))\n",
" for i, path in enumerate(pbar):\n",
" x = loader(path)\n",
" dset.resize(i+1, axis=0)\n",
" dset[i,...] = x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (x,y)\n",
"\n",
"For supervised task, when you have both data and label. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from PIL import Image\n",
"\n",
"ROOT = SOME/PATH/TO/DIR\n",
"\n",
"def im_loader (path):\n",
" im = Image.open(path)\n",
" return np.array(im)\n",
"\n",
"def label_loader (label_file_path): # exmaple of label loader\n",
" with open(label_file_path) as f:\n",
" for line in f:\n",
" yield line.strip()\n",
"\n",
"data_dir = Path(ROOT)/'image'\n",
"label_path = Path(ROOT)/'label.txt'\n",
"\n",
"ext = 'jpg'\n",
"img_shape = (1,128,128,3)\n",
"img_maxshape = (None,128,128,3) \n",
"img_chunks = (1,128,128,3) \n",
"\n",
"label_shape = (1)\n",
"label_maxshape = (None)\n",
"\n",
"label_gen = label_loader(label_path)\n",
"\n",
"with h5py.File(str(h5_path), 'w') as f:\n",
" im_dset = f.create_dataset('image', img_shape, maxshape=img_maxshape, chunks=img_chunks)\n",
" l_dset = f.create_dataset('label', label_shape, maxshape=label_maxshape)\n",
" pbar = tqdm(list(data_dir.glob(f'*.{ext}')))\n",
" for i, path in enumerate(pbar):\n",
" x = loader(path)\n",
" y = next(label_gen)\n",
" im_dset.resize(i+1, axis=0)\n",
" im_dset[i,...] = x\n",
" l_dset.resize(i+1, axis=0)\n",
" l_dset[i] = y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataloader\n",
"\n",
"Exmaple of dataloader when using hdf5 file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class H5Dataset(torch.utils.data.Dataset):\n",
" # ... some other methods ... \n",
" \n",
" def __getitem__ (self, index):\n",
" with h5py.File(str(self.data_path), 'r') as f:\n",
" dset = f['image']\n",
" im_np = dset[index, ...]\n",
" return self.transforms(im_np)"
]
}
],
"metadata": {
"interpreter": {
"hash": "fe50492733d7449d3e23ddde50178f9615d673c5e7dc001510615bbd620e525d"
},
"kernelspec": {
"display_name": "Python 3.9.7 ('self-sup')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}