{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "---\n",
    "title: \"Creating a Petastorm Dataset from ImageNet\"\n",
    "date: 2021-05-03\n",
    "type: technical_note\n",
    "draft: false\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "elect-companion",
   "metadata": {},
   "source": [
    "# Creating a petastorm Dataset from ImageNet\n",
    "\n",
    "## Why petastorm?\n",
    "Petastorm is an open source library for large datasets, suited for high throughput I/O applications. Petastorm uses parquet as a columnar storage format which allows for better compression than e.g. the .csv format and combines fragmented datasets consisting of many files into fewer and larger files. You should use petastorm when your DataLoader needs to read a lot of files and is slowing down your training. One drawback of petastorm datasets is that you loose random access to elements and the dataset's length. In PyTorch Dataset terms, petastorm implements an `IterativeDataset`.\n",
    "\n",
    "## The dataset\n",
    "For this example, we use the ImageNette dataset (https://github.com/fastai/imagenette), a subset of the original ImageNet dataset. It contains ten categories with training and test images for each. The images vary in their size from merely ~150x150 resolution to 4k images.\n",
    "\n",
    "## The files\n",
    "This notebook assumes that the ImageNette folder is present and extracted in _DataSets/ImageNet/imagenette/_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cultural-redhead",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting Spark application\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th></tr><tr><td>61</td><td>application_1615797295425_0002</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"http://moritzgpu-master-upgrade.internal.cloudapp.net:8088/proxy/application_1615797295425_0002/\">Link</a></td><td><a target=\"_blank\" href=\"http://moritzgpu-worker-2.internal.cloudapp.net:8042/node/containerlogs/container_e65_1615797295425_0002_01_000001/PyTorch_spark_minimal__realamac\">Link</a></td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SparkSession available as 'spark'.\n"
     ]
    }
   ],
   "source": [
    "from hops import hdfs\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import torch\n",
    "import torchvision.transforms as T\n",
    "from PIL import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "japanese-cream",
   "metadata": {},
   "source": [
    "## Creating a PyTorch Dataset\n",
    "We first create a Dataset just as we would for a normal PyTorch training script. The goal here is to produce a generator for training samples that behaves exactly as we are used from PyTorch. We could of course implement another row generator with the same functionality, but this is probably the easiest way.\n",
    "### Reading the metadata\n",
    "The metadata of our dataset is stored in a .csv file located in the root folder. It contains the labels of each image and its source path. For convenience, we relabel the classes into integers. Also, since distributed training performs best on balanced datasets (=same amount of batches per worker), we drop the last few examples. This is not strictly necessary, but can help you maximize your GPU utilization later on.\n",
    "### Loading and transforming the images\n",
    "Petastorm expects its images to be of uniform size. This is a direct consequence of the column-based storage format with strict schemas. In the `__getitem__` function, we therefore need to crop the images to a standard resolution. Luckily, PyTorch datasets enable us to do so by simply applying a transformation after reading the image and its label. For now this transform is generic, it will be defined later. Another advantage of defining our own dataset is that we have no problems performing I/O operations on our DFS, which would fail when simply calling `os.open()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "historic-senator",
   "metadata": {},
   "outputs": [],
   "source": [
    "class ImageNetDataset(torch.utils.data.Dataset):\n",
    "    \n",
    "    def __init__(self, path, n_exec, batch_size=1, transform=None, test_set=False):\n",
    "        super().__init__()\n",
    "        self.root = path\n",
    "        self.df = pd.read_csv(path + \"noisy_imagenette.csv\")\n",
    "        self.transform = transform\n",
    "        if test_set:\n",
    "            self.df = self.df[self.df.is_valid]\n",
    "        else:\n",
    "            self.df = self.df[self.df.is_valid == False]\n",
    "        self.df.drop([\"noisy_labels_\" + str(i) for i in [1, 5, 25,50]], axis=1, inplace=True)\n",
    "        self.labels = {\"n01440764\": 0,  # \"tench\" \n",
    "                       \"n02102040\": 1,  # \"English springer\"\n",
    "                       \"n02979186\": 2,  # \"cassette player\"\n",
    "                       \"n03000684\": 3,  # \"chain saw\"\n",
    "                       \"n03028079\": 4,  # \"church\"\n",
    "                       \"n03394916\": 5,  # \"French horn\"\n",
    "                       \"n03417042\": 6,  # \"garbage truck\"\n",
    "                       \"n03425413\": 7,  # \"gas pump\"\n",
    "                       \"n03445777\": 8,  # \"golf ball\"\n",
    "                       \"n03888257\": 9,  # \"parachute\"\n",
    "                      }\n",
    "        # Drop the last samples to create a balanced dataset.\n",
    "        self.req_items = n_exec*batch_size\n",
    "        len_df = len(self.df)\n",
    "        self.df.drop(index=self.df.index[len_df-len_df%self.req_items-1:len_df-1], inplace=True)\n",
    "    \n",
    "    def __len__(self):\n",
    "        return len(self.df)\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        row = self.df.iloc[idx]\n",
    "        label = self.labels[row[\"noisy_labels_0\"]]\n",
    "        f = hdfs.open_file(self.root + row[\"path\"])\n",
    "        try:\n",
    "            img = Image.open(f).convert(\"RGB\")\n",
    "        finally:\n",
    "            f.close()\n",
    "        if self.transform:\n",
    "            img = self.transform(img)\n",
    "        sample = {\"image\": np.array(img, dtype=np.uint8), \"label\": label}\n",
    "        return sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "drawn-welsh",
   "metadata": {},
   "source": [
    "### Creating the datasets\n",
    "We create both the train and the test set by setting the correct source path for their folder. Since petastorm requires uniformly sized images as previously mentioned, we add a torch transform to resize and crop the images into fitting sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "hairy-village",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9344\n",
      "3840"
     ]
    }
   ],
   "source": [
    "root_folder = hdfs.project_path() + \"DataSets/ImageNet/imagenette/\"\n",
    "\n",
    "train_path = hdfs.project_path() + \"DataSets/ImageNet/PetastormImageNette/train\"\n",
    "test_path = hdfs.project_path() + \"DataSets/ImageNet/PetastormImageNette/test\"\n",
    "\n",
    "transform = T.Compose(\n",
    "    [T.Resize(256),\n",
    "     T.CenterCrop(256),\n",
    "    ])\n",
    "\n",
    "train_dataset = ImageNetDataset(root_folder, 2, 64, transform=transform)\n",
    "test_dataset = ImageNetDataset(root_folder, 2, 64, transform=transform, test_set=True)\n",
    "\n",
    "print(\"Length training set: {}, Size of test set: {}\".format(len(train_dataset), len(test_dataset))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "encouraging-dealing",
   "metadata": {},
   "source": [
    "### Preparing the petastorm schema and generator\n",
    "In order to create a petastorm dataset, we need to provide two things: First, a schema of our dataset, and second, a row generator that maps indices of the dataset to a dictionary containing the data. In our case, the schema is as simple as choosing a numpy array of size (256,256,3), dtype uint8 for the images, and an int8 scalar for the labels. Note that for images other than PyTorch's tensors, the channels reside in the 3rd dimension. For more information on the schema, go to https://github.com/uber/petastorm. Since we already created our datasets for this exact purpose, we can simply return the dataset at the desired index as a row generator.\n",
    "### Configuring the dataset\n",
    "The `with materialize_dataset()` API then creates a petastorm dataset with the spark operations performed within the context. Note that you can adjust the amount of parquet files. For distributed training, it is necessary to have at least as many files as nodes. In general, it is advised to make `parquet_files_count` divisable by the number of nodes in your setup. It also controls the size of the individual files by distributing the total size over more files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "published-ending",
   "metadata": {},
   "outputs": [],
   "source": [
    "from petastorm.codecs import CompressedImageCodec, ScalarCodec\n",
    "from petastorm.etl.dataset_metadata import materialize_dataset\n",
    "from petastorm.unischema import Unischema, UnischemaField, dict_to_spark_row\n",
    "from pyspark.sql.types import IntegerType\n",
    "\n",
    "\n",
    "ImageNetSchema = Unischema('ScalarSchema', [\n",
    "   UnischemaField('image', np.uint8, (256, 256, 3), CompressedImageCodec(\"png\"), False),\n",
    "   UnischemaField('label', np.int8, (), ScalarCodec(IntegerType()), False)])\n",
    "\n",
    "\n",
    "def row_generator(idx, dataset):\n",
    "    return dataset[idx]\n",
    "\n",
    "\n",
    "def generate_ImageNet_dataset(output_url, dataset):\n",
    "    rowgroup_size_mb = 8\n",
    "    rows_count = len(dataset)\n",
    "    parquet_files_count = 16\n",
    "    \n",
    "    sc = spark.sparkContext\n",
    "    # Wrap dataset materialization portion. Will take care of setting up spark environment variables as\n",
    "    # well as save petastorm specific metadata\n",
    "    with materialize_dataset(spark, output_url, ImageNetSchema, rowgroup_size_mb):\n",
    "        rows_rdd = sc.parallelize(range(rows_count))\\\n",
    "            .map(lambda x: row_generator(x, dataset))\\\n",
    "            .map(lambda x: dict_to_spark_row(ImageNetSchema, x))\n",
    "\n",
    "        spark.createDataFrame(rows_rdd, ImageNetSchema.as_spark_schema()) \\\n",
    "            .repartition(parquet_files_count) \\\n",
    "            .write \\\n",
    "            .mode('overwrite') \\\n",
    "            .parquet(output_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advanced-contract",
   "metadata": {},
   "source": [
    "### Write the dataset\n",
    "Once we defined all necessary functions and schemas, we can simply call the functions on the two different datasets to invoke the creation of our datasets. Note that this might take a few minutes due to the amount of data that has to be read and processed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "young-wagner",
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_ImageNet_dataset(train_path, train_dataset)\n",
    "generate_ImageNet_dataset(test_path, test_dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "first-spirituality",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PySpark",
   "language": "python",
   "name": "pysparkkernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "python",
    "version": 3
   },
   "mimetype": "text/x-python",
   "name": "pyspark",
   "pygments_lexer": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}