{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "---\n", "title: \"Creating a Petastorm Dataset from ImageNet\"\n", "date: 2021-05-03\n", "type: technical_note\n", "draft: false\n", "---" ] }, { "cell_type": "markdown", "id": "elect-companion", "metadata": {}, "source": [ "# Creating a petastorm Dataset from ImageNet\n", "\n", "## Why petastorm?\n", "Petastorm is an open source library for large datasets, suited for high throughput I/O applications. Petastorm uses parquet as a columnar storage format which allows for better compression than e.g. the .csv format and combines fragmented datasets consisting of many files into fewer and larger files. You should use petastorm when your DataLoader needs to read a lot of files and is slowing down your training. One drawback of petastorm datasets is that you loose random access to elements and the dataset's length. In PyTorch Dataset terms, petastorm implements an `IterativeDataset`.\n", "\n", "## The dataset\n", "For this example, we use the ImageNette dataset (https://github.com/fastai/imagenette), a subset of the original ImageNet dataset. It contains ten categories with training and test images for each. The images vary in their size from merely ~150x150 resolution to 4k images.\n", "\n", "## The files\n", "This notebook assumes that the ImageNette folder is present and extracted in _DataSets/ImageNet/imagenette/_." ] }, { "cell_type": "code", "execution_count": 1, "id": "cultural-redhead", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
61 | application_1615797295425_0002 | pyspark | idle | Link | Link |