# Example of Save Image Data as a Feature Group in the Feature Store

Often, image data can be fed in as raw data to deep learning models and requires less feature engineering than other type of data. Thus, in many cases you would **not** need need to store image data as a feature group in the feature store, but rather you would save it directly as a training dataset in for example .tfrecords or .petastorm format.

However, sometimes you want to join image features with other types of features and you might also need to do feature engineering steps such as *data augmentation, image scaling, image normalization etc.*. This notebook will show you how you can save image data as a feature group in the feature store.

In [1]:
from hops import featurestore
from hops import hdfs

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9,application_1550835076939_0011,pyspark,idle,Link,Link,âœ”


SparkSession available as 'spark'.


## Step 1: Read in the Raw Image Data

You can read in the image data from HopsFS using for example Spark or Tensorflow. In this example we will use Spark to read in a batch of images stored in the folder `mnist/` inside your project

In [7]:
image_dir = hdfs.project_path() + "mnist"

In [8]:
hdfs.ls(image_dir)

['hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_1.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_108.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_17.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_23.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_4.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_5.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_54.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_63.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_69.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_98.jpg']

In [9]:
image_df = spark.read.format("image").load(image_dir)

## Step 2: Process The Images (Feature Engineering)

After having read the images using for example Spark or Tensorflow you can do feature engineering as you like with the images before you save them to the feature store.

In [10]:
#image_df = image_df.map()....

## Step 3: Saving The Processed Images to the Feature Store as a Feature Group

To save the images to the feature store as a feature group you can store them in the format that Spark automatically structures images:

```
root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
```
Or you can setup your own custom format for storing the images (for example flattening each image to a float array).

In [12]:
featurestore.create_featuregroup(image_df, "mnist_images_featuregroup", 
                                 feature_correlation=False, 
                                 cluster_analysis=False,
                                feature_histograms=False)

computing descriptive statistics for : mnist_images_featuregroup
Running sql: use demo_featurestore_admin000_featurestore

In [13]:
image_fg = featurestore.get_featuregroup("mnist_images_featuregroup")

Running sql: use demo_featurestore_admin000_featurestore
Running sql: SELECT * FROM mnist_images_featuregroup_1

In [14]:
image_fg.show(5)

+--------------------+
|               image|
+--------------------+
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
+--------------------+
only showing top 5 rows

In [15]:
image_fg.printSchema()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

## Example of Saving a Feature Group of Images as a Petastorm Training Dataset for Deep Learning

In [37]:
from petastorm.codecs import ScalarCodec, NdarrayCodec
from petastorm.unischema import Unischema, UnischemaField
from petastorm import make_reader
import numpy as np
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField
from petastorm.tf_utils import tf_tensors, make_petastorm_dataset
import tensorflow as tf

### Read Featuregroup with Images

In [28]:
image_fg = featurestore.get_featuregroup("mnist_images_featuregroup")

Running sql: use demo_featurestore_admin000_featurestore
Running sql: SELECT * FROM mnist_images_featuregroup_1

#### Inspect the Dimensions

In [29]:
image_fg.select(["image.nChannels", "image.height", "image.width"]).first()

Row(nChannels=1, height=28, width=28)

#### Drop Metadata Columns as Metadata Will be stored in the Petastorm Schema Instead

In [30]:
image_fg_binary = image_fg.withColumn("img", image_fg.image.data).drop("image")

In [31]:
image_fg_binary.printSchema()

root
 |-- img: binary (nullable = true)

#### Convert Binary Byte Arrays to Numpy Arrays (Petastorm schema uses numpy datatypes)

In [32]:
# If we had labels to the images we could have joined the dataframe with the labels and added it as a field in the schema.
ImageSchema = Unischema('ImageSchema', [
    UnischemaField('img', np.uint8, (28,28), NdarrayCodec(), False)
])

def create_dict(row):
    return {
    "img": np.array(row.img).reshape(28,28)
    }

In [33]:
df_compatible_with_petastorm_schema = spark.createDataFrame(
                                      image_fg_binary.rdd.map(create_dict)\
                                      .map(lambda x: dict_to_spark_row(ImageSchema, x))
                                      , ImageSchema.as_spark_schema())

#### Save Petastorm Training Dataset

In [34]:
petastorm_args = {"schema": ImageSchema}
featurestore.create_training_dataset(df_compatible_with_petastorm_schema, "image_test_petastorm_from_fg", 
                                     data_format="petastorm", petastorm_args=petastorm_args,
                                    descriptive_statistics=False, feature_correlation=False,
                                    feature_histograms=False, cluster_analysis=False)

#### Read a Petastorm Image Dataset using Tensorflow

In [38]:
def tensorflow_read_td(training_dataset):
    OUTPUT_URL = featurestore.get_training_dataset_path(training_dataset)
    # Example: use tf.data.Dataset API
    with make_reader(OUTPUT_URL, hdfs_driver='libhdfs') as reader:
        dataset = make_petastorm_dataset(reader)
        iterator = dataset.make_one_shot_iterator()
        tensor = iterator.get_next()
        with tf.Session() as sess:
            sample = sess.run(tensor)
            print(sample)

In [39]:
tensorflow_read_td("image_test_petastorm_from_fg")

ImageSchema_view(img=array([[  0,   0,   0,   0,   0,   0,   0,   0,   4,   0,   0,   0,   0,
          0,   6,   0,   5,   0,   0,   0,   0,   0,   0,   1,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   3,  12,   0,  15,
          0,   0,   0,   0,   4,   5,   0,   0,   8,  10,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   9,   0,   0,   0,   4,
          3,   0,   6,   3,   0,   0,  10,   6,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,  10,   2,   9,   0,
         12,   4,   0,   0,   7,   1,   0,   0,   6,  15,   9,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   4,   2,   0,  16,  60,
        139, 138,  76,   0,   0,   1,   6,   5,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   1,  14,  72, 255,
        255, 233, 241,   0,   0,   3,   4,   1,   0,   7,  16,   0,   0,
        