## HSFS feature exploration

In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store. 

A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets.
Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.

![Join](./images/join.svg "Join")

As for the [feature_engineering](./feature_engineering.ipynb) notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle.

In [1]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
5,application_1604957327609_0009,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### Explore feature groups

You can interact with the feature groups as if they were Spark dataframe. A feature group object has a `show()` method, to show `n` number of lines, and a `read()` method to read the content of the feature group in a Spark dataframe.

The first step to do any operation on a feature group is to get its handle from the feature store. The `get_feature_group` method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1

In [2]:
sales_fg = fs.get_feature_group("sales_fg")



In [3]:
sales_fg.show(5)

+----------------------------+-----+--------------------------+--------------------------+------------+----------+----------+---------------------+------------------------+------------------------------+-------------------------+----------------------+----+
|sales_last_quarter_store_dep|store|sales_last_month_store_dep|sales_last_six_month_store|weekly_sales|is_holiday|      date|sales_last_year_store|sales_last_quarter_store|sales_last_six_month_store_dep|sales_last_year_store_dep|sales_last_month_store|dept|
+----------------------------+-----+--------------------------+--------------------------+------------+----------+----------+---------------------+------------------------+------------------------------+-------------------------+----------------------+----+
|                         0.0|   20|                       0.0|                       0.0|    32362.95|     false|2010-02-05|                  0.0|                     0.0|                           0.0|                      0

In [4]:
sales_df = sales_fg.read()
sales_df.filter("store == 20").show(5)
print(type(sales_df))

+----------------------------+-----+--------------------------+--------------------------+------------+----------+----------+---------------------+------------------------+------------------------------+-------------------------+----------------------+----+
|sales_last_quarter_store_dep|store|sales_last_month_store_dep|sales_last_six_month_store|weekly_sales|is_holiday|      date|sales_last_year_store|sales_last_quarter_store|sales_last_six_month_store_dep|sales_last_year_store_dep|sales_last_month_store|dept|
+----------------------------+-----+--------------------------+--------------------------+------------+----------+----------+---------------------+------------------------+------------------------------+-------------------------+----------------------+----+
|                         0.0|   20|                       0.0|                       0.0|    32362.95|     false|2010-02-05|                  0.0|                     0.0|                           0.0|                      0

You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:

In [5]:
print("Name: {}".format(sales_fg.name))
print("Description: {}".format(sales_fg.description))
print("Features:")
features = sales_fg.features
for feature in features:
    print("{:<60} \t Primary: {} \t Partition: {}".format(feature.name, feature.primary, feature.partition))

Name: sales_fg
Description: Sales related features
Features:
sales_last_quarter_store_dep                                 	 Primary: False 	 Partition: False
store                                                        	 Primary: True 	 Partition: False
sales_last_month_store_dep                                   	 Primary: False 	 Partition: False
sales_last_six_month_store                                   	 Primary: False 	 Partition: False
weekly_sales                                                 	 Primary: False 	 Partition: False
is_holiday                                                   	 Primary: False 	 Partition: False
date                                                         	 Primary: True 	 Partition: False
sales_last_year_store                                        	 Primary: False 	 Partition: False
sales_last_quarter_store                                     	 Primary: False 	 Partition: False
sales_last_six_month_store_dep                               	 Prima

If you are interested only in a subset of features, you can use the `select()` method on the feature group object to select a list of features. The `select()` behaves like a feature group, as such, you can call the `.show()` or `.read()` methods on it.

In [6]:
sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)

+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   20|  55|    32362.95|
|   20|  94|    63787.83|
|   20|  22|    17597.83|
|   20|  30|     9488.37|
|   20|   2|    85812.69|
+-----+----+------------+
only showing top 5 rows

If your feature group is available both online and offline, you can use the `online` option of the `show()` and `read()` methods to specify if you want to read your feature group from online storage.

In [7]:
sales_fg_3 = fs.get_feature_group('sales_fg', 3)

sales_fg_3.select(['store', 'dept', 'weekly_sales']).show(5, online=True)

+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   24|   8|    51815.65|
|   19|  46|    22333.69|
|   19|  83|     3570.62|
|   22|  10|    17241.23|
|    8|  49|     2580.28|
+-----+----+------------+
only showing top 5 rows

### Join Features and Feature Groups

HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups.
The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.

You can use the `select_all()` method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition. 
If you don't specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.

In the example below, `sales_fg` has `store`, `dept` and `date` as composite primary key while `exogenous_fg` has only `store` and `date`. So Hopsworks will set as joining condition `store` and `date`.

In [8]:
sales_fg = fs.get_feature_group('sales_fg')
exogenous_fg = fs.get_feature_group('exogenous_fg')

query = sales_fg.select_all().join(exogenous_fg.select_all())



You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the `to_string()` method on it.

In [9]:
print(query.to_string())

SELECT `fg0`.`sales_last_quarter_store_dep`, `fg0`.`store`, `fg0`.`sales_last_month_store_dep`, `fg0`.`sales_last_year_store_dep`, `fg0`.`sales_last_month_store`, `fg0`.`dept`, `fg0`.`sales_last_year_store`, `fg0`.`sales_last_quarter_store`, `fg0`.`sales_last_six_month_store_dep`, `fg0`.`sales_last_six_month_store`, `fg0`.`weekly_sales`, `fg0`.`is_holiday`, `fg0`.`date`, `fg1`.`markdown5`, `fg1`.`markdown2`, `fg1`.`fuel_price`, `fg1`.`markdown1`, `fg1`.`markdown4`, `fg1`.`cpi`, CASE WHEN `fg1`.`appended_feature` IS NULL THEN 10.0 ELSE `fg1`.`appended_feature` END `appended_feature`, `fg1`.`temperature`, `fg1`.`markdown3`, `fg1`.`unemployment`, `fg1`.`is_holiday`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg0`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg0`.`date` = `fg1`.`date` AND `fg0`.`store` = `fg1`.`store`

As for the feature groups, you can call the `show()` method to inspect the data before generating a training dataset from it. Or you can call the `read()` method to get a Spark DataFrame with the result of the query and apply additional transformations to it.

In [10]:
query.show(5)

+----------------------------+-----+--------------------------+-------------------------+----------------------+----+---------------------+------------------------+------------------------------+--------------------------+------------+----------+----------+---------+---------+----------+---------+---------+-----------+----------------+-----------+---------+------------+----------+
|sales_last_quarter_store_dep|store|sales_last_month_store_dep|sales_last_year_store_dep|sales_last_month_store|dept|sales_last_year_store|sales_last_quarter_store|sales_last_six_month_store_dep|sales_last_six_month_store|weekly_sales|is_holiday|      date|markdown5|markdown2|fuel_price|markdown1|markdown4|        cpi|appended_feature|temperature|markdown3|unemployment|is_holiday|
+----------------------------+-----+--------------------------+-------------------------+----------------------+----+---------------------+------------------------+------------------------------+--------------------------+----------

As for the `show()` and `read()` method of the feature group, even in the case of a query you can specify against which storage to run the query.

### Select only a subset of features

You can replace the `select_all()` method with the `select([])` method to be able to select only a subset of features from a feature group you want to join:

In [11]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']))
query.show(5)

+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
|   20|  55|    32362.95|     2.784|
|   20|  94|    63787.83|     2.784|
|   20|  22|    17597.83|     2.784|
|   20|  30|     9488.37|     2.784|
|   20|   2|    85812.69|     2.784|
+-----+----+------------+----------+
only showing top 5 rows

### Overwrite the joining key

If your feature groups don't have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.

As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:

In [12]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), on=['date'])
query.show(5)

+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
|   20|  55|    32362.95|     2.784|
|   20|  55|    32362.95|     2.666|
|   20|  55|    32362.95|     2.572|
|   20|  55|    32362.95|     2.962|
|   20|  55|    32362.95|      2.58|
+-----+----+------------+----------+
only showing top 5 rows

### Overwriting the join type

By default, the join type between two feature groups is `INNER JOIN`. You can overwrite this behavior by passing the `join_type` parameter to the join method. Valid types are: `INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA`

In [13]:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), join_type="left")

print(query.to_string())

SELECT `fg0`.`store`, `fg0`.`dept`, `fg0`.`weekly_sales`, `fg1`.`fuel_price`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg0`
LEFT JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg0`.`store` = `fg1`.`store` AND `fg0`.`date` = `fg1`.`date`

### Join mulitple feature groups

You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:

    (`sales_fg` <> `store_fg`) <> `exogenous_fg`

The join paramers you pass in each `join()` method call apply to that specific join. This means that you can concatenate left and right joins.
Please be aware that currently HSFS **does not support** nested join such as: 

    `sales_fg` <> (`store_fg` <> `exogenous_fg`)

In [14]:
store_fg = fs.get_feature_group("store_fg")

query = sales_fg.select_all()\
                .join(store_fg.select_all())\
                .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

print(query.to_string())

SELECT `fg0`.`sales_last_quarter_store_dep`, `fg0`.`store`, `fg0`.`sales_last_month_store_dep`, `fg0`.`sales_last_year_store_dep`, `fg0`.`sales_last_month_store`, `fg0`.`dept`, `fg0`.`sales_last_year_store`, `fg0`.`sales_last_quarter_store`, `fg0`.`sales_last_six_month_store_dep`, `fg0`.`sales_last_six_month_store`, `fg0`.`weekly_sales`, `fg0`.`is_holiday`, `fg0`.`date`, `fg1`.`size`, `fg1`.`type`, `fg1`.`num_depts`, `fg2`.`fuel_price`, `fg2`.`unemployment`, `fg2`.`cpi`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg0`
INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg1` ON `fg0`.`store` = `fg1`.`store`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg2` ON `fg0`.`store` = `fg2`.`store` AND `fg0`.`date` = `fg2`.`date`

### Free hand query

With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. `fs.sql` returns a Spark Dataframe.

In [15]:
fs.sql("SELECT * FROM `store_fg_1`")

DataFrame[store: int, type: string, size: int, num_depts: bigint]