# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Load, Engineer & Connect</span>

<span style="font-width:bold;"> This is the first part of the quick start series of tutorials about Hopsworks Feature Store. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**</span>

## **üóíÔ∏è This notebook is divided in 3 sections:** 
1. Connect to the Hopsworks feature store. 
2. Loading the data and feature engineeing.
3. Create feature groups and upload them to the feature store.
4. Append additional data and features to Feature Group
5. Deleting Feature Group

![tutorial-flow](images/01_featuregroups.png)

## <span style="color:#ff5f27;"> üßëüèª‚Äçüè´ Features and Feature Groups </span>

The Hopsworks feature store is a centralized repository, within an organization, to manage machine learning features. A feature is a measurable property of a phenomenon. It could be a simple value such as the age of a customer, or it could be an aggregated value, such as the number of transactions made by a customer in the last 30 days.

A feature is not restricted to a numeric value, it could be a string representing an address, or an image.

![Feature Store Overview](../images/overview.svg "Feature Store Overview")

A feature store is not a pure storage service, it goes hand-in-hand with feature computation. Feature engineering is the process of transforming raw data into a format that is compatible and understandable for predictive models.

In this notebook we are going to focus on the left side of the picture above. In particular how data engeneers can create features and push them to the Hopsworks feature store so that they are available to the data scientists

### <span style="color:#ff5f27;">üßëüèª‚Äçüè´ HSFS library</span>

The Hopsworks feature feature store library is called `hsfs` (**H**opswork**s** **F**eature **S**tore). 
The library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.
In this notebook, we are going to cover Python part.

You can find the complete documentation of the library here: 

The first step is to establish a connection with your Hopsworks feature store instance and retrieve the object that represents the feature store you'll be working with. 

By default `connection.get_feature_store()` returns the feature store of the project you are working with. However, it accepts also a project name as parameter to select a different feature store. 

In [1]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

# Hadoop File System in order to access data.
from hops import hdfs

Connected. Call `.close()` to terminate connection gracefully.


> Using `hdfs` we can get **project name** using `hdfs.project_name()` method and **project path** using `hdfs.project_path()` method.

In [2]:
project_name = hdfs.project_name()
project_name

'Basics'

In [3]:
project_path = hdfs.project_path()
project_path

'hdfs://rpc.namenode.service.consul:8020/Projects/Basics/'

---

# <span style="color:#ff5f27;">üî¨ üß¨ Working with Data</span>

We are going to use a dataset containing information related to a chain of deparment stores. The dataset is taken from [Kaggle](https://www.kaggle.com/manjeetsingh/retaildataset?select=Features+data+set.csv).

We are going to create 3 feature groups:
- `stores_fg`: it's going to contain features related to the store itself. Mainly the category, the number of deparmetns and the size.
- `sales_fg`: it's going to contain sales features for each store/deparment over the weeks. 
- `exogenous_fg`: it's going to contain features which are not related to the stores themselves, but they have an effect on sales. These features are, for instance, the gas price, the unemployment rate, temperature in the area and so on.

## <span style="color:#ff5f27;"> üíΩ Loading Data </span>

In [4]:
import pandas as pd

In [5]:
stores_csv = pd.read_csv(project_path + 'Jupyter/archive/stores data-set.csv')
stores_csv.head()



Unnamed: 0,store,type,size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


In [6]:
exogenous_csv = pd.read_csv(project_path + 'Jupyter/archive/Features data set.csv')
exogenous_csv.head()

Unnamed: 0,store,date,temperature,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,is_holiday
0,1,05/02/2010,42.31,2.572,,,,,,211.096358,8.106,False
1,1,12/02/2010,38.51,2.548,,,,,,211.24217,8.106,True
2,1,19/02/2010,39.93,2.514,,,,,,211.289143,8.106,False
3,1,26/02/2010,46.63,2.561,,,,,,211.319643,8.106,False
4,1,05/03/2010,46.5,2.625,,,,,,211.350143,8.106,False


In [7]:
sales_csv = pd.read_csv(project_path + 'Jupyter/archive/sales data-set.csv')
sales_csv.head()

Unnamed: 0,store,dept,date,weekly_sales,is_holiday
0,1,1,05/02/2010,24924.5,False
1,1,1,12/02/2010,46039.49,True
2,1,1,19/02/2010,41595.55,False
3,1,1,26/02/2010,19403.54,False
4,1,1,05/03/2010,21827.9,False


---
## <span style="color:#ff5f27;"> üõ† ü™Ñ Feature Engineering and Feature Group Creation </span>

In [8]:
stores_depts_count = pd.merge(stores_csv,sales_csv,on = 'store')
stores_depts_count = stores_depts_count.groupby('store').nunique('dept')['dept'].reset_index()
stores_depts_count.head()

Unnamed: 0,store,dept
0,1,77
1,2,78
2,3,72
3,4,78
4,5,72


In [9]:
stores_fg = pd.merge(stores_csv,stores_depts_count, on = 'store')
stores_fg.head()

Unnamed: 0,store,type,size,dept
0,1,A,151315,77
1,2,A,202307,78
2,3,B,37392,72
3,4,A,205863,78
4,5,B,34875,72


In [10]:
store_fg_meta = fs.get_or_create_feature_group(
    name = "store_fg",
    version = 1,
    primary_key = ['store'],
    description = "Store related features",
    online_enabled = True
)

Up to this point we have just created the metadata object representing the feature group. However, we haven't saved the feature group in the feature store yet. To do so, we can call the method `insert` on the metadata object created in the cell above.

In [11]:
store_fg_meta.insert(stores_fg)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1073
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/store_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3ba41c7a00>, None)

#### <span style="color:#ff5f27;">‚õ≥Ô∏è Sales Dataset</span>

Differently from the `store_fg`, for the `sales_fg` we are going to define a composite primary key. This means that each entry in the `sales_fg` is going to be uniquely identified by the store, the department and the week. In this case we are going to specify also a partition key. Partitioning is a tool available at your disposal to improve the performances of querying a feature group.

In [12]:
from datetime import datetime

def timestamp_2_time(x):
    dt_obj = datetime.strptime(x, '%d/%m/%Y')
    dt_obj = dt_obj.timestamp() * 1000
    return int(dt_obj)

In [13]:
sales_csv.date = sales_csv.date.apply(timestamp_2_time)
sales_csv.head()

Unnamed: 0,store,dept,date,weekly_sales,is_holiday
0,1,1,1265328000000,24924.5,False
1,1,1,1265932800000,46039.49,True
2,1,1,1266537600000,41595.55,False
3,1,1,1267142400000,19403.54,False
4,1,1,1267747200000,21827.9,False


In [14]:
windows = [30,90,180,365]

for window in windows:
    sales_csv[f'sales_last_{window}_days_store_dep'] = sales_csv.groupby('store').weekly_sales.rolling(window = window).sum().fillna(0).reset_index(drop = True)
    sales_csv[f'sales_last_{window}_days_store'] = sales_csv.groupby('store').weekly_sales.rolling(window = window).sum().fillna(0).reset_index(drop = True)

sales_csv.is_holiday = sales_csv.is_holiday.apply(int)
sales_csv.head()

Unnamed: 0,store,dept,date,weekly_sales,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,sales_last_180_days_store,sales_last_365_days_store_dep,sales_last_365_days_store
0,1,1,1265328000000,24924.5,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1,1265932800000,46039.49,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,1,1266537600000,41595.55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,1,1267142400000,19403.54,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,1,1267747200000,21827.9,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
sales_fg_meta = fs.get_or_create_feature_group(
    name = "sales_fg",
    version = 1,
    primary_key = ['store', 'dept', 'date'],
    description = "Sales related features",
    online_enabled = True
)

sales_fg_meta.insert(sales_csv)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1074
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/sales_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3b979b2f70>, None)

When creating a feature group we can also specify a `partition key`. Partition keys help organize the feature data on the file system and improve performances when reading the feature group data. As for the `primary key`, also `partition key` can be a composite one.

In [16]:
sales_part_fg_meta = fs.get_or_create_feature_group(
    name = "sales_fg",
    version = 2,
    primary_key = ['store', 'dept', 'date'],
    partition_key = ['store'],
    description = "Sales related features",
    time_travel_format = None,                                                                                          
    statistics_config = False
)

sales_part_fg_meta.insert(sales_csv)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1075
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/sales_fg_2_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3b979da910>, None)

You can enable a feature group to be online by setting the `online_enabled` flag to true. 

By default `HSFS` configures the feature group such that new feature data that gets saved or inserted is written to the offline feature store. If `online_enabled = True`, additionally, the data is saved to the online storage of the feature store. Note that the insert and save to both storages is not transactional.

If you want to create a purely online feature group. Save the feature group with `online_enabled = True` but with an empty dataframe and subsequently use the insert with `storage = "online"` to overwrite the default and write to the online feature store only.

In [17]:
sales_part_fg_meta = fs.get_or_create_feature_group(
    name = "sales_fg",
    version = 3,
    primary_key = ['store', 'dept', 'date'],
    online_enabled = True,
    description = "Sales related features"
)

sales_part_fg_meta.insert(sales_csv)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1076
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/sales_fg_3_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3b979c55b0>, None)

#### <span style="color:#ff5f27;">‚õ≥Ô∏è Exogenous Dataset </span>

This feature group will contain exogenous features that can influence sales, but are not under the control of the distribution chain. These are the unemployment, the consumer price index (cpi) and so on.
We are going to write these features as they are in the feature store

In [19]:
exogenous_csv.date = exogenous_csv.date.apply(timestamp_2_time)
exogenous_csv.is_holiday = exogenous_csv.is_holiday.apply(int)
exogenous_csv.head()

Unnamed: 0,store,date,temperature,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,is_holiday
0,1,1265328000000,42.31,2.572,,,,,,211.096358,8.106,0
1,1,1265932800000,38.51,2.548,,,,,,211.24217,8.106,1
2,1,1266537600000,39.93,2.514,,,,,,211.289143,8.106,0
3,1,1267142400000,46.63,2.561,,,,,,211.319643,8.106,0
4,1,1267747200000,46.5,2.625,,,,,,211.350143,8.106,0


In [20]:
exogenous_fg_meta = fs.get_or_create_feature_group(
    name = "exogenous_fg",
    version = 1,
    primary_key = ['store', 'date'],
    description = "External features that influence sales, but are not under the control of the distribution chain",
    online_enabled = True,
    event_time = ['date']
)

exogenous_fg_meta.insert(exogenous_csv)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1077
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exogenous_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3b979c6100>, None)

---
## <span style="color:#ff5f27;"> üîÆ Append additional data </span>

You can add additional data to a feature group by calling the `insert` method. In the example below we assume that we got also the data for 2013 and we are going to append it to the existing `exogenous_fg`.

In [21]:
exogenous_fg_2000 = exogenous_csv.copy()
exogenous_fg_2000.date = exogenous_fg_2000.date + timestamp_2_time('01/01/2000')

In [22]:
exogenous_fg_meta = fs.get_or_create_feature_group(
    name = 'exogenous_fg',
    version = 1
)

exogenous_fg_meta.insert(exogenous_fg_2000)

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exogenous_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3b979c5c70>, None)

This will also recompute statistics after inserting new data. The new statistics will be saved along the metadata with a new commit time.

---
## <span style="color:#ff5f27;"> üîÆ Append an additional feature </span>

Appending features to a feature group is a non-breaking schema change compared to removing features, which will require creating a new version of the feature group.

You can append a feature group by specifying a data type and default value for the new feature. The default value is necessary for the data that is already in the feature group.

In [23]:
from hsfs.feature import Feature

In [24]:
exogenous_fg_meta.append_features([Feature("appended_feature", type="double", default_value="10.0")])

2022-06-20 13:41:49,240 INFO: USE `basics_featurestore`
2022-06-20 13:41:50,163 INFO: SELECT `fg0`.`store` `store`, `fg0`.`date` `date`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `fg0`.`unemployment` `unemployment`, `fg0`.`is_holiday` `is_holiday`
FROM `basics_featurestore`.`exogenous_fg_1` `fg0`


<hsfs.feature_group.FeatureGroup at 0x7f3b979c68b0>

---
## <span style="color:#ff5f27;"> üß¨ Delete a feature group </span>

You can call the `delete` method on a feature group to delete the entire feature group.

In [25]:
exogenous_fg_meta = fs.get_or_create_feature_group(
    name = "exogenous_fg",
    version = 3,
    primary_key = ['store', 'date'],
    description = "External features that influence sales, but are not under the control of the distribution chain",
    time_travel_format = None,                                                                                        
    statistics_config = False
)

exogenous_fg_meta.insert(exogenous_csv)

Feature Group created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fg/1078
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exogenous_fg_3_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f3ba41b2850>, None)

In [26]:
exogenous_fg_meta = fs.get_or_create_feature_group(
    name = 'exogenous_fg',
    version = 3
)

exogenous_fg_meta.delete()

---

## <span style="color:#ff5f27;">‚è≠Ô∏è **Next:** Part 02 </span>

In the following notebook we will explore feature groups.