# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the third part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **üóíÔ∏è In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a Feature View**.
3. **Create a dataset split** for training and validation data.

![tutorial-flow](images/02_training-dataset.png) 

---
## <span style="color:#ff5f27;">üßëüèª‚Äçüè´ HSFS Feature Views and Training Datasets </span>

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">‚öôÔ∏è Feature View Creation</span>

In the previous notebook ([feature_exploration](./feature_exploration.ipynb)) we walked through how to explore and query the Hopsworks feature store using HSFS. We can use the queries produced in the previous notebook to create a `Feature Views`.

In [2]:
sales_fg = fs.get_or_create_feature_group(
    name = 'sales_fg',
    version = 1
)

exogenous_fg = fs.get_or_create_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = sales_fg.select_all()\
        .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

In [3]:
feature_view = fs.create_feature_view(
    name = 'exodenous_sale',
    version = 1,
    labels = ['weekly_sales'],
    query = query
)

Feature view created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fv/exodenous_sale/version/1


In [4]:
feature_view

<hsfs.feature_view.FeatureView at 0x7fa053b52250>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`

In [3]:
feature_view = fs.get_feature_view(
    name = 'exodenous_sale',
    version = 1
)

In [4]:
feature_view.version

1

In [1]:
#feature_view.get_feature_vector(entry = {'store':7,'dept':67,'date':1335484800000})

> To get subset of data use `FeatureView.get_batch_data()` 

In [2]:
#df_batch = feature_view.get_batch_data()

In [9]:
#type(df_batch)

In [10]:
#df_batch.head()

---

## <span style="color:#ff5f27;"> üèãÔ∏è Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

In [11]:
feature_view.create_training_data(
    description = 'training_dataset',
    data_format = 'csv'
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exodenous_sale_1_1_create_fv_td_20062022143924/executions




(1, <hsfs.core.job.Job at 0x7fa053a4cd30>)

### <span style="color:#ff5f27;">üßëüèª‚Äçüî¨ Dataset with splits</span>

Also we can create dataset with **train and test** splits and even with **train, validation and test** splits!

You can use `feature_view.create_train_test_split()` and `feature_view.create_train_validation_test_splits()` and simply specify `test_size` and `val_size`.

In [12]:
feature_view.create_train_test_split(
    test_size = 0.2
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exodenous_sale_1_2_create_fv_td_20062022144029/executions




(2, <hsfs.core.job.Job at 0x7fa0539d8220>)

In [13]:
feature_view.create_train_validation_test_splits(
    val_size = 0.2,
    test_size = 0.1
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exodenous_sale_1_3_create_fv_td_20062022144139/executions




(3, <hsfs.core.job.Job at 0x7fa0539e0d30>)

---

## <span style="color:#ff5f27;"> ü™ù Training Dataset retreival </span>

In [14]:
X_train, y_train, X_test, y_test = feature_view.get_train_test_split(
    training_dataset_version = 2
)



In [15]:
X_train.head()

Unnamed: 0,store,dept,date,is_holiday,sales_last_30_days_store_dep,sales_last_30_days_store,sales_last_90_days_store_dep,sales_last_90_days_store,sales_last_180_days_store_dep,sales_last_180_days_store,sales_last_365_days_store_dep,sales_last_365_days_store,fuel_price,unemployment,cpi
0,1,1,1265932800000,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.548,8.106,211.24217
1,1,1,1266537600000,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.514,8.106,211.289143
2,1,1,1267142400000,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.561,8.106,211.319643
3,1,1,1268352000000,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.667,8.106,211.380643
4,1,1,1274400000000,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.826,7.808,210.617093


In [16]:
y_train.head()

Unnamed: 0,weekly_sales
0,46039.49
1,41595.55
2,19403.54
3,21043.39
4,14773.04


---

## <span style="color:#ff5f27;">üîÆ Creating Training Datasets with Event Time filter </span>

First of all lets import **datetime** from datetime library and set up a time format.

Then we can define start_time point and end_time point.

Finally we can create training dataset with data in specific time bourders. 

In [17]:
from datetime import datetime

def timestamp_2_time(x):
    dt_obj = datetime.strptime(x, '%Y-%m-%d')
    dt_obj = dt_obj.timestamp() * 1000
    return int(dt_obj)

In [18]:
start_time = timestamp_2_time('2008-01-01')
end_time = timestamp_2_time('2012-01-01')

In [19]:
exogenous_fg = fs.get_or_create_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = exogenous_fg.select(['date','temperature','fuel_price'])

In [20]:
query.show(5)

2022-06-20 14:43:04,776 INFO: USE `basics_featurestore`
2022-06-20 14:43:05,677 INFO: SELECT `fg0`.`date` `date`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`
FROM `basics_featurestore`.`exogenous_fg_1` `fg0`


Unnamed: 0,date,temperature,fuel_price
0,1326412800000,48.07,3.657
1,1296777600000,36.33,2.989
2,1312502400000,86.09,3.662
3,1310688000000,91.05,3.575
4,1330646400000,52.27,4.178


In [21]:
exogenous_fv = fs.create_feature_view(
    name = 'exogenous_fg_2008_2012',
    version = 1,
    labels = ['fuel_price'],
    query = query
)

Feature view created successfully, explore it at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/fs/1095/fv/exogenous_fg_2008_2012/version/1


In [22]:
exogenous_fv.create_training_data(
    data_format = 'csv',
    start_time = start_time,
    end_time = end_time
)

Training dataset job started successfully, you can follow the progress at https://e457b610-ee84-11ec-826a-b9f03258546b.cloud.hopsworks.ai/p/1147/jobs/named/exogenous_fg_2008_2012_1_1_create_fv_td_20062022144309/executions




(1, <hsfs.core.job.Job at 0x7fa0357d56d0>)

---
## <span style="color:#ff5f27;"> ü™ù Training Dataset retreival </span>

To retrieve training dataset from Feature Store we can use `get_training_data()` methods. 

If version is not provided - new one will be created.
If version is provided and version exists - retrieves trainining dataset and returns as dataframe.

In [26]:
X_train_lim, y_train_lim = exogenous_fv.get_training_data(
    training_dataset_version = 1
)

In [27]:
X_train_lim.head()

Unnamed: 0,date,temperature
0,1296777600000,36.33
1,1312502400000,86.09
2,1310688000000,91.05
3,1306454400000,77.72
4,1287100800000,71.57


In [28]:
y_train_lim.head()

Unnamed: 0,fuel_price
0,2.989
1,3.662
2,3.575
3,3.786
4,2.72


---