 # Feature Store Tour - Python API
 
This notebook contains a tour/reference for the Hopsworks feature store Python API on Amazon SageMaker. The notebook is meant to be run on Amazon SageMaker after setting up Hopsworks to work with AWS [(see here)](https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/featurestore.html#connecting-from-amazon-sagemaker).

The notebook is designed to be used in combination with the Feature Store Tour on Hopsworks, it assumes that you have run the following feature engineering job: [job](https://github.com/logicalclocks/hops-examples/tree/master/featurestore_tour) (**the job is added automatically when you start the feature store tour in Hopsworks. You can run the job by going to the 'Jobs' tab to the left in the Hopsworks project home page**). 

Which will produce the following model of feature groups in your project's feature store:

![Feature Store Model](../images/model.png "Feature Store Model")

In this notebook we will run queries over this feature store model. We will also create new feature groups and training datasets.

We will go from (1) features to (2) training datasets to (3) A trained model

## Installation

To access the feature store from SageMaker the hopsworks-cloud-sdk needs to be installed:

In [None]:
!pip install hopsworks-cloud-sdk

## Imports

In [3]:
f = open("key.txt", "w")
f.write("ALdTp4ClHxMdFs5R.p3KdVzLMp6VUjoAPwhukTwtZ8hIDj7mQ1ZqdmGf1VcrgiEx4CWa0egupbIb7h24a")
f.close()

In [4]:
from hops import featurestore

## Connecting

This assumes that Hopsworks and AWS were already configured correctly. See [AWS SageMaker Integration](https://docs.hopsworks.ai/integrations/sagemaker/).

In [5]:
featurestore.connect('50d40db0-27f4-11eb-a729-b3a0388357f2.cloud.hopsworks.ai', 'demo_fs_meb10179', secrets_store = 'local', api_key_file = 'key.txt', hostname_verification=False)

## Get The Name of The Project's Feature Store

Each project with the feature store service enabled automatically gets its own feature store created. This feature store is only accessible within the project unless you decide to share it with other projects. The name of the feature store is `<project_name>_featurestore`, and you can get the name with the API method `project_featurestore()`. 

In [6]:
featurestore.project_featurestore()

'demo_fs_meb10179_featurestore'

## Get a List of All Feature Stores Accessible in the Current Project 

Feature Stores can be shared across projects in a multi-tenant manner, just like any Hopsworks-dataset can. You can read more about sharing datasets at [hops.io](hops.io), but in essence to share a dataset you just have to right click on it in your project. The feature groups in the feature store are located in a dataset called `<project_name>_featurestore.db` in your project.

![Share Feature Store](./images/share_featurestore.png "Share Feature Store")

In [7]:
featurestore.get_project_featurestores()

['demo_fs_meb10179_featurestore']

## Querying The Feature Store

The feature store can be queried programmatically and with raw SQL. When you query the feature store programmatically, the library will infer how to fetch the different features using a **query planner**. 

![Feature Store Query Planner](../images/query_optimizer.png "Feature Store Query Planner")

When interacting with the feature store it is sufficient to be familiar with three concepts:

- The **feature**, this refer to an individual versioned and documented feature in the feature store, e.g the age of a person.
- The **feature group**, this refer to a documented and versioned group of features stored as a Hive table that is linked to a specific Spark/Numpy/Pandas job that takes in raw data and outputs the computed features.
- The **training dataset**, this refer to a versioned and managed dataset of features, stored in HopsFS as tfrecords, .csv, .tsv, or parquet.

A feature group contains a group of features and a training dataset contains a set of features, potentially from many different feature groups.

![Feature Store Concepts](../images/concepts.png "Feature Store Contents")

When you query the feature store you will always get back the results in a pandas dataframe. This is for scalability reasons. If the dataset is small and you want to work with it in memory you can convert it into a pandas dataframe or a numpy matrix using one line of code as we will demonstrate later on in this notebook.

### Fetch an Individual Feature

When retrieving a single feature from the featurestore, the hops-util-py library will infer in which feature group the feature belongs to by querying the metastore, but you can also explicitly specify which featuregroup and version to query. 

If there are multiple features of the same name in the featurestore, it is required to specify enough information to uniquely identify the feature (e.g specify feature group and version). If no featurestore is provided it will default to the project's featurestore.

To read an individual feature, use the method `get_feature(feature_name)`

Without specifying the feature store, feature group and version, the library will infer it:

In [8]:
featurestore.get_feature("team_budget").head(5)

Logical query plan for getting 1 feature from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget FROM teams_features_1 against the offline feature store


Unnamed: 0,team_budget
0,12957.076
1,2403.3704
2,3390.3755
3,13547.429
4,9678.333


You can also explicitly specify the feature store, feature group, the version, and the return format:

In [9]:
featurestore.get_feature(
    "team_budget", 
    featurestore=featurestore.project_featurestore(), 
    featuregroup="teams_features", 
    featuregroup_version = 1
).head(5)

Logical query plan for getting 1 feature from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget FROM teams_features_1 against the offline feature store


Unnamed: 0,team_budget
0,12957.076
1,2403.3704
2,3390.3755
3,13547.429
4,9678.333


### Fetch an Entire Feature Group

You can get an entire featuregroup from the API. If no feature store is provided the API will default to the project's feature store, if no version is provided it will default to version 1 of the feature group. The return format is as a pandas dataframe.

In [10]:
featurestore.get_featuregroup("teams_features").head(5)

SQL string for the query created successfully
Running sql: SELECT * FROM teams_features_1 against the offline feature store


Unnamed: 0,team_budget,team_id,team_position
0,12957.076,1,1
1,2403.3704,2,2
2,3390.3755,3,3
3,13547.429,4,4
4,9678.333,5,5


The default parameters can be overriden:

In [11]:
featurestore.get_featuregroup(
    "teams_features", 
    featurestore=featurestore.project_featurestore(), 
    featuregroup_version = 1
).head(5)

SQL string for the query created successfully
Running sql: SELECT * FROM teams_features_1 against the offline feature store


Unnamed: 0,team_budget,team_id,team_position
0,12957.076,1,1
1,2403.3704,2,2
2,3390.3755,3,3
3,13547.429,4,4
4,9678.333,5,5


### Fetch A Set of Features

When retrieving a list of features from the featurestore, the hops-util-py library will infer which featuregroup the features belongs to by querying the metastore. If the features reside in different featuregroups, the library will also try to infer how to join the features together based on common columns. If the JOIN query cannot be inferred due to existence of multiple features with the same name or non-obvious JOIN query, the user need to supply enough information to the API call to be able to query the featurestore. If the user already knows the JOIN query it can also run featurestore.sql(joinQuery) directly (an example of this is shown further down in this notebook). If no featurestore is provided the API will default to the project's featurestore.

Example of querying the feature store for a list of features without specifying the feature groups and feature store:

In [12]:
featurestore.get_features(
    ["team_budget", "average_attendance", "average_player_age"]
).head(5)

Logical query plan for getting 3 features from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget, average_player_age, average_attendance FROM teams_features_1 JOIN players_features_1 JOIN attendances_features_1 ON teams_features_1.`team_id`=players_features_1.`team_id` AND teams_features_1.`team_id`=attendances_features_1.`team_id` against the offline feature store


Unnamed: 0,team_budget,average_player_age,average_attendance
0,16758.066,25.65,3271.934
1,9290.638,25.67,2701.0522
2,4134.0903,26.18,2823.996
3,6907.2817,24.34,3473.2007
4,3839.0754,25.63,3397.8066


We can also explicitly specify the feature groups where the features reside. Either the feature groups and versions can be specified by prepending feature names with `<feature group name>_<feature group version.`, or by providing a dict with entries of `<feature group name> -> <feature group version>`:

In [13]:
featurestore.get_features(
    ["teams_features_1.team_budget", 
     "attendances_features_1.average_attendance", 
     "players_features_1.average_player_age"]
).head(5)

Logical query plan for getting 3 features from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT teams_features_1.team_budget, attendances_features_1.average_attendance, players_features_1.average_player_age FROM teams_features_1 JOIN attendances_features_1 JOIN players_features_1 ON teams_features_1.`team_id`=attendances_features_1.`team_id` AND teams_features_1.`team_id`=players_features_1.`team_id` against the offline feature store


Unnamed: 0,team_budget,average_attendance,average_player_age
0,12957.076,92301.086,25.88
1,4134.0903,2823.996,26.18
2,12514.562,3587.5015,24.63
3,6907.2817,3473.2007,24.34
4,11169.979,1940.3131,25.75


In [14]:
featurestore.get_features(
    ["team_budget", "average_attendance", "average_player_age"],
    featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
        "teams_features": 1, 
        "attendances_features": 1,
        "players_features": 1
    }
).head(5)

Logical query plan for getting 3 features from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget, average_player_age, average_attendance FROM teams_features_1 JOIN players_features_1 JOIN attendances_features_1 ON teams_features_1.`team_id`=players_features_1.`team_id` AND teams_features_1.`team_id`=attendances_features_1.`team_id` against the offline feature store


Unnamed: 0,team_budget,average_player_age,average_attendance
0,12957.076,25.88,92301.086
1,4134.0903,26.18,2823.996
2,12514.562,24.63,3587.5015
3,15072.062,25.35,1995.5691
4,7326.092,25.45,6462.462


If you have a lot of name collisions and it is not obvious how to infer the JOIN query to get the features from the feature store. You can explicitly specify the argument `join_key` to the API (or you can provide the entire SQL query using the API method `.sql` as we will demonstrate later on in the notebook)

In [15]:
featurestore.get_features(
    ["team_budget", "average_attendance", "average_player_age"],
    featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
        "teams_features": 1, 
        "attendances_features": 1,
        "players_features": 1
    },
    join_key = "team_id"
).head(5)

Logical query plan for getting 3 features from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget, average_player_age, average_attendance FROM teams_features_1 JOIN attendances_features_1 JOIN players_features_1 ON teams_features_1.`team_id`=attendances_features_1.`team_id` AND teams_features_1.`team_id`=players_features_1.`team_id` against the offline feature store


Unnamed: 0,team_budget,average_player_age,average_attendance
0,14580.948,25.67,2695.4463
1,9290.638,25.67,2701.0522
2,12957.076,25.88,92301.086
3,6907.2817,24.34,3473.2007
4,1621.1936,26.01,7118.376


#### Advanced Eamples of Fetching Sets of Features and Common Pitfalls

Getting 12 features from 4 different feature groups:

In [18]:
featurestore.get_features(
    ["team_budget", "average_attendance", "average_player_age",
    "team_position", "sum_attendance", 
     "average_player_rating", "average_player_worth", "sum_player_age",
     "sum_player_rating", "sum_player_worth"
    ]
).head(5)

Logical query plan for getting 10 features from the featurestore created successfully
SQL string for the query created successfully
Running sql: SELECT team_budget, average_player_rating, sum_attendance, average_player_worth, sum_player_worth, average_attendance, team_position, sum_player_age, average_player_age, sum_player_rating FROM teams_features_1 JOIN players_features_1 JOIN attendances_features_1 ON teams_features_1.`team_id`=players_features_1.`team_id` AND teams_features_1.`team_id`=attendances_features_1.`team_id` against the offline feature store


Unnamed: 0,team_budget,average_player_rating,sum_attendance,average_player_worth,sum_player_worth,average_attendance,team_position,sum_player_age,average_player_age,sum_player_rating
0,9290.638,227.61397,54021.043,184.85991,18485.99,2701.0522,37,2567.0,25.67,22761.396
1,4134.0903,181.49428,56479.92,179.36293,17936.293,2823.996,43,2618.0,26.18,18149.428
2,6907.2817,262.44653,69464.016,252.60298,25260.299,3473.2007,30,2434.0,24.34,26244.654
3,1621.1936,467.7938,142367.52,490.94702,49094.703,7118.376,17,2601.0,26.01,46779.38
4,11169.979,178.7692,38806.26,156.81357,15681.356,1940.3131,46,2575.0,25.75,17876.92


### Create a training dataset from the Feature Store

The feature store has an abstraction of a **training dataset**, which is a dataset with a set of features (potentially from many different feature groups) and labels (in case of supervised learning).

When you train a machine learning model, you want to use all features that have predictive power and that the model can learn from. At this point, we can create a training dataset of features from several different feature groups and use that for training. That is the purpose of the training dataset abstraction.

Of course you can always just save a group of features anywhere inside your project, e.g as a csv, or .tfrecords file. However, by using the feature store you can create managed training datasets. Managed training datasets will show up in the feature registry UI and will automatically be versioned, documented and reproducible.

Lets create a dataset called *team_position_prediction* by using the previous set of 12 relevant features from the featurestore. We will combine features from four different feature groups to form this training dataset: 

- teams_features
- attendances_features
- players_features
- season_scores_features

In [19]:
feature_list = ["team_budget", "average_attendance", "average_player_age",
    "team_position", "sum_attendance", 
     "average_player_rating", "average_player_worth", "sum_player_age",
     "sum_player_rating", "sum_player_worth",
    ]

Now we can create a training dataset with the list of features with some extended metadata such as schema (automatically inferred). By default when you create a training dataset it will be in "tfrecords" format and statistics will be computed for all features. After the dataset have been created you can view and/or update the metadata about the training dataset from the Hopsworks featurestore UI.

First you should check if a training dataset with the same name has been created before:

In [21]:
latest_version = featurestore.get_latest_training_dataset_version("team_position_prediction")
print(latest_version)

1


Now you can use the `featurestore.create_training_dataset()` API to create and launch a job which will create your training dataset. You can either pass in a list of feature names from different feature groups to be joined on `join_key` or a pure sql query string for more complex training datasets. The `join_key` is optional and we will infer it from the primary key of the feature groups if it is not provided. The job will be called the same name as your training dataset, in case you want to rerun the creation from the Hopsworks UI. Good practice is to increase the version by 1, but you can also decide to overwrite it with the same version if you set the `overwrite` argument to `True`.

By default the dataset is written as "tfrecords" to HopsFS but you can specify an alternative `sink` by passing your storage connector name. Please note that the storage connector has to be created in the Hopsworks featurestore UI previously.

The `featurestore.create_training_dataset()` API offers additional parameters to modify its behaviour, for the full range of possible arguments please refer to the docs. Most arguments will be familiar to you from the `featurestore.get_features()` API.

In [22]:
featurestore.create_training_dataset(
    training_dataset = "team_position_prediction",
    features = feature_list,
    training_dataset_version = latest_version + 1,
    overwrite=True
)

Training Dataset job successfully started


If you want to utilize the SQL query functionality, you have to know that you can query a feature group by assembling its name with the desired version: **[featuregroup]_[version]**.

In the following example we create a training dataset by selecting all features from the 'games_features' feature group with version 1.

In [23]:
latest_version = featurestore.get_latest_training_dataset_version("games_features_all")
print(latest_version)

0


In [24]:
featurestore.create_training_dataset(
    training_dataset = "games_features_all",
    sql_query = "SELECT * FROM games_features_1",
    training_dataset_version = latest_version + 1,
    overwrite=True
)

Training Dataset job successfully started


### Free Text SQL Query from the Feature Store

For complex queries that cannot be inferred by the helper functions, enter the sql directly to the method `featurestore.sql()` it will default to the project specific feature store but you can also specify it explicitly. If you are proficient in SQL, this is the most efficient and preferred way to query the feature store.

Without specifying the feature store the query will by default be run against the project's feature store:

In [25]:
featurestore.sql("SELECT * FROM teams_features_1 WHERE team_position < 5").head(5)

Running sql: SELECT * FROM teams_features_1 WHERE team_position < 5 against the offline feature store


Unnamed: 0,team_budget,team_id,team_position
0,12957.076,1,1
1,2403.3704,2,2
2,3390.3755,3,3
3,13547.429,4,4


You can also specify the featurestore to query explicitly:

In [26]:
featurestore.sql("SELECT * FROM teams_features_1 WHERE team_position < 5",
                featurestore=featurestore.project_featurestore()).head(5)

Running sql: SELECT * FROM teams_features_1 WHERE team_position < 5 against the offline feature store


Unnamed: 0,team_budget,team_id,team_position
0,12957.076,1,1
1,2403.3704,2,2
2,3390.3755,3,3
3,13547.429,4,4


## Training Datasets

To group data in the feature store we use three concepts:

- Feature
- Feature group
- Training Dataset

Typically during the feature engineering phase of a machine learning project, you compute a set of features for each type of data that you have, these features are naturally grouped into a documented and versioned **feature group**. 

In practice, it is common that organizations have many different type of datasets that they can extract features from, for example if you are building a recommendation system you might have demographic data about each user as well as user-activity data. 

When you train a machine learning model, you want to use all features that have predictive power and that the model can learn from. At this point, we can create a training dataset of features from several different feature groups and use that for training. That is the purpose of the training dataset abstraction. 

Of course you can always just save a group of features anywhere inside your project, e.g as a csv, or .tfrecords file. However, by using the feature store you can create **managed** training datasets. Managed training datasets will show up in the feature registry UI and will automatically be versioned, documented and reproducible. 

Once a training dataset have been created you can find it in the featurestore UI in hopsworks under the tab `Training datasets`, from there you can also edit the metadata if necessary. 

### Get Training Dataset Path

After a **managed dataset** have been created, it is easy to share it and re-use it for training various models. For example if the dataset have been materialized in tf-records format you can call the method `get_training_dataset_path(training_dataset)` to get the HDFS path and read it directly in your tensorflow code.

In [27]:
featurestore.get_training_dataset_path("tour_training_dataset_test")

'hopsfs://10.0.0.247:8020/Projects/demo_fs_meb10179/demo_fs_meb10179_Training_Datasets/tour_training_dataset_test_1/tour_training_dataset_test'

By default the library will look for the training dataset in the project's featurestore and use version 1, but this can be overriden if required:

In [28]:
featurestore.get_training_dataset_path(
    "tour_training_dataset_test", 
    featurestore=featurestore.project_featurestore(),
    training_dataset_version=featurestore.get_latest_training_dataset_version("tour_training_dataset_test")
)

'hopsfs://10.0.0.247:8020/Projects/demo_fs_meb10179/demo_fs_meb10179_Training_Datasets/tour_training_dataset_test_1/tour_training_dataset_test'

## Get Featurestore Metadata
To explore the contents of the featurestore we recommend using the featurestore page in the Hopsworks UI but you can also get the metadata programmatically from the REST API

### Update Metadata Cache

In [29]:
featurestore.get_featurestore_metadata(update_cache=True)

<hops.featurestore_impl.dao.common.featurestore_metadata.FeaturestoreMetadata at 0x7fd2137bdef0>

### List all Feature Stores Accessible In the Project

In [30]:
featurestore.get_project_featurestores()

['demo_fs_meb10179_featurestore']

### List all Feature Groups in a Feature Store

In [31]:
featurestore.get_featuregroups()

['teams_features_1',
 'games_features_hudi_tour_1',
 'players_features_1',
 'attendances_features_1',
 'season_scores_features_1',
 'games_features_1',
 'season_features_on_demand_1']

By default `get_featuregroups()` will use the project's feature store, but this can also be specified with the optional argument featurestore

In [32]:
featurestore.get_featuregroups(featurestore=featurestore.project_featurestore())

['teams_features_1',
 'games_features_hudi_tour_1',
 'players_features_1',
 'attendances_features_1',
 'season_scores_features_1',
 'games_features_1',
 'season_features_on_demand_1']

### List all Features in a Feature Store

In [33]:
featurestore.get_features_list()

['team_budget',
 'team_id',
 'team_position',
 'away_team_id',
 '_hoodie_record_key',
 '_hoodie_file_name',
 'home_team_id',
 '_hoodie_partition_path',
 '_hoodie_commit_time',
 '_hoodie_commit_seqno',
 'score',
 'team_id',
 'average_player_rating',
 'sum_player_rating',
 'average_player_age',
 'average_player_worth',
 'sum_player_age',
 'sum_player_worth',
 'team_id',
 'average_attendance',
 'sum_attendance',
 'team_id',
 'average_position',
 'sum_position',
 'away_team_id',
 'home_team_id',
 'score',
 'team_id',
 'average_position',
 'sum_position']

By default get_features_list() will use the project's feature store, but this can also be specified with the optional argument featurestore

In [34]:
featurestore.get_features_list(featurestore=featurestore.project_featurestore())

['team_budget',
 'team_id',
 'team_position',
 'home_team_id',
 '_hoodie_partition_path',
 '_hoodie_commit_time',
 '_hoodie_commit_seqno',
 'away_team_id',
 '_hoodie_record_key',
 '_hoodie_file_name',
 'score',
 'team_id',
 'average_player_rating',
 'sum_player_rating',
 'average_player_age',
 'average_player_worth',
 'sum_player_age',
 'sum_player_worth',
 'team_id',
 'average_attendance',
 'sum_attendance',
 'team_id',
 'average_position',
 'sum_position',
 'away_team_id',
 'home_team_id',
 'score',
 'team_id',
 'average_position',
 'sum_position']

### List all Training Datasets in a Feature Store

In [35]:
featurestore.get_training_datasets()

['team_position_prediction_2',
 'tour_training_dataset_test_1',
 'team_position_prediction_1',
 'games_features_all_1']

By default `get_training_datasets()` will use the project's feature store, but this can also be specified with the optional argument featurestore

In [36]:
featurestore.get_training_datasets(featurestore=featurestore.project_featurestore())

['team_position_prediction_2',
 'tour_training_dataset_test_1',
 'team_position_prediction_1',
 'games_features_all_1']

### List all Storage Connectors in a Feature Store

In [37]:
featurestore.get_storage_connectors()

[('demo_fs_meb10179', 'JDBC'),
 ('demo_fs_meb10179_featurestore', 'JDBC'),
 ('demo_fs_meb10179_meb10179_onlinefeaturestore', 'JDBC'),
 ('demo_fs_meb10179_Training_Datasets', 'HOPSFS')]

By default `get_storage_connectors()` will use the project's feature store, but this can also be specified with the optional argument featurestore

In [38]:
featurestore.get_storage_connectors(featurestore=featurestore.project_featurestore())

[('demo_fs_meb10179', 'JDBC'),
 ('demo_fs_meb10179_featurestore', 'JDBC'),
 ('demo_fs_meb10179_meb10179_onlinefeaturestore', 'JDBC'),
 ('demo_fs_meb10179_Training_Datasets', 'HOPSFS')]