---
title: "Colab Hopsworks Feature Store Tour"
date: 2021-02-24
type: technical_note
draft: false
---

## Prerequistes

### Step 1: Register an account on [hopsworks.ai](https://hopsworks.ai)
Click on the "Demo" button to access a demo cluster. 
Copy the URL to the cluster in the form "[UUID].cloud.hopsworks.ai". You will need it to connect to Hopsworks later.

### Step 2.  Open the Demo Cluster and run the "Feature Store Tour"
Note the "project-name" that is created when you run the Feature Store Tour. You will need it to connect to Hopsworks later.

### Step 3: Configure a Hopsworks API Key
You need to set up a Feature Store API key for authentication.
In Hopsworks, click on your username in the top-right corner 
(1) and select Settings to open the user settings. Select API keys. 
(2) Give the key a name and select the job, featurestore, dataset.create and project scopes before 
(3) creating the key. 

Copy the key into your clipboard for the next step.

In [None]:
!pip3 uninstall hsfs -y
!pip3 install hsfs[hive]

Uninstalling hsfs-2.2.15:
  Successfully uninstalled hsfs-2.2.15
Collecting hsfs[hive]
Collecting avro==1.10.1 (from hsfs[hive])
Collecting sqlalchemy (from hsfs[hive])
  Using cached https://files.pythonhosted.org/packages/72/0c/abd3bd19298cd3fc0a6f2f0ac05c369e7272472f578397043929ed743c79/SQLAlchemy-1.4.17-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Collecting boto3 (from hsfs[hive])
  Using cached https://files.pythonhosted.org/packages/11/20/4294e37c3c6936c905f1e9da958c776d7fee54a4512bdb7706d69c8720e6/boto3-1.17.84-py2.py3-none-any.whl
Collecting furl (from hsfs[hive])
  Using cached https://files.pythonhosted.org/packages/12/18/b29367947b32b510cbbbfa86164929ceed069ff020f84a6dc780df5d6ba1/furl-2.1.2-py2.py3-none-any.whl
Collecting pyhumps==1.6.1 (from hsfs[hive])
  Using cached https://files.pythonhosted.org/packages/8b/5e/d075fb7d93d757da5601d55188bde9869a9a6b59b1fc8d7fb0fdce7714a2/pyhumps-1.6.1-py3-none-any.whl
Collecting mock (

In [None]:
import hsfs

# TODO: replace the values below: [UUID], [project-name], [api-key]
connection = hsfs.connection(host="[UUID].cloud.hopsworks.ai",   # UUID is from Step 1, above
    project="[project-name]",                                    # project-name is from Step 2, above
    engine="hive",
    api_key_value="[api-key]")                                   # the API key comes from Step 3, above

fs = connection.get_feature_store()

## Show the first 5 rows in the Demo Feature Group

First run the "Feature Store Tour" in Hopsworks to create the demo Feature Store project.

A feature group is a set of related `features`. A feature is a data point that helps make predictions. A feature data value (or point) is often either a number (scalar, vector, etc) or a boolean or enum or string (categorical value).  If you are a data engineer, think of features in feature groups as columns in a database. If you are a data scientist, think of features in feature groups as columns in a dataframe.

In [None]:
teams_features = fs.get_feature_group("teams_features",version=1)
teams_features.show(5)

## Ingest some features into the Feature Store as a Feature Group
The date we will ingest looks as follows:

 * first_name : string (categorical value)
 * last_name : string (categorical value)
 * country : string (categorical value)
 
 We want to use these features later to predict the country a first_name,last_name pair come from.

In [None]:
import pandas as pd
try:
    name_country_fg = fs.get_feature_group(name="name_country_fg",version=1)
    print("name_country_fg found in feature store")
except Exception as e: 
    url = "https://repo.hops.works/dev/jdowling/data_cleaned_train.csv"
    df = pd.read_csv(url, sep=";")
    name_country_fg = fs.create_feature_group(name="name_country_fg",
                                    version=1,
                                    primary_key=['first_name', 'last_name'],
                                    description="Name - Country prediction",
#                                    validation_type="STRICT",
                                    time_travel_format="HUDI",
                                    online_enabled=True,                                        
                                    statistics_config=True)
    print("Created name_country_fg in the feature store")
    name_country_fg.save(df)

In [None]:
print("Name: {}".format(name_country_fg.name))
print("Description: {}".format(name_country_fg.description))
print("Features:")
features = name_country_fg.features
for feature in features:
    print("{:<60} \t Primary: {} \t Partition: {}".format(feature.name, feature.primary, feature.partition))

## Feature Data Validation

Garbage in, garbage out.

Let's check for garbage in. If you ingest names from more than 195 countries, it's garbage.

In [None]:
from hsfs.rule import Rule
rules = connection.get_rules()
[print(rule.to_dict()) for rule in rules]

In [None]:
expectation_countries = fs.create_expectation("countries",
                                          description="min and max number of countries",
                                          features=["country"], 
                                          rules=[Rule(name="HAS_NUMBER_OF_DISTINCT_VALUES", level="ERROR", min=1), 
                                                 Rule(name="HAS_NUMBER_OF_DISTINCT_VALUES", level="ERROR", max=195)])
expectation_countries.save()

In [None]:
name_country_fg.attach_expectation(expectation_countries)

In [None]:
# Create a Pandas Dataframe and ingest its features into a feature group that you create here.  
import pandas as pd 
columns = ['first_name', 'last_name', 'country']
data = [['tom', 'johnson', 'UK'], ['penelope', 'charles', 'UK'], ['harry', 'windsor', "USA"]]   
df = pd.DataFrame(data, columns=columns) 
name_country_fg.insert(df)

In [None]:
exps = name_country_fg.get_expectations()
[print(exp.description) for exp in exps]

In [None]:
fg_validations = name_country_fg.get_validations()
[print(validation.to_dict()) for validation in fg_validations]

In [5]:
import string
import random
import numpy as np
def id_generator(size=1500, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

num_rows = 600
data = np.array([id_generator() for i in range(num_rows)]).reshape(200,3)
df2 = pd.DataFrame(data, columns=columns)
print(df2)

                                            first_name  \
0    VRE1CIFZ3TVAPUTXD3SZBRT5F8TORU1G4QRU4ASWVA3NOQ...   
1    WGBN6225ML0E1EVBVEC9QK24YF5M7H5X48UEMJDFUP9MZI...   
2    VWC1JGD3RCM0RLE7Q4R48AW1PLTFAV4MNRYMUVTOIQRYJR...   
3    MZXVS9DIDHG4LQCYQQR705PYS9DY959ZQ0E71JG8MTGHJU...   
4    91ZY2Y6K2FJFFEB745UBUG6099ISSGEYLY1JDTEKTRRM7M...   
..                                                 ...   
195  JK91GNOUIXLXVZ2UVKUKBEEREPDT9RX6LSSBDYGYR1VIE4...   
196  MYENROCDXE1O0REIRG6YE17F6OVXZZVLQYHLXP39CUGW79...   
197  84RV7XECJWAXPERR6VIPPB6K1LN01LIG5IRP00T2ENSGLZ...   
198  4Y9VAHZHG3ERJ5ZJSY5K0ROOAIVU6YSQR66RP1WPKVMWHN...   
199  DZ4LKITDNFWDRB5X1KFBASDL74XA5VSPMF7CGQGZUIUTV5...   

                                             last_name  \
0    PGAJWUP5RVF4RZCG6SN9DQ88ST5H629H24SP1B4RTJ4AF9...   
1    WAXG3OMOS15MQAH1L81Y7573VW5ITNLC8AT91CGGZK8M96...   
2    RKJTQPPW65EKT3CL0ZQS32ITRAJVGI4EK3795AROT78BXX...   
3    DAL2W7CKMZ0W3DQW7WYDT5O8JM8GDD7K90OEIM034ABEHS...   
4    HU8S30A8

In [None]:
name_country_fg.insert(df2)

In [None]:
fg_validations = name_country_fg.get_validations()
[print(validation.to_dict()) for validation in fg_validations]