{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"---\n",
"title: \"Data Validation with Scala\"\n",
"date: 2021-02-24\n",
"type: technical_note\n",
"draft: false\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Validation with the Hopsworks Feature Store\n",
"\n",
"In this notebook we introduce feature validation operations with the Hopsworks Feature Store and its client API, hsfs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Background"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Motivation\n",
"\n",
"Data ingested into the Feature Store form the basis for the data fed as input to algorithms that develope machine learning models. The Feature store is a place where curated feature data is stored, therefore it is important that this data is validated against different rules to it adheres to business requirements. \n",
"\n",
"For example, ingested features might be expected to never be empty or to lie within a certain range, for example a feature `age` should always be a non-negative number.\n",
"\n",
"The Hopsworks Feature Store provides users with an API to create `Expectations` on ingested feature data by utilizing the `Deequ` https://github.com/awslabs/deequ open source library. Feature validation is part of the HSFS Java/Scala and Python API for working with Feature Groups. Users work with the abstractions:\n",
"\n",
"- Rules: A set of validation rules applied on a Spark/PySpark dataframe that is inserted into a Feature Group. \n",
"- Expectations: A set of rules that is applied on a set of features as provided by the user. Expecations are created at the feature store level and can be attached to multiple feature groups.\n",
"- Validations: The results of expectations against the ingested dataframe are assigned a ValidationTime and are persisted within the Feature Store. Users can then retrieve validation results by validation time and by commit time for time-travel enabled feature groups.\n",
"\n",
"Feature Validation is disabled by default, by having the `validation_type` feature group attribute set to `NONE`. The list of allowed validation types are:\n",
"- STRICT: Data validation is performed and feature group is updated only if validation status is \"Success\"\n",
"- WARNING: Data validation is performed and feature group is updated only if validation status is \"Warning\" or lower\n",
"- ALL: Data validation is performed and feature group is updated only if validation status is \"Failure\" or lower\n",
"- NONE: Data validation not performed on feature group"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create time travel enabled feature group and Bulk Insert Sample Dataset\n",
"\n",
"For this demo we will use small sample of the Agarwal Generator that is a widely used dataset. It contains the hypothetical data of people applying for a loan. `Rakesh Agrawal, Tomasz Imielinksi, and Arun Swami, \"Database Mining: A Performance Perspective\", IEEE Transactions on Knowledge and Data Engineering, 5(6), December 1993.
`\n",
"\n",
"##### For simplicity of demo purposes we split Agarwal dataset into 3 freature groups and demostrate feature validaton on the economy_fg feature group: \n",
"* `economy_fg` with customer id, salary, loan, value of house, age of house, commission and type of car features; "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importing necessary libraries "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Starting Spark application\n"
]
},
{
"data": {
"text/html": [
"
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
7 | application_1612535100309_0043 | spark | idle | Link | Link |