{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Time travel operations in Hopsworks Feature Store\n", "\n", "In this notebook we will introduce time travel operations in Hopsworks Feature Store (HSFS). Currently HSFS supports Apache Hudi (http://hudi.apache.org/) a storage abstraction/library for doing **incremental** data ingestion to a Hopsworks Feature Store." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Motivation\n", "\n", "Traditional ETL typically involves taking a snapshot of a production database and doing a full load into a data lake (typically stored on a distributed file system). Using the snapshot approach for ETL is simple since the snapshot is immutable and can be loaded as an atomic unit into the data lake. However, the con of taking this approach to doing data ingestion is that it is *slow*. Even if just a single record have been updated since the last data ingestion, the entire table has to be re-written. If you are working with Big Data (TB or PB size datasets) then this will introduce significant *data latency* and *wasted resources* (majority of the writes when ingesting the snapshot is redundant as most of the records have not been updated since the last ETL step). \n", "\n", "This motivates the use-case for **incremental** data ingestion. Incremental data ingestion means that only deltas/changelogs since the last ingestion are inserted. With incremental processing, you process data in *mini-batches* and run the spark job frequently. The incremental model makes better use of resources and makes it easier to do complex processing and joins.\n", "\n", "In addition data is rarely immutable in practice. A bank transaction might be reverted, a customer might change his or her home adress, and a customer review might be updated, to give a few examples. This is where Hudi comes into the picture. Hudi stands for `Hadoop Upserts anD Incrementals` and brings two new primitives for data engineering on distributed file systems (in addition to append/read):\n", "\n", "- `Upsert`: the ability to do insertions (appends) and updates efficiently. \n", "- `Incremental reads`: the ability to read datasets incrementally using the notion of \"commits\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How Hopsworks Feature Store time travel operations can be used for ML and Feature Pipelines\n", "\n", "Hudi is integrated in the Hopsworks Feature Store for doing incremental feature computation and for point-in-time correctness and backfilling of feature data.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
18 | application_1609813430371_0019 | spark | idle | Link | Link |