{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "title: \"Feature Engineering/Ingestion\"\n", "date: 2021-02-24\n", "type: technical_note\n", "draft: false\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Store Tour - Python API\n", "\n", "This set of notebooks contain a tour/reference for the Hopsworks feature store Scala/Java API. The notebook is meant to be run from feature store demo projects on Hopsworks. We will go over best practices for using the API as well as common pitfalls.\n", "\n", "There are 3 notebooks:\n", "- Feature groups: Discover how to work with features and feature groups, both offline and online\n", "- [Feature Exploration](./feature_exploration.ipynb): Discover how to join features from different feature groups\n", "- [Training datasets](./training_datasets.ipynb): Discover how to save training datasets to be used by ML models\n", "\n", "The data required to run this tour is located in a zip file called `archive.zip` in the same directory as the notebooks. Head to the Dataset browser on Hopsworks and unzip it.\n", "\n", "## Features and Feature Groups\n", "\n", "The Hopsworks feature store is a centralized repository, within an organization, to manage machine learning features. A feature is a measurable property of a phenomenon. It could be a simple value such as the age of a customer, or it could be an aggregated value, such as the number of transactions made by a customer in the last 30 days.\n", "\n", "A feature is not restricted to an numeric value, it could be a string representing an address, or an image.\n", "\n", "\n", "\n", "A feature store is not a pure storage service, it goes hand-in-hand with feature computation. Feature engineering is the process of transforming raw data into a format that is compatible and understandable for predictive models.\n", "\n", "In this notebook we are going to focus on the left side of the picture above. In particular how data engeneers can create features and push them to the Hopsworks feature store so that they are available to the data scientists\n", "\n", "### HSFS library\n", "\n", "The Hopsworks feature feature store library is called `hsfs` (**H**opswork**s** **F**eature **S**tore). \n", "The library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.\n", "In this notebook we are going to cover Python part.\n", "\n", "You can find the complete documentation of the library here: \n", "\n", "The first step is to establish a connection with your Hopsworks feature store instance and retrieve the object that represents the feature store you'll be working with. \n", "\n", "By default `connection.get_feature_store()` returns the feature store of the project you are working with. However, it accepts also a project name as parameter to select a different feature store. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
7 | application_1605286520909_0013 | pyspark | idle | Link | Link |