{ "cells": [ { "cell_type": "markdown", "id": "79fffb06", "metadata": {}, "source": [ "---\n", "\n", "## 🧑🏻‍🏫 HSFS feature exploration \n", "\n", "In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store. \n", "\n", "A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets.\n", "Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.\n", "\n", "![Join](../images/join.svg \"Join\")\n", "\n", "As for the [feature_engineering](./feature_engineering.ipynb) notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle." ] }, { "cell_type": "code", "execution_count": 1, "id": "5b7f9f96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected. Call `.close()` to terminate connection gracefully.\n" ] } ], "source": [ "import hsfs\n", "\n", "# Create a connection\n", "connection = hsfs.connection()\n", "\n", "# Get the feature store handle for the project's feature store\n", "fs = connection.get_feature_store()" ] }, { "cell_type": "markdown", "id": "81ebbac5", "metadata": {}, "source": [ "---\n", "\n", "## 🕵🏻‍♂️ Explore feature groups \n", "\n", "You can interact with the feature groups as if they were Spark dataframe. A feature group object has a `show()` method, to show `n` number of lines, and a `read()` method to read the content of the feature group in a Spark dataframe.\n", "\n", "The first step to do any operation on a feature group is to get its handle from the feature store. The `get_or_create_feature_group()` method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1" ] }, { "cell_type": "code", "execution_count": 2, "id": "54829188", "metadata": {}, "outputs": [], "source": [ "sales_fg = fs.get_or_create_feature_group(\n", " name = \"sales_fg\",\n", " version = 1\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "0c68db20", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:09,946 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:10,805 INFO: SELECT `fg0`.`date` `date`, `fg0`.`weekly_sales` `weekly_sales`, `fg0`.`is_holiday` `is_holiday`, `fg0`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg0`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg0`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateweekly_salesis_holidaysales_last_30_days_store_depsales_last_365_days_store_dep
013040352000003659.650149274.401802200.44
112967776000002934.610154996.76471965.34
2132883200000018208.241631611.496589537.79
3134637120000010230.280416341.080.00
4131129280000017140.020607928.453963133.37
\n", "
" ], "text/plain": [ " date weekly_sales is_holiday sales_last_30_days_store_dep \\\n", "0 1304035200000 3659.65 0 149274.40 \n", "1 1296777600000 2934.61 0 154996.76 \n", "2 1328832000000 18208.24 1 631611.49 \n", "3 1346371200000 10230.28 0 416341.08 \n", "4 1311292800000 17140.02 0 607928.45 \n", "\n", " sales_last_365_days_store_dep \n", "0 1802200.44 \n", "1 471965.34 \n", "2 6589537.79 \n", "3 0.00 \n", "4 3963133.37 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_fg.select(['date','weekly_sales','is_holiday', 'sales_last_30_days_store_dep', 'sales_last_365_days_store_dep']).show(5)" ] }, { "cell_type": "code", "execution_count": 4, "id": "7a6d6d88", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:18,948 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:19,823 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`date` `date`, `fg0`.`weekly_sales` `weekly_sales`, `fg0`.`is_holiday` `is_holiday`, `fg0`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg0`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg0`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg0`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg0`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg0`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg0`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg0`.`sales_last_365_days_store` `sales_last_365_days_store`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg0`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptdateweekly_salesis_holidaysales_last_30_days_store_depsales_last_30_days_storesales_last_90_days_store_depsales_last_90_days_storesales_last_180_days_store_depsales_last_180_days_storesales_last_365_days_store_depsales_last_365_days_store
702094134274240000068495.6901982312.561982312.565677882.275677882.2711014154.5111014154.5131409255.2031409255.20
712012130826880000010485.090191750.50191750.50977080.59977080.593319720.123319720.1210552557.4410552557.44
85203134818560000011109.080486562.46486562.461417725.911417725.915472437.625472437.6216621469.8116621469.81
1452080127923840000041604.1301173650.611173650.613499311.123499311.126490119.606490119.6012913043.3612913043.36
2112079129980160000039220.6301137264.431137264.432683499.332683499.334715535.394715535.3918667704.9118667704.91
\n", "
" ], "text/plain": [ " store dept date weekly_sales is_holiday \\\n", "70 20 94 1342742400000 68495.69 0 \n", "71 20 12 1308268800000 10485.09 0 \n", "85 20 3 1348185600000 11109.08 0 \n", "145 20 80 1279238400000 41604.13 0 \n", "211 20 79 1299801600000 39220.63 0 \n", "\n", " sales_last_30_days_store_dep sales_last_30_days_store \\\n", "70 1982312.56 1982312.56 \n", "71 191750.50 191750.50 \n", "85 486562.46 486562.46 \n", "145 1173650.61 1173650.61 \n", "211 1137264.43 1137264.43 \n", "\n", " sales_last_90_days_store_dep sales_last_90_days_store \\\n", "70 5677882.27 5677882.27 \n", "71 977080.59 977080.59 \n", "85 1417725.91 1417725.91 \n", "145 3499311.12 3499311.12 \n", "211 2683499.33 2683499.33 \n", "\n", " sales_last_180_days_store_dep sales_last_180_days_store \\\n", "70 11014154.51 11014154.51 \n", "71 3319720.12 3319720.12 \n", "85 5472437.62 5472437.62 \n", "145 6490119.60 6490119.60 \n", "211 4715535.39 4715535.39 \n", "\n", " sales_last_365_days_store_dep sales_last_365_days_store \n", "70 31409255.20 31409255.20 \n", "71 10552557.44 10552557.44 \n", "85 16621469.81 16621469.81 \n", "145 12913043.36 12913043.36 \n", "211 18667704.91 18667704.91 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_df = sales_fg.read()\n", "sales_df[sales_df.store == 20].head()" ] }, { "cell_type": "code", "execution_count": 5, "id": "11e4114c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "⛳️ Type: \n" ] } ], "source": [ "print(f'⛳️ Type: {type(sales_df)}')" ] }, { "cell_type": "markdown", "id": "e8b9e37b", "metadata": {}, "source": [ "You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:" ] }, { "cell_type": "code", "execution_count": 6, "id": "ea5d3a00", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: sales_fg\n", "Description: Sales related features\n", "Features:\n", "store \t Primary: True \t Partition: False\n", "dept \t Primary: True \t Partition: False\n", "date \t Primary: True \t Partition: False\n", "weekly_sales \t Primary: False \t Partition: False\n", "is_holiday \t Primary: False \t Partition: False\n", "sales_last_30_days_store_dep \t Primary: False \t Partition: False\n", "sales_last_30_days_store \t Primary: False \t Partition: False\n", "sales_last_90_days_store_dep \t Primary: False \t Partition: False\n", "sales_last_90_days_store \t Primary: False \t Partition: False\n", "sales_last_180_days_store_dep \t Primary: False \t Partition: False\n", "sales_last_180_days_store \t Primary: False \t Partition: False\n", "sales_last_365_days_store_dep \t Primary: False \t Partition: False\n", "sales_last_365_days_store \t Primary: False \t Partition: False\n" ] } ], "source": [ "print(\"Name: {}\".format(sales_fg.name))\n", "print(\"Description: {}\".format(sales_fg.description))\n", "print(\"Features:\")\n", "features = sales_fg.features\n", "for feature in features:\n", " print(\"{:<60} \\t Primary: {} \\t Partition: {}\".format(feature.name, feature.primary, feature.partition))" ] }, { "cell_type": "markdown", "id": "57d7cbb1", "metadata": {}, "source": [ "If you are interested only in a subset of features, you can use the `select()` method on the feature group object to select a list of features. The `select()` behaves like a feature group, as such, you can call the `.show()` or `.read()` methods on it." ] }, { "cell_type": "code", "execution_count": 7, "id": "7f734f1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:34,057 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:34,925 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg0`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptweekly_sales
032713659.65
121672934.61
2311018208.24
38110230.28
412317140.02
\n", "
" ], "text/plain": [ " store dept weekly_sales\n", "0 32 71 3659.65\n", "1 21 67 2934.61\n", "2 31 10 18208.24\n", "3 8 1 10230.28\n", "4 1 23 17140.02" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)" ] }, { "cell_type": "markdown", "id": "56436cd6", "metadata": {}, "source": [ "---\n", "\n", "## 👮🏼‍♀️ Filter Feature Groups \n", "\n", "If you do not want to read your feature group into a dataframe, you can also filter directly on a `FeatureGroup` object. Applying a filter to a feature group returns a `Query` object which can subsequently be used to be further joined with other feature groups or queries, or it can be materialized as a training dataset." ] }, { "cell_type": "code", "execution_count": 8, "id": "8f0bb1c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:40,887 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:41,804 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg0`\n", "WHERE `fg0`.`weekly_sales` >= 50000\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptweekly_sales
0303860223.39
1191656018.16
2139355090.05
393883424.12
4277267459.34
\n", "
" ], "text/plain": [ " store dept weekly_sales\n", "0 30 38 60223.39\n", "1 19 16 56018.16\n", "2 13 93 55090.05\n", "3 9 38 83424.12\n", "4 27 72 67459.34" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).filter(sales_fg.weekly_sales >= 50000).show(5)" ] }, { "cell_type": "markdown", "id": "61896925", "metadata": {}, "source": [ "Conjunctions of filters can be constructed using the Python Bitwise Operators `|` and `&` which replace the regular binary operators when working with feature groups and filters." ] }, { "cell_type": "code", "execution_count": 9, "id": "08d2e9b9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:45,228 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:46,205 INFO: SELECT `fg0`.`store` `store`, `fg0`.`dept` `dept`, `fg0`.`weekly_sales` `weekly_sales`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg0`\n", "WHERE `fg0`.`weekly_sales` >= 50000 AND `fg0`.`dept` = 2\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptweekly_sales
019261912.39
114282661.34
211254084.66
34287181.89
419250399.44
\n", "
" ], "text/plain": [ " store dept weekly_sales\n", "0 19 2 61912.39\n", "1 14 2 82661.34\n", "2 11 2 54084.66\n", "3 4 2 87181.89\n", "4 19 2 50399.44" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).filter((sales_fg.weekly_sales >= 50000) & (sales_fg.dept == 2)).show(5)" ] }, { "cell_type": "markdown", "id": "4c004ed3", "metadata": {}, "source": [ "---\n", "\n", "## 👷🏼‍♂️ Join Features and Feature Groups \n", "\n", "HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups.\n", "The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.\n", "\n", "You can use the `select_all()` method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition. \n", "If you don't specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.\n", "\n", "In the example below, `sales_fg` has `store`, `dept` and `date` as composite primary key while `exogenous_fg` has only `store` and `date`. So Hopsworks will set as joining condition `store` and `date`." ] }, { "cell_type": "code", "execution_count": 10, "id": "87c0d1b2", "metadata": {}, "outputs": [], "source": [ "sales_fg = fs.get_or_create_feature_group(\n", " name = 'sales_fg',\n", " version = 1\n", ")\n", "\n", "exogenous_fg = fs.get_or_create_feature_group(\n", " name = 'exogenous_fg',\n", " version = 1\n", ")\n", "\n", "query = sales_fg.select_all().join(exogenous_fg.select_all())" ] }, { "cell_type": "markdown", "id": "c84ce075", "metadata": {}, "source": [ "You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the `to_string()` method on it." ] }, { "cell_type": "code", "execution_count": 11, "id": "9cf095b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `fg0`.`unemployment` `unemployment`, `fg0`.`is_holiday` `is_holiday`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg1`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`\n" ] } ], "source": [ "print(query.to_string())" ] }, { "cell_type": "markdown", "id": "9775982d", "metadata": {}, "source": [ "As for the feature groups, you can call the `show()` method to inspect the data before generating a training dataset from it. Or you can call the `read()` method to get a Spark DataFrame with the result of the query and apply additional transformations to it." ] }, { "cell_type": "code", "execution_count": 12, "id": "02d11b6b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:51:50,028 INFO: USE `basics_featurestore`\n", "2022-06-20 13:51:50,849 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`date` `date`, `fg1`.`weekly_sales` `weekly_sales`, `fg1`.`is_holiday` `is_holiday`, `fg1`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg1`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg1`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg1`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg1`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg1`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg1`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg1`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`temperature` `temperature`, `fg0`.`fuel_price` `fuel_price`, `fg0`.`markdown1` `markdown1`, `fg0`.`markdown2` `markdown2`, `fg0`.`markdown3` `markdown3`, `fg0`.`markdown4` `markdown4`, `fg0`.`markdown5` `markdown5`, `fg0`.`cpi` `cpi`, `fg0`.`unemployment` `unemployment`, `fg0`.`is_holiday` `is_holiday`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg1`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptdateweekly_salesis_holidaysales_last_30_days_store_depsales_last_30_days_storesales_last_90_days_store_depsales_last_90_days_storesales_last_180_days_store_dep...fuel_pricemarkdown1markdown2markdown3markdown4markdown5cpiunemploymentis_holidayappended_feature
076713354848000003100.230128353.04128353.04368801.46368801.46469529.79...3.927819.48NaN45.9449.821209.89197.6645468.090010.0
1414133125120000029320.670955751.92955751.922749310.672749310.674651795.12...3.28615255.04285.897.743203.613889.43196.9631606.589010.0
2274412725856000004235.890230722.72230722.72846650.03846650.031567698.75...3.042NaNNaNNaNNaNNaN135.9170838.058010.0
31181133911360000014540.750392018.44392018.441150766.821150766.822402243.25...3.45211984.13191.4078.652768.316358.77225.2521466.664010.0
4696131915520000026781.650960483.58960483.582709414.122709414.1210796825.64...3.353NaNNaNNaNNaNNaN219.0740176.551010.0
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " store dept date weekly_sales is_holiday \\\n", "0 7 67 1335484800000 3100.23 0 \n", "1 41 4 1331251200000 29320.67 0 \n", "2 27 44 1272585600000 4235.89 0 \n", "3 11 81 1339113600000 14540.75 0 \n", "4 6 96 1319155200000 26781.65 0 \n", "\n", " sales_last_30_days_store_dep sales_last_30_days_store \\\n", "0 128353.04 128353.04 \n", "1 955751.92 955751.92 \n", "2 230722.72 230722.72 \n", "3 392018.44 392018.44 \n", "4 960483.58 960483.58 \n", "\n", " sales_last_90_days_store_dep sales_last_90_days_store \\\n", "0 368801.46 368801.46 \n", "1 2749310.67 2749310.67 \n", "2 846650.03 846650.03 \n", "3 1150766.82 1150766.82 \n", "4 2709414.12 2709414.12 \n", "\n", " sales_last_180_days_store_dep ... fuel_price markdown1 markdown2 \\\n", "0 469529.79 ... 3.927 819.48 NaN \n", "1 4651795.12 ... 3.286 15255.04 285.89 \n", "2 1567698.75 ... 3.042 NaN NaN \n", "3 2402243.25 ... 3.452 11984.13 191.40 \n", "4 10796825.64 ... 3.353 NaN NaN \n", "\n", " markdown3 markdown4 markdown5 cpi unemployment is_holiday \\\n", "0 45.94 49.82 1209.89 197.664546 8.090 0 \n", "1 7.74 3203.61 3889.43 196.963160 6.589 0 \n", "2 NaN NaN NaN 135.917083 8.058 0 \n", "3 78.65 2768.31 6358.77 225.252146 6.664 0 \n", "4 NaN NaN NaN 219.074017 6.551 0 \n", "\n", " appended_feature \n", "0 10.0 \n", "1 10.0 \n", "2 10.0 \n", "3 10.0 \n", "4 10.0 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.show(5)" ] }, { "cell_type": "markdown", "id": "8d89ca21", "metadata": {}, "source": [ "As for the `show()` and `read()` method of the feature group, even in the case of a query you can specify against which storage to run the query." ] }, { "cell_type": "markdown", "id": "1b534476", "metadata": {}, "source": [ "---\n", "\n", "## 💼 Select only a subset of features \n", "\n", "You can replace the `select_all()` method with the `select([])` method to be able to select only a subset of features from a feature group you want to join:" ] }, { "cell_type": "code", "execution_count": 13, "id": "dc42d825", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:52:46,041 INFO: USE `basics_featurestore`\n", "2022-06-20 13:52:46,947 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg1`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptweekly_salesfuel_price
07673100.233.927
141429320.673.286
227444235.893.042
3118114540.753.452
469626781.653.353
\n", "
" ], "text/plain": [ " store dept weekly_sales fuel_price\n", "0 7 67 3100.23 3.927\n", "1 41 4 29320.67 3.286\n", "2 27 44 4235.89 3.042\n", "3 11 81 14540.75 3.452\n", "4 6 96 26781.65 3.353" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']))\n", "query.show(5)" ] }, { "cell_type": "markdown", "id": "8aa6a2b5", "metadata": {}, "source": [ "---\n", "\n", "## 💈 Overwrite the joining key \n", "\n", "If your feature groups don't have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.\n", "\n", "As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:" ] }, { "cell_type": "code", "execution_count": 14, "id": "56838a4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:53:16,607 INFO: USE `basics_featurestore`\n", "2022-06-20 13:53:17,440 INFO: SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg1`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`date` = `fg0`.`date`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storedeptweekly_salesfuel_price
011851823.72.735
111851823.73.070
211851823.73.162
311851823.72.830
411851823.72.735
\n", "
" ], "text/plain": [ " store dept weekly_sales fuel_price\n", "0 11 85 1823.7 2.735\n", "1 11 85 1823.7 3.070\n", "2 11 85 1823.7 3.162\n", "3 11 85 1823.7 2.830\n", "4 11 85 1823.7 2.735" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']), on=['date'])\n", "query.show(5)" ] }, { "cell_type": "markdown", "id": "36247901", "metadata": {}, "source": [ "### ⛓ Overwriting the join type \n", "\n", "\n", "\n", "By default, the join type between two feature groups is `INNER JOIN`. You can overwrite this behavior by passing the `join_type` parameter to the join method. Valid types are: `INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA`" ] }, { "cell_type": "code", "execution_count": 15, "id": "8b4ff79c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg1`.`store` `store`, `fg1`.`dept` `dept`, `fg1`.`weekly_sales` `weekly_sales`, `fg0`.`fuel_price` `fuel_price`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg1`\n", "LEFT JOIN `basics_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`\n" ] } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']), join_type=\"left\")\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "id": "38516ca4", "metadata": {}, "source": [ "---\n", "\n", "## 🧲 Join mulitple feature groups \n", "\n", "You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:\n", "\n", " (`sales_fg` <> `store_fg`) <> `exogenous_fg`\n", "\n", "The join paramers you pass in each `join()` method call apply to that specific join. This means that you can concatenate left and right joins.\n", "Please be aware that currently HSFS **does not support** nested join such as: \n", "\n", " `sales_fg` <> (`store_fg` <> `exogenous_fg`)" ] }, { "cell_type": "code", "execution_count": 16, "id": "5d78eba2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg2`.`store` `store`, `fg2`.`dept` `dept`, `fg2`.`date` `date`, `fg2`.`weekly_sales` `weekly_sales`, `fg2`.`is_holiday` `is_holiday`, `fg2`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg2`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg2`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg2`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg2`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg2`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg2`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg2`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`type` `type`, `fg0`.`size` `size`, `fg0`.`dept` `dept`, `fg1`.`fuel_price` `fuel_price`, `fg1`.`unemployment` `unemployment`, `fg1`.`cpi` `cpi`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg2`\n", "INNER JOIN `basics_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`store` = `fg1`.`store` AND `fg2`.`date` = `fg1`.`date`\n" ] } ], "source": [ "store_fg = fs.get_or_create_feature_group(\n", " name = \"store_fg\",\n", " version = 1\n", ")\n", "\n", "query = sales_fg.select_all()\\\n", " .join(store_fg.select_all())\\\n", " .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "id": "94812892", "metadata": {}, "source": [ "### 🔱 Use Joins together with Filters \n", "\n", "It is possible to use filters in any of the subqueries of a joined query." ] }, { "cell_type": "code", "execution_count": 17, "id": "98d28d1b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg2`.`store` `store`, `fg2`.`dept` `dept`, `fg2`.`date` `date`, `fg2`.`weekly_sales` `weekly_sales`, `fg2`.`is_holiday` `is_holiday`, `fg2`.`sales_last_30_days_store_dep` `sales_last_30_days_store_dep`, `fg2`.`sales_last_30_days_store` `sales_last_30_days_store`, `fg2`.`sales_last_90_days_store_dep` `sales_last_90_days_store_dep`, `fg2`.`sales_last_90_days_store` `sales_last_90_days_store`, `fg2`.`sales_last_180_days_store_dep` `sales_last_180_days_store_dep`, `fg2`.`sales_last_180_days_store` `sales_last_180_days_store`, `fg2`.`sales_last_365_days_store_dep` `sales_last_365_days_store_dep`, `fg2`.`sales_last_365_days_store` `sales_last_365_days_store`, `fg0`.`type` `type`, `fg0`.`size` `size`, `fg0`.`dept` `dept`, `fg1`.`fuel_price` `fuel_price`, `fg1`.`unemployment` `unemployment`, `fg1`.`cpi` `cpi`\n", "FROM `basics_featurestore`.`sales_fg_1` `fg2`\n", "INNER JOIN `basics_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`\n", "INNER JOIN `basics_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`store` = `fg1`.`store` AND `fg2`.`date` = `fg1`.`date`\n", "WHERE `fg2`.`weekly_sales` >= 50000 AND `fg1`.`fuel_price` <= 2.7\n" ] } ], "source": [ "query = sales_fg.select_all()\\\n", " .join(store_fg.select_all())\\\n", " .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']).filter(exogenous_fg.fuel_price <= 2.7)) \\\n", " .filter(sales_fg.weekly_sales >= 50000)\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "id": "61550062", "metadata": {}, "source": [ "---\n", "\n", "## 🔮 Free hand query \n", "\n", "With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. `fs.sql` returns a Spark Dataframe." ] }, { "cell_type": "code", "execution_count": 18, "id": "a96a02de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-20 13:58:08,689 INFO: USE `basics_featurestore`\n", "2022-06-20 13:58:09,381 INFO: SELECT * FROM `store_fg_1`\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_fg_1.storestore_fg_1.typestore_fg_1.sizestore_fg_1.deptstore_fg_1._hoodie_record_keystore_fg_1._hoodie_partition_pathstore_fg_1._hoodie_commit_timestore_fg_1._hoodie_file_namestore_fg_1._hoodie_commit_seqno
026A1525137626202206201300349982daa2664-4907-4932-af25-c089474b316b-0_0-29-66...20220620130034998_0_1
11A151315771202206201300349982daa2664-4907-4932-af25-c089474b316b-0_0-29-66...20220620130034998_0_2
221B1401677721202206201300349982daa2664-4907-4932-af25-c089474b316b-0_0-29-66...20220620130034998_0_3
317B931887617202206201300349982daa2664-4907-4932-af25-c089474b316b-0_0-29-66...20220620130034998_0_4
430C429886430202206201300349982daa2664-4907-4932-af25-c089474b316b-0_0-29-66...20220620130034998_0_5
\n", "
" ], "text/plain": [ " store_fg_1.store store_fg_1.type store_fg_1.size store_fg_1.dept \\\n", "0 26 A 152513 76 \n", "1 1 A 151315 77 \n", "2 21 B 140167 77 \n", "3 17 B 93188 76 \n", "4 30 C 42988 64 \n", "\n", " store_fg_1._hoodie_record_key store_fg_1._hoodie_partition_path \\\n", "0 26 \n", "1 1 \n", "2 21 \n", "3 17 \n", "4 30 \n", "\n", " store_fg_1._hoodie_commit_time \\\n", "0 20220620130034998 \n", "1 20220620130034998 \n", "2 20220620130034998 \n", "3 20220620130034998 \n", "4 20220620130034998 \n", "\n", " store_fg_1._hoodie_file_name \\\n", "0 2daa2664-4907-4932-af25-c089474b316b-0_0-29-66... \n", "1 2daa2664-4907-4932-af25-c089474b316b-0_0-29-66... \n", "2 2daa2664-4907-4932-af25-c089474b316b-0_0-29-66... \n", "3 2daa2664-4907-4932-af25-c089474b316b-0_0-29-66... \n", "4 2daa2664-4907-4932-af25-c089474b316b-0_0-29-66... \n", "\n", " store_fg_1._hoodie_commit_seqno \n", "0 20220620130034998_0_1 \n", "1 20220620130034998_0_2 \n", "2 20220620130034998_0_3 \n", "3 20220620130034998_0_4 \n", "4 20220620130034998_0_5 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.sql(\"SELECT * FROM `store_fg_1`\").head()" ] }, { "cell_type": "markdown", "id": "e5f4e509", "metadata": {}, "source": [ "---\n", "## ⏭️ **Next:** Part 03 \n", "\n", "In the following notebook we will use our feature groups to create a dataset we can train a model on." ] } ], "metadata": { "kernelspec": { "display_name": "Python", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 5 }