{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: \"Databricks Azure Feature Store Quickstart\"\n", "date: 2021-02-24\n", "type: technical_note\n", "draft: false\n", "---" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "26dbe322-f2e0-4b58-96c8-ec0de1564127", "showTitle": false, "title": "" } }, "source": [ "# Databricks Azure Feature Store Quick Start\n", "\n", "This notebook gives you a quick overview of how you can intergrate the Feature Store on Hopsworks with Databricks and Azure ADL. We'll go over four steps:\n", "\n", "- Generate some sample data and store it on ADL\n", "- Do some feature engineering with Databricks and the data from ADL\n", "- Save the engineered features to the Feature Store\n", "- Select a group of the features from the Feature Store and create a training dataset\n", "\n", "This requries configuring the Databricks cluster to be able to interact with Hopsworks Feature Store, see [Databricks Quick Start](https://docs.hopsworks.ai/feature-store-api/latest/integrations/databricks/configuration/)." ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "d72a3a79-1c6e-41e2-bda7-73ce32bb3a0d", "showTitle": false, "title": "" } }, "source": [ "### Imports\n", "\n", "We'll use numpy and pandas for data generation, pyspark for feature engineering, tensorflow and keras for model training, and the `hsfs` library to interact with the Hopsworks Feature Store." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "8b87abd1-30c0-4240-8460-bdf7d26024ea", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
/databricks/python/lib/python3.7/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working\n", " from collections import Mapping, MutableMapping\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
/databricks/python/lib/python3.7/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working\n from collections import Mapping, MutableMapping\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "import hsfs\n", "import random\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from pyspark.sql import SQLContext\n", "from pyspark.sql import Row\n", "sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "6f552032-13e0-4bf0-bfb2-712c4d43abbc", "showTitle": false, "title": "" } }, "source": [ "### Connecting to the Feature Store" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "57f369fb-09d7-42b2-b08d-f818e4d890a8", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
Connected. Call `.close()` to terminate connection gracefully.\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Connected. Call `.close()` to terminate connection gracefully.\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "# Connect to the feature store, see https://docs.hopsworks.ai/feature-store-api/latest/generated/api/connection_api/#connection_1 for more information\n", "# The API key can be also saved as secret in Databricks and retrieved using the dbutils\n", "connection = hsfs.connection(\n", " host=\"10.0.0.4\",\n", " project=\"dataai\",\n", " port=\"443\",\n", " api_key_value=\"IbCoKz4aRChIeWpj.qNJjGItvVfUBeTxCCk6osXyswK9oDmcNVw1X9xrK8khVUcLqgtF8xM3v5H8dMWvP\",\n", " hostname_verification=False\n", ")\n", "\n", "fs = connection.get_feature_store()" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "0d7512de-8f2f-4ed9-8749-17ea219e53cd", "showTitle": false, "title": "" } }, "source": [ "#### Configure Databricks to write to ADL gen2\n", "\n", "Follow the steps in the [Databricks Documentation](https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "273c2e38-9b01-4830-a0bc-e25564dde4f0", "showTitle": false, "title": "" } }, "source": [ "#### Generate Sample Data\n", "\n", "Lets generate two sample datasets and store them on S3:\n", "\n", "1. `houses_for_sale_data`:\n", "\n", "```bash\n", "+-------+--------+------------------+------------------+------------------+\n", "|area_id|house_id| house_worth| house_age| house_size|\n", "+-------+--------+------------------+------------------+------------------+\n", "| 1| 0| 11678.15482418699|133.88670106643886|366.80067322738535|\n", "| 1| 1| 2290.436167500643|15994.969706808222|195.84014889823976|\n", "| 1| 2| 8380.774578431328|1994.8576926471007|1544.5164614303735|\n", "| 1| 3|11641.224696102923|23104.501275562343|1673.7222604337876|\n", "| 1| 4| 5382.089422436954| 13903.43637058141| 274.2912104765028|\n", "+-------+--------+------------------+------------------+------------------+\n", "\n", " |-- area_id: long (nullable = true)\n", " |-- house_id: long (nullable = true)\n", " |-- house_worth: double (nullable = true)\n", " |-- house_age: double (nullable = true)\n", " |-- house_size: double (nullable = true)\n", "```\n", "2. `houses_sold_data``\n", "```bash\n", "+-------+-----------------+-----------------+------------------+\n", "|area_id|house_purchase_id|number_of_bidders| sold_for_amount|\n", "+-------+-----------------+-----------------+------------------+\n", "| 1| 0| 0| 70073.06059070028|\n", "| 1| 1| 15| 146.9198329740602|\n", "| 1| 2| 6| 594.802165433149|\n", "| 1| 3| 10| 77187.84123130841|\n", "| 1| 4| 1|110627.48922722359|\n", "+-------+-----------------+-----------------+------------------+\n", "\n", " |-- area_id: long (nullable = true)\n", " |-- house_purchase_id: long (nullable = true)\n", " |-- number_of_bidders: long (nullable = true)\n", " |-- sold_for_amount: double (nullable = true)\n", "```\n", "\n", "We'll use this data for predicting what a house is sold for based on features about the **area** where the house is." ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "16fea98c-a098-4c88-85f0-f2746a058b2a", "showTitle": false, "title": "" } }, "source": [ "##### Generation of `houses_for_sale_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "338f220c-5f53-40e5-8f07-4ea9423a3fe4", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [ { "name": "houses_for_sale_data_spark_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_worth", "nullable": true, "type": "double" }, { "metadata": {}, "name": "house_age", "nullable": true, "type": "double" }, { "metadata": {}, "name": "house_size", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" } ], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "area_ids = list(range(1,51))\n", "house_sizes = []\n", "house_worths = []\n", "house_ages = []\n", "house_area_ids = []\n", "for i in area_ids:\n", " for j in list(range(1,100)):\n", " house_sizes.append(abs(np.random.normal()*1000)/i)\n", " house_worths.append(abs(np.random.normal()*10000)/i)\n", " house_ages.append(abs(np.random.normal()*10000)/i)\n", " house_area_ids.append(i)\n", "house_ids = list(range(len(house_area_ids)))\n", "houses_for_sale_data = pd.DataFrame({\n", " 'area_id':house_area_ids,\n", " 'house_id':house_ids,\n", " 'house_worth': house_worths,\n", " 'house_age': house_ages,\n", " 'house_size': house_sizes\n", " })\n", "houses_for_sale_data_spark_df = sqlContext.createDataFrame(houses_for_sale_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "7a2906f0-ee42-485f-abf5-8f0dd6fe8f3f", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
+-------+--------+------------------+------------------+------------------+\n", "area_id|house_id| house_worth| house_age| house_size|\n", "+-------+--------+------------------+------------------+------------------+\n", " 1| 0|1991.9888943412495| 7167.871511762735|2403.4083622753215|\n", " 1| 1|14264.364158433278| 2050.858854537419|12.544630598354674|\n", " 1| 2|17842.405873372376|427.54596016089846| 346.6902449005049|\n", " 1| 3| 9505.131108244657|1881.7501969939058| 273.8686208277227|\n", " 1| 4|1252.5398136957444|2242.2149219552875| 367.5512280204664|\n", "+-------+--------+------------------+------------------+------------------+\n", "only showing top 5 rows\n", "\n", "root\n", "-- area_id: long (nullable = true)\n", "-- house_id: long (nullable = true)\n", "-- house_worth: double (nullable = true)\n", "-- house_age: double (nullable = true)\n", "-- house_size: double (nullable = true)\n", "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
+-------+--------+------------------+------------------+------------------+\n|area_id|house_id| house_worth| house_age| house_size|\n+-------+--------+------------------+------------------+------------------+\n| 1| 0|1991.9888943412495| 7167.871511762735|2403.4083622753215|\n| 1| 1|14264.364158433278| 2050.858854537419|12.544630598354674|\n| 1| 2|17842.405873372376|427.54596016089846| 346.6902449005049|\n| 1| 3| 9505.131108244657|1881.7501969939058| 273.8686208277227|\n| 1| 4|1252.5398136957444|2242.2149219552875| 367.5512280204664|\n+-------+--------+------------------+------------------+------------------+\nonly showing top 5 rows\n\nroot\n |-- area_id: long (nullable = true)\n |-- house_id: long (nullable = true)\n |-- house_worth: double (nullable = true)\n |-- house_age: double (nullable = true)\n |-- house_size: double (nullable = true)\n\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_for_sale_data_spark_df.show(5)\n", "houses_for_sale_data_spark_df.printSchema()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "1c369e7b-54f3-4c1d-b0dd-16627bc25182", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_for_sale_data_spark_df.write.parquet(\"abfss://dbexample@hopsworksdbexample.dfs.core.windows.net/house_sales_data\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "da86cd76-9357-41a8-b2d2-414382367531", "showTitle": false, "title": "" } }, "source": [ "##### Generation of `houses_sold_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "ca60040b-059c-42b5-a1ca-acb033f982e7", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [ { "name": "houses_sold_data_spark_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_purchase_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "number_of_bidders", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sold_for_amount", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" } ], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "house_purchased_amounts = []\n", "house_purchases_bidders = []\n", "house_purchases_area_ids = []\n", "for i in area_ids:\n", " for j in list(range(1,1000)):\n", " house_purchased_amounts.append(abs(np.random.exponential()*100000)/i)\n", " house_purchases_bidders.append(int(abs(np.random.exponential()*10)/i))\n", " house_purchases_area_ids.append(i)\n", "house_purchase_ids = list(range(len(house_purchases_bidders)))\n", "houses_sold_data = pd.DataFrame({\n", " 'area_id':house_purchases_area_ids,\n", " 'house_purchase_id':house_purchase_ids,\n", " 'number_of_bidders': house_purchases_bidders,\n", " 'sold_for_amount': house_purchased_amounts\n", " })\n", "houses_sold_data_spark_df = sqlContext.createDataFrame(houses_sold_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "e7c6dc46-9bf2-4b25-8016-2096ce369516", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
+-------+-----------------+-----------------+------------------+\n", "area_id|house_purchase_id|number_of_bidders| sold_for_amount|\n", "+-------+-----------------+-----------------+------------------+\n", " 1| 0| 5| 71267.36761467403|\n", " 1| 1| 3| 39689.03803464887|\n", " 1| 2| 0|33332.809984440915|\n", " 1| 3| 17|37183.624558190655|\n", " 1| 4| 1| 99465.23505460238|\n", "+-------+-----------------+-----------------+------------------+\n", "only showing top 5 rows\n", "\n", "root\n", "-- area_id: long (nullable = true)\n", "-- house_purchase_id: long (nullable = true)\n", "-- number_of_bidders: long (nullable = true)\n", "-- sold_for_amount: double (nullable = true)\n", "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
+-------+-----------------+-----------------+------------------+\n|area_id|house_purchase_id|number_of_bidders| sold_for_amount|\n+-------+-----------------+-----------------+------------------+\n| 1| 0| 5| 71267.36761467403|\n| 1| 1| 3| 39689.03803464887|\n| 1| 2| 0|33332.809984440915|\n| 1| 3| 17|37183.624558190655|\n| 1| 4| 1| 99465.23505460238|\n+-------+-----------------+-----------------+------------------+\nonly showing top 5 rows\n\nroot\n |-- area_id: long (nullable = true)\n |-- house_purchase_id: long (nullable = true)\n |-- number_of_bidders: long (nullable = true)\n |-- sold_for_amount: double (nullable = true)\n\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_sold_data_spark_df.show(5)\n", "houses_sold_data_spark_df.printSchema()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "9ded21f8-be1e-45d6-85ce-25decd78675f", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_sold_data_spark_df.write.parquet(\"abfss://dbexample@hopsworksdbexample.dfs.core.windows.net/house_sold_data\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "306eb091-ec3b-4e27-9861-f5e179e34867", "showTitle": false, "title": "" } }, "source": [ "### Generate Features From `houses_for_sale_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "d74e2b81-932f-4eff-85b7-4be730c88956", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [ { "name": "houses_for_sale_data_spark_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_worth", "nullable": true, "type": "double" }, { "metadata": {}, "name": "house_age", "nullable": true, "type": "double" }, { "metadata": {}, "name": "house_size", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "sum_houses_for_sale_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(area_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(house_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(house_worth)", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum(house_age)", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum(house_size)", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "count_houses_for_sale_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "count", "nullable": false, "type": "long" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "sum_count_houses_for_sale_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(area_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(house_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum_house_worth", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_house_age", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_house_size", "nullable": true, "type": "double" }, { "metadata": {}, "name": "num_rows", "nullable": false, "type": "long" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "houses_for_sale_features_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "avg_house_age", "nullable": true, "type": "double" }, { "metadata": {}, "name": "avg_house_size", "nullable": true, "type": "double" }, { "metadata": {}, "name": "avg_house_worth", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_house_age", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_house_size", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_house_worth", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" } ], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_for_sale_data_spark_df = spark.read.parquet(\"abfss://dbexample@hopsworksdbexample.dfs.core.windows.net/house_sales_data\")\n", "sum_houses_for_sale_df = houses_for_sale_data_spark_df.groupBy(\"area_id\").sum()\n", "count_houses_for_sale_df = houses_for_sale_data_spark_df.groupBy(\"area_id\").count()\n", "sum_count_houses_for_sale_df = sum_houses_for_sale_df.join(count_houses_for_sale_df, \"area_id\")\n", "sum_count_houses_for_sale_df = sum_count_houses_for_sale_df \\\n", " .withColumnRenamed(\"sum(house_age)\", \"sum_house_age\") \\\n", " .withColumnRenamed(\"sum(house_worth)\", \"sum_house_worth\") \\\n", " .withColumnRenamed(\"sum(house_size)\", \"sum_house_size\") \\\n", " .withColumnRenamed(\"count\", \"num_rows\")\n", "def compute_average_features_house_for_sale(row):\n", " avg_house_worth = row.sum_house_worth/float(row.num_rows)\n", " avg_house_size = row.sum_house_size/float(row.num_rows)\n", " avg_house_age = row.sum_house_age/float(row.num_rows)\n", " return Row(\n", " sum_house_worth=row.sum_house_worth, \n", " sum_house_age=row.sum_house_age,\n", " sum_house_size=row.sum_house_size,\n", " area_id = row.area_id,\n", " avg_house_worth = avg_house_worth,\n", " avg_house_size = avg_house_size,\n", " avg_house_age = avg_house_age\n", " )\n", "houses_for_sale_features_df = sum_count_houses_for_sale_df.rdd.map(\n", " lambda row: compute_average_features_house_for_sale(row)\n", ").toDF()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "77afff52-b47f-4813-8e75-4985b0507fc1", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
root\n", "-- area_id: long (nullable = true)\n", "-- avg_house_age: double (nullable = true)\n", "-- avg_house_size: double (nullable = true)\n", "-- avg_house_worth: double (nullable = true)\n", "-- sum_house_age: double (nullable = true)\n", "-- sum_house_size: double (nullable = true)\n", "-- sum_house_worth: double (nullable = true)\n", "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
root\n |-- area_id: long (nullable = true)\n |-- avg_house_age: double (nullable = true)\n |-- avg_house_size: double (nullable = true)\n |-- avg_house_worth: double (nullable = true)\n |-- sum_house_age: double (nullable = true)\n |-- sum_house_size: double (nullable = true)\n |-- sum_house_worth: double (nullable = true)\n\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_for_sale_features_df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "062d1f98-a2d1-450a-b651-800fdef3cb38", "showTitle": false, "title": "" } }, "source": [ "### Generate Features from `houses_sold_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "93696782-4eee-4345-ae39-1cec676053ae", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
", "datasetInfos": [ { "name": "houses_sold_data_spark_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "house_purchase_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "number_of_bidders", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sold_for_amount", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "sum_houses_sold_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(area_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(house_purchase_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(number_of_bidders)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(sold_for_amount)", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "count_houses_sold_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "count", "nullable": false, "type": "long" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "sum_count_houses_sold_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(area_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum(house_purchase_id)", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum_number_of_bidders", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum_sold_for_amount", "nullable": true, "type": "double" }, { "metadata": {}, "name": "num_rows", "nullable": false, "type": "long" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" }, { "name": "houses_sold_features_df", "schema": { "fields": [ { "metadata": {}, "name": "area_id", "nullable": true, "type": "long" }, { "metadata": {}, "name": "avg_num_bidders", "nullable": true, "type": "double" }, { "metadata": {}, "name": "avg_sold_for", "nullable": true, "type": "double" }, { "metadata": {}, "name": "sum_number_of_bidders", "nullable": true, "type": "long" }, { "metadata": {}, "name": "sum_sold_for_amount", "nullable": true, "type": "double" } ], "type": "struct" }, "tableIdentifier": null, "typeStr": "pyspark.sql.dataframe.DataFrame" } ], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_sold_data_spark_df = spark.read.parquet(\"abfss://dbexample@hopsworksdbexample.dfs.core.windows.net/house_sold_data\")\n", "sum_houses_sold_df = houses_sold_data_spark_df.groupBy(\"area_id\").sum()\n", "count_houses_sold_df = houses_sold_data_spark_df.groupBy(\"area_id\").count()\n", "sum_count_houses_sold_df = sum_houses_sold_df.join(count_houses_sold_df, \"area_id\")\n", "sum_count_houses_sold_df = sum_count_houses_sold_df \\\n", " .withColumnRenamed(\"sum(number_of_bidders)\", \"sum_number_of_bidders\") \\\n", " .withColumnRenamed(\"sum(sold_for_amount)\", \"sum_sold_for_amount\") \\\n", " .withColumnRenamed(\"count\", \"num_rows\")\n", "def compute_average_features_houses_sold(row):\n", " avg_num_bidders = row.sum_number_of_bidders/float(row.num_rows)\n", " avg_sold_for = row.sum_sold_for_amount/float(row.num_rows)\n", " return Row(\n", " sum_number_of_bidders=row.sum_number_of_bidders, \n", " sum_sold_for_amount=row.sum_sold_for_amount,\n", " area_id = row.area_id,\n", " avg_num_bidders = avg_num_bidders,\n", " avg_sold_for = avg_sold_for\n", " )\n", "houses_sold_features_df = sum_count_houses_sold_df.rdd.map(\n", " lambda row: compute_average_features_houses_sold(row)\n", ").toDF()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "88b15a62-f5d6-4da1-a74d-d10c9ebc422f", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
root\n", "-- area_id: long (nullable = true)\n", "-- avg_num_bidders: double (nullable = true)\n", "-- avg_sold_for: double (nullable = true)\n", "-- sum_number_of_bidders: long (nullable = true)\n", "-- sum_sold_for_amount: double (nullable = true)\n", "\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
root\n |-- area_id: long (nullable = true)\n |-- avg_num_bidders: double (nullable = true)\n |-- avg_sold_for: double (nullable = true)\n |-- sum_number_of_bidders: long (nullable = true)\n |-- sum_sold_for_amount: double (nullable = true)\n\n
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "houses_sold_features_df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "54f6be40-2bde-4063-895c-9db28d3ccc1c", "showTitle": false, "title": "" } }, "source": [ "### Save Features to the Feature Store\n", "\n", "The Featue store has an abstraction of a **feature group** which is a set of features that naturally belong together that are computed using the same feature engineering job.\n", "\n", "Lets create two feature groups:\n", "\n", "1. `houses_for_sale_featuregroup`\n", "\n", "2. `houses_sold_featuregroup`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "6f8cd3aa-ce28-4a4a-bf89-617c43e3b2e3", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
Out[20]: <hsfs.feature_group.FeatureGroup at 0x7f0e346870f0>
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Out[20]: <hsfs.feature_group.FeatureGroup at 0x7f0e346870f0>
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "# Refer to https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group for the different parameters for creating a feature group\n", "house_sale_fg = fs.create_feature_group(\"houses_for_sale_featuregroup\",\n", " version=1,\n", " description=\"aggregate features of houses for sale per area\",\n", " primary_key=['area_id'],\n", " online_enabled=False,\n", " time_travel_format=None,\n", " statistics_config={\"histograms\": True, \"correlations\": True, \"exact_uniqueness\": True})\n", "\n", "house_sale_fg.save(houses_for_sale_features_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "b1d01f43-e9fd-4546-ab4a-9a450838177a", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
Out[21]: <hsfs.feature_group.FeatureGroup at 0x7f0e34685240>
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Out[21]: <hsfs.feature_group.FeatureGroup at 0x7f0e34685240>
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "house_sold_fg = fs.create_feature_group(\"houses_sold_featuregroup\",\n", " version=1,\n", " description=\"aggregate features of sold houses per area\",\n", " primary_key=['area_id'],\n", " online_enabled=False,\n", " time_travel_format=None,\n", " statistics_config={\"histograms\": True, \"correlations\": True, \"exact_uniqueness\": True})\n", "\n", "house_sold_fg.save(houses_sold_features_df)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "b691790b-9b48-41f9-87c9-de5a2500edf4", "showTitle": false, "title": "" } }, "source": [ "## Create a Training Dataset\n", "\n", "The feature store has an abstraction of a **training dataset**, which is a dataset with a set of features (potentially from many different feature groups) and labels (in case of supervised learning) stored in a ML Framework friendly format (CSV, Tfrecords, ...)\n", "\n", "Let's create a training dataset called *predict_house_sold_for_dataset* using the following features:\n", "\n", "- `avg_house_age`\n", "- `avg_house_size`\n", "- `avg_house_worth`\n", "- `avg_num_bidders`\n", "\n", "and the target variable is:\n", "\n", "- `avg_sold_for`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "inputWidgets": {}, "nuid": "b1d0d861-028c-4103-8d41-ce6e473fdc1a", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", "
Out[22]: <hsfs.training_dataset.TrainingDataset at 0x7f0e37442d30>
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Out[22]: <hsfs.training_dataset.TrainingDataset at 0x7f0e37442d30>
", "datasetInfos": [], "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "# Join features and feature groups to create the training dataset\n", "feature_query = house_sale_fg.select([\"avg_house_age\", \"avg_house_size\", \"avg_house_worth\"])\\\n", " .join(house_sold_fg.select([\"avg_num_bidders\", \"avg_sold_for\"]))\n", " \n", "# Create the training dataset metadata\n", "td = fs.create_training_dataset(name=\"predict_house_sold_for_dataset\",\n", " version=1,\n", " data_format=\"csv\",\n", " label=['avg_sold_for'],\n", " statistics_config={\"histograms\": True, \"correlations\": True, \"exact_uniqueness\": True})\n", "\n", "# Save the training dataset\n", "td.save(feature_query)" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookName": "DatabricksSample", "notebookOrigID": 3856973133281905, "widgets": {} }, "kernelspec": { "display_name": "PySpark", "language": "python", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }