{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: \"1. Feature Exploration\"\n", "date: 2021-02-24\n", "type: technical_note\n", "draft: false\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HSFS feature exploration\n", "\n", "In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store. \n", "\n", "A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets.\n", "Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.\n", "\n", "![Join](../images/join.svg \"Join\")\n", "\n", "As for the [feature_engineering](./feature_engineering.ipynb) notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "\n", "
IDYARN Application IDKindStateSpark UIDriver log
28application_1607613578055_0015pysparkidleLinkLink
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "SparkSession available as 'spark'.\n", "Connected. Call `.close()` to terminate connection gracefully." ] } ], "source": [ "import hsfs\n", "# Create a connection\n", "connection = hsfs.connection()\n", "# Get the feature store handle for the project's feature store\n", "fs = connection.get_feature_store()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore feature groups\n", "\n", "You can interact with the feature groups as if they were Spark dataframe. A feature group object has a `show()` method, to show `n` number of lines, and a `read()` method to read the content of the feature group in a Spark dataframe.\n", "\n", "The first step to do any operation on a feature group is to get its handle from the feature store. The `get_feature_group` method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`." ] } ], "source": [ "sales_fg = fs.get_feature_group(\"sales_fg\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep|\n", "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "| 20| 0.0| 55| false| 0.0| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 94| false| 0.0| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 2| false| 0.0| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 92| false| 0.0| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 12| false| 0.0| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "only showing top 5 rows" ] } ], "source": [ "sales_fg.show(5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep|\n", "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "| 20| 0.0| 55| false| 0.0| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 94| false| 0.0| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 2| false| 0.0| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 92| false| 0.0| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "| 20| 0.0| 12| false| 0.0| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|\n", "+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+\n", "only showing top 5 rows\n", "\n", "" ] } ], "source": [ "sales_df = sales_fg.read()\n", "sales_df.filter(\"store == 20\").show(5)\n", "print(type(sales_df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: sales_fg\n", "Description: Sales related features\n", "Features:\n", "store \t Primary: True \t Partition: False\n", "sales_last_six_month_store_dep \t Primary: False \t Partition: False\n", "dept \t Primary: True \t Partition: False\n", "is_holiday \t Primary: False \t Partition: False\n", "sales_last_year_store_dep \t Primary: False \t Partition: False\n", "sales_last_six_month_store \t Primary: False \t Partition: False\n", "weekly_sales \t Primary: False \t Partition: False\n", "sales_last_quarter_store_dep \t Primary: False \t Partition: False\n", "sales_last_month_store \t Primary: False \t Partition: False\n", "sales_last_quarter_store \t Primary: False \t Partition: False\n", "sales_last_year_store \t Primary: False \t Partition: False\n", "date \t Primary: True \t Partition: False\n", "sales_last_month_store_dep \t Primary: False \t Partition: False" ] } ], "source": [ "print(\"Name: {}\".format(sales_fg.name))\n", "print(\"Description: {}\".format(sales_fg.description))\n", "print(\"Features:\")\n", "features = sales_fg.features\n", "for feature in features:\n", " print(\"{:<60} \\t Primary: {} \\t Partition: {}\".format(feature.name, feature.primary, feature.partition))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are interested only in a subset of features, you can use the `select()` method on the feature group object to select a list of features. The `select()` behaves like a feature group, as such, you can call the `.show()` or `.read()` methods on it." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+\n", "|store|dept|weekly_sales|\n", "+-----+----+------------+\n", "| 20| 55| 32362.95|\n", "| 20| 94| 63787.83|\n", "| 20| 2| 85812.69|\n", "| 20| 92| 195223.84|\n", "| 20| 12| 6527.56|\n", "+-----+----+------------+\n", "only showing top 5 rows" ] } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your feature group is available both online and offline, you can use the `online` option of the `show()` and `read()` methods to specify if you want to read your feature group from online storage." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+\n", "|store|dept|weekly_sales|\n", "+-----+----+------------+\n", "| 41| 97| 19861.01|\n", "| 25| 52| 1542.31|\n", "| 40| 42| 5279.49|\n", "| 28| 21| 4933.9|\n", "| 33| 25| 60.5|\n", "+-----+----+------------+\n", "only showing top 5 rows" ] } ], "source": [ "sales_fg_3 = fs.get_feature_group('sales_fg', 3)\n", "\n", "sales_fg_3.select(['store', 'dept', 'weekly_sales']).show(5, online=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter Feature Groups\n", "\n", "If you do not want to read your feature group into a dataframe, you can also filter directly on a `FeatureGroup` object. Applying a filter to a feature group returns a `Query` object which can subsequently be used to be further joined with other feature groups or queries, or it can be materialized as a training dataset." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+\n", "|store|dept|weekly_sales|\n", "+-----+----+------------+\n", "| 20| 94| 63787.83|\n", "| 20| 2| 85812.69|\n", "| 20| 92| 195223.84|\n", "| 20| 95| 162621.8|\n", "| 20| 8| 89319.61|\n", "+-----+----+------------+\n", "only showing top 5 rows" ] } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).filter(sales_fg.weekly_sales >= 50000).show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conjunctions of filters can be constructed using the Python Bitwise Operators `|` and `&` which replace the regular binary operators when working with feature groups and filters." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+\n", "|store|dept|weekly_sales|\n", "+-----+----+------------+\n", "| 20| 2| 85812.69|\n", "| 20| 2| 67951.33|\n", "| 20| 2| 78321.16|\n", "| 20| 2| 72410.12|\n", "| 20| 2| 81139.43|\n", "+-----+----+------------+\n", "only showing top 5 rows" ] } ], "source": [ "sales_fg.select(['store', 'dept', 'weekly_sales']).filter((sales_fg.weekly_sales >= 50000) & (sales_fg.dept == 2)).show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Join Features and Feature Groups\n", "\n", "HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups.\n", "The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.\n", "\n", "You can use the `select_all()` method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition. \n", "If you don't specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.\n", "\n", "In the example below, `sales_fg` has `store`, `dept` and `date` as composite primary key while `exogenous_fg` has only `store` and `date`. So Hopsworks will set as joining condition `store` and `date`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`.\n", "VersionWarning: No version provided for getting feature group `exogenous_fg`, defaulting to `1`." ] } ], "source": [ "sales_fg = fs.get_feature_group('sales_fg')\n", "exogenous_fg = fs.get_feature_group('exogenous_fg')\n", "\n", "query = sales_fg.select_all().join(exogenous_fg.select_all())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the `to_string()` method on it." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg1`.`dept`, `fg1`.`is_holiday`, `fg1`.`sales_last_year_store_dep`, `fg1`.`sales_last_six_month_store`, `fg1`.`store`, `fg1`.`sales_last_six_month_store_dep`, `fg1`.`weekly_sales`, `fg1`.`sales_last_quarter_store_dep`, `fg1`.`sales_last_month_store`, `fg1`.`sales_last_quarter_store`, `fg1`.`sales_last_year_store`, `fg1`.`date`, `fg1`.`sales_last_month_store_dep`, `fg0`.`cpi`, `fg0`.`is_holiday`, `fg0`.`markdown4`, `fg0`.`markdown5`, `fg0`.`unemployment`, `fg0`.`fuel_price`, `fg0`.`markdown1`, `fg0`.`markdown2`, `fg0`.`markdown3`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`, `fg0`.`temperature`\n", "FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`\n", "INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`" ] } ], "source": [ "print(query.to_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As for the feature groups, you can call the `show()` method to inspect the data before generating a training dataset from it. Or you can call the `read()` method to get a Spark DataFrame with the result of the query and apply additional transformations to it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+\n", "|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|store|sales_last_six_month_store_dep|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep| cpi|is_holiday|markdown4|markdown5|unemployment|fuel_price|markdown1|markdown2|markdown3|appended_feature|temperature|\n", "+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+\n", "| 55| false| 0.0| 0.0| 20| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|\n", "| 94| false| 0.0| 0.0| 20| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|\n", "| 2| false| 0.0| 0.0| 20| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|\n", "| 92| false| 0.0| 0.0| 20| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|\n", "| 12| false| 0.0| 0.0| 20| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|\n", "+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+\n", "only showing top 5 rows" ] } ], "source": [ "query.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As for the `show()` and `read()` method of the feature group, even in the case of a query you can specify against which storage to run the query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select only a subset of features\n", "\n", "You can replace the `select_all()` method with the `select([])` method to be able to select only a subset of features from a feature group you want to join:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+----------+\n", "|store|dept|weekly_sales|fuel_price|\n", "+-----+----+------------+----------+\n", "| 20| 55| 32362.95| 2.784|\n", "| 20| 94| 63787.83| 2.784|\n", "| 20| 2| 85812.69| 2.784|\n", "| 20| 92| 195223.84| 2.784|\n", "| 20| 12| 6527.56| 2.784|\n", "+-----+----+------------+----------+\n", "only showing top 5 rows" ] } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']))\n", "query.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overwrite the joining key\n", "\n", "If your feature groups don't have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.\n", "\n", "As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----+------------+----------+\n", "|store|dept|weekly_sales|fuel_price|\n", "+-----+----+------------+----------+\n", "| 20| 55| 32362.95| 2.784|\n", "| 20| 55| 32362.95| 2.666|\n", "| 20| 55| 32362.95| 2.572|\n", "| 20| 55| 32362.95| 2.962|\n", "| 20| 55| 32362.95| 2.58|\n", "+-----+----+------------+----------+\n", "only showing top 5 rows" ] } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']), on=['date'])\n", "query.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overwriting the join type\n", "\n", "By default, the join type between two feature groups is `INNER JOIN`. You can overwrite this behavior by passing the `join_type` parameter to the join method. Valid types are: `INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg1`.`store`, `fg1`.`dept`, `fg1`.`weekly_sales`, `fg0`.`fuel_price`\n", "FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`\n", "LEFT JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`" ] } ], "source": [ "query = sales_fg.select(['store', 'dept', 'weekly_sales'])\\\n", " .join(exogenous_fg.select(['fuel_price']), join_type=\"left\")\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Join mulitple feature groups\n", "\n", "You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:\n", "\n", " (`sales_fg` <> `store_fg`) <> `exogenous_fg`\n", "\n", "The join paramers you pass in each `join()` method call apply to that specific join. This means that you can concatenate left and right joins.\n", "Please be aware that currently HSFS **does not support** nested join such as: \n", "\n", " `sales_fg` <> (`store_fg` <> `exogenous_fg`)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`\n", "FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`\n", "INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`\n", "INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`\n", "VersionWarning: No version provided for getting feature group `store_fg`, defaulting to `1`." ] } ], "source": [ "store_fg = fs.get_feature_group(\"store_fg\")\n", "\n", "query = sales_fg.select_all()\\\n", " .join(store_fg.select_all())\\\n", " .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Joins together with Filters\n", "\n", "It is possible to use filters in any of the subqueries of a joined query." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`\n", "FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`\n", "INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`\n", "INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`\n", "WHERE `fg2`.`weekly_sales` >= 50000 AND `fg1`.`fuel_price` <= 2.7" ] } ], "source": [ "query = sales_fg.select_all()\\\n", " .join(store_fg.select_all())\\\n", " .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']).filter(exogenous_fg.fuel_price <= 2.7)) \\\n", " .filter(sales_fg.weekly_sales >= 50000)\n", "\n", "print(query.to_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Free hand query\n", "\n", "With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. `fs.sql` returns a Spark Dataframe." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame[store: int, type: string, size: int, num_depts: bigint]" ] } ], "source": [ "fs.sql(\"SELECT * FROM `store_fg_1`\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "python", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }