# Get started with S3 and the Feature Store

This tutorial notebook will help you get started with working with the Hopsworks feature store and S3.

To execute this tutorial, you can use the sample data from [here](./data/Sacramentorealestatetransactions.csv) - and place it in a S3 bucket.

Before starting with the execution, you should also create a S3 storage connector pointing to the bucket where you uploaded the data. You can follow the [Hopsworks documentation](https://hopsworks.readthedocs.io/en/latest/featurestore/featurestore.html#configuring-storage-connectors-for-the-feature-store) to see how you can create the storage connector from the feature store UI.

The tutorial is divided in 3 parts: 
* [Import already feature engineered data from S3](#already_eng)
* [Import raw data, do feature engineering and create a feature group](#raw)
* [Export training dataset to S3](#training)

## Import already feature engineered data from S3<a name="already_eng"></a>

In this section we are going to assume that the feature engineering process has already happended outside Hopsworks. In other words, the data in S3 is already feature engineered and we only want to import it into the feature store to be made available to data scientistis.

To do that we can use the `featurestore` module of the hops python library. The Hops python library is already available in the environment and you can simply import it. You can find the documentation of the library [here](http://hops-py.logicalclocks.com/hops.html#module-hops.featurestore).

In [None]:
from hops import featurestore

To import the feature data into the feature store we are going to use the following method: `featurestore.import_featuregroup_s3`. 

I called my connector `house-bucket` and I located the file in the `fg` subdirectory. The sample data is in CSV format. The method will infer the schema and the feature names from the file itself. In this case, the first line of the `csv` file contains the feature names.

We are going to store this feature group in the feature store of the project we are currently working in, and it is going to be the first version of the feature group. 

The call below will also compute statistics which will be available from the Hopsworks UI or through the `get_featuregroup_statistics` method.

In [2]:
featurestore.import_featuregroup_s3("house-bucket", "fg", "sacramento_houses_raw", 
                                    description="House sale transactions in Sacramento",
                                    featurestore=featurestore.project_featurestore(),
                                    featuregroup_version=1,
                                    data_format="csv")

computing descriptive statistics for : sacramento_houses_raw, version: 1
computing feature correlation for: sacramento_houses_raw, version: 1
computing feature histograms for: sacramento_houses_raw, version: 1
computing cluster analysis for: sacramento_houses_raw, version: 1
Registering feature metadata...
Registering feature metadata... [COMPLETE]
Writing feature data to offline feature group (Hive)...
Running sql: use demo_featurestore_admin000_featurestore against offline feature store
Writing feature data to offline feature group (Hive)... [COMPLETE]
Feature group created successfully
Feature group imported successfully

In the feature store UI you should now be able to see that the feature group has been created, browse its schema and statistics. You can now use it to [build training datasets](#training).

## Import raw data, do feature engineering and create a feature group<a name="raw"></a>

In the next session we are going to assume that the data in the S3 bucket is raw data that needs to be feature engineered before it can be used by data scientists to build models.

Hopsworks feature store relies on Apache Spark to provide a scalabale framework for feature engineering processing. Hopsworks allows users to write both PySpark and Scala code. To know more about how to work with Spark code in Hopsworks you can have a look at [Apache Spark documentation](https://spark.apache.org/docs/latest/index.html) and at the [Hopsworks Jupyter documentation](https://hopsworks.readthedocs.io/en/1.1/user_guide/hopsworks/jupyter.html).

For the sake of the tutorial, in this section we are going to read the CSV file in a dataframe, convert the `type` feature from a string to a categorical numerical feature and write the new feature group in the feature store.

To instruct Spark to read from S3 we build the path to the file in the bucket. Please note the file system - `s3a://`.

In [3]:
import os

raw_data_path = os.path.join("s3a://", featurestore.get_storage_connector("house-bucket").bucket, 'fg')

In [4]:
raw_data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(raw_data_path)
raw_data.show(5)

+----------------+----------+-----+-----+----+-----+------+-----------+--------------------+-----+---------+-----------+
|          street|      city|  zip|state|beds|baths|sq__ft|       type|           sale_date|price| latitude|  longitude|
+----------------+----------+-----+-----+----+-----+------+-----------+--------------------+-----+---------+-----------+
|    3526 HIGH ST|SACRAMENTO|95838|   CA|   2|    1|   836|Residential|Wed May 21 00:00:...|59222|38.631913|-121.434879|
|     51 OMAHA CT|SACRAMENTO|95823|   CA|   3|    1|  1167|Residential|Wed May 21 00:00:...|68212|38.478902|-121.431028|
|  2796 BRANCH ST|SACRAMENTO|95815|   CA|   2|    1|   796|Residential|Wed May 21 00:00:...|68880|38.618305|-121.443839|
|2805 JANETTE WAY|SACRAMENTO|95815|   CA|   2|    1|   852|Residential|Wed May 21 00:00:...|69307|38.616835|-121.439146|
| 6001 MCMAHON DR|SACRAMENTO|95824|   CA|   2|    1|   797|Residential|Wed May 21 00:00:...|81900| 38.51947|-121.435768|
+----------------+----------+---

In [5]:
raw_data.printSchema()

root
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- state: string (nullable = true)
 |-- beds: integer (nullable = true)
 |-- baths: integer (nullable = true)
 |-- sq__ft: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- sale_date: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)

In [6]:
from pyspark.sql.functions import monotonically_increasing_id

index_table = raw_data.select("type").distinct()\
                    .withColumn('type_class', monotonically_increasing_id())

fg_data = raw_data.join(index_table, raw_data.type == index_table.type).drop("type")

In the next cell we are passing `fg_data` to the `create_featuregroup` method of the `featurestore` module. This is going to create a new feature group based on the schema of the dataframe, insert the data in the feature group itself and compute the statistics.

At the end of the execution, the feature group will be available in the Feature Store UI.

In [7]:
featurestore.create_featuregroup(fg_data, "sacramento_houses_fgeng",
                                 featuregroup_version=1,
                                 description="House sale transactions in Sacramento")

computing descriptive statistics for : sacramento_houses_fgeng, version: 1
computing feature correlation for: sacramento_houses_fgeng, version: 1
computing feature histograms for: sacramento_houses_fgeng, version: 1
computing cluster analysis for: sacramento_houses_fgeng, version: 1
Registering feature metadata...
Registering feature metadata... [COMPLETE]
Writing feature data to offline feature group (Hive)...
Running sql: use demo_featurestore_admin000_featurestore against offline feature store
Writing feature data to offline feature group (Hive)... [COMPLETE]
Feature group created successfully

## Export training dataset to S3<a name="training"></a>

Once the feature groups have been created, you can join them together to build a training dataset to train a machine learning model.

While Hopsworks provides [capabilities](https://hopsworks.readthedocs.io/en/latest/hopsml/index.html) to train and serve machine learning models, traning datasets can also be exported to S3 to be used from SageMaker or other ML systems in AWS.

To export the training dataset we are going to use the `create_training_dataset` method which accepts a Spark dataframe.
In this tutorial we are going to create a training dataset containing features from a single feature group. In real world use cases, feature can be extracted from different feature groups by joining them. You can have a look at [this notebook](../FeaturestoreTourPython.ipynb) for some examples.

The data can be exported in multiple format, in this tutorial we are going to export it in CSV format, but tfrecords, parquet and other formats are available as well.

As for feature groups, statistics are computed and recorded also for training datasets. They will be available in the Feature Store UI at the end of the execution.

In [8]:
td = featurestore.get_featuregroup("sacramento_houses_fgeng", featuregroup_version=1)
featurestore.create_training_dataset(td, "house_price_model_training_data", 
                                     data_format="csv", sink="house-bucket",
                                     path="house_price_model_training_data")

Running sql: use demo_featurestore_admin000_featurestore against offline feature store
SQL string for the query created successfully
Running sql: SELECT * FROM sacramento_houses_fgeng_1 against offline feature store
computing descriptive statistics for : house_price_model_training_data, version: 1
computing feature correlation for: house_price_model_training_data, version: 1
computing feature histograms for: house_price_model_training_data, version: 1
computing cluster analysis for: house_price_model_training_data, version: 1
Training Dataset created successfully