{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: \"Visualizations - PySpark examples\"\n", "date: 2021-02-24\n", "type: technical_note\n", "draft: false\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Visualization\n", "\n", "In this notebook we will visualize the feature statistics stored in the featurestore for featuregroups and training datasets. This notebook assumes that you have already run the featurestore tour and the notebook `FeaturestoreTourPython.ipynb`. \n", "\n", "The following featuregroups should exist in the featurestore:\n", "\n", "- `games_features`\n", "- `attendances_features`\n", "- `players_features`\n", "- `season_scores_features`\n", "- `teams_features`\n", "\n", "And the following training dataset should exist in the featurestore:\n", "\n", "- `team_position_prediction`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting using the PySpark Kernel Background\n", "\n", "When using Jupyter on Hopsworks, a library called sparkmagic is used to interact with the Hops cluster. When you create a Jupyter notebook on Hopsworks, you first select a kernel. A kernel is simply a program that executes the code that you have in the Jupyter cells, you can think of it as a REPL-backend to your jupyter notebook that acts as a frontend.\n", "\n", "Sparkmagic works with a remote REST server for Spark, called livy, running inside the Hops cluster. Livy is an interface that Jupyter-on-Hopsworks uses to interact with the Hops cluster. When you run Jupyter cells using the pyspark kernel, the kernel will automatically send commands to livy in the background for executing the commands on the cluster. \n", "\n", "Since the code in a pyspark notebook is being executed remotely, in the spark cluster, regular python plotting will not work. What you can do however is to use the magic `%%local` to access the local python kernel, or save figures as pngs to HopsFS and plot them locally later, we will go over both approaches in this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "
ID | YARN Application ID | Kind | State | Spark UI | Driver log | Current session? |
---|---|---|---|---|---|---|
19 | application_1560074891958_0021 | pyspark | idle | Link | Link | ✔ |