# A Guide for Llama3.1 8B-Instruct on Hopsworks

For details about this Large Language Model (LLM) visit the model page in the HuggingFace repository ‚û°Ô∏è [link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

### 1Ô∏è‚É£ Download Llama3.1 8B-Instruct using the huggingface_hub library

First, we download the Llama3.1 model files (e.g., weights, configuration files) directly from the HuggingFace repository.


In [3]:
!pip install huggingface_hub --quiet

In [1]:
# Place your HuggingFace token in the HF_TOKEN environment variable

import os
os.environ["HF_TOKEN"] = "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"

# os.environ['http_proxy'] = "http://www-proxy.seli.gic.ericsson.se:8080"
# os.environ['https_proxy'] = "http://www-proxy.seli.gic.ericsson.se:8080"
# os.environ['no_proxy'] = "http://www-proxy.seli.gic.ericsson.se:8080"

In [2]:
from huggingface_hub import snapshot_download

llama31_local_dir = snapshot_download("meta-llama/Llama-3.1-8B-Instruct", ignore_patterns="original/*")

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.63k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.69k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

## 2Ô∏è‚É£ Register Llama3.1 8B-Instruct into Hopsworks Model Registry

Once the model files are downloaded from the HuggingFace repository, we can register the models files into the Hopsworks Model Registry.

In [3]:
import hopsworks

project = hopsworks.login()
mr = project.get_model_registry()

2025-01-22 12:33:35,229 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks.ai.local/p/119


In [4]:
# The following instantiates a Hopsworks LLM model, not yet saved in the Model Registry

llama31 = mr.llm.create_model(
    name="llama31_instruct",
    description="Llama3.1 8B-Instruct model (via HF)"
)

In [None]:
# Register the Llama model pointing to the local model files

llama31.save(llama31_local_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

## 3Ô∏è‚É£ Deploy Llama3.1 8B-Instruct

After registering the LLM model into the Model Registry, we can create a deployment that serves it using the vLLM engine.

In [6]:
# Get a reference to the Llama model if not obtained yet

llama31 = mr.get_model("llama31_instruct")




### üü® Using vLLM OpenAI server

In [9]:
# VLLM Engine arguments can be found here: https://docs.vllm.ai/en/latest/serving/engine_args.html#engine-args
path_to_config_file = f"/Projects/{project.name}/Jupyter/llama31-8b-instruct/llama_vllmconfig.yaml"

In [10]:
llama31_depl = llama31.deploy(
    name="llama31v3",
    description="Llama3.1 8B-Instruct from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*12, "gpus": 1}},
)

Deployment created, explore it at https://hopsworks.ai.local/p/119/deployments/1025
Before making predictions, start the deployment by using `.start()`


---

In [11]:
# Retrieve one of the deployments created above

ms = project.get_model_serving()
llama31_depl = ms.get_deployment("llama31v3")

In [None]:
llama31_depl.start(await_running=60*15) # wait for 15 minutes maximum

  0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
# llama31_depl.stop()

In [None]:
llama31_depl.get_state().describe()

## 4Ô∏è‚É£ Prompting Llama3.1 8B-Instruct

Once the Llama31 deployment is up and running, we can start sending user prompts to the LLM. You can either use an OpenAI API-compatible client (e.g., openai library) or any other http client.

In [10]:
import os

# Get the istio endpoint from the Llama deployment page in the Hopsworks UI.
istio_endpoint = "<ISTIO_ENDPOINT>" # with format "http://<ip-address>"
    
# Resolve base uri. NOTE: KServe's vLLM server prepends the URIs with /openai
base_uri = "/openai" if llama31_depl.predictor is not None else ""

openai_v1_uri = istio_endpoint + base_uri + "/v1"
completions_url = openai_v1_uri + "/completions" 
chat_completions_url = openai_v1_uri + "/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
    'Host': f"{llama31_depl.name}.{project.name.lower().replace('_', '-')}.hopsworks.ai", # also provided in the Hopsworks UI
}

### üü® Using httpx

In [11]:
import httpx

In [12]:
#
# Chat Completion for a user message
#

user_message = "Who is the best French painter. Answer with detailed explanations."

completion_request = {
    "model": llama31_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'llama31v2', 'messages': [{'role': 'user', 'content': 'Who is the best French painter. Answer with detailed explanations.'}]}
2025-01-21 11:46:46,388 INFO: HTTP Request: POST http://57.129.34.241/openai/v1/chat/completions "HTTP/1.1 200 OK"
Choosing the "best" French painter is subjective, as it depends on personal taste and historical context. However, I can provide you with some of the most renowned French painters and highlight their significant contributions to the world of art.

1.  **Claude Monet (1840-1926)**: Monet is arguably one of the most famous French painters of all time. He is a leading figure in the Impressionist movement, which sought to capture the fleeting effects of light and color. Monet's paintings, such as "Impression, Sunrise" (1872), "Water Lilies" (1919), and "The Japanese Footbridge" (1899), are iconic examples of his ability to convey the beauty of everyday life and the effects of light.

2.  **Pierre-Auguste Renoir (1841-1919)

In [13]:
#
# Chat Completion for list of messages
#

messages = [{
    "role": "user",
    "content": "Hi! How are you doing today?"
}, {
    "role": "assistant",
    "content": "I'm doing well! How can I help you?",
}, {
    "role": "user",
     "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]


completion_request = {
    "model": llama31_depl.name,
    "messages": messages
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'llama31v2', 'messages': [{'role': 'user', 'content': 'Hi! How are you doing today?'}, {'role': 'assistant', 'content': "I'm doing well! How can I help you?"}, {'role': 'user', 'content': 'Can you tell me what the temperate will be in Dallas, in fahrenheit?'}]}
2025-01-21 11:46:51,036 INFO: HTTP Request: POST http://57.129.34.241/openai/v1/chat/completions "HTTP/1.1 200 OK"
However, I'm a large language model, I don't have real-time access to current or future weather conditions. But I can suggest a few options to help you find the current temperature in Dallas:

1. Check online weather websites: You can check websites like weather.com, accuweather.com, or wunderground.com to get the current temperature in Dallas.
2. Mobile apps: You can download mobile apps like Dark Sky, Weather Underground, or The Weather Channel to get real-time weather information.
3. Check local sources: Visit the official website of the National Weather Service (weather.gov) for th

### üü® Using OpenAI client

In [16]:
!pip install openai --quiet

In [17]:
from openai import OpenAI

In [18]:
client = OpenAI(
    base_url=openai_v1_uri,
    api_key="X",
    default_headers=headers
)

In [22]:
#
# Chat Completion for a user message
#

chat_response = client.chat.completions.create(
    model=llama31_depl.name,
    messages=[
        {"role": "user", "content": "Who is the best French painter. Answer with a short explanations."},
    ]
)

print(chat_response.choices[0].message.content)

2025-01-21 11:49:58,644 INFO: HTTP Request: POST http://57.129.34.241/openai/v1/chat/completions "HTTP/1.1 200 OK"
While this is subjective, one of the most widely regarded as the best French painter is Claude Monet. Monet was a prominent figure in the Impressionist movement, emphasizing capturing light and color in his landscapes, gardens, and water scenes. His works, such as "Impression, Sunrise" (1872) and "Water Lilies" (1919), continue to inspire art enthusiasts worldwide.

However, other notable French painters include:

1. **Pierre-Auguste Renoir**: A fellow Impressionist, known for his vibrant and evocative portraits and landscapes.
2. **Paul C√©zanne**: A post-Impressionist painter who redefined the modern art world with his pioneering work on Cubism and an emphasis on structure.
3. **Henri de Toulouse-Lautrec**: A Post-Impressionist master of expressive, detailed, and often whimsical prints and paintings of Parisian city life.
4. **Jean-Honor√© Fragonard**: An Rococo painter 