Product docs and API reference are now on Akamai TechDocs.
Search product docs.
Search for “” in product docs.
Search API reference.
Search for “” in API reference.
Search Results
results matching
results
No Results
Filters
Deploy an LLM for AI Inferencing with App Platform for LKE
LLMs (large language models) are deep-learning models that are pre-trained on vast amounts of information. AI inferencing is the method by which an AI model (such as an LLM) is trained to “infer”, and subsequently deliver accurate information. The LLM used in this deployment, Meta AI’s Llama 3, is an open-source, pre-trained LLM often used for tasks like responding to questions in multiple languages, coding, and advanced reasoning.
KServe is a standard Model Inference Platform for Kubernetes, built for highly-scalable use cases. KServe comes with multiple Model Serving Runtimes, including the Hugging Face serving runtime. The Hugging Face runtime supports the following machine learning (ML) tasks: text generation, Text2Text generation, token classification, sequence and text classification, and fill mask.
Akamai App Platform for LKE comes with a set of preconfigured and integrated open source Kubernetes applications like Istio and Knative, both of which are prerequisites for using KServe. App Platform automates the provisioning process of these applications.
This guide describes the steps required to: install KServe with Akamai App Platform for LKE, deploy Meta AI’s Llama 3 model using the Hugging Face service runtime, and deploy a chatbot using Open WebUI. Once functional, use our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to configure an additional LLM trained on a custom data set.
If you prefer to manually install an LLM and RAG Pipeline on LKE rather than using Akamai App Platform, see our Deploy a Chatbot and RAG Pipeline for AI Inferencing on LKE guide.
Diagram
Components
Infrastructure
Linode GPUs (NVIDIA RTX 4000): Akamai has several GPU virtual machines available, including NVIDIA RTX 4000 (used in this tutorial) and Quadro RTX 6000. NVIDIA’s Ada Lovelace architecture in the RTX 4000 VMs are adept at many AI tasks, including inferencing and image generation.
Linode Kubernetes Engine (LKE): LKE is Akamai’s managed Kubernetes service, enabling you to deploy containerized applications without needing to build out and maintain your own Kubernetes cluster.
App Platform for LKE: A Kubernetes-based platform that combines developer and operations-centric tools, automation, self-service, and management of containerized application workloads. App Platform for LKE streamlines the application lifecycle from development to delivery and connects numerous CNCF (Cloud Native Computing Foundation) technologies in a single environment, allowing you to construct a bespoke Kubernetes architecture.
Software
Open WebUI: A self-hosted AI chatbot application that’s compatible with LLMs like Llama 3 and includes a built-in inference engine for RAG (Retrieval-Augmented Generation) solutions. Users interact with this interface to query the LLM.
Hugging Face: A data science platform and open-source library of data sets and pre-trained AI models. A Hugging Face account and access key is required to access the Llama 3 large language model (LLM) used in this deployment.
Meta AI’s Llama 3: The meta-llama/Meta-Llama-3-8B model is used as the LLM in this guide. You must review and agree to the licensing agreement before deploying.
KServe: Serves machine learning models. This tutorial installs the Llama 3 LLM to KServe, which then serves it to other applications, such as the chatbot UI.
Istio: An open source service mesh used for securing, connecting, and monitoring microservices.
Knative: Used for deploying and managing serverless workloads on the Kubernetes platform.
Kyverno: A comprehensive toolset used for managing the Policy-as-Code (PaC) lifecycle for Kubernetes.
Prerequisites
A Cloud Manager account is required to use Akamai’s cloud computing services, including LKE.
A Hugging Face account is used for pulling Meta AI’s Llama 3 model.
Access granted to Meta AI’s Llama 3 model is required. To request access, navigate to Hugging Face’s Llama 3-8B Instruct LLM link, read and accept the license agreement, and submit your information.
Enrollment into the Akamai App Platform’s beta program.
An provisioned and configured LKE cluster with App Platform enabled. We recommend an LKE cluster consisting of at least 3 RTX4000 Ada x1 Medium GPU plans.
To learn more about provisioning a LKE cluster with App Platform, see our Getting Started with App Platform for LKE guide.
Set Up Infrastructure
Once your LKE cluster is provisioned and the App Platform web UI is available, complete the following steps to continue setting up your infrastructure.
Sign into the App Platform web UI using the platform-admin
account, or another account that uses the platform-admin
role. Instructions for signing into App Platform for the first time can be found in our Getting Started with Akamai App Platform guide.
Enable Knative
Select view > platform in the top bar.
Select Apps in the left menu.
Enable the Knative and Kyverno apps by hovering over each app icon and clicking the power on button. It may take a few minutes for the apps to enable.
Enabled apps move up and appear in color towards the top of the available app list.
Create a New Team
Teams are isolated tenants on the platform to support Development/DevOps teams, projects or even DTAP. A Team gets access to the Console, including access to self-service features and all shared apps available on the platform.
Select view > platform.
Select Teams in the left menu.
Click Create Team.
Provide a Name for the Team. Keep all other default values, and click Submit. This guide uses the Team name
demo
.
Install the NVIDIA GPU Operator
The NVIDIA GPU Operator automates the management of NVIDIA software components needed for provisioning the GPUs, including drivers, the Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and others.
Select view > team and team > admin in the top bar.
Select Shell in the left menu. Wait for the shell session to load.
In the provided shell session, install the NVIDIA GPU operator using Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.9.1
Add the kserve-crd Helm Chart to the Catalog
Helm charts provide information for defining, installing, and managing resources on a Kubernetes cluster. Custom Helm charts can be added to App Platform Catalog using the Add Helm Chart feature.
Click on Catalog in the left menu.
Select Add Helm Chart.
Under Git Repository URL, add the URL to the
kserve-crd
Helm chart:https://github.com/kserve/kserve/blob/v0.14.1/charts/kserve-crd/Chart.yaml
Click Get Details to populate the
kserve-crd
Helm chart details.Optional: Add a Catalogue Icon Use an image URL in the Icon URL field to optionally add an icon to your custom Helm chart in the Catalog.Deselect Allow teams to use this chart.
Click Add Chart.
Create a Workload for the kserve-crd Helm Chart
A Workload is a self-service feature for creating Kubernetes resources using Helm charts from the Catalog.
Select view > team and team > admin in the top bar.
Select Workloads.
Click on Create Workload.
Select the Kserve-Crd Helm chart from the Catalog.
Click on Values.
Provide a name for the Workload. This guide uses the Workload name
kserve-crd
.Add
kserve
as the namespace.Select Create a new namespace.
Continue with the rest of the default values, and click Submit.
After the Workload is submitted, App Platform creates an Argo CD application to install the kserve-crd
Helm chart. Wait for the Status of the Workload to become healthy as represented by a green check mark. This may take a few minutes.




Click on the ArgoCD Application link once the Workload is ready. You should be brought to the Argo CD screen in a separate window:




Confirm the App Health is marked “Healthy”, and return to the App Platform UI.
Add the kserve-resources Helm Chart to the Catalog
Click on Catalog in the left menu.
Select Add Helm Chart.
Under Git Repository URL, add the URL to the
kserve-resources
Helm chart:https://github.com/kserve/kserve/blob/v0.14.1/charts/kserve-resources/Chart.yaml
Click Get Details to populate the
kserve-resources
Helm chart details.Note the name of the Helm chart populates as
Kserve
rather thanKserve-Resources
. Edit Target Directory Name to readKserve-Resources
so that it can be identified later.Deselect Allow teams to use this chart.
Click Add Chart.
Create a Workload for the kserve-resources Helm Chart
Select view > team and team > admin in the top bar.
Select Workloads.
Click on Create Workload.
Select the Kserve-Resources Helm chart from the Catalog.
Click on Values.
Provide a name for the Workload. This guide uses the Workload name
kserve-resources
.Add
kserve
as the namespace.Select Create a new namespace.
Continue with the default values, and click Submit. The Workload may take a few minutes to become ready.
Add the open-webui Helm Chart to the Catalog
Click on Catalog in the left menu.
Select Add Helm Chart.
Under Git Repository URL, add the URL to the
open-webui
Helm chart:https://github.com/open-webui/helm-charts/blob/open-webui-5.20.0/charts/open-webui/Chart.yaml
Click Get Details to populate the
open-webui
Helm chart details.Leave the Allow teams to use this chart option selected.
Click Add Chart.
Add the inferencing-service Helm Chart to the Catalog
Click on Catalog in the left menu.
Select Add Helm Chart.
Under Git Repository URL, add the URL to the
inferencing-service
Helm chart:https://github.com/linode/apl-examples/blob/main/inferencing-service/Chart.yaml
Click Get Details to populate the
inferencing-service
Helm chart details.Leave the Allow teams to use this chart option selected.
Click Add Chart.
Create a Hugging Face Access Token
Navigate to the Hugging Face Access Tokens page.
Click Create new token.
Under Token type, select “Write” access.
Enter a name for your token, and click Create token.
Save your Access Token information.
See the Hugging Face user documentation on User access tokens for additional information.
Request Access to Llama 3
If you haven’t done it already, request access to the Llama 3 LLM model. To do this, go to Hugging Face’s Llama 3-8B Instruct LLM link, read and agree the license agreement, and submit your information. You must wait for access to be granted in order to proceed.
Deploy and Expose the Model
Create a Sealed Secret
Sealed Secrets are encrypted Kubernetes Secrets stored in the Values Git repository. When a Sealed Secret is created in the Console, the Kubernetes Secret will appear in the Team’s namespace.
Select view > team and team > demo in the top bar.
Select Sealed Secrets from the menu.
Click Create SealedSecret.
Add the name
hf-secret
.Select type kubernetes.io/opaque from the type dropdown menu.
Add Key:
HF_TOKEN
.Add your Hugging Face Access Token in the Value field: HUGGING_FACE_TOKEN
Click Submit. The Sealed Secret may take a few minutes to become ready.
Create a Workload to Deploy the Model
Select view > team and team > demo in the top bar.
Select Catalog from the menu.
Select the Kserve-Ai-Inferencing-Service chart.
Click on Values.
Provide a name for the Workload. This guide uses the Workload name
llama3-model
.Set the following values to disable sidecar injection, define your Hugging Face token, and specify resource limits:
labels: sidecar.istio.io/inject: "false" env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN optional: "false" args: - --model_name=llama3 - --model_id=meta-llama/meta-llama-3-8b-instruct resources: limits: cpu: "12" memory: 24Gi nvidia.com/gpu: "1" requests: cpu: "6" memory: 12Gi nvidia.com/gpu: "1"
Click Submit.
Check the Status of Your Workload
It may take a few minutes for the Kserve-Ai-Inferencing-Service Workload to become ready. To check the status of the Workload build, open a shell session by selecting Shell in the left menu, and use the following command to check the status of the pods with
kubectl
:kubectl get pods
NAME READY STATUS RESTARTS AGE llama3-model-predictor-00001-deployment-86f5fc5d5d-7299c 0/2 Pending 0 4m22s tekton-dashboard-5f57787b8c-gswc2 2/2 Running 0 19h
To gather more information about a pod in a
Pending
state, run thekubectl describe pod
command below, replacing POD_NAME with the name of your pod. In the output above,llama3-model-predictor-00001-deployment-86f5fc5d5d-7299c
is the name of the pending pod:kubectl describe pod POD_NAME
Scroll to the bottom of the output and look for
Events
. If there is an event with ReasonFailedScheduling
, theresources.request
values in your Kserve-Ai-Inferencing-Service Workload may need to be adjusted.Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 12s default-scheduler 0/3 nodes are available: 3 Insufficient cpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
Based on the output above, the
Insufficient cpu
warning denotes the CPUresources.request
is set too high.If this is the case, edit the
resources.request
values for your Kserve-Ai-Inferencing-Service Workload:Navigate to Workloads.
Select your
llama3-model
Workload.Click the Values tab.
Adjust the necessary
resources.request
value. In the example above, the number of CPUs should be lowered.Click Submit when you have finished adjusting your resources values.
Wait for the Workload to be ready again, and proceed to the following steps for exposing the model.
Expose the Model
Select Services from the menu.
Click Create Service.
In the Name dropdown list, select the
llama3-model-predictor
service.Under Exposure (ingress), select External.
Click Submit.
Once the Service is ready, copy the URL for the llama3-model-predictor
service, and add it to your clipboard.
Deploy and Expose the AI Interface
The publicly-exposed LLM in this guide uses a wide range of ports, and as a result, all pods in a Team are automatically injected with an Istio sidecar. Sidecar injection is a means of adding additional containers and their configurations to a pod template.
The Istio sidecar in this case prevents the open-webui
pod from connecting to the llama3-model
service, because all egress traffic for pods in the Team namespace are blocked by an Istio ServiceEntry by default. This means that prior to deploying the AI interface using the open-webui
Helm chart, the open-webui
pod must be prevented from getting the Istio sidecar.
Since the open-webui
Helm chart does not allow for the addition of extra labels, there are two workarounds:
Adjust the
open-webui
Helm chart in the chart’s Git repository. This is the Git repository where theopen-webui
Helm chart was been stored when it was added to the Catalog.Add a Kyverno Policy that mutates the
open-webui
pod so that it will have thesidecar.istio.io/inject: "false"
label.
Follow the steps below to follow the second option and add the Kyverno security policy.
In the Apps section, select the Gitea app.
Navigate to the
team-demo-argocd
repository.Click the Add File dropdown, and select New File. Create a file named
open-webui-policy.yaml
with the following contents:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
apiVersion: kyverno.io/v1 kind: Policy metadata: name: disable-sidecar-injection annotations: policies.kyverno.io/title: Disable Istio sidecar injection spec: rules: - name: disable-sidecar-injection match: any: - resources: kinds: - StatefulSet - Deployment selector: matchLabels: ## change the value to match the name of the Workload app.kubernetes.io/instance: "llama3-ui" mutate: patchStrategicMerge: spec: template: metadata: labels: sidecar.istio.io/inject: "false"
Optionally add a title and any notes to the change history, and click Commit Changes.
Check to see if the policy has been created in Argo CD:
Go to Apps, and open the Argocd application.
Using the search feature, go to the
team-demo
application to see if the policy has been created. If it isn’t there yet, view theteam-demo
application in the list of Applications, and click Refresh as needed.
Create a Workload to Deploy the AI Interface
Select view > team and team > demo in the top bar.
Select Catalog from the menu.
Select the Open-Webui chart.
Click on Values.
Provide a name for the Workload. This guide uses the Workload name
llama3-ui
.Add the following values, and change the
openaiBaseApiUrl
to the host and domain name you added to your clipboard when exposing the model (the URL for thellama3-model-predictor
service). Make sure to append/openai/v1
to your URL as shown below.Remember to change the
nameOverride
value to the name of your Workload,llama3-ui
:# Change the nameOverride to match the name of the Workload nameOverride: llama3-ui ollama: enabled: false pipelines: enabled: false replicaCount: 1 persistence: enabled: false openaiBaseApiUrl: https://llama3-model--predictor-team-demo.<cluster-domain>/openai/v1 extraEnvVars: - name: WEBUI_AUTH value: "false"
Click Submit.
Expose the AI Interface
Select Services from the menu.
Click Create Service.
In the Name dropdown menu, select the
llama3-ui
service.Under Exposure (ingress), select External.
Click Submit.
Access the Open Web User Interface
Once the AI user interface is ready, you should be able to access the web UI for the Open WebUI chatbot.
Click on Services in the menu.
In the list of available services, click on the URL for the
llama3-ui
service. This should bring you to the chatbot user interface.
Next Steps
See our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to expand on the architecture built in this guide. This tutorial deploys a RAG (Retrieval-Augmented Generation) pipeline that indexes a custom data set and attaches relevant data as context when users send the LLM queries.
More Information
You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.
This page was originally published on