Deploy an LLM for AI Inferencing with App Platform for LKE

Create a Linode account to try this guide with a $100 credit.
This credit will be applied to any valid services used during your first 60 days.
Beta Notice
The Akamai App Platform is now available as a limited beta. It is not recommended for production workloads. To register for the beta, visit the Betas page in the Cloud Manager and click the Sign Up button next to the Akamai App Platform Beta.

LLMs (large language models) are deep-learning models that are pre-trained on vast amounts of information. AI inferencing is the method by which an AI model (such as an LLM) is trained to “infer”, and subsequently deliver accurate information. The LLM used in this deployment, Meta AI’s Llama 3, is an open-source, pre-trained LLM often used for tasks like responding to questions in multiple languages, coding, and advanced reasoning.

KServe is a standard Model Inference Platform for Kubernetes, built for highly-scalable use cases. KServe comes with multiple Model Serving Runtimes, including the Hugging Face serving runtime. The Hugging Face runtime supports the following machine learning (ML) tasks: text generation, Text2Text generation, token classification, sequence and text classification, and fill mask.

Akamai App Platform for LKE comes with a set of preconfigured and integrated open source Kubernetes applications like Istio and Knative, both of which are prerequisites for using KServe. App Platform automates the provisioning process of these applications.

This guide describes the steps required to: install KServe with Akamai App Platform for LKE, deploy Meta AI’s Llama 3 model using the Hugging Face service runtime, and deploy a chatbot using Open WebUI. Once functional, use our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to configure an additional LLM trained on a custom data set.

If you prefer to manually install an LLM and RAG Pipeline on LKE rather than using Akamai App Platform, see our Deploy a Chatbot and RAG Pipeline for AI Inferencing on LKE guide.

Diagram

Components

Infrastructure

  • Linode GPUs (NVIDIA RTX 4000): Akamai has several GPU virtual machines available, including NVIDIA RTX 4000 (used in this tutorial) and Quadro RTX 6000. NVIDIA’s Ada Lovelace architecture in the RTX 4000 VMs are adept at many AI tasks, including inferencing and image generation.

  • Linode Kubernetes Engine (LKE): LKE is Akamai’s managed Kubernetes service, enabling you to deploy containerized applications without needing to build out and maintain your own Kubernetes cluster.

  • App Platform for LKE: A Kubernetes-based platform that combines developer and operations-centric tools, automation, self-service, and management of containerized application workloads. App Platform for LKE streamlines the application lifecycle from development to delivery and connects numerous CNCF (Cloud Native Computing Foundation) technologies in a single environment, allowing you to construct a bespoke Kubernetes architecture.

Software

  • Open WebUI: A self-hosted AI chatbot application that’s compatible with LLMs like Llama 3 and includes a built-in inference engine for RAG (Retrieval-Augmented Generation) solutions. Users interact with this interface to query the LLM.

  • Hugging Face: A data science platform and open-source library of data sets and pre-trained AI models. A Hugging Face account and access key is required to access the Llama 3 large language model (LLM) used in this deployment.

  • Meta AI’s Llama 3: The meta-llama/Meta-Llama-3-8B model is used as the LLM in this guide. You must review and agree to the licensing agreement before deploying.

  • KServe: Serves machine learning models. This tutorial installs the Llama 3 LLM to KServe, which then serves it to other applications, such as the chatbot UI.

  • Istio: An open source service mesh used for securing, connecting, and monitoring microservices.

  • Knative: Used for deploying and managing serverless workloads on the Kubernetes platform.

  • Kyverno: A comprehensive toolset used for managing the Policy-as-Code (PaC) lifecycle for Kubernetes.

Prerequisites

  • A Cloud Manager account is required to use Akamai’s cloud computing services, including LKE.

  • A Hugging Face account is used for pulling Meta AI’s Llama 3 model.

  • Access granted to Meta AI’s Llama 3 model is required. To request access, navigate to Hugging Face’s Llama 3-8B Instruct LLM link, read and accept the license agreement, and submit your information.

  • Enrollment into the Akamai App Platform’s beta program.

  • An provisioned and configured LKE cluster with App Platform enabled. We recommend an LKE cluster consisting of at least 3 RTX4000 Ada x1 Medium GPU plans.

To learn more about provisioning a LKE cluster with App Platform, see our Getting Started with App Platform for LKE guide.

Set Up Infrastructure

Once your LKE cluster is provisioned and the App Platform web UI is available, complete the following steps to continue setting up your infrastructure.

Sign into the App Platform web UI using the platform-admin account, or another account that uses the platform-admin role. Instructions for signing into App Platform for the first time can be found in our Getting Started with Akamai App Platform guide.

Enable Knative

  1. Select view > platform in the top bar.

  2. Select Apps in the left menu.

  3. Enable the Knative and Kyverno apps by hovering over each app icon and clicking the power on button. It may take a few minutes for the apps to enable.

    Enabled apps move up and appear in color towards the top of the available app list.

Create a New Team

Teams are isolated tenants on the platform to support Development/DevOps teams, projects or even DTAP. A Team gets access to the Console, including access to self-service features and all shared apps available on the platform.

  1. Select view > platform.

  2. Select Teams in the left menu.

  3. Click Create Team.

  4. Provide a Name for the Team. Keep all other default values, and click Submit. This guide uses the Team name demo.

Install the NVIDIA GPU Operator

The NVIDIA GPU Operator automates the management of NVIDIA software components needed for provisioning the GPUs, including drivers, the Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and others.

  1. Select view > team and team > admin in the top bar.

  2. Select Shell in the left menu. Wait for the shell session to load.

  3. In the provided shell session, install the NVIDIA GPU operator using Helm:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
    helm repo update
    helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.9.1

Add the kserve-crd Helm Chart to the Catalog

Helm charts provide information for defining, installing, and managing resources on a Kubernetes cluster. Custom Helm charts can be added to App Platform Catalog using the Add Helm Chart feature.

  1. Click on Catalog in the left menu.

  2. Select Add Helm Chart.

  3. Under Git Repository URL, add the URL to the kserve-crd Helm chart:

    https://github.com/kserve/kserve/blob/v0.14.1/charts/kserve-crd/Chart.yaml
  4. Click Get Details to populate the kserve-crd Helm chart details.

    Optional: Add a Catalogue Icon
    Use an image URL in the Icon URL field to optionally add an icon to your custom Helm chart in the Catalog.
  5. Deselect Allow teams to use this chart.

  6. Click Add Chart.

Create a Workload for the kserve-crd Helm Chart

A Workload is a self-service feature for creating Kubernetes resources using Helm charts from the Catalog.

  1. Select view > team and team > admin in the top bar.

  2. Select Workloads.

  3. Click on Create Workload.

  4. Select the Kserve-Crd Helm chart from the Catalog.

  5. Click on Values.

  6. Provide a name for the Workload. This guide uses the Workload name kserve-crd.

  7. Add kserve as the namespace.

  8. Select Create a new namespace.

  9. Continue with the rest of the default values, and click Submit.

After the Workload is submitted, App Platform creates an Argo CD application to install the kserve-crd Helm chart. Wait for the Status of the Workload to become healthy as represented by a green check mark. This may take a few minutes.

Click on the ArgoCD Application link once the Workload is ready. You should be brought to the Argo CD screen in a separate window:

Confirm the App Health is marked “Healthy”, and return to the App Platform UI.

Add the kserve-resources Helm Chart to the Catalog

  1. Click on Catalog in the left menu.

  2. Select Add Helm Chart.

  3. Under Git Repository URL, add the URL to the kserve-resources Helm chart:

    https://github.com/kserve/kserve/blob/v0.14.1/charts/kserve-resources/Chart.yaml
  4. Click Get Details to populate the kserve-resources Helm chart details.

  5. Note the name of the Helm chart populates as Kserve rather than Kserve-Resources. Edit Target Directory Name to read Kserve-Resources so that it can be identified later.

  6. Deselect Allow teams to use this chart.

  7. Click Add Chart.

Create a Workload for the kserve-resources Helm Chart

  1. Select view > team and team > admin in the top bar.

  2. Select Workloads.

  3. Click on Create Workload.

  4. Select the Kserve-Resources Helm chart from the Catalog.

  5. Click on Values.

  6. Provide a name for the Workload. This guide uses the Workload name kserve-resources.

  7. Add kserve as the namespace.

  8. Select Create a new namespace.

  9. Continue with the default values, and click Submit. The Workload may take a few minutes to become ready.

Add the open-webui Helm Chart to the Catalog

  1. Click on Catalog in the left menu.

  2. Select Add Helm Chart.

  3. Under Git Repository URL, add the URL to the open-webui Helm chart:

    https://github.com/open-webui/helm-charts/blob/open-webui-5.20.0/charts/open-webui/Chart.yaml
  4. Click Get Details to populate the open-webui Helm chart details.

  5. Leave the Allow teams to use this chart option selected.

  6. Click Add Chart.

Add the inferencing-service Helm Chart to the Catalog

  1. Click on Catalog in the left menu.

  2. Select Add Helm Chart.

  3. Under Git Repository URL, add the URL to the inferencing-service Helm chart:

    https://github.com/linode/apl-examples/blob/main/inferencing-service/Chart.yaml
  4. Click Get Details to populate the inferencing-service Helm chart details.

  5. Leave the Allow teams to use this chart option selected.

  6. Click Add Chart.

Create a Hugging Face Access Token

  1. Navigate to the Hugging Face Access Tokens page.

  2. Click Create new token.

  3. Under Token type, select “Write” access.

  4. Enter a name for your token, and click Create token.

  5. Save your Access Token information.

See the Hugging Face user documentation on User access tokens for additional information.

Request Access to Llama 3

If you haven’t done it already, request access to the Llama 3 LLM model. To do this, go to Hugging Face’s Llama 3-8B Instruct LLM link, read and agree the license agreement, and submit your information. You must wait for access to be granted in order to proceed.

Deploy and Expose the Model

Create a Sealed Secret

Sealed Secrets are encrypted Kubernetes Secrets stored in the Values Git repository. When a Sealed Secret is created in the Console, the Kubernetes Secret will appear in the Team’s namespace.

  1. Select view > team and team > demo in the top bar.

  2. Select Sealed Secrets from the menu.

  3. Click Create SealedSecret.

  4. Add the name hf-secret.

  5. Select type kubernetes.io/opaque from the type dropdown menu.

  6. Add Key: HF_TOKEN.

  7. Add your Hugging Face Access Token in the Value field: HUGGING_FACE_TOKEN

  8. Click Submit. The Sealed Secret may take a few minutes to become ready.

Create a Workload to Deploy the Model

  1. Select view > team and team > demo in the top bar.

  2. Select Catalog from the menu.

  3. Select the Kserve-Ai-Inferencing-Service chart.

  4. Click on Values.

  5. Provide a name for the Workload. This guide uses the Workload name llama3-model.

  6. Set the following values to disable sidecar injection, define your Hugging Face token, and specify resource limits:

    labels:
      sidecar.istio.io/inject: "false"
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-secret
            key: HF_TOKEN
            optional: "false"
    args:
      - --model_name=llama3
      - --model_id=meta-llama/meta-llama-3-8b-instruct
    resources:
      limits:
        cpu: "12"
        memory: 24Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "6"
        memory: 12Gi
        nvidia.com/gpu: "1"
    
  7. Click Submit.

Check the Status of Your Workload

  1. It may take a few minutes for the Kserve-Ai-Inferencing-Service Workload to become ready. To check the status of the Workload build, open a shell session by selecting Shell in the left menu, and use the following command to check the status of the pods with kubectl:

    kubectl get pods
    NAME                                                       READY   STATUS    RESTARTS   AGE
    llama3-model-predictor-00001-deployment-86f5fc5d5d-7299c   0/2     Pending   0          4m22s
    tekton-dashboard-5f57787b8c-gswc2                          2/2     Running   0          19h
  2. To gather more information about a pod in a Pending state, run the kubectl describe pod command below, replacing POD_NAME with the name of your pod. In the output above, llama3-model-predictor-00001-deployment-86f5fc5d5d-7299c is the name of the pending pod:

    kubectl describe pod POD_NAME

    Scroll to the bottom of the output and look for Events. If there is an event with Reason FailedScheduling, the resources.request values in your Kserve-Ai-Inferencing-Service Workload may need to be adjusted.

    Events:
    Type     Reason            Age                From               Message
    ----     ------            ----               ----               -------
    Warning  FailedScheduling  12s                default-scheduler  0/3 nodes are available: 3 Insufficient cpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

    Based on the output above, the Insufficient cpu warning denotes the CPU resources.request is set too high.

  3. If this is the case, edit the resources.request values for your Kserve-Ai-Inferencing-Service Workload:

    1. Navigate to Workloads.

    2. Select your llama3-model Workload.

    3. Click the Values tab.

    4. Adjust the necessary resources.request value. In the example above, the number of CPUs should be lowered.

    5. Click Submit when you have finished adjusting your resources values.

Wait for the Workload to be ready again, and proceed to the following steps for exposing the model.

Expose the Model

  1. Select Services from the menu.

  2. Click Create Service.

  3. In the Name dropdown list, select the llama3-model-predictor service.

  4. Under Exposure (ingress), select External.

  5. Click Submit.

Once the Service is ready, copy the URL for the llama3-model-predictor service, and add it to your clipboard.

Deploy and Expose the AI Interface

The publicly-exposed LLM in this guide uses a wide range of ports, and as a result, all pods in a Team are automatically injected with an Istio sidecar. Sidecar injection is a means of adding additional containers and their configurations to a pod template.

The Istio sidecar in this case prevents the open-webui pod from connecting to the llama3-model service, because all egress traffic for pods in the Team namespace are blocked by an Istio ServiceEntry by default. This means that prior to deploying the AI interface using the open-webui Helm chart, the open-webui pod must be prevented from getting the Istio sidecar.

Since the open-webui Helm chart does not allow for the addition of extra labels, there are two workarounds:

  1. Adjust the open-webui Helm chart in the chart’s Git repository. This is the Git repository where the open-webui Helm chart was been stored when it was added to the Catalog.

  2. Add a Kyverno Policy that mutates the open-webui pod so that it will have the sidecar.istio.io/inject: "false" label.

Follow the steps below to follow the second option and add the Kyverno security policy.

  1. In the Apps section, select the Gitea app.

  2. Navigate to the team-demo-argocd repository.

  3. Click the Add File dropdown, and select New File. Create a file named open-webui-policy.yaml with the following contents:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    
    apiVersion: kyverno.io/v1
    kind: Policy
    metadata:
      name: disable-sidecar-injection
      annotations:
        policies.kyverno.io/title: Disable Istio sidecar injection
    spec:
      rules:
      - name: disable-sidecar-injection
        match:
          any:
          - resources:
              kinds:
              - StatefulSet
              - Deployment
              selector:
                matchLabels:
                  ## change the value to match the name of the Workload
                  app.kubernetes.io/instance: "llama3-ui"
        mutate:
          patchStrategicMerge:
            spec:
              template:
                metadata:
                  labels:
                    sidecar.istio.io/inject: "false"
  4. Optionally add a title and any notes to the change history, and click Commit Changes.

  5. Check to see if the policy has been created in Argo CD:

    1. Go to Apps, and open the Argocd application.

    2. Using the search feature, go to the team-demo application to see if the policy has been created. If it isn’t there yet, view the team-demo application in the list of Applications, and click Refresh as needed.

Create a Workload to Deploy the AI Interface

  1. Select view > team and team > demo in the top bar.

  2. Select Catalog from the menu.

  3. Select the Open-Webui chart.

  4. Click on Values.

  5. Provide a name for the Workload. This guide uses the Workload name llama3-ui.

  6. Add the following values, and change the openaiBaseApiUrl to the host and domain name you added to your clipboard when exposing the model (the URL for the llama3-model-predictor service). Make sure to append /openai/v1 to your URL as shown below.

    Remember to change the nameOverride value to the name of your Workload, llama3-ui:

    # Change the nameOverride to match the name of the Workload
    nameOverride: llama3-ui
    ollama:
      enabled: false
    pipelines:
      enabled: false
    replicaCount: 1
    persistence:
      enabled: false
    openaiBaseApiUrl: https://llama3-model--predictor-team-demo.<cluster-domain>/openai/v1
    extraEnvVars:
      - name: WEBUI_AUTH
        value: "false"
    
  7. Click Submit.

Expose the AI Interface

  1. Select Services from the menu.

  2. Click Create Service.

  3. In the Name dropdown menu, select the llama3-ui service.

  4. Under Exposure (ingress), select External.

  5. Click Submit.

Access the Open Web User Interface

Once the AI user interface is ready, you should be able to access the web UI for the Open WebUI chatbot.

  1. Click on Services in the menu.

  2. In the list of available services, click on the URL for the llama3-ui service. This should bring you to the chatbot user interface.

Next Steps

See our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to expand on the architecture built in this guide. This tutorial deploys a RAG (Retrieval-Augmented Generation) pipeline that indexes a custom data set and attaches relevant data as context when users send the LLM queries.

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

This page was originally published on


Your Feedback Is Important

Let us know if this guide was helpful to you.


Join the conversation.
Read other comments or post your own below. Comments must be respectful, constructive, and relevant to the topic of the guide. Do not post external links or advertisements. Before posting, consider if your comment would be better addressed by contacting our Support team or asking on our Community Site.
The Disqus commenting system for Linode Docs requires the acceptance of Functional Cookies, which allow us to analyze site usage so we can measure and improve performance. To view and create comments for this article, please update your Cookie Preferences on this website and refresh this web page. Please note: You must have JavaScript enabled in your browser.