<lp>

Running Open Source LLM Models on CentOS 7

Published:

This week I was given access to a new VM in our datacenter dedicated to exploring LLMs and different AI applications. I wrote about one of our first PoCs a week ago - so it was only natural to use it as my guinea pig. I’m no dev-ops wizard - so I stuck to simple infrastructure for now. It’s important to note - this VM’s purpose is to demo and document PoCs for internal audiences - so we don’t need to get too fancy with a production-grade tech-stack.

Overview

  1. Install Necessary Tools
    1. Install Docker
    2. Download an open-source model with huggingface-cli
  2. Spin up a docker container to interact with the model
  3. Add Nginx to make it web-accessible

Prerequisites

Install Docker

Probably the easiest, most cookie-cutter part of the whole process and possibly even the most obvious one. Installing docker is a breeze - and so many folks have already written about it. I found this and it just worked - so I didn’t dig too deep into it all (reminder: I’m not a daily CentOS user).

Install Docker

sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io

Start and Enable Docker

sudo systemctl start docker
sudo systemctl enable docker

Verify Docker Installation

sudo docker run hello-world

…and install docker-compose, create a docker user, etc.

Install huggingface-hub and Download a Model

This is where things got interesting and really might be the crux of the whole post. CentOS 7 comes with Python 2.7 by default, and you’ll need Python 3 to use the huggingface-hub package.

Install Python 3

sudo yum install -y python3

Install huggingface-hub

pip3 install huggingface-hub

Things break…

$ huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False Traceback (most recent call last): File "/usr/local/bin/huggingface-cli", line 7, in <module> from huggingface_hub.commands.huggingface_cli import main File "/usr/local/lib/python3.6/site-packages/huggingface_hub/__init__.py", line 21, in <module> from .commands.user import notebook_login File "/usr/local/lib/python3.6/site-packages/huggingface_hub/commands/user.py", line 26, in <module> from huggingface_hub.hf_api import HfApi, HfFolder File "/usr/local/lib/python3.6/site-packages/huggingface_hub/hf_api.py", line 35, in <module> from .utils.endpoint_helpers import ( File "/usr/local/lib/python3.6/site-packages/huggingface_hub/utils/endpoint_helpers.py", line 17, in <module> from dataclasses import dataclass

ModuleNotFoundError: No module named 'dataclasses'

This is where I spent the bulk of my time - fighting my Python installation to get all the pieces I needed. Ultimately it was a question of either: pip3 install dataclasses or Upgrade to Python 3.7 or Later. In the moment, I opted for Python 3.7 and went about building it from the source - with lots of trial and error.

Realizing I need Python 3.8+

I spent a lot of time fighting Python errors - trying to fix them - but all of that could have been avoided if I had just looked at pypi - https://pypi.org/project/huggingface-hub/

The error that led me there (for documentation) - it seemed like my huggingface-cli was out of date and didn’t support the download command.

$ huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks
False usage: huggingface-cli <command> [<args>]
huggingface-cli: error: invalid choice: 'download' (choose from 'login', 'whoami', 'logout', 'repo', 'lfs-enable-largefiles', 'lfs-multipart-upload')

Ultimately I landed on Python 3.8, installed huggingface-hub and was finally able to download the model I wanted.

Comparing Docker Containers

I spent some time researching and looking for the various means of hosting an LLM on CPU-Only hardware - a term/keyword I learned in the process. The entire premise is “How can I host an LLM on a VM without a GPU” - with the added requirement of using a Docker container to isolate and standardize the deployment process.

1. llama.cpp

llama.cpp is a port of Facebook’s LlaMa model written in C/C++ and right now it seems to be the most advanced in terms of CPU-only infrastructure. Originally I was exploring Dockerfiles that I could host - but ultimately I landed on llama-cpp-python which can be dropped straight into my LangChain example from last week. Ultimately, llama.cpp was the easiest to spin up and for now, it will serve as my primary mechanism for hosting the LLM.

Finding GGUF file

llama.cpp specifically requires a GGUF file. It exposes an API/CLI to convert a model to this filetype - but I found that TheBloke on huggingface provided all I needed to skip this step.

2. Ollama

Prior to researching this - I thought Ollama was specifically built for Apple silicon. I guess I completely missed the “Linux” and “Windows” options when I first downloaded it. I was blind to the fact that it had a Dockerfile. At the time of writing, I have not explored Ollama on the VM - but I certainly will.

3. vLLM

vLLM is highly discussed and referred to across the various communities talking about hosting open-source models. It has all the bells and whistles - describing itself as “A high-throughput and memory-efficient inference and serving engine for LLMs.” However, right now it is GPU-only and doesn’t support CPU-only infrastructure.

Next Steps

With an LLM running in a docker container - I opted to wrap things up in a docker-compose and expose the LangChain playground via Nginx. This is where I could finally validate whether or not things were working (and ultimately answering: “Can I run an LLM on this VM?”).

Now, this is simply an interim step in the larger process of hosting the PoC - so take the following resources with a grain of salt. It just illustrates how simple LangChain is to host and expose.

docker-compose.yml

version: '3'
services:
  langchain:
    build: ./server
    ports:
      - "8080:8080"
    volumes:
      - /home/docker-user/llama/models:/models
    networks:
      - webnet

  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - langchain
    networks:
      - webnet

networks:
  webnet:

nginx.conf

events {}

http {
    server {
        listen 80;

        location / {
            proxy_pass http://langchain:8080/;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection 'upgrade';
            proxy_set_header Host $host;
            proxy_cache_bypass $http_upgrade;
        }
    }
}