Running Open Source LLM Models on CentOS 7
This week I was given access to a new VM in our datacenter dedicated to exploring LLMs and different AI applications. I wrote about one of our first PoCs a week ago - so it was only natural to use it as my guinea pig. I’m no dev-ops wizard - so I stuck to simple infrastructure for now. It’s important to note - this VM’s purpose is to demo and document PoCs for internal audiences - so we don’t need to get too fancy with a production-grade tech-stack.
Overview
- Install Necessary Tools
- Install Docker
- Download an open-source model with
huggingface-cli
- Spin up a docker container to interact with the model
- Add Nginx to make it web-accessible
Prerequisites
Install Docker
Probably the easiest, most cookie-cutter part of the whole process and possibly even the most obvious one. Installing docker is a breeze - and so many folks have already written about it. I found this and it just worked - so I didn’t dig too deep into it all (reminder: I’m not a daily CentOS user).
Install Docker
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io
Start and Enable Docker
sudo systemctl start docker
sudo systemctl enable docker
Verify Docker Installation
sudo docker run hello-world
…and install docker-compose, create a docker user, etc.
Install huggingface-hub
and Download a Model
This is where things got interesting and really might be the crux of the whole post. CentOS 7 comes with Python 2.7 by default, and you’ll need Python 3 to use the huggingface-hub
package.
Install Python 3
sudo yum install -y python3
Install huggingface-hub
pip3 install huggingface-hub
Things break…
$ huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False Traceback (most recent call last): File "/usr/local/bin/huggingface-cli", line 7, in <module> from huggingface_hub.commands.huggingface_cli import main File "/usr/local/lib/python3.6/site-packages/huggingface_hub/__init__.py", line 21, in <module> from .commands.user import notebook_login File "/usr/local/lib/python3.6/site-packages/huggingface_hub/commands/user.py", line 26, in <module> from huggingface_hub.hf_api import HfApi, HfFolder File "/usr/local/lib/python3.6/site-packages/huggingface_hub/hf_api.py", line 35, in <module> from .utils.endpoint_helpers import ( File "/usr/local/lib/python3.6/site-packages/huggingface_hub/utils/endpoint_helpers.py", line 17, in <module> from dataclasses import dataclass
ModuleNotFoundError: No module named 'dataclasses'
This is where I spent the bulk of my time - fighting my Python installation to get all the pieces I needed. Ultimately it was a question of either: pip3 install dataclasses
or Upgrade to Python 3.7 or Later. In the moment, I opted for Python 3.7 and went about building it from the source - with lots of trial and error.
Realizing I need Python 3.8+
I spent a lot of time fighting Python errors - trying to fix them - but all of that could have been avoided if I had just looked at pypi - https://pypi.org/project/huggingface-hub/
The error that led me there (for documentation) - it seemed like my huggingface-cli was out of date and didn’t support the download
command.
$ huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks
False usage: huggingface-cli <command> [<args>]
huggingface-cli: error: invalid choice: 'download' (choose from 'login', 'whoami', 'logout', 'repo', 'lfs-enable-largefiles', 'lfs-multipart-upload')
Ultimately I landed on Python 3.8, installed huggingface-hub
and was finally able to download the model I wanted.
Comparing Docker Containers
I spent some time researching and looking for the various means of hosting an LLM on CPU-Only hardware - a term/keyword I learned in the process. The entire premise is “How can I host an LLM on a VM without a GPU” - with the added requirement of using a Docker container to isolate and standardize the deployment process.
1. llama.cpp
llama.cpp is a port of Facebook’s LlaMa model written in C/C++ and right now it seems to be the most advanced in terms of CPU-only infrastructure. Originally I was exploring Dockerfiles that I could host - but ultimately I landed on llama-cpp-python which can be dropped straight into my LangChain example from last week. Ultimately, llama.cpp was the easiest to spin up and for now, it will serve as my primary mechanism for hosting the LLM.
Finding GGUF file
llama.cpp specifically requires a GGUF
file. It exposes an API/CLI to convert a model to this filetype - but I found that TheBloke on huggingface provided all I needed to skip this step.
2. Ollama
Prior to researching this - I thought Ollama was specifically built for Apple silicon. I guess I completely missed the “Linux” and “Windows” options when I first downloaded it. I was blind to the fact that it had a Dockerfile. At the time of writing, I have not explored Ollama on the VM - but I certainly will.
3. vLLM
vLLM is highly discussed and referred to across the various communities talking about hosting open-source models. It has all the bells and whistles - describing itself as “A high-throughput and memory-efficient inference and serving engine for LLMs.” However, right now it is GPU-only and doesn’t support CPU-only infrastructure.
Next Steps
With an LLM running in a docker container - I opted to wrap things up in a docker-compose and expose the LangChain playground via Nginx. This is where I could finally validate whether or not things were working (and ultimately answering: “Can I run an LLM on this VM?”).
Now, this is simply an interim step in the larger process of hosting the PoC - so take the following resources with a grain of salt. It just illustrates how simple LangChain is to host and expose.
docker-compose.yml
version: '3'
services:
langchain:
build: ./server
ports:
- "8080:8080"
volumes:
- /home/docker-user/llama/models:/models
networks:
- webnet
nginx:
image: nginx:latest
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- langchain
networks:
- webnet
networks:
webnet:
nginx.conf
events {}
http {
server {
listen 80;
location / {
proxy_pass http://langchain:8080/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}
}