AI/LLM

M1 llama.cpp로 EEVE Korean Instruct GGUF 모델 실행

a_mnesia 2024. 4. 14. 00:30

 

앞에서는 Ollama를 이용해서 eeve 및 gemma 모델을 M1 노트북에서 실행해봤습니다.

이번에는 llama.cpp 로 모델을 실행해보고 Ollama를 사용할때와 차이점을 확인해보겠습니다

 

주의할점은 M1에서 llama.cpp를 사용하기위해서는 Tensorflow가 필요한데 이때 python은 3.8, 3.9, 3.10만 설치에 문제가 발생하지 않습니다. 3.11, 3.12에서는 저는 tensorflow설치에 실패했습니다.

% conda create -n llm python=3.10
% conda install -c apple tensorflow-deps
% python -m pip install tensorflow
% python -m pip install tensorflow-macos
% python -m pip install tensorflow-metal

# 설치된 Library 확인
% pip list | grep tensor
tensorboard                  2.16.2
tensorboard-data-server      0.7.2
tensorflow                   2.16.1
tensorflow-io-gcs-filesystem 0.36.0
tensorflow-macos             2.16.1
tensorflow-metal             1.1.0

# 설치 확인	
% python
>>> import tensorflow as tf
>>> import keras
>>> print(tf.__version__)
2.16.1
>>> print(keras.__version__)
3.2.1
>>>

 

 

사이트의 방법대로 따라합니다

 

Usage

# GPU model
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

# CPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

pip install huggingface_hub

 

설치 스크립트로는 에러가 나서 M1의 경우 oxs arm64 archtecture 옵션을 주고 해야 정상빌드가 되더군요

(LLM RAG Langchain 통합 채팅방에서 고석현 Noah님께서 도움을 주셨습니다.) 

% CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
Using pip 23.3.1 from /Users/dongsik/miniconda/envs/llm/lib/python3.10/site-packages/pip (python 3.10)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.61.tar.gz (37.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.4/37.4 MB 10.1 MB/s eta 0:00:00
  Running command pip subprocess to install build dependencies
...
생략
...
      Successfully uninstalled Jinja2-3.1.3
Successfully installed MarkupSafe-2.1.5 diskcache-5.6.3 jinja2-3.1.3 llama-cpp-python-0.2.61 numpy-1.26.4 typing-extensions-4.11.0

 

% pip install huggingface_hub
Collecting huggingface_hub
  Using cached huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
...
생략
...
Installing collected packages: tqdm, pyyaml, fsspec, filelock, huggingface_hub
Successfully installed filelock-3.13.4 fsspec-2024.3.1 huggingface_hub-0.22.2 pyyaml-6.0.1 tqdm-4.66.2

 

설치 확인

% python -V
Python 3.10.14
% pip list
Package                      Version
---------------------------- --------------
...
huggingface-hub              0.22.2
llama_cpp_python             0.2.61
...
tensorboard                  2.16.2
tensorboard-data-server      0.7.2
tensorflow                   2.16.1
tensorflow-io-gcs-filesystem 0.36.0
tensorflow-macos             2.16.1
tensorflow-metal             1.1.0
...

 

추가로 jupyter lab을 깔아서 notebook으로 작업하도록 합니다.

% pip install jupyter lab
Collecting jupyter
  Using cached jupyter-1.0.0-py2.py3-none-any.whl.metadata (995 bytes)
...

 

jupyter lab을 구동하고 notebook으로 에제코드를 실행해보겠습니다

 

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

import time
from pprint import pprint

print(Llama)
<class 'llama_cpp.llama.Llama'>

 

# download model
model_name_or_path = "heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF" # repo id
# 4bit
model_basename = "ggml-model-Q4_K_M.gguf" # file name

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)
print(model_path)
/Users/dongsik/.cache/huggingface/hub/models--heegyu--EEVE-Korean-Instruct-10.8B-v1.0-GGUF/snapshots/9bf4892cf2017362dbadf99bd9a3523387135362/ggml-model-Q4_K_M.gguf

 

# GPU에서 사용하려면 아래 코드로 실행
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=43, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=4096, # Context window
)
llama_model_loader: loaded meta data with 24 key-value pairs and 435 tensors from /Users/dongsik/.cache/huggingface/hub/models--heegyu--EEVE-Korean-Instruct-10.8B-v1.0-GGUF/snapshots/9bf4892cf2017362dbadf99bd9a3523387135362/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
...
생략
...
Model metadata: {'general.quantization_version': '2', 'tokenizer.chat_template': "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful assistant.' %}{% endif %}{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 %}{{'<|im_start|>system\n' + system_message + '<|im_end|>\n'}}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.model': 'llama', 'llama.attention.head_count_kv': '8', 'llama.context_length': '4096', 'llama.attention.head_count': '32', 'llama.rope.freq_base': '10000.000000', 'llama.rope.dimension_count': '128', 'general.file_type': '15', 'llama.feed_forward_length': '14336', 'llama.embedding_length': '4096', 'llama.block_count': '48', 'general.architecture': 'llama', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'general.name': 'LLaMA v2'}
Using gguf chat template: {% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful assistant.' %}{% endif %}{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 %}{{'<|im_start|>system
' + system_message + '<|im_end|>
'}}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Using chat eos_token: <|im_end|>
Using chat bos_token: <s>

 

prompt_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: {prompt}\nAssistant:\n"
text = '한국의 수도는 어디인가요? 아래 선택지 중 골라주세요.\n\n(A) 경성\n(B) 부산\n(C) 평양\n(D) 서울\n(E) 전주'

prompt = prompt_template.format(prompt=text)

start = time.time()
response = lcpp_llm(
    prompt=prompt,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    top_k=50,
    stop = ['</s>'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)
pprint(response)
print(time.time() - start)
llama_print_timings:        load time =    7595.99 ms
llama_print_timings:      sample time =      49.12 ms /   159 runs   (    0.31 ms per token,  3236.90 tokens per second)
llama_print_timings: prompt eval time =    7595.51 ms /    83 tokens (   91.51 ms per token,    10.93 tokens per second)
llama_print_timings:        eval time =   24649.90 ms /   158 runs   (  156.01 ms per token,     6.41 tokens per second)
llama_print_timings:       total time =   33079.70 ms /   241 tokens
{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'text': 'A chat between a curious user and an artificial '
                      'intelligence assistant. The assistant gives helpful, '
                      "detailed, and polite answers to the user's questions.\n"
                      'Human: 한국의 수도는 어디인가요? 아래 선택지 중 골라주세요.\n'
                      '\n'
                      '(A) 경성\n'
                      '(B) 부산\n'
                      '(C) 평양\n'
                      '(D) 서울\n'
                      '(E) 전주\n'
                      'Assistant:\n'
                      '한국은 동아시아에 위치한 국가로, 공식적으로 대한민국이라고 불립니다. 한국의 수도는 (D) '
                      '서울입니다. 서울은 나라의 북동부에 위치해 있으며 가장 큰 도시이자 정치, 경제, 문화의 '
                      '중심지입니다. 1948년 대한민국이 설립된 이래로 수도 역할을 해오고 있습니다.\n'
                      '\n'
                      '다른 선택지들은 다음과 같습니다:\n'
                      '(A) 경성 - 이 용어는 구식으로, 지금은 서울이라고 불립니다.\n'
                      '(B) 부산 - 한국의 중요한 도시지만 수도는 아닙니다.\n'
                      '(C) 평양 - 북한을 구성하는 도시 중 하나이지만 대한민국의 수도가 아닙니다.\n'
                      '(D) 전주 - 한국의 역사적인 도시로 중요하지만 수도는 아닙니다.'}],
 'created': 1713092665,
 'id': 'cmpl-c3bd8b09-3a89-4364-8c8d-41d60891160f',
 'model': '/Users/dongsik/.cache/huggingface/hub/models--heegyu--EEVE-Korean-Instruct-10.8B-v1.0-GGUF/snapshots/9bf4892cf2017362dbadf99bd9a3523387135362/ggml-model-Q4_K_M.gguf',
 'object': 'text_completion',
 'usage': {'completion_tokens': 158, 'prompt_tokens': 83, 'total_tokens': 241}}
33.08910894393921

 

답변이 12~33초 정도 걸리네요.

여기서 33초는 첫번재 실행때 소요된시간이고 두세번해보면 11~12초의 속도가 나옵니다.

ggml-model-Q4_K_M.gguf: 100%
 6.51G/6.51G [10:54<00:00, 10.6MB/s]
/Users/dongsik/.cache/huggingface/hub/models--heegyu--EEVE-Korean-Instruct-10.8B-v1.0-GGUF/snapshots/9bf4892cf2017362dbadf99bd9a3523387135362/ggml-model-Q4_K_M.gguf

 

 

다음은 vllm....

 

참고


GGUF 설명

https://huggingface.co/docs/hub/gguf

 

GGUF

GGUF Hugging Face Hub supports all file formats, but has built-in features for GGUF format, a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and

huggingface.co

 

Hugging Face GGUF Library

https://huggingface.co/models?library=gguf

 

Models - Hugging Face

 

huggingface.co