.NET 呼叫 Ollama 範例與 CPU 使用率問題

2024-04-14 10:52 PM

前幾天介紹過用 Ollama + Open WebUI 跑本地 LLM 的懶人做法，只靠 CPU 速度不甚理想。

沒 GPU 學人玩地端 LLM，慢到靠北也是剛好而已。。

話雖如此，基於好奇我很想看看若 CPU 核數加倍再加倍，效能會不會有明顯提升，便在 Azure 開了台 48 vCPU VM 短暫小試一下。

過程發現一件事，48 核 CPU 使用率最高只到 50% 就止步了，跟想像中所有 CPU 操好操滿不太一樣。如此豈不浪費資源，沒有火力全開？

爬文查到不少人也提到 Ollama/llama.cpp 只用到 50% CPU 的問題：(註：Ollama 的底層也是 llama.cpp)

研究了一下，llama.cpp 有個 n_threads 參數可控制開幾條 Thread 跑 LLM 生成，ollama 則是由 Modelfile 的 num_thread 參數控制，而我查到一篇討論，有個 llama.cpp Python 專案使用 self.n_threads = n_threads or max(multiprocessing.cpu_count() // 2, 1) 將 Thread 數設為 CPU 核數的一半，被質疑部分 CPU 閒置未達最佳化，在討論中提到：

Most physical systems are hyperthreaded

Hyperthreading doesn't seem to improve performance due to the memory I/O bound nature of llama.cpp

Might be invalid for VMs

This was tested in original llama. C++ implementation already utilities full available compute of physical cores, increasing amount of threads beyond would only to lower performance due to reason mentioned above.

大意是說：經實測，llama.cpp 在實作上已充分運用實體核心的算力，且基於 Hyperthreading、Memory I/O Bound、VM 特性等理由，增加 Thread 數對效能並無助益。

做個實驗驗證一下。

為得到明確效能數字，我打算寫支測試程式呼叫 Ollama API 並計時，藉此機會也練習如何用 C# 整合 Ollama，補齊 RAG 整合地端 LLM 的必要技能。.NET 要連 Ollama 有個開源專案 OllamaSharp，dotent add package OllamaSharp 參照程式庫，幾行程式便可搞定。

我寫了一個簡單的取關鍵字任務，程式如下：(Prompt 我請 ChatGPT 幫忙調整，比我最早想的版本專業多了)

using System.Diagnostics;
using OllamaSharp;

var ollama = new OllamaApiClient(new Uri("http://localhost:11434"));
var prompt = @"As an AI, carefully read the given article and determine the 10 most crucial keywords. 
These should be listed in a string seperated by comma as follows: '1. keyword1, 2. keyword2, 3. keyword3...'. 
Don't give more than 10 keywords and no explanations and additional words are permitted.

Here is the article:
""""""
Understand the Transformer architecture and explore large language models in Azure Machine Learning 

Introduction

Foundation models, such as GPT-3, are state-of-the-art natural language processing models ... 內容省略
""""""
";

string[] models = ["mistral", "mistral-16t", "mistral-32t", "mistral-47t"];
foreach (var modelName in models)
{
    await RunTest(modelName);
    await Task.Delay(10000);
}


async Task RunTest(string modelName)
{
    ollama.SelectedModel = modelName;
    ConversationContext context = null!;
    Console.WriteLine($"Testing model {modelName}...");
    for (int i = 0; i < 3; i++)
    {
        var sw = new Stopwatch();
        sw.Start();
        try
        {
            var resp = await ollama.GetCompletion(prompt, context);
            Console.WriteLine(resp.Response);
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
            continue;
        }
        sw.Stop();
        Console.WriteLine($" Run {i} - Time: {sw.ElapsedMilliseconds:n0}ms");
    }
}

ollama 部分要透過 Modefile 指定 num_thread 建立不同模型，我先準備類似以下內容存成 Modelfile_mistral_32，使用 ollama create "mistral-32t" -f ./Modelfile-mistral-32 建立名為 mistral-32t 的模型，呼叫 ollama.GetCompletion() 前用 ollama.SelectedModel = modelName 指定使用模型的名稱，便可針對不同 Thread 數的模型進行測試。我在 48 核 VM 上測了不指定(系統自動決定)、16、32、48 四種 Thread 數，每種各跑三次。Linux 伺服器用 sar -u 5 每五秒記錄一次 CPU 使用率。切換不同模型間會休息 10 秒，故應有兩次 CPU 歸零做為區測，實測結果如下：