当前位置: 代码网 > it编程>软件设计>数据结构 > Streaming local LLM with FastAPI, Llama.cpp and Langchain

Streaming local LLM with FastAPI, Llama.cpp and Langchain

2024年08月04日 数据结构 我要评论
使用FastAPI、Llama.cpp和Langchain流式传输本地大型语言模型

题意:

使用fastapi、llama.cpp和langchain流式传输本地大型语言模型

问题背景:

i have setup fastapi with llama.cpp and langchain. now i want to enable streaming in the fastapi responses. streaming works with llama.cpp in my terminal, but i wasn't able to implement it with a fastapi response.

我已经使用llama.cpp和langchain设置了fastapi。现在我想在fastapi响应中启用流式传输。在我的终端中,流式传输与llama.cpp一起工作正常,但我无法将其与fastapi响应一起实现。

most tutorials focused on enabling streaming with an openai model, but i am using a local llm (quantized mistral) with llama.cpp. i think i have to modify the callbackhandler, but no tutorial worked. here is my code:

大多数教程都集中在如何使用openai模型启用流式传输,但我正在使用带有llama.cpp的本地大型语言模型(量化的mistral)。我认为我需要修改callbackhandler,但我没有找到任何可行的教程。以下是我的代码:

from fastapi import fastapi, request, response
from langchain_community.llms import llamacpp
from langchain.callbacks.manager import callbackmanager
from langchain.callbacks.streaming_stdout import streamingstdoutcallbackhandler
import copy
from langchain.chains import llmchain
from langchain.prompts import prompttemplate

model_path = "../modelle/mixtral-8x7b-instruct-v0.1.q5_k_m.gguf"

prompt= """
<s> [inst] im folgenden bekommst du eine aufgabe. erledige diese anhand des user inputs.

### hier die aufgabe: ###
{typescript_string}

### hier der user input: ###
{input}

antwort: [/inst]
"""

def model_response_prompt():
    return prompttemplate(template=prompt, input_variables=['input', 'typescript_string'])

def build_llm(model_path, callback=none):
        callback_manager = callbackmanager([streamingstdoutcallbackhandler()])
        #callback_manager = callbackmanager(callback)
        
        n_gpu_layers = 1 # metal set to 1 is enough. # ausprobiert mit mehreren
        n_batch = 512#1024 # should be between 1 and n_ctx, consider the amount of ram of your apple silicon chip.
   
        llm = llamacpp(
                max_tokens =1000,
                n_threads = 6,
                model_path=model_path,
                temperature= 0.8,
                f16_kv=true,
                n_ctx=28000, 
                n_gpu_layers=n_gpu_layers,
                n_batch=n_batch,
                callback_manager=callback_manager, 
                verbose=true,
                top_p=0.75,
                top_k=40,
                repeat_penalty = 1.1,
                streaming=true,
                model_kwargs={
                        'mirostat': 2,
                },
        )
        
        return llm

# caching llm
@lru_cache(maxsize=100)
def get_cached_llm():
        chat = build_llm(model_path)
        return chat

chat = get_cached_llm()

app = fastapi(
    title="inference api for mistral and mixtral",
    description="a simple api that use mistral or mixtral",
    version="1.0",
)

app.add_middleware(
    corsmiddleware,
    allow_origins=["*"],
    allow_credentials=true,
    allow_methods=["*"],
    allow_headers=["*"],
)

def bullet_point_model():          
    llm = build_llm(model_path=model_path)
    llm_chain = llmchain(
        llm=llm,
        prompt=model_response_prompt(),
        verbose=true,
    )
    return llm_chain

@app.get('/model_response')
async def model(question : str, prompt: str):
    model = bullet_point_model()
    res = model({"typescript_string": prompt, "input": question})
    result = copy.deepcopy(res)
    return result

in a example notebook, i am calling fastapi like this:

在一个示例笔记本中,我像这样调用fastapi:

import  subprocess
import urllib.parse
import shlex
query = input("insert your bullet points here: ")
task = input("insert the task here: ")
#safe encode url string
encodedquery =  urllib.parse.quote(query)
encodedtask =  urllib.parse.quote(task)
#join the curl command textx
command = f"curl -x 'get' 'http://127.0.0.1:8000/model_response?question={encodedquery}&prompt={encodedtask}' -h 'accept: application/json'"
print(command)
args = shlex.split(command)
process = subprocess.popen(args, shell=false, stdout=subprocess.pipe, stderr=subprocess.pipe)
stdout, stderr = process.communicate()
print(stdout)

so with this code, getting responses from the api works. but i only see streaming in my terminal (i think this is because of the streamingstdoutcallbackhandler. after the streaming in the terminal is complete, i am getting my fastapi response.

所以,使用这段代码,从api获取响应是可行的。但我只能在终端中看到流式传输(我认为这是因为使用了streamingstdoutcallbackhandler)。在终端中的流式传输完成后,我才能收到fastapi的响应。

what do i have to change now that i can stream token by token with fastapi and a local llama.cpp model?

我现在可以使用fastapi和本地的llama.cpp模型逐令牌(token-by-token)地进行流式传输,那么我还需要改变什么?

问题解决:

i was doing the same and hit similar issue that fastapi was not streaming the response even i am using the streamingresponse api and eventually i got the following code work. there are three important part:

我之前也做了同样的事情,并遇到了类似的问题,即即使我使用了streamingresponse api,fastapi也没有流式传输响应。但最终我得到了以下可以工作的代码。这里有三个重要的部分:

  • make sure using streamingresponse to wrap an iterator.

确保使用streamingresponse来包装一个迭代器

  • make sure the iterator sends newline character \n in each streaming response.

确保迭代器在每个流式响应中发送换行符 \n

  • make sure using streaming apis to connect to your llms. for example, _client.chat function in my example is using httpx to connect to rest apis for llms. if you use requests package, it won't work as it doesn't support streaming.

确保使用流式api来连接您的大型语言模型(llms)。例如,在我的示例中,_client.chat 函数使用 httpx 来连接到llms的rest api。如果您使用 requests 包,那么它将无法工作,因为 requests 不支持流式传输。

async def chat(self, request: request):
"""
generate a chat response using the requested model.
"""

# passing request body json to parameters of function _chat
# request body follows ollama api's chat request format for now.
params = await request.json()
self.logger.debug("request data: %s", params)

chat_response = self._client.chat(**params)

# always return as streaming
if isinstance(chat_response, iterator):
    def generate_response():
        for response in chat_response:
            yield json.dumps(response) + "\n"
    return streamingresponse(generate_response(), media_type="application/x-ndjson")
elif chat_response is not none:
    return json.dumps(chat_response)

(0)

相关文章:

版权声明:本文内容由互联网用户贡献,该文观点仅代表作者本人。本站仅提供信息存储服务,不拥有所有权,不承担相关法律责任。 如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 2386932994@qq.com 举报,一经查实将立刻删除。

发表评论

验证码:
Copyright © 2017-2025  代码网 保留所有权利. 粤ICP备2024248653号
站长QQ:2386932994 | 联系邮箱:2386932994@qq.com