题意:
使用fastapi、llama.cpp和langchain流式传输本地大型语言模型
问题背景:
i have setup fastapi with llama.cpp and langchain. now i want to enable streaming in the fastapi responses. streaming works with llama.cpp in my terminal, but i wasn't able to implement it with a fastapi response.
我已经使用llama.cpp和langchain设置了fastapi。现在我想在fastapi响应中启用流式传输。在我的终端中,流式传输与llama.cpp一起工作正常,但我无法将其与fastapi响应一起实现。
most tutorials focused on enabling streaming with an openai model, but i am using a local llm (quantized mistral) with llama.cpp. i think i have to modify the callbackhandler, but no tutorial worked. here is my code:
大多数教程都集中在如何使用openai模型启用流式传输,但我正在使用带有llama.cpp的本地大型语言模型(量化的mistral)。我认为我需要修改callbackhandler,但我没有找到任何可行的教程。以下是我的代码:
from fastapi import fastapi, request, response
from langchain_community.llms import llamacpp
from langchain.callbacks.manager import callbackmanager
from langchain.callbacks.streaming_stdout import streamingstdoutcallbackhandler
import copy
from langchain.chains import llmchain
from langchain.prompts import prompttemplate
model_path = "../modelle/mixtral-8x7b-instruct-v0.1.q5_k_m.gguf"
prompt= """
<s> [inst] im folgenden bekommst du eine aufgabe. erledige diese anhand des user inputs.
### hier die aufgabe: ###
{typescript_string}
### hier der user input: ###
{input}
antwort: [/inst]
"""
def model_response_prompt():
return prompttemplate(template=prompt, input_variables=['input', 'typescript_string'])
def build_llm(model_path, callback=none):
callback_manager = callbackmanager([streamingstdoutcallbackhandler()])
#callback_manager = callbackmanager(callback)
n_gpu_layers = 1 # metal set to 1 is enough. # ausprobiert mit mehreren
n_batch = 512#1024 # should be between 1 and n_ctx, consider the amount of ram of your apple silicon chip.
llm = llamacpp(
max_tokens =1000,
n_threads = 6,
model_path=model_path,
temperature= 0.8,
f16_kv=true,
n_ctx=28000,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=true,
top_p=0.75,
top_k=40,
repeat_penalty = 1.1,
streaming=true,
model_kwargs={
'mirostat': 2,
},
)
return llm
# caching llm
@lru_cache(maxsize=100)
def get_cached_llm():
chat = build_llm(model_path)
return chat
chat = get_cached_llm()
app = fastapi(
title="inference api for mistral and mixtral",
description="a simple api that use mistral or mixtral",
version="1.0",
)
app.add_middleware(
corsmiddleware,
allow_origins=["*"],
allow_credentials=true,
allow_methods=["*"],
allow_headers=["*"],
)
def bullet_point_model():
llm = build_llm(model_path=model_path)
llm_chain = llmchain(
llm=llm,
prompt=model_response_prompt(),
verbose=true,
)
return llm_chain
@app.get('/model_response')
async def model(question : str, prompt: str):
model = bullet_point_model()
res = model({"typescript_string": prompt, "input": question})
result = copy.deepcopy(res)
return result
in a example notebook, i am calling fastapi like this:
在一个示例笔记本中,我像这样调用fastapi:
import subprocess
import urllib.parse
import shlex
query = input("insert your bullet points here: ")
task = input("insert the task here: ")
#safe encode url string
encodedquery = urllib.parse.quote(query)
encodedtask = urllib.parse.quote(task)
#join the curl command textx
command = f"curl -x 'get' 'http://127.0.0.1:8000/model_response?question={encodedquery}&prompt={encodedtask}' -h 'accept: application/json'"
print(command)
args = shlex.split(command)
process = subprocess.popen(args, shell=false, stdout=subprocess.pipe, stderr=subprocess.pipe)
stdout, stderr = process.communicate()
print(stdout)
so with this code, getting responses from the api works. but i only see streaming in my terminal (i think this is because of the streamingstdoutcallbackhandler
. after the streaming in the terminal is complete, i am getting my fastapi response.
所以,使用这段代码,从api获取响应是可行的。但我只能在终端中看到流式传输(我认为这是因为使用了streamingstdoutcallbackhandler)。在终端中的流式传输完成后,我才能收到fastapi的响应。
what do i have to change now that i can stream token by token with fastapi and a local llama.cpp model?
我现在可以使用fastapi和本地的llama.cpp
模型逐令牌(token-by-token)地进行流式传输,那么我还需要改变什么?
问题解决:
i was doing the same and hit similar issue that fastapi was not streaming the response even i am using the streamingresponse
api and eventually i got the following code work. there are three important part:
我之前也做了同样的事情,并遇到了类似的问题,即即使我使用了streamingresponse
api,fastapi也没有流式传输响应。但最终我得到了以下可以工作的代码。这里有三个重要的部分:
-
make sure using
streamingresponse
to wrap aniterator
.
确保使用streamingresponse
来包装一个迭代器
-
make sure the iterator sends newline character
\n
in each streaming response.
确保迭代器在每个流式响应中发送换行符 \n
。
-
make sure using streaming apis to connect to your llms. for example,
_client.chat
function in my example is usinghttpx
to connect to rest apis for llms. if you userequests
package, it won't work as it doesn't support streaming.
确保使用流式api来连接您的大型语言模型(llms)。例如,在我的示例中,_client.chat
函数使用 httpx
来连接到llms的rest api。如果您使用 requests
包,那么它将无法工作,因为 requests
不支持流式传输。
async def chat(self, request: request):
"""
generate a chat response using the requested model.
"""
# passing request body json to parameters of function _chat
# request body follows ollama api's chat request format for now.
params = await request.json()
self.logger.debug("request data: %s", params)
chat_response = self._client.chat(**params)
# always return as streaming
if isinstance(chat_response, iterator):
def generate_response():
for response in chat_response:
yield json.dumps(response) + "\n"
return streamingresponse(generate_response(), media_type="application/x-ndjson")
elif chat_response is not none:
return json.dumps(chat_response)
发表评论