关于chatgpt:ChatGPT-Token优化与突破长度限制

原文参考我的公众号文章 ChatGPT Token 优化与冲破长度限度并行不悖～

通过计算可能发现，中文是十分吃亏的，甚至是讲一个中文字拆分而后算 token 的（偏旁部首 …），所以尽量用中文向 LLM 发问。

https://platform.openai.com/tokenizer

NodeJS：gpt-3-encoder

Python：tiktoken

https://www.npmjs.com/package/gpt-3-encoder
https://community.openai.com/t/gpt-3-encoder-now-available-in…

1. 优化提醒词
2. 总结上下文
3.text-embedding

用英文发问，英文答复。将答复翻译成须要展现的语言。
精简提醒词，去掉无必要符号。
…

这个就要依据教训并联合提醒词技巧来做提醒词优化了，然而成果比拟个别，不适宜长对话场景。

对于长对话聊天来说，随着问答轮数越来越多，每次携带的上下文内容也就越多，很容易达到单次模型发问时的
token 下限。我个别采纳两种操作：

1. 最近上下文：当记录超过 N 条后，只取第一条和最初 M 条作为对话上下文；

2. 压缩上下文：当记录超过 N 条后，向 LLM 发动一个「总结聊天内容」的对话，它的答复也就是对话历史的总结
摘录一段 LangChain 的总结上下文提醒词：

Progressively summarize the lines of conversation provided, adding onto the previous
summary recrning a new summary.
EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good.
New lines of conversation:
Human: Why do you think artificial intelligence is a force for good?
AI: Because artificial intelligence will help humans reach their full potential.
New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good because it will help humans reach their full potential.
END OF EXAMPLE
Current summary:
(summary}
New lines of conversation:
(new_lines}
New summary:

这种形式实用于针对特定文档或已有文本材料进行发问，比方企业外部文档、各种阐明文档等。能够通过 openai 官网的 textembedding 接口解决，也能够应用 llama-index（都是须要 openai key 的）。

次要流程就是：

1.text-to-vec：应用 embedding 技术，把指标文档或知识库转换成向量，能够存在本地磁盘，也能够存储在业余的向量数据库（如：Qdrant）

2.prompt-to-vec：仍然是用 text-embedding 技术，把用户发送的提醒词转成向量，记作“VP”。拿 VP 在本地磁盘或者业余向量数据库中检索，用一种叫做“余弦类似度”的技术，把相干内容匹配进去（类似度 0 ～ 1，能够定义类似度的值进行后果过滤）。

3.newprompt-to-gpt：第 2 步中匹配的内容是一条条依照类似度降序排列的记录，每条记录里都蕴含了「原始文本 - 向量」的映射。咱们只须要取每条记录的「原始文本」，把它注入到新的 Prompt 中，作为最终向 LLM 模型发问的提醒词。这样就实现了「针对无效内容的自然语言发问」，节约了 token。

newPromptLike = `We have the opportunity to refine the above answer 
(only if needed) with some more context below.
------------
{similar_context_msg}
------------
Given the new context, refine the original answer to better 
answer the question. 
If the context isn't useful, output the original answer again.`;

在 text-embedding 过程中有一些细活儿须要留神，比方：为了管制每次 embedding 的 token 数量，以及匹配进去文本量的大小，要给文档内容分段，分段的策略也会影响匹配后果的品质。

上面找了一张网上的图，流程很清晰。

以下是我通过 openai 官网 python 包实现的 text-embedding 过程，整体分为两部份「Build」和「Query」。

import pandas as pd
import tiktoken

from openai.embeddings_utils import get_embedding

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

readfilename = 'qawiki'; # 原文档名称
savefilename = 'qawiki_with_embeddings'; # embedding 后的文档存储名称

# load & inspect dataset
# to save space, we provide a pre-filtered dataset
input_datapath = f"data/{readfilename}.csv"
# 读取 csv 内容，并将第一列作为索引列
df = pd.read_csv(input_datapath, index_col=0)

# 从 DataFrame 中抉择的列的列表
df = df[["Question", "Answer"]]
df = df.dropna()
df["combined"] = ("问:" + df.Question.str.strip() + "答:" + df.Answer.str.strip())
print(df["combined"])
df.head(2)

# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000

encoding = tiktoken.get_encoding(embedding_encoding)

# 省略太长而无奈嵌入的评论
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)


# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv(f"data/{savefilename}.csv")

from openai.embeddings_utils import get_embedding
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import ast
import numpy as np

input_datapath = "data/qawiki_with_embeddings.csv"
df = pd.read_csv(input_datapath)

# 将 embedding 列中的字符串转换为浮点数数组
df.embedding = df.embedding.apply(lambda x: np.array(ast.literal_eval(x)))

def search_reviews(df, product_description, n=3, pprint=True):
    # 获取问题的 embedding
    embedding = get_embedding(product_description, engine='text-embedding-ada-002')

    # 计算 embedding cosin 类似度
    df['similarities'] = cosine_similarity(np.array(df.embedding.tolist()), np.array(embedding).reshape(1, -1))
    res = df.sort_values('similarities', ascending=False).head(n)
    return res

msg = input("有什么须要帮忙的吗？\n")

# 输出「exit」退出程序
while msg != 'exit':
    # 获取查问后果
    res = search_reviews(df, msg, n=4)

    print('\n')
    print(res)
    print('\n')

    # 我须要把合乎类似度条件的记录中的 combined 字段拼接起来，存储在 relativecontent 中
    relativecontent = ''
    for row in res.itertuples():
        # print('similarities:' + str(row.similarities))
        if row.similarities > 0.7:
            relativecontent += row.combined + '\n'
        else:
            print('drop a row of low similarities:' + str(row.similarities))

    # print(relativecontent)

    newprompt = f"你是公司的 AI 客服，请联合''' 里的内容答复我的问题，不须要解释。如果 '''里的内容没有用途，请输出原始答复。\n 内容：\n'''{relativecontent}'''\n\n 请问，{msg}"

    print('新的提醒词 \n')
    print(newprompt)

    # TODO
    # answer = callGPT(newprompt)
    # ...

    # 期待持续输出
    msg = input("\n\n 还有什么问题？")

QueryEmbedding 后果：

GPT3.5 问答：

关于chatgpt:ChatGPT-Token优化与突破长度限制

计算 Token

在线体验

代码里应用

参考链接

优化 Token

优化提醒词

解决上下文

text-embedding

一个 QA 文档 embedding 解决示例

1.build：将文档向量化解决

2.query：搜寻向量