共计 4101 个字符,预计需要花费 11 分钟才能阅读完成。
咱们以前介绍过 Pandas 和 ChaGPT 整合,这样能够不理解 Pandas 的状况下对 DataFrame 进行操作。当初又有人开源了 Scikit-LLM,它联合了弱小的语言模型,如 ChatGPT 和 scikit-learn。但这个并不是让咱们自动化 scikit-learn,而是将 scikit-learn 和语言模型进行整合,scikit-learn 也能够解决文本数据了。
装置
pip install scikit-llm
既然要与 Open AI 的模型整合,就须要他的 Key,从 Scikit-LLM 库中导入 SKLLMConfig 模块,并增加 openAI 密钥:
# importing SKLLMConfig to configure OpenAI API (key and Name) | |
fromskllm.configimportSKLLMConfig | |
# Set your OpenAI API key | |
SKLLMConfig.set_openai_key("<YOUR_KEY>") | |
# Set your OpenAI organization (optional) | |
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>") |
ZeroShotGPTClassifier
通过整合 ChatGPT 不须要专门的训练就能够对文本进行分类。ZeroShotGPTClassifier,就像任何其余 scikit-learn 分类器一样,应用非常简单。
# importing zeroshotgptclassifier module and classification dataset | |
fromskllmimportZeroShotGPTClassifier | |
fromskllm.datasetsimportget_classification_dataset | |
# get classification dataset from sklearn | |
X, y=get_classification_dataset() | |
# defining the model | |
clf=ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo") | |
# fitting the data | |
clf.fit(X, y) | |
# predicting the data | |
labels=clf.predict(X) |
Scikit-LLM 在后果上通过了非凡解决,确保响应只蕴含一个无效的标签。如果响应短少标签,它还能够进行填充,依据它在训练数据中呈现的频率为你抉择一个标签。
对于咱们本人的带标签的数据,只须要提供候选标签的列表,代码是这个样子的:
# importing zeroshotgptclassifier module and classification dataset | |
fromskllmimportZeroShotGPTClassifier | |
fromskllm.datasetsimportget_classification_dataset | |
# get classification dataset from sklearn for prediction only | |
X, _=get_classification_dataset() | |
# defining the model | |
clf=ZeroShotGPTClassifier() | |
# Since no training so passing the labels only for prediction | |
clf.fit(None, ['positive', 'negative', 'neutral']) | |
# predicting the labels | |
labels=clf.predict(X) |
MultiLabelZeroShotGPTClassifier
多标签也相似
# importing Multi-Label zeroshot module and classification dataset | |
fromskllmimportMultiLabelZeroShotGPTClassifier | |
fromskllm.datasetsimportget_multilabel_classification_dataset | |
# get classification dataset from sklearn | |
X, y=get_multilabel_classification_dataset() | |
# defining the model | |
clf=MultiLabelZeroShotGPTClassifier(max_labels=3) | |
# fitting the model | |
clf.fit(X, y) | |
# making predictions | |
labels=clf.predict(X) |
创立 MultiLabelZeroShotGPTClassifier 类的实例时,指定要调配给每个样本的最大标签数量(这里:max_labels=3)
数据没有没有标签怎么办?能够通过提供候选标签列表来训练没有标记数据的分类器。y 的类型应该是 List[List[str]]。上面是一个没有标记数据的训练示例:
# getting classification dataset for prediction only | |
X, _=get_multilabel_classification_dataset() | |
# Defining all the labels that needs to predicted | |
candidate_labels= [ | |
"Quality", | |
"Price", | |
"Delivery", | |
"Service", | |
"Product Variety" | |
] | |
# creating the model | |
clf=MultiLabelZeroShotGPTClassifier(max_labels=3) | |
# fitting the labels only | |
clf.fit(None, [candidate_labels]) | |
# predicting the data | |
labels=clf.predict(X) |
文本向量化
文本向量化是将文本转换为数字的过程,Scikit-LLM 中的 GPTVectorizer 模块,能够将一段文本 (无论文本有多长) 转换为固定大小的一组向量。
# Importing the necessary modules and classes | |
fromsklearn.pipelineimportPipeline | |
fromsklearn.preprocessingimportLabelEncoder | |
fromxgboostimportXGBClassifier | |
# Creating an instance of LabelEncoder class | |
le=LabelEncoder() | |
# Encoding the training labels 'y_train' using LabelEncoder | |
y_train_encoded=le.fit_transform(y_train) | |
# Encoding the test labels 'y_test' using LabelEncoder | |
y_test_encoded=le.transform(y_test) | |
# Defining the steps of the pipeline as a list of tuples | |
steps= [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())] | |
# Creating a pipeline with the defined steps | |
clf=Pipeline(steps) | |
# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded' | |
clf.fit(X_train, y_train_encoded) | |
# Predicting the labels for the test data 'X_test' using the trained pipeline | |
yh=clf.predict(X_test) |
文本摘要
GPT 十分善于总结文本。在 Scikit-LLM 中有一个叫 GPTSummarizer 的模块。
# Importing the GPTSummarizer class from the skllm.preprocessing module | |
from skllm.preprocessing import GPTSummarizer | |
# Importing the get_summarization_dataset function | |
from skllm.datasets import get_summarization_dataset | |
# Calling the get_summarization_dataset function | |
X = get_summarization_dataset() | |
# Creating an instance of the GPTSummarizer | |
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15) | |
# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'. | |
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries' | |
summaries = s.fit_transform(X) |
须要留神的是,max_words 超参数是对生成摘要中单词数量的灵便限度。尽管 max_words 为摘要长度设置了一个粗略的指标,但摘要器可能偶然会依据输出文本的上下文和内容生成略长的摘要。
总结
ChaGPT 的火爆使得泛化模型有了更多的提高,这种提高也给咱们日常的应用带来了微小的改革,Scikit-LLM 就将 LLM 整合进了 Scikit 的工作流,如果你有趣味,这里是源码:
https://avoid.overfit.cn/post/9ba131a01d374926b6b7efff97f61c45
作者:Fareed Khan