关于人工智能:ScikitLLM将大语言模型整合进Sklearn的工作流

咱们以前介绍过Pandas和ChaGPT整合，这样能够不理解Pandas的状况下对DataFrame进行操作。当初又有人开源了Scikit-LLM，它联合了弱小的语言模型，如ChatGPT和scikit-learn。但这个并不是让咱们自动化scikit-learn，而是将scikit-learn和语言模型进行整合，scikit-learn也能够解决文本数据了。

装置

 pip install scikit-llm

既然要与Open AI的模型整合，就须要他的Key，从Scikit-LLM库中导入SKLLMConfig模块，并增加openAI密钥:

 # importing SKLLMConfig to configure OpenAI API (key and Name) fromskllm.configimportSKLLMConfig  # Set your OpenAI API key SKLLMConfig.set_openai_key("<YOUR_KEY>")  # Set your OpenAI organization (optional) SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

ZeroShotGPTClassifier

通过整合ChatGPT不须要专门的训练就能够对文本进行分类。ZeroShotGPTClassifier，就像任何其余scikit-learn分类器一样，应用非常简单。

 # importing zeroshotgptclassifier module and classification dataset fromskllmimportZeroShotGPTClassifier fromskllm.datasetsimportget_classification_dataset  # get classification dataset from sklearn X, y=get_classification_dataset()  # defining the model clf=ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")  # fitting the data clf.fit(X, y)  # predicting the data labels=clf.predict(X)

Scikit-LLM在后果上通过了非凡解决，确保响应只蕴含一个无效的标签。如果响应短少标签，它还能够进行填充，依据它在训练数据中呈现的频率为你抉择一个标签。

对于咱们本人的带标签的数据，只须要提供候选标签的列表，代码是这个样子的：

 # importing zeroshotgptclassifier module and classification dataset fromskllmimportZeroShotGPTClassifier fromskllm.datasetsimportget_classification_dataset  # get classification dataset from sklearn for prediction only  X, _=get_classification_dataset()  # defining the model clf=ZeroShotGPTClassifier()  # Since no training so passing the labels only for prediction clf.fit(None, ['positive', 'negative', 'neutral'])  # predicting the labels labels=clf.predict(X)

MultiLabelZeroShotGPTClassifier

多标签也相似

 # importing Multi-Label zeroshot module and classification dataset fromskllmimportMultiLabelZeroShotGPTClassifier fromskllm.datasetsimportget_multilabel_classification_dataset  # get classification dataset from sklearn  X, y=get_multilabel_classification_dataset()  # defining the model clf=MultiLabelZeroShotGPTClassifier(max_labels=3)  # fitting the model clf.fit(X, y)  # making predictions labels=clf.predict(X)

创立MultiLabelZeroShotGPTClassifier类的实例时，指定要调配给每个样本的最大标签数量（这里:max_labels=3）

数据没有没有标签怎么办？能够通过提供候选标签列表来训练没有标记数据的分类器。y的类型应该是List[List[str]]。上面是一个没有标记数据的训练示例:

 # getting classification dataset for prediction only X, _=get_multilabel_classification_dataset()  # Defining all the labels that needs to predicted candidate_labels= [     "Quality",     "Price",     "Delivery",     "Service",     "Product Variety" ]  # creating the model clf=MultiLabelZeroShotGPTClassifier(max_labels=3)  # fitting the labels only clf.fit(None, [candidate_labels])  # predicting the data labels=clf.predict(X)

文本向量化

文本向量化是将文本转换为数字的过程，Scikit-LLM中的GPTVectorizer模块，能够将一段文本(无论文本有多长)转换为固定大小的一组向量。

 # Importing the necessary modules and classes fromsklearn.pipelineimportPipeline fromsklearn.preprocessingimportLabelEncoder fromxgboostimportXGBClassifier  # Creating an instance of LabelEncoder class le=LabelEncoder()  # Encoding the training labels 'y_train' using LabelEncoder y_train_encoded=le.fit_transform(y_train)  # Encoding the test labels 'y_test' using LabelEncoder y_test_encoded=le.transform(y_test)  # Defining the steps of the pipeline as a list of tuples steps= [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]  # Creating a pipeline with the defined steps clf=Pipeline(steps)  # Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded' clf.fit(X_train, y_train_encoded)  # Predicting the labels for the test data 'X_test' using the trained pipeline yh=clf.predict(X_test)

文本摘要

GPT十分善于总结文本。在Scikit-LLM中有一个叫GPTSummarizer的模块。

 # Importing the GPTSummarizer class from the skllm.preprocessing module from skllm.preprocessing import GPTSummarizer  # Importing the get_summarization_dataset function from skllm.datasets import get_summarization_dataset  # Calling the get_summarization_dataset function X = get_summarization_dataset()  # Creating an instance of the GPTSummarizer s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)  # Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'. # It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries' summaries = s.fit_transform(X)

须要留神的是，max_words超参数是对生成摘要中单词数量的灵便限度。尽管max_words为摘要长度设置了一个粗略的指标，但摘要器可能偶然会依据输出文本的上下文和内容生成略长的摘要。

总结

ChaGPT的火爆使得泛化模型有了更多的提高，这种提高也给咱们日常的应用带来了微小的改革，Scikit-LLM就将LLM整合进了Scikit的工作流，如果你有趣味，这里是源码：

https://avoid.overfit.cn/post/9ba131a01d374926b6b7efff97f61c45

作者：Fareed Khan