关于人工智能:大语言模型的创意能力到底几何探索从GPT2到GPT4的演进

编者按：大语言模型能够提供许多创意性内容，如写诗、写小说等。那么到底应该如何评估大语言模型生成的创意性内容的程度呢？

本文探讨了GPT-2到GPT-4的创造性，并剖析了这些模型在不同创造性测试中的体现。作者应用了三种测试来掂量模型的创造性：Remote Associates Test、Alternate Uses Task和Divergent Association Task，并发现越新的模型在所有创造性工作中体现越来越好。

然而，也存在一些例外，在某些仅有一个“正确”答案的测试中，简略的GPT-2模型比最新的GPT-4模型体现更好。作者认为这些试验旨在掂量模型生成的创意性内容程度，而不是掂量模型自身的创造力。

以下是译文，Enjoy!

作者 | YENNIE JUN

编译 | 岳扬

最近一段时间，人们应用大语言模型（LLMs）发明了许多有创意的内容，比方书籍[1]、微型小说[2]、rap battles[3]和音乐和弦[4]等。然而，咱们是否能更全面地掂量这些模型中的创意能力呢？

长期以来，人类的创造力始终是令许多心理学家和心理研究者着迷的畛域。自上世纪50年代起[5]，心理研究者们发明了一系列创造力测试，能够比拟个体的创造力体现和后劲。尽管没有任何一个测试能够齐全量化人类的创造力，但这些测试掂量了创造力的不同方面。

本文将剖析从2019年至2023年GPT模型在可能掂量两种创造性的测试下的体现：收敛性（只存在繁多的正确解决方案）和发散性（开放式的测试；可能存在多种解决方案） 。这些测试包含：

应用第四个词连贯三个看似不相干的单词（Remote Associates Test[6]）
为日常物品提供尽可能多的代替用处（Alternate Uses Task[7]）
列举出尽可能相互之间差别最大的十个名词（Divergent Association Task[8]）

通过对这些测试后果进行剖析，我专一于晚期的GPT-2模型到最新的GPT-4模型的体现变动。通过使用主观和主观的视角，来了解这些模型的行为。一般而言，前期的模型在所有创造性工作中的体现都会越来越好。然而我发现了几个例外，当测试中仅有一个“正确”答案时，简略的GPT-2模型比最新的GPT-4模型体现更好。同时，我将GPT模型与人类在同样的创造性测试上的体现进行了比拟。

注：这些试验旨在掂量模型生成的创意性内容的程度，而不是掂量模型自身的创造力。我并不认为GPT模型或个别的LLMs像人类那样领有创造性思维。相同，我的目标是展现这些模型如何对特定的创意性性测试进行响应。

GPT模型的倒退历程

生成式预训练Transformer（GPT）是一种语言模型，其在互联网中大量的语料内容中进行训练，基于前一个token预测下一个token。只管最近几个月ChatGPT引起了公众的关注，但OpenAI的GPT模型（以及模型背地的技术）曾经存在几年了。

GPT-1（2018年，117M参数）。这是最原始的GPT模型。我应用了HuggingFace上的开源版本[9]。然而，输入后果不太妙，所以在本篇文章中我没有记录任何GPT-1的输入。

论文：Improving Language Understanding by Generative Pre-Training[10]
GPT-2（2019年，15B 参数）。我应用了HuggingFace上的开源版本[11]。对于一些试验， GPT-2的输入毫无意义，对于了解其行为没有用途，因而我在相应的剖析中没有记录这些内容。

论文：Language Models are Unsupervised Multitask Learners[12]
GPT-3（2020年，175B 参数）。比GPT-2大100多倍。我应用了OpenAI APIs[13]上的text-davinci-002。

论文：Language Models are Few-Shot Learners[14]
GPT-3.5（2022年）。尽管有多个版本的GPT-3.5，但我应用了与最后ChatGPT[15]应用的模型雷同的模型，该模型应用强化学习和人类反馈进行训练。我应用了OpenAI APIs上的gpt-3.5-turbo[16]。
GPT-4（2023年，约100T 参数）。我应用了OpenAI APIs上的gpt-4。

论文：GPT-4 Technical Report[17]

在本文的所有试验中，除非特地提到，否则针对每个惟一的Prompt，我调用每个GPT模型五次，这样次要思考其生成内容具备的随机性。

01 远距离联想测验：连贯不相干的词

远距离联想测验（Remote Associates Test，RAT）是Martha Mednick于1962年开发的一种创造力测试。这个测试次要掂量收敛性思维，也就是说每个问题都只有一个“正确”的答案。

给定三个看似不相干的单词，参与者必须找到第四个单词来连贯这三个单词。 例如，“dream”、“break”和“light”能够通过单词“day”（daydream，daybreak，daylight）进行连贯。单词“sense”、“courtesy”和“place”能够通过单词“common”连贯。

我从Remote Associates Test网站收集了许多单词三元组（triplets），总共有312个问题。每个问题的难度都评级在非常容易到十分艰难的一个范畴内。

Remote Associates Test网站上的远距离联想测验问题分类

对于每种模型和每个单词三元组，我都应用了以下Prompt模板。思考到模型答复的随机性，我调用了每个模型5次。

    You are given three words. Think of a fourth word that is somehow related to each of the first three words. First give the answer ("Answer") then explain your reasoning ("Reason").

    Words: {Word1}, {Word2}, {Word3}

新GPT模型在远距离联想测验中得分更高

我计算了每个版本的GPT模型生成的答案正确率，可能发现较新版本的GPT模型生成的正确答案更多，这并不令人意外。

按问题的难度程度划分对模型性能进行分析表明，较新的GPT模型在每个难度区间上的体现都比旧模型更好。尽管较新的GPT模型在难度为艰难的问题上体现优于之前版本的模型，然而对解决艰难问题的改良水平比解决容易的问题小得多。这意味着即便是GPT-4也会在难度为艰难的问题上遇到困难。

依据Remote Associates Test网站形容，“个别人能够解决大多数难度为容易的题目，大概一半的中等难度题目，很少有难题。”该网站没有颁布正式的统计数据，但疾速浏览能够发现，GPT-4的确解决了大概一半的“medium”难度题目和一些难度为“hard”的题目，成为惟一与人类程度靠近的模型。其余GPT模型的体现都不如人类。

几种例外情况：GPT-2回复正确而GPT-4回复谬误

尽管产生频率不高，但有一些问题，2019年的GPT-2模型可能生成正确答案，而2023年的GPT-4模型无奈解决。在咱们的测试中有8个这样的问题（残缺列表附在文章结尾的附录中）。

An easy question

Words: cottage / brick / cake

GPT-2: “cheese” [correct answer]

GPT-4: “layer”

A medium question

Words: cat / number / phone

GPT-2: “call”

GPT-4: “dial”

A hard question

Words: through / disk / way

GPT-2: “drive” [correct]

GPT-4: “circle”

A very hard question

Words: land / hand / house

GPT-2: “farm” [correct]

GPT-4: “property”

请留神，对于这些问题，GPT-4的答案都并非是不合理的。事实上，在不晓得正确答案的状况下，你可能会认为GPT-4的答案有可能与这三个单词具备分割。如果你这么想，你并没有错。我将在下一节中介绍这个问题。

GPT-4是答复错了还是只是在跳出框架思考？

GPT-4答复了一半以上的问题，可能都有充沛的理由。我要求GPT-4解释其每个答案。而后，我看到了一些难度为“Very Hard”的问题的解释是谬误的。

Words: artist / hatch / route

正确答案： “escape”

GPT-4 的答复： “map”

    The word "artist" might bring to mind the idea of creating something, perhaps even a visual representation of something. Maps are a common form of visual representation that require a degree of artistic skill to create.

    The word "hatch" could refer to the act of opening something or bringing something into existence. A map is a type of blueprint or plan, often used to navigate new territory or chart out a course of action. 

    The word "route" is related to direction and movement, which can be plotted and tracked on a map. A map can help you determine the best route to take when traveling, and can also provide information about the routes others have taken in the past.

    All in all, "Map" seems to be a word that fits well with each of the three initial words, as it encapsulates the ideas of artistic creation, planning and direction, and movement and navigation.

GPT-4解释为什么“map”与“artist / hatch / route”分割严密的起因是可能令我同意的。

以下是另一个例子：

Words: fork / dark / man

正确答案： “pitch”

GPT-4 的答复： “shadow”

    The word "fork" can be associated with division or branching, such as a fork in the road. Shadows can also be seen as a type of division, as they represent the absence of light.

    The word "dark" refers to the absence of light. Shadows are created when light is blocked by an object, and can often appear darker than the surrounding area.

    The word "man" can be associated with the concept of identity or individuality. Shadows can also be used to represent the identity of a person, as they can provide clues about their shape and size.

    Therefore, "Shadow" seems to be a word that fits well with each of the initial three words, as it encapsulates the ideas of division, absence, and identity.

这些例子体现了Remote Associates Test以及其余个别的收敛性创造力测试的缺点，即 “正确”答案可能不是惟一的“正确”答案。在这些案例中，GPT-4生成了一个“谬误”的答案，但只是因为它没有合乎所谓的“正确”答案。

事实上，如果我要求GPT-4为三个单词组想出多个第四个单词的答案，它可能给我多个答案，并且仿佛都是可行的。

对于任意生成的三个单词组，GPT-4会如何答复

GPT模型有可能在训练数据中见过上述远距离联想测验（Remote Associates Test，RAT）的问题。那么，如果给出任意生成的三个单词，GPT-4会如何答复呢？在本节中，我构建了本人的“remote association”测试并主观地评估输入。

我列出了最罕用的《英语单词10000词》清单[18]。而后我随机从这份清单中选取三个单词，让GPT-4提供将这些单词分割在一起的第四个单词。有些三个单词的组合十分奇怪且不相干，然而GPT-4总能找到将它们分割在一起的办法。倡议读者在浏览GPT-4生成的后果之前先尝试这些例子！

在看完接下来的几个例子后，你应该能够做出判断——GPT-4是跳出框架思考？还是只是草草了事地答复？（本文开端的附录中展现了更长的例子列表）

⛑️ Health

这个答复听起来相当令人信服。但乏味的是，“衰弱监测（health surveillance）”的定义简直照搬自CDC网站对于公共卫生监测介绍的内容[19]。

Public health surveillance is “the ongoing, systematic collection, analysis, and interpretation of health-related data essential to planning, implementation, and evaluation of public health practice.” — Field Epidemiology

(from the CDC website)

    Words: dietary / surveillance / born

    Answer: Health

    Reason: 
    1. Dietary - A person's diet is directly related to their health, as it can affect their overall well-being, energy levels, and risk for certain diseases.
    2. Surveillance - Health surveillance is the ongoing, systematic collection, analysis, and interpretation of data essential to the planning, implementation, and evaluation of health practice.
    3. Born - When a person is born, their health is of utmost importance, as it can have long-lasting effects on their development, growth, and overall quality of life.

🍀 Clovers

一开始我被GPT-4的答复所压服——事实上，GPT-4听起来十分聪慧。然而，当我进行谷歌搜寻，发现Trifolium duodecim这种说法并不存在的时候，我就不那么确信了。这可能是大模型的一个“幻觉”例子，即LLM会假造一个不精确的答案。

    Words: lucky / field / twelve

    Answer: Clover

    Reason: 
    1. Lucky - A four-leaf clover is considered a symbol of good luck.
    2. Field - Clovers are plants that can be commonly found growing in fields.
    3. Twelve - There is a variety of clover known as "Trifolium duodecim" (Twelve-leaf clover), which has 12 leaflets.

☕️ Starbucks

这个例子是我最喜爱的，只管它有跳出思维定势的感觉，但仍有点穿凿附会。甚至有点好笑（特地是对于Patrick的那局部）。

    Words: grande / cognitive / patrick

    Answer: Starbucks

    Reason: 
    1. Grande: Starbucks uses the term "grande" to refer to one of its medium-sized drinks.
    2. Cognitive: Starbucks is a popular place for people to go to relax, study, and work, which are all activities that require cognitive function.
    3. Patrick: Patrick Dempsey, the actor who played Dr. Derek Shepherd on the television show Grey's Anatomy, is known for his love of coffee and was nicknamed "McDreamy." Starbucks is a well-known coffee shop, so it's possible that his character would have frequented Starbucks.

这些例子和案例剖析阐明，仅有一个正确答案的创造力测试可能存在限度。在本文的下一部分中，将会探讨其余不受此限度的测试。

02 代替用处工作测试：挖掘一般物品的其余用处

另一种罕用的测试创造力的办法是抉择代替用处工作（Alternative Uses Task，AUT）测试。该测试目标是让人们挖掘日常物品的更多应用形式，以此评估他们的创造力。该测试由J·P·吉尔福德（J. P. Guilford）于1967年创立，与“远距离联想测验”不同，它测量的是发散性思维，也就是说没有正确答案。

测试参与者须要尽可能多地提出日常物品的用处，例如，“laptop（笔记本电脑）”能够用作“plate（盘子）”，用作“body shield（身材护盾）”，或者用作“wall art（墙面艺术）”。这些答复的原创性和思路的灵活性将被用来掂量创造力程度，计算形式蕴含的指标如原创性（用处的差别水平）和灵活性（思路的多样性）。 测试的目标是挖掘出那些一般物品暗藏的更多潜在用处。

然而，因为人工智能的资源和能力无限，难以取代人类智慧，所以即便是应用人工智能进行抉择代替用处工作（Alternative Uses Task，AUT）测试，仍须借助人工干预才可得出更加精准的后果。因而，我应用了创造力的替代物（proxy for creativity），阿姆斯特丹大学（University of Amsterdam）的钻研人员在Putting GPT-3’s Creativity to the (Alternative Uses) Test [20]中也应用了这个替代物。

在这篇论文中，钻研人员要求人类和GPT-3（davinci-002）为以下常见的物体想出其余的用处：叉子、铁罐和书（fork, tin can, and book）。作为掂量创造力的替代品，他们测量了这些物体和生成的用处之间的语义间隔（semantic distance）——也就是说，生成的用处与这些物体之间有多大区别？

我从该论文的我的项目网站[21]上获取了人类的答复，以便与人类的体现进行比拟。我在GPT模型中应用了以下Prompt（基于论文中应用的相似prompt）。

    Come up with 10 creative uses for the following item that might strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different: {item}

基于嵌入余弦间隔的创造力评判替代物

在自然语言解决中，语义间隔（或语义类似度 [22] ）被用作掂量两个单词相关性的指标，它基于这些单词在类似上下文中一起应用的频率。 例如，“电脑（computer）”和“屏幕（screen）”这两个词可能具备更短的语义间隔，因为它们常常在雷同的上下文中同时应用，而“电脑（computer）”和“扁豆（lentils）”这两个词之间的语义间隔可能更大，因为它们不常在雷同的上下文中同时呈现。

为了计算两个单词之间的语义间隔，我采纳了上述钻研论文中所应用的办法[23]。通过应用Spacy（en\_core\_web\_lg）这个自然语言解决库，来计算对象（例如，“笔记本电脑（laptop）”）和用处词（例如，“盘子”）的词嵌入。而后，我应用了两个词嵌入间的余弦间隔来计算它们之间的语义间隔。

对于代替用处工作（Alternative Uses Task，AUT）测试，较大的语义间隔意味着更具“创造性”的物品用处。钻研结果表明，在三个物品（叉子、易拉罐和书）中，与人类体现相比，GPT-3生成的余弦间隔较低（因而人类的体现具备更高的语义差异性） 。然而，在某些状况下，GPT-3.5或GPT-4的创造力语义间隔分数能够匹敌或超过人类体现。

只管人类偏向于提出比GPT-3提出的更具创造性的物品用处（应用余弦间隔所计算的语义间隔来代替创造力），但GPT-4的体现曾经靠近于人类程度。也就是说，GPT-4的回复与原始项（original item）之间的语义间隔较大，而其体现与人类体现类似。然而，如果没有人类对主观类别（subjective categories）如原创性、灵活性和实用性进行正文和决定，则应用余弦间隔进行的量化仅能测量GPT模型创造性产出的一小部分，因而受到限制。

你能辨别由人类还是AI生成的物品用处吗？

其中一个列表显示了人类给出的易拉罐（tin can）代替用处，另一个列表则给出了GPT-4生成的代替用处。你能发现区别吗？答案在上面。

List 1: Uses for tin can（易拉罐的代替用处）

as a mirror（镜子）
to create toys（制作玩具）
as art（艺术品）
as a reminder of Andy Warhol（安迪·沃霍尔格调的产品）
as a key ring with the clip from the can（用罐子的夹子当钥匙圈）
as jewelry（饰品）

List 2: Uses for tin can（易拉罐的代替用处）

miniature drum set 迷你鼓套装
quirky plant pot 乖僻的动物盆
impromptu cookie cutter 长期的饼干切割器
homemade camp stove 自制露营炉
whimsical wind chimes 胡思乱想的风铃
miniature herb garden 微型植物园

下一节的测试也是测量发散性思维。

03 发散联想工作：生成十个没有分割的词

第三个创造力测试是发散联想工作（DAT），是由钻研人员在2021年创立的。这个测试，正如其名是发散性的，也没有正确答案。

在这个网站上有一个测试的版本。它只须要几分钟的工夫，因为其波及到尽可能不相干的十个名词的列表。这是我在该网站上的测试后果：

DAT分数 The DAT Score

有读者会疑难该分数是如何计算的，作者在此处形容了他们的分数计算过程。 DAT分数是应用嵌入（embeddings）和余弦间隔（cosine distance）进行计算的。 我改编了计算分数的过程：

计算每个词的GloVe嵌入
计算每对单词之间的余弦间隔
取这个值的平均值

这个过程相似于计算代替用处工作（Alternative Uses Task，AUT）中的创造力评判替代物得分的办法，惟一的区别是应用的词嵌入模型。

比拟模型之间的差异性

上面的Prompt用于GPT模型，它是从DAT网站[24]上获取的。

    Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words.

    Rules
    - Only single words in English.
    - Only nouns (e.g., things, objects, concepts).
    - No proper nouns (e.g., no specific people or places).
    - No specialized vocabulary (e.g., no technical terms).
    - Think of the words on your own (e.g., do not just look at objects in your surroundings).

上面的后果没有蕴含GPT-2，因为GPT-2输入的内容是胡说八道。（能够在文章开端的附录中看到这个例子）

结果显示，越新模型的DAT得分越高。这意味着更新的模型在实现生成不同语义的10个单词工作时做得更好。

例如，由GPT-3生成的以下单词列表的DAT得分是0.36：

    dusk, night, day, dawn, morning, afternoon, evening, midnight, twilight

另一方面，由GPT-4生成的以下单词列表的得分是0.95：

    elephant, microscope, sandcastle, galaxy, harmony, spatula, glacier, philosophy, tambourine, drought

依据介绍DAT测试的论文，得分通常在65到90之间，平均分在75到80之间。95分被认为是十分高的分数。

须要留神的是上图中GPT-4的误差条（error bars）比GPT-3小得多。这一点在下一个试验中将变得更加显著，因为我会通过调整“temperature”参数来扭转每种GPT模型生成回复中的随机性。

“temperature”是介于0和1之间的值，靠近0的值最容易预测，而1则是最随机的。 我尝试了0.5到0.9之间以0.1为增量的temperature值（在之前的试验中，我将“temperature”放弃在默认值0.7）。我预计“temperature”越高变动越大，而“temperature”越低则会越稳固。

然而，结果表明，相较于语义间隔（semantic distance）的差别方面，“temperature”并不像模型那样重要。 对于GPT-3，只管“temperature”不同，每个由10个单词组成的单词列表的DAT分数差异很大。而对于GPT-4，只管“temperature”不同，DAT分数仍在一个小范畴内放弃严密统一，变动十分小。

当我看到这三个版本的GPT模型生成的文本内容时，这一点变得更加清晰。相较于GPT-3，GPT-4生成的文本变动要小得多。在某种程度上，感觉GPT-4曾经晓得了一个原本应该没有正确答案的工作的“正确答案”。

针对所有“temperature”分数的反馈，我让每个版本的GPT模型生成50次由10个单词组成的列表，而后查看不同单词在所有50次运行中呈现的频率。须要留神的是，GPT-3最常生成的单词是book，在50次中呈现了19次，即38％。而另一方面，GPT-4最常生成的单词是microscope，在50次中呈现了48次，即96％。

这验证了GPT-3在生成单词方面实际上更具备可变性，而GPT-4则通过预测大量它晓得会在“多样化单词（diverse words）”工作中得分高的雷同单词——比方microscope、volcano、elephant和galaxy等单词，来保障回复的平安、牢靠。

这一发现阐明GPT-4仿佛曾经“破解”了生成它所认为的多样化词汇的明码。如果是这样的话，我不确定是否自信地说GPT-4是在施展创造力，而不是在“如何最好地玩这个生成多样化词汇的游戏”方面达到了局部优化。

04 Conclusion 论断

“We can imagine a future in which GPT-3 and other generative LLMs responses cannot be distinguished from humans, although the creative process will be different.”

“咱们能够设想，在将来GPT-3和其余生成性LLMs的回复简直会和人类很类似，只管创作过程会有所不同。”

— from Putting GPT-3’s Creativity to the (Alternative Uses) Test

应用一套无限的基于文本的测试来精确测量像创造力这样简单和多面的货色是非常具备挑战性的。只管 Visual Remote Associates Test 试图通过引入视觉元素来解决这个问题，但这种评估依然存在固有的局限性。收敛性测试（如Remote Associates Test）有一个“正确答案（right answer）”，可能这个“right answer”并不总是最有创造力或最正确的解决方案。另一方面，发散性测试（如Divergent Association Test)）没有特定的正确答案，依赖于不完满的词嵌入（word embeddings）， 可能蕴含偏见。

此外，GPT模型在这些测试中的体现可能会受到其在训练过程中所接触到的工作的影响，可能导致数据交叠（data overlap）。例如，GPT-4在Divergence Association Test上的体现展现出了奇怪的单词反复，这表明其可能先前曾经晓得了正确的答案。当初开发其余代替的测试方法至关重要，因为有传言称GPT-4在测试数据上进行了训练，特地是在代码编写工作中。

只管存在这些限度，创造力测试依然能够作为比拟和跟踪（comparing and tracking）大语言模型性能的基准（benchmarks）。将来能够在其余语言中反复本文的试验测试，以此来评估GPT模型在不同语言畛域的创造力。 通过这样做，咱们能够更全面地理解AI生成的内容，并进一步摸索这些高级语言模型的能力和局限性。

05 附录

Remote Association Test：GPT模型正确答复问题的状况比拟

咱们能够查看每个模型可能正确答复的问题难度程度的散布状况。咱们能够看到，GPT3、GPT3.5和GPT4在答复艰难问题的能力比例上类似。

在Remote Association Test中GPT-2答复正确，而GPT-4答复谬误的状况

    Words:  cat / number / phone
    Level:  Medium
    Answer: call
    GPT2:   call 
    GPT4:   dial

    Words:  land / hand / house
    Level:  Very Hard
    Answer: farm
    GPT2:   farm 
    GPT4: property

    Words:  water / pen / soda
    Level:  Unknown
    Answer: fountain
    GPT2:   fountain 
    GPT4:   liquid

    Words:  measure / worm / video
    Level:  Very Easy
    Answer: tape
    GPT2:   tape 
    GPT4:   length

    Words:  cottage / brick / cake
    Level:  Very Easy
    Answer: cheese
    GPT2:   cheese 
    GPT4:   layer

    Words:  through / disk / way
    Level:  Hard
    Answer: drive
    GPT2:   drive 
    GPT4:   circle

    Words:  skunk / kings / boiled
    Level:  Very Easy
    Answer: cabbage
    GPT2:   cabbage 
    GPT4:   smell

    Words:  jump / kill / bliss
    Level:  Hard
    Answer: joy
    GPT2:   joy 
    GPT4:   height

GPT-4对随机生成的 Remote Association Test 问题的答复

上面是GPT-4对一些随机产生的 Remote Association Test 问题的答复。

Rings（戒指）

这一个答复让人感觉有点穿凿附会。如果你对你的恋情生存或其余事件感到超级厌倦，我能够承受戒指是 “简单的象征，特地是在某一段关系中 “这一部分，但最初一部分感觉很勉强。

    Words: complications | saturn | drops
    Answer: Rings
    Reason: 
    1. Complications: Rings can be a symbol of complications, especially in relationships, as they represent commitment and the complexities that come with it.
    2. Saturn: Saturn is famously known for its beautiful and prominent rings, made primarily of ice particles and some rocky debris.
    3. Drops: A ring can be seen as a drop or loop of material, such as metal or plastic, that forms a complete circle.

Revolution（反动）

Monte Cassino[26]确实是一个实在的中央，它在二战期间确实被轰炸过。

    Words: belfast | movement | monte
    Answer: Revolution
    Reason: 
    1. Belfast is a city in Northern Ireland that has experienced significant political and social changes due to the Irish Republican movement. 
    2. Movement is a general term that can describe political and social changes, often caused by revolutions. 
    3. Monte is likely referring to Monte Cassino, a historic site in Italy where a significant battle occurred during World War II, which was a time of global revolution and change. 

    Therefore, the word "revolution" is related to each of the first three words as it represents significant change and upheaval.

Help（帮忙）

通过帮忙核心（help centers）将 “Southampton”城市与 “帮忙（help）”分割起来的一种乏味的形式…

    Words: ensures | assistant | southampton
    Answer: Help
    Reason: 
    1. Ensures: "Help" is related to "ensures" as it is an action that can ensure the success or completion of a task.
    2. Assistant: "Help" is related to "assistant" because an assistant provides help in various tasks and activities.
    3. Southampton: "Help" is related to "Southampton" as Southampton is a city in England which has various help centers and services, such as the Southampton Voluntary Services, providing assistance to the community.

Internet

我想它的观点是十分正确的。

    Words: uruguay | accessible | requirement
    Answer: Internet
    Reason: Uruguay has a government initiative called Plan Ceibal, which aims to make the Internet accessible to all its citizens, fulfilling a requirement for digital inclusion and reducing the digital divide.

Language

兴许有点抽象，但我想它是无效的。

    Words: compensation | atlas | prefix
    Answer: Language
    Reason: 
    1. Compensation - In linguistics, compensation is a phenomenon where a sound change in one part of a word is "compensated" by a change in another part of the word to maintain the overall structure of the word. 
    2. Atlas - An atlas is a collection of maps, which often include labels and descriptions in a specific language.
    3. Prefix - A prefix is a morpheme or a linguistic element that is added to the beginning of a word to create a new word with a different meaning, which is a concept in language and grammar.

Health

health surveillance局部有点dark，但我想它的答复是无效的。

    Words: dietary | surveillance | born
    Answer: Health
    Reason: 
    1. Dietary - A person's diet is directly related to their health, as it can affect their overall well-being, energy levels, and risk for certain diseases.
    2. Surveillance - Health surveillance is the ongoing, systematic collection, analysis, and interpretation of data essential to the planning, implementation, and evaluation of health practice.
    3. Born - When a person is born, their health is of utmost importance, as it can have long-lasting effects on their development, growth, and overall quality of life.

ERIC

这个答复其实很乏味。

    Words: eric | figures | purchasing
    Answer: accountant
    Reason: An accountant is a professional who can be named Eric, works with figures (numbers), and is often involved in purchasing decisions for businesses.

Divergent Association Test 中GPT-2的输入

我在发散性测试的剖析中排除了GPT-2，因为它的输入齐全是无意义的。即便我尽可能地简化Prompt，它的输入也没有什么意义。我在这里提供一个样例，因为我认为它们实际上还是十分乏味的。

    List 10 words that are as different from each other as possible:

    "Merry Christmas"

    "I'm not a robot, we're the robots"

    "It's a good thing we're all robots"

    "I'm just a robot"

    "I'm just a robot"

    "I'm just a robot"

    "I'm just a robot"

    "I'm not a robot"

    "I'm just a robot"

    "I'm just a robot"

END

参考资料

1.https://www.reuters.com/technology/chatgpt-launches-boom-ai-w…

2.https://blog.yenniejun.com/p/creative-writing-with-gpt-3-from

3.https://twitter.com/mehran__jalali/status/1639846978850021377…

4.https://towardsdatascience.com/using-chatgpt-as-a-creative-wr…

5.https://www.ideatovalue.com/podc/nickskillicorn/2021/04/the-1…

6.https://en.wikipedia.org/wiki/Remote_Associates_Test

7.https://en.wikipedia.org/wiki/Guilford%27s_Alternate_Uses

8.https://en.wikipedia.org/wiki/Divergent_Association_Task

9.https://huggingface.co/openai-gpt

10.https://cdn.openai.com/research-covers/language-unsupervised/…

11.https://huggingface.co/gpt2

12.https://d4mucfpksywv.cloudfront.net/better-language-models/la…

13.https://platform.openai.com/docs/models/gpt-3-5

14.https://arxiv.org/pdf/2005.14165.pdf

15.https://en.wikipedia.org/wiki/ChatGPT

16.https://platform.openai.com/docs/models/gpt-3-5

17.https://arxiv.org/pdf/2303.08774.pdf

18.https://github.com/first20hours/google-10000-english

19.https://www.cdc.gov/training/publichealth101/surveillance.htm…

20.https://arxiv.org/pdf/2206.08932.pdf

21.https://osf.io/vmk3c/?view_only=

22.https://en.wikipedia.org/wiki/Semantic_similarity

23.https://arxiv.org/pdf/2206.08932.pdf

24.https://www.datcreativity.com/task

25.https://www.pnas.org/doi/10.1073/pnas.2022340118

26.https://en.wikipedia.org/wiki/Monte_Cassino

本文经原作者受权，由Baihai IDP编译。如需转载译文，请分割获取受权。

原文链接：

https://blog.yenniejun.com/p/exploring-creativity-in-large-la…

关于人工智能:大语言模型的创意能力到底几何探索从GPT2到GPT4的演进

01 远距离联想测验：连贯不相干的词

新GPT模型在远距离联想测验中得分更高

几种例外情况：GPT-2回复正确而GPT-4回复谬误

GPT-4是答复错了还是只是在跳出框架思考？

对于任意生成的三个单词组，GPT-4会如何答复

⛑️ Health

🍀 Clovers

☕️ Starbucks

02 代替用处工作测试：挖掘一般物品的其余用处

基于嵌入余弦间隔的创造力评判替代物

你能辨别由人类还是AI生成的物品用处吗？

03 发散联想工作：生成十个没有分割的词

DAT分数 The DAT Score

比拟模型之间的差异性

04 Conclusion 论断

05 附录

Remote Association Test：GPT模型正确答复问题的状况比拟

在Remote Association Test中GPT-2答复正确，而GPT-4答复谬误的状况

GPT-4对随机生成的 Remote Association Test 问题的答复

Divergent Association Test 中GPT-2的输入

END

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

关于人工智能:大语言模型的创意能力到底几何探索从GPT2到GPT4的演进

01 远距离联想测验： 连贯不相干的词

新GPT模型在远距离联想测验中得分更高

几种例外情况：GPT-2回复正确而GPT-4回复谬误

GPT-4是答复错了还是只是在跳出框架思考？

对于任意生成的三个单词组，GPT-4会如何答复

⛑️ Health

🍀 Clovers

☕️ Starbucks

02 代替用处工作测试： 挖掘一般物品的其余用处

基于嵌入余弦间隔的创造力评判替代物

你能辨别由人类还是AI生成的物品用处吗？

03 发散联想工作： 生成十个没有分割的词

DAT分数 The DAT Score

比拟模型之间的差异性

04 Conclusion 论断

05 附录

Remote Association Test：GPT模型正确答复问题的状况比拟

在Remote Association Test中GPT-2答复正确，而GPT-4答复谬误的状况

GPT-4对随机生成的 Remote Association Test 问题的答复

Divergent Association Test 中GPT-2的输入

END

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

01 远距离联想测验：连贯不相干的词

02 代替用处工作测试：挖掘一般物品的其余用处

03 发散联想工作：生成十个没有分割的词

发表回复取消回复