关于人工智能:如何在-Unity-游戏中集成-AI-语音识别

39次阅读

共计 7023 个字符,预计需要花费 18 分钟才能阅读完成。

简介

语音辨认是一项将语音转换为文本的技术,设想一下它如何在游戏中发挥作用?收回命令操纵控制面板或者游戏角色、间接与 NPC 对话、晋升交互性等等,都有可能。本文将介绍如何应用 Hugging Face Unity API 在 Unity 游戏中集成 SOTA 语音辨认性能。

您能够拜访 itch.io 网站 下载 Unity 游戏样例,亲自尝试一下语音辨认性能。

先决条件

浏览文本可能须要理解一些 Unity 的基本概念。除此之外,您还需装置 Hugging Face Unity API,能够点击 之前的博文 浏览 API 装置阐明。

步骤

1. 设置场景

在本教程中,咱们将设置一个非常简单的场景。玩家能够点击按钮来开始或进行录制语音,辨认音频并转换为文本。

首先咱们新建一个 Unity 我的项目,而后创立一个蕴含三个 UI 组件的画布 (Canvas):

  1. 开始按钮 : 按下以开始录制语音。
  2. 进行按钮 : 按下以进行录制语音。
  3. 文本组件 (TextMeshPro): 显示语音辨认后果文本的中央。

2. 创立脚本

创立一个名为 SpeechRecognitionTest 的脚本,并将其附加到一个空的游戏对象 (GameObject) 上。

在脚本中,首先定义对 UI 组件的援用:

[SerializeField] private Button startButton;
[SerializeField] private Button stopButton;
[SerializeField] private TextMeshProUGUI text;

在 inspector 窗口中调配对应组件。

而后,应用 Start() 办法为开始和进行按钮设置监听器:

private void Start() {startButton.onClick.AddListener(StartRecording);
    stopButton.onClick.AddListener(StopRecording);
}

此时,脚本中的代码应该如下所示:

using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {[SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private void Start() {startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void StartRecording() {}

    private void StopRecording() {}
}

3. 录制麦克风语音输入

当初,咱们来录制麦克风语音输入,并将其编码为 WAV 格局。这里须要先定义成员变量:

private AudioClip clip;
private byte[] bytes;
private bool recording;

而后,在 StartRecording() 中,应用 Microphone.Start() 办法实现开始录制语音的性能:

private void StartRecording() {clip = Microphone.Start(null, false, 10, 44100);
    recording = true;
}

下面代码实现以 44100 Hz 录制最长为 10 秒的音频。

当录音时长达到 10 秒的最大限度,咱们心愿录音行为主动进行。为此,须要在 Update() 办法中写上以下内容:

private void Update() {if (recording && Microphone.GetPosition(null) >= clip.samples) {StopRecording();
    }
}

接着,在 StopRecording() 中,截取录音片段并将其编码为 WAV 格局:

private void StopRecording() {var position = Microphone.GetPosition(null);
    Microphone.End(null);
    var samples = new float[position * clip.channels];
    clip.GetData(samples, 0);
    bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
    recording = false;
}

最初,咱们须要实现音频编码的 EncodeAsWAV() 办法,这里间接应用 Hugging Face API,只须要将音频数据筹备好即可:

private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {using (var writer = new BinaryWriter(memoryStream)) {writer.Write("RIFF".ToCharArray());
            writer.Write(36 + samples.Length * 2);
            writer.Write("WAVE".ToCharArray());
            writer.Write("fmt".ToCharArray());
            writer.Write(16);
            writer.Write((ushort)1);
            writer.Write((ushort)channels);
            writer.Write(frequency);
            writer.Write(frequency * channels * 2);
            writer.Write((ushort)(channels * 2));
            writer.Write((ushort)16);
            writer.Write("data".ToCharArray());
            writer.Write(samples.Length * 2);

            foreach (var sample in samples) {writer.Write((short)(sample * short.MaxValue));
            }
        }
        return memoryStream.ToArray();}
}

残缺的脚本如下所示:

using System.IO;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {[SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void Update() {if (recording && Microphone.GetPosition(null) >= clip.samples) {StopRecording();
        }
    }

    private void StartRecording() {clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = new float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {using (var writer = new BinaryWriter(memoryStream)) {writer.Write("RIFF".ToCharArray());
                writer.Write(36 + samples.Length * 2);
                writer.Write("WAVE".ToCharArray());
                writer.Write("fmt".ToCharArray());
                writer.Write(16);
                writer.Write((ushort)1);
                writer.Write((ushort)channels);
                writer.Write(frequency);
                writer.Write(frequency * channels * 2);
                writer.Write((ushort)(channels * 2));
                writer.Write((ushort)16);
                writer.Write("data".ToCharArray());
                writer.Write(samples.Length * 2);

                foreach (var sample in samples) {writer.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();}
    }
}

如要测试该脚本代码是否失常运行,您能够在 StopRecording() 办法开端增加以下代码:

File.WriteAllBytes(Application.dataPath + "/test.wav", bytes);

好了,当初您点击 Start 按钮,而后对着麦克风谈话,接着点击 Stop 按钮,您录制的音频将会保留为 test.wav 文件,位于工程目录的 Unity 资产文件夹中。

4. 语音辨认

接下来,咱们将应用 Hugging Face Unity API 对编码音频实现语音辨认。为此,咱们创立一个 SendRecording() 办法:

using HuggingFace.API;

private void SendRecording() {
    HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
        text.color = Color.white;
        text.text = response;
    }, error => {
        text.color = Color.red;
        text.text = error;
    });
}

该办法实现将编码音频发送到语音辨认 API,如果发送胜利则以红色显示响应,否则以红色显示谬误音讯。

别忘了在 StopRecording() 办法的开端调用 SendRecording():

private void StopRecording() {
    /* other code */
    SendRecording();}

5. 最初润色

最初来晋升一下用户体验,这里咱们应用交互性按钮和状态音讯。

开始和进行按钮应该仅在适当的时候才产生交互成果,比方: 筹备录制、正在录制、进行录制。

在录制语音或期待 API 返回辨认后果时,咱们能够设置一个简略的响应文原本显示对应的状态信息。

残缺的脚本如下所示:

using System.IO;
using HuggingFace.API;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {[SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
        stopButton.interactable = false;
    }

    private void Update() {if (recording && Microphone.GetPosition(null) >= clip.samples) {StopRecording();
        }
    }

    private void StartRecording() {
        text.color = Color.white;
        text.text = "Recording...";
        startButton.interactable = false;
        stopButton.interactable = true;
        clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = new float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
        SendRecording();}

    private void SendRecording() {
        text.color = Color.yellow;
        text.text = "Sending...";
        stopButton.interactable = false;
        HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
            text.color = Color.white;
            text.text = response;
            startButton.interactable = true;
        }, error => {
            text.color = Color.red;
            text.text = error;
            startButton.interactable = true;
        });
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {using (var writer = new BinaryWriter(memoryStream)) {writer.Write("RIFF".ToCharArray());
                writer.Write(36 + samples.Length * 2);
                writer.Write("WAVE".ToCharArray());
                writer.Write("fmt".ToCharArray());
                writer.Write(16);
                writer.Write((ushort)1);
                writer.Write((ushort)channels);
                writer.Write(frequency);
                writer.Write(frequency * channels * 2);
                writer.Write((ushort)(channels * 2));
                writer.Write((ushort)16);
                writer.Write("data".ToCharArray());
                writer.Write(samples.Length * 2);

                foreach (var sample in samples) {writer.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();}
    }
}

恭喜!当初您能够在 Unity 游戏中集成 SOTA 语音辨认性能了!

如果您有任何疑难,或想更多地参加 Hugging Face for Games 系列,能够退出 Hugging Face Discord 频道!


英文原文: https://hf.co/blog/unity-asr

作者: Dylan Ebert

译者: SuSung-boy

审校 / 排版: zhongdongy (阿东)

正文完
 0