AIAvatarKit
๐ฅฐ Building AI-based conversational avatars lightning fast โก๏ธ๐ฌ
โจ Features
- Live anywhere: VRChat, cluster and any other metaverse platforms, and even devices in real world.
- Extensible: Unlimited capabilities that depends on you.
- Easy to start: Ready to start conversation right out of the box.
๐ฉ Requirements
- VOICEVOX API in your computer or network reachable machine (Text-to-Speech)
- API key for Speech Services of Google or Azure (Speech-to-Text)
- API key for OpenAI API (ChatGPT)
- Python 3.10 (Runtime)
๐ Quick start
Install AIAvatarKit.
$ pip install aiavatar
Make the script as run.py
.
from aiavatar import AIAvatar
app = AIAvatar(
openai_api_key="YOUR_OPENAI_API_KEY",
google_api_key="YOUR_GOOGLE_API_KEY"
)
app.start_listening_wakeword()
Start AIAvatar. Also, don't forget to launch VOICEVOX beforehand.
$ python run.py
Conversation will start when you say the wake word "ใใใซใกใฏ" (or "Hello" when language is not ja-JP
).
Feel free to enjoy the conversation afterwards!
๐ Contents
๐ Configuration Guide
Here are the configuration for each component.
๐ Generative AI
You can set model and system message content when instantiate AIAvatar
.
app = AIAvatar(
openai_api_key="YOUR_OPENAI_API_KEY",
google_api_key="YOUR_GOOGLE_API_KEY",
model="gpt-4-turbo",
system_message_content="You are my cat."
)
ChatGPT
If you want to configure in detail, create instance of ChatGPTProcessor
with custom parameters and set it to AIAvatar
.
from aiavatar.processors.chatgpt import ChatGPTProcessor
chat_processor = ChatGPTProcessor(
api_key=OPENAI_API_KEY,
model="gpt-4-turbo",
temperature=0.0,
max_tokens=200,
system_message_content="You are my cat.",
history_count=20,
history_timeout=120.0
)
app.chat_processor = chat_processor
Claude
Create instance of ClaudeProcessor
with custom parameters and set it to AIAvatar
. The default model is claude-3-sonnet-20240229
.
from aiavatar.processors.claude import ClaudeProcessor
claude_processor = ClaudeProcessor(
api_key="ANTHROPIC_API_KEY"
)
app = AIAvatar(
google_api_key=GOOGLE_API_KEY,
chat_processor=claude_processor
)
NOTE: We support Claude 3 on Anthropic API, not Amazon Bedrock for now.
Gemini
Create instance of GeminiProcessor
with custom parameters and set it to AIAvatar
. The default model is gemini-pro
.
from aiavatar.processors.gemini import GeminiProcessor
gemini_processor = GeminiProcessor(
api_key="YOUR_GOOGLE_API_KEY"
)
app = AIAvatar(
google_api_key=GOOGLE_API_KEY,
chat_processor=gemini_processor
)
NOTE: We support Gemini on Google AI Studio, not Vertex AI for now.
Dify
You can use the Dify API instead of a specific LLM's API. This eliminates the need to manage code for tools or RAG locally.
from aiavatar import AIAvatar
from aiavatar.processors.dify import DifyProcessor
chat_processor_dify = DifyProcessor(
api_key=DIFY_API_KEY,
user=DIFY_USER
)
app = AIAvatar(
google_api_key=GOOGLE_API_KEY,
chat_processor=chat_processor_dify
)
app.start_listening_wakeword()
Other LLMs
You can make your custom processor that uses other generative AIs such as Llama3 by implementing ChatProcessor
interface. We provide the example later.๐
๐ฃ๏ธใVoice
You can set speaker id and the base url for VOICEVOX server when instantiate AIAvatar
.
app = AIAvatar(
openai_api_key="YOUR_OPENAI_API_KEY",
google_api_key="YOUR_GOOGLE_API_KEY",
voicevox_speaker_id=46
)
If you want to configure in detail, create instance of VoicevoxSpeechController
with custom parameters and set it to AIAvatar
.
from aiavatar.speech.voicevox import VoicevoxSpeechController
speech_controller = VoicevoxSpeechController(
base_url="https",
speaker_id=46,
device_index=app.audio_devices.output_device
)
app.avatar_controller.speech_controller = speech_controller
Speech is handled in a separate subprocess to improve audio quality and reduce noises such as popping, caused by thread blocking during parallel processing of AI responses and speech output. For systems with limited resources, setting use_subprocess=False
allows speech processing within the main process, potentially reintroducing some noise.
app.avatar_controller.speech_controller = VoicevoxSpeechController(
base_url="http://127.0.0.1:50021",
speaker_id=46,
device_index=app.audio_devices.output_device,
use_subprocess=False
)
You can also set speech controller that uses alternative Text-to-Speech services. We provide AzureSpeechController
for now.
from aiavatar.speech.azurespeech import AzureSpeechController
AzureSpeechController(
AZURE_SUBSCRIPTION_KEY, AZURE_REGION,
device_index=app.audio_devices.output_device,
)
The default speaker is en-US-JennyMultilingualNeural
that support multi languages.
https://learn.microsoft.com/ja-jp/azure/ai-services/speech-service/language-support?tabs=tts
You can make custom speech controller by impelemting SpeechController
interface or extending SpeechControllerBase
.
๐ Wakeword listener
Set wakewords when instantiate AIAvatar
. Conversation will start when AIAvatar recognizes the one of the words in this list.
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
google_api_key=GOOGLE_API_KEY,
wakewords=["Hello", "ใใใซใกใฏ"],
)
If you want to configure in detail, create instance of WakewordListener
with custom parameters and set it to AIAvatar
.
from aiavatar.listeners.wakeword import WakewordListener
wakeword_listener = WakewordListener(
api_key=GOOGLE_API_KEY,
wakewords=["Hello", "ใใใซใกใฏ"],
device_index=app.audio_devices.input_device,
timeout=0.2,
max_duration=1.5
)
app.wakeword_listener = wakeword_listener
๐ Request listener
If you want to configure in detail, create instance of VoiceRequestListener
with custom parameters and set it to AIAvatar
.
from aiavatar.listeners.voicerequest import VoiceRequestListener
request_listener = VoiceRequestListener(
api_key=GOOGLE_API_KEY,
device_index=app.audio_devices.input_device,,
detection_timeout=15.0,
timeout=0.5,
max_duration=20.0,
min_duration=0.2,
)
app.request_listener = request_listener
โจ Using Azure Listeners
We strongly recommend using AzureWakewordListener and AzureRequestListner that are more stable than the default listners. Check examples/run_azure.py that works out-of-the-box.
Install Azure SpeechSDK.
$ pip install azure-cognitiveservices-speech
Change script to use AzureRequestListener and AzureWakewordListener.
from aiavatar.listeners.azurevoicerequest import AzureVoiceRequestListener
from aiavatar.listeners.azurewakeword import AzureWakewordListener
YOUR_SUBSCRIPTION_KEY = "YOUR_SUBSCRIPTION_KEY"
YOUR_REGION_NAME = "YOUR_REGION_NAME"
azure_request_listener = AzureVoiceRequestListener(
YOUR_SUBSCRIPTION_KEY,
YOUR_REGION_NAME
)
async def on_wakeword(text):
logger.info(f"Wakeword: {text}")
await app.start_chat()
azrue_wakeword_listener = AzureWakewordListener(
YOUR_SUBSCRIPTION_KEY,
YOUR_REGION_NAME,
on_wakeword=on_wakeword,
wakewords=["ใใใซใกใฏ"]
)
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
request_listener=azure_request_listener,
wakeword_listener=azrue_wakeword_listener
)
To specify the microphone device by setting device_name
argument.
See Microsoft Learn to know how to check the device UID on each platform.
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-select-audio-input-devices
We provide a script for MacOS. Just run it on Xcode.
Device UID: BuiltInMicrophoneDevice, Name: MacBook Proใฎใใคใฏ
Device UID: com.vbaudio.vbcableA:XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, Name: VB-Cable A
Device UID: com.vbaudio.vbcableB:XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, Name: VB-Cable B
For example, the UID for the built-in microphone on MacOS is BuiltInMicrophoneDevice
.
Then, set it as the value of device_name
.
azure_request_listener = AzureVoiceRequestListener(
YOUR_SUBSCRIPTION_KEY,
YOUR_REGION_NAME,
device_name="BuiltInMicrophoneDevice"
)
azure_wakeword_listener = AzureWakewordListener(
YOUR_SUBSCRIPTION_KEY,
YOUR_REGION_NAME,
on_wakeword=on_wakeword,
wakewords=["Hello", "ใใใซใกใฏ"],
device_name="BuiltInMicrophoneDevice"
)
๐ฅ Using OpenAI's audio APIs
OpenAI's Speech-to-Text and Text-to-Speech capabilities provide dynamic speech recognition and voice output across multiple languages, without the need for fixed language settings.
from aiavatar import AIAvatar
from aiavatar.device import AudioDevice
from aiavatar.listeners.openailisteners import (
OpenAIWakewordListener,
OpenAIVoiceRequestListener
)
from aiavatar.speech.openaispeech import OpenAISpeechController
devices = AudioDevice()
speech_controller = OpenAISpeechController(
api_key=OPENAI_API_KEY,
device_index=devices.output_device
)
async def on_wakeword(text):
await app.start_chat(request_on_start=text, skip_start_voice=True)
wakeword_listener = OpenAIWakewordListener(
api_key=OPENAI_API_KEY,
device_index=devices.input_device,
wakewords=["ใใใซใกใฏ"],
on_wakeword=on_wakeword
)
request_listener = OpenAIVoiceRequestListener(
api_key=OPENAI_API_KEY,
device_index=devices.input_device
)
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
wakeword_listener=wakeword_listener,
request_listener=request_listener,
speech_controller=speech_controller,
noise_margin=10.0,
verbose=True
)
app.start_listening_wakeword()
๐ Audio device
You can specify the audio devices to be used in components by name or index.
from aiavatar.device import AudioDevice
audio_device = AudioDevice(
input_device="ใใคใฏ",
output_device="ในใใผใซใผ"
)
Set device to components.
speech_controller = VoicevoxSpeechControllerSubProcess(
device_index=audio_device.output_device,
base_url="http://127.0.0.1:50021",
speaker_id=46,
)
request_listener = VoiceRequestListener(
device_index=audio_device.input_device
)
wakeword_listener = WakewordListener(
device_index=audio_device.input_device,
wakewords=["Hello", "ใใใซใกใฏ"]
)
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
speech_controller=speech_controller,
request_listener=request_listener,
wakeword_listener=wakeword_listener
)
๐ฅฐ Face expression
To control facial expressions within conversations, set the facial expression names and values in FaceController.faces
as shown below, and then include these expression keys in the response message by adding instructions to the prompt.
app.avatar_controller.face_controller.faces = {
"neutral": "๐",
"joy": "๐",
"angry": "๐ ",
"sorrow": "๐",
"fun": "๐ฅณ"
}
app.chat_processor.system_message_content = """# Face Expression
* You have the following expressions:
- joy
- angry
- sorrow
- fun
* If you want to express a particular emotion, please insert it at the beginning of the sentence like [face:joy].
Example
[face:joy]Hey, you can see the ocean! [face:fun]Let's go swimming.
"""
This allows emojis like ๐ฅณ to be autonomously displayed in the terminal during conversations. To actually control the avatar's facial expressions in a metaverse platform, instead of displaying emojis like ๐ฅณ, you will need to use custom implementations tailored to the integration mechanisms of each platform. Please refer to our VRChatFaceController
as an example.
๐ Animation
Now writing... โ๏ธ
๐ Vision
AIAvatarKit captures and sends image to AI dynamically when the AI determine that vision is required to process the request from the user. This gives "eyes" to your AIAvatar in metaverse platforms like VRChat.
To use vision, instruct vision tag in the system message and ChatGPTProcessor.get_image
.
import io
import pyautogui
from aiavatar.processors.chatgpt import ChatGPTProcessor
from aiavatar.device.video import VideoDevice
system_message_content = """
### Using Vision
If you need an image to process a user's request, you can obtain it using the following methods:
- screenshot
- camera
If an image is needed to process the request, add an instruction like [vision:screenshot] to your response to request an image from the user.
By adding this instruction, the user will provide an image in their next utterance. No comments about the image itself are necessary.
Example:
user: Look! This is the sushi I had today.
assistant: [vision:screenshot] Let me take a look.
"""
default_camera = VideoDevice(device_index=0, width=960, height=540)
async def get_image(source: str=None) -> bytes:
if source == "camera":
return await default_camera.capture_image("camera.jpg")
else:
buffered = io.BytesIO()
image = pyautogui.screenshot(region=(0, 0, 1280, 720))
image.save(buffered, format="PNG")
image.save("screenshot.png")
return buffered.getvalue()
chat_processor = ChatGPTProcessor(
api_key=OPENAI_API_KEY,
model="gpt-4o",
system_message_content=system_message_content,
use_vision = True
)
chat_processor.get_image = get_image
NOTE
- Only the latest image will be sent to ChatGPT to avoid performance issues.
- Gemini and Claude can also use vision in the same way. Simply replace
ChatGPTProcessor
with ClaudeProcessor
or GeminiProcessor
.
##ใ ๐ญ Custom Behavior
You can invoke custom implementations when listening to requests from user, processing those requests, or when recognized a wake word to start conversation.
In the following example, changing face expressions at each timing aims to enhance the interaction experience with the AI avatar.
async def set_listening_face():
await app.avatar_controller.face_controller.set_face("listening", 3.0)
app.request_listener.on_start_listening = set_listening_face
async def set_thinking_face():
await app.avatar_controller.face_controller.set_face("thinking", 3.0)
app.chat_processor.on_start_processing = set_thinking_face
async def on_wakeword(text):
logger.info(f"Wakeword: {text}")
await app.avatar_controller.face_controller.set_face("smile", 2.0)
await app.start_chat(request_on_start=text, skip_start_voice=True)
๐ Platform Guide
AIAvatarKit is capable of operating on any platform that allows applications to hook into audio input and output. The platforms that have been tested include:
In addition to running on PCs to operate AI avatars on these platforms, you can also create a communication robot by connecting speakers, a microphone, and, if possible, a display to a Raspberry Pi.
๐ VRChat
- 2 Virtual audio devices (e.g. VB-CABLE) are required.
- Multiple VRChat accounts are required to chat with your AIAvatar.
Get started
First, run the commands below in python interpreter to check the audio devices.
$ % python
>>> from aiavatar import AudioDevice
>>> AudioDevice.list_audio_devices()
Available audio devices:
0: Headset Microphone (Oculus Virt
:
6: CABLE-B Output (VB-Audio Cable
7: Microsoft ใตใฆใณใ ใใใใผ - Output
8: SONY TV (NVIDIA High Definition
:
13: CABLE-A Input (VB-Audio Cable A
:
In this example,
- To use
VB-Cable-A
for microphone for VRChat, index for output_device
is 13
(CABLE-A Input). - To use
VB-Cable-B
for speaker for VRChat, index for input_device
is 6
(CABLE-B Output). Don't forget to set VB-Cable-B Input
as the default output device of Windows OS.
Then edit run.py
like below.
app = AIAvatar(
GOOGLE_API_KEY,
OPENAI_API_KEY,
model="gpt-3.5-turbo",
system_message_content=system_message_content,
input_device=6
output_device=13,
)
You can also set the name of audio devices instead of index (partial match, ignore case).
input_device="CABLE-B Out"
output_device="cable-a input",
Run it.
$ run.py
Launch VRChat as desktop mode on the machine that runs run.py
and log in with the account for AIAvatar. Then set VB-Cable-A
to microphone in VRChat setting window.
That's all! Let's chat with the AIAvatar. Log in to VRChat on another machine (or Quest) and go to the world the AIAvatar is in.
Face Expression
AIAvatarKit controls the face expression by Avatar OSC.
LLM(ChatGPT/Claude/Gemini)
โ response with face tag [face:joy]Hello!
AIAvatarKit(VRCFaceExpressionController)
โ osc FaceOSC=1
VRChat(FX AnimatorController)
โ
๐
So at first, setup your avatar the following steps:
- Add avatar parameter
FaceOSC
(type: int, default value: 0, saved: false, synced: true). - Add
FaceOSC
parameter to the FX animator controller. - Add layer and put states and transitions for face expression to the FX animator controller.
- (option) If you use the avatar that is already used in VRChat, add input parameter configuration to avatar json.
Next, use VRChatFaceController
.
from aiavatar.face.vrchat import VRChatFaceController
vrc_face_controller = VRChatFaceController(
faces={
"neutral": 0,
"joy": 1,
"angry": 2,
"sorrow": 3,
"fun": 4
}
)
Lastly, add face expression section to the system prompt.
system_message_content = """
# Face Expression
* You have following expressions:
- joy
- angry
- sorrow
- fun
* If you want to express a particular emotion, please insert it at the beginning of the sentence like [face:joy].
Example
[face:joy]Hey, you can see the ocean! [face:fun]Let's go swimming.
"""
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
google_api_key=GOOGLE_API_KEY,
face_controller=vrc_face_controller,
system_message_content=system_message_content
)
You can test it not only through the voice conversation but also via the REST API.
๐ Raspberry Pi
Now writing... โ๏ธ
๐งฉ RESTful APIs
You can control AIAvatar via RESTful APIs. The provided functions are:
-
WakewordLister
- start: Start WakewordListener
- stop: Stop WakewordListener
- status: Show status of WakewordListener
-
Avatar
- speech: Speak text with face expression and animation
- face: Set face expression
- animation: Set animation
-
System
To use REST APIs, create API app and set router instead of calling app.start_listening_wakeword()
.
from fastapi import FastAPI
from aiavatar import AIAvatar
from aiavatar.api.router import get_router
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
google_api_key=GOOGLE_API_KEY
)
api = FastAPI()
api_router = get_router(app, "aiavatar.log")
api.include_router(api_router)
Start API with uvicorn.
$ uvicorn run:api
Call /wakeword/start
to start wakeword listener.
$ curl -X 'POST' \
'http://127.0.0.1:8000/wakeword/start' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"wakewords": []
}'
See API spec and try it on http://127.0.0.1:8000/docs .
NOTE: AzureWakewordListeners stops immediately but the default WakewordListener stops after it recognizes wakeword.
๐คฟ Deep dive
Advanced usases.
โก๏ธ Function Calling
Use chat_processor.add_function
to use ChatGPT function calling. In this example, get_weather
will be called autonomously.
async def get_weather(location: str):
await asyncio.sleep(1.0)
return {"weather": "sunny partly cloudy", "temperature": 23.4}
app.chat_processor.add_function(
name="get_weather",
description="Get the current weather in a given location",
parameters={
"type": "object",
"properties": {
"location": {
"type": "string"
}
}
},
func=get_weather
)
And, after get_weather
called, message to get voice response will be sent to ChatGPT internally.
{
"role": "function",
"content": "{\"weather\": \"sunny partly cloudy\", \"temperature\": 23.4}",
"name": "get_weather"
}
๐ Other Tips
Useful information for developping and debugging.
๐ค Testing audio I/O
Using the script below to test the audio I/O before configuring AIAvatar.
- Step-by-Step audio device configuration.
- Speak immediately after start if the output device is correctly configured.
- All recognized text will be shown in console if the input device is correctly configured.
- Just echo on wakeword recognized.
import asyncio
import logging
from aiavatar import (
AudioDevice,
VoicevoxSpeechController,
WakewordListener
)
GOOGLE_API_KEY = "YOUR_API_KEY"
VV_URL = "http://127.0.0.1:50021"
VV_SPEAKER = 46
INPUT_DEVICE = -1
OUTPUT_DEVICE = -1
logger = logging.getLogger()
logger.setLevel(logging.INFO)
log_format = logging.Formatter("[%(levelname)s] %(asctime)s : %(message)s")
streamHandler = logging.StreamHandler()
streamHandler.setFormatter(log_format)
logger.addHandler(streamHandler)
if INPUT_DEVICE < 0:
input_device_info = AudioDevice.get_input_device_with_prompt()
else:
input_device_info = AudioDevice.get_device_info(INPUT_DEVICE)
input_device = input_device_info["index"]
if OUTPUT_DEVICE < 0:
output_device_info = AudioDevice.get_output_device_with_prompt()
else:
output_device_info = AudioDevice.get_device_info(OUTPUT_DEVICE)
output_device = output_device_info["index"]
logger.info(f"Input device: [{input_device}] {input_device_info['name']}")
logger.info(f"Output device: [{output_device}] {output_device_info['name']}")
speaker = VoicevoxSpeechController(
VV_URL,
VV_SPEAKER,
device_index=output_device
)
asyncio.run(speaker.speak("ใชใผใใฃใชใใใคในใฎใในใฟใผใ่ตทๅใใพใใใ็งใฎๅฃฐใ่ใใใฆใใพใใ๏ผ"))
wakewords = ["ใใใซใกใฏ"]
async def on_wakeword(text):
logger.info(f"Wakeword: {text}")
await speaker.speak(f"{text}")
wakeword_listener = WakewordListener(
api_key=GOOGLE_API_KEY,
wakewords=["ใใใซใกใฏ"],
on_wakeword=on_wakeword,
verbose=True,
device_index=input_device
)
ww_thread = wakeword_listener.start()
ww_thread.join()
๐๏ธ Noise Filter
AIAvatarKit automatically adjusts the noise filter for listeners when you instantiate an AIAvatar object. To manually set the noise filter level for voice detection, set auto_noise_filter_threshold
to False
and specify the volume_threshold_db
in decibels (dB).
app = AIAvatar(
openai_api_key=OPENAI_API_KEY,
google_api_key=GOOGLE_API_KEY,
auto_noise_filter_threshold=False,
volume_threshold_db=-40
)
๐งช LM Studio API
Use ChatGPTProcessor with some arguments.
- base_url: URL for LM Studio local server
- model: Name of model
- parse_function_call_in_response: Always set
False
from aiavatar import AIAvatar
from aiavatar.processors.chatgpt import ChatGPTProcessor
chat_processor = ChatGPTProcessor(
api_key=OPENAI_API_KEY,
base_url="http://127.0.0.1:1234/v1",
model="mmnga/DataPilot-ArrowPro-7B-KUJIRA-gguf",
parse_function_call_in_response=False
)
app = AIAvatar(
google_api_key=GOOGLE_API_KEY,
chat_processor=chat_processor
)
app.start_listening_wakeword()
โก๏ธ Use custom listener
It's very easy to add your original listeners. Just make it run on other thread and invoke app.start_chat()
when the listener handles the event.
Here the example of FileSystemListener
that invokes chat when test.txt
is found on the file system.
import asyncio
import os
from threading import Thread
from time import sleep
class FileSystemListener:
def __init__(self, on_file_found):
self.on_file_found = on_file_found
def start_listening(self):
while True:
if os.path.isfile("test.txt"):
asyncio.run(self.on_file_found())
sleep(3)
def start(self):
th = Thread(target=self.start_listening, daemon=True)
th.start()
return th
Use this listener in run.py
like below.
def on_file_found():
asyncio.run(app.chat())
fs_listener = FileSystemListener(on_file_found)
fs_thread = fs_listener.start()
:
fs_thread.join()