Kotoniaを作った理由

Kotoniaを作った理由 | Tsukutta

What We Wanted to Solve

The experience of "conversing" with AI has become widespread rapidly, but most of it is text-centric. There was still a gap in immersive experiences where you could build ongoing relationships with a character through voice. In particular, we couldn't find anything that could speak in high-quality voice + lip-sync avatars in Japanese, English, and Chinese with low latency (response time of approximately 1-2 seconds). The starting point was wanting to create an "AI character you can meet with your voice" that could serve as a language practice partner or a roleplay companion.

Why We Decided to Build It

General-purpose AI chat becomes a cost competition with model providers (Anthropic / OpenAI / Google), and individuals can't win. B2B voice AI also disadvantages individuals through VC-backed dumping sales. On the other hand, the immersive UX of "multilingual × high-quality lip-sync × continuous emotion" was a remaining area where large companies' R&D priority was low. Here, even one person could build differentiation through refinement. With that thinking, we created Kotonia as a service that put all our effort into the core experience (voice conversation × avatar × continuous persona).

What We Focused On

・Accelerated responses through low-latency voice pipeline (VAD → STT → LLM → streaming TTS)
・Lip-sync avatar with synchronized mouth movements
・Multilingual support for Japanese / English / Chinese
・Leveraging open models on local GPUs to achieve both quality and affordability
・Character memory and personality persist, so the relationship grows the more you use it

Behind the scenes, we've also put considerable effort into optimization for perceived speed, such as physically isolating the real-time voice pathway to dedicated GPUs.

In Closing

Technical details are also documented in articles within the Kotonia site. → https://kotonia.ai/articles/

こちらにはデモ動画を載せております。低遅延の会話をご確認いただければと思います。→https://youtu.be/gubxZYryWq8

実際に声で話してみたい方はこちら → https://kotonia.ai/

解決したかったこと

AIと「会話する」体験は一気に普及したけれど、その多くはテキスト中心。声で、キャラクターと、継続的に関係を築ける没入体験にはまだ穴がありました。特に、日本語・英語・中国語を高品質な音声＋リップシンクのアバターで、低遅延（返答約1~2秒）で話せるものが見当たらない。語学の練習相手にもロールプレイの相棒にもなる「声で会えるAIキャラ」を作りたかったのが出発点です。

なぜ作ろうと思ったか

汎用AIチャットはモデル提供元（Anthropic / OpenAI / Google）とのコスト勝負になり、個人では勝てません。B2Bの音声AIもVCバックのダンピング営業で個人に不利。一方で「多言語 × 高品質リップシンク × 感情の継続」という没入UXは、大手のR&D優先度が低い残存領域でした。ここなら1人でも磨けば差別化が立つ。そう考えて、コア体験（音声会話 × アバター × 継続ペルソナ）に全振りしたサービスとして Kotonia を作りました。

こだわったところ

・低遅延の音声パイプライン（VAD → STT → LLM → ストリーミングTTS）で返答を高速化
・口の動きが同期するリップシンクアバター
・日本語 / 英語 / 中国語のマルチリンガル対応
・モデルはローカルGPU上のオープンモデルを活用し、品質と価格を両立
・キャラの記憶・人格が継続するので、使うほど関係性が育つ

裏側では、音声リアルタイム経路を専用GPUに物理隔離するなど、体感速度のための最適化にもかなり手を入れています。

最後に

技術的なお話はKotoniaサイト内の記事でも記載しております。→https://kotonia.ai/articles/

こちらにはデモ動画を載せております。低遅延の会話をご確認いただければと思います。→https://youtu.be/gubxZYryWq8

実際に声で話してみたい方はこちら → https://kotonia.ai/