Full-Stack Voice AI Platform - Khaireddine Arbouch

A Full-Stack Voice AI Platform with TTS, Voice Conversion, and Generative Audio

Client

Personal

Timeframe

March 30 - May 12

Stack

PyTorch, FastAPI, Docker, StyleTTS2, Seed-VC, Make-An-Audio, React, Next.js, Tailwind CSS, AWS, Inngest, Auth.js

Source Code

Project Breakdown

Project Overview

The goal was to create a full-stack, self-hosted alternative to ElevenLabs—a platform capable of Text-to-Speech (TTS), Voice Conversion, and Text-to-Audio generation—using open-source AI models. The motivation stemmed from the limitations of proprietary tools like ElevenLabs (e.g., closed APIs, high costs). This project successfully delivered a modular, scalable voice AI platform that brings powerful generative audio capabilities to users without vendor lock-in.

Implemented Features

The final platform includes the following components:

Text-to-Speech (TTS) with StyleTTS2 for high-quality, expressive voice generation.
Voice Conversion using Seed-VC, allowing identity-preserving voice transformations.
Text-to-Audio generation powered by Make-An-Audio, enabling non-speech soundscape creation.
Custom voice fine-tuning from just a few minutes of user-provided audio samples.
A modern, credit-based web interface for managing usage and output

A modern web interface with:

Real-time voice switching
Custom voice uploads
Audio history and downloads
A credit-based system to track and manage usage

Technical Stack

AI/Deep Learning Models

StyleTTS2 – Diffusion-based TTS with strong expressiveness and naturalness
Seed-VC – Neural voice conversion with clear speaker identity preservation
Make-An-Audio – Latent diffusion-based text-to-audio synthesis engine

Backend & Infrastructure

PyTorch for training, inference, and fine-tuning pipelines
FastAPI serving RESTful endpoints
Inngest for GPU queue management to ensure efficient task execution
Docker for model containerization and deployment
AWS S3 used for storing generated audio files

Frontend

Built using Next.js 15.3.0, React, Tailwind CSS
Auth.js for user authentication; optional integration with Clerk
Clean, modular UI designed for smooth UX

Research & Innovation Highlights

While the models are open-source, this project goes beyond simply running them:

Fine-tuned user voices with just 2–5 minutes of training data
Benchmarked audio quality and latency across models to inform trade-offs
Built a hybrid inference pipeline: Seed-VC ➜ Make-An-Audio to combine voice conversion with generative audio
Identified and mitigated bottlenecks related to memory consumption and model size (especially with diffusion models)
Designed the system with future SaaS scalability in mind

Outcome & Impact

This project demonstrates a complete full-stack solution that integrates advanced generative AI, infrastructure engineering, and user experience design. It offers the same core functionalities as ElevenLabs but remains open, customizable, and cost-efficient. The system is already usable in production scenarios like:

Custom voice assistants
Content creation workflows
Game and virtual world audio
Accessibility enhancements

This build showcases technical mastery in diffusion models, transformers, real-time inference systems, and product-ready AI engineering.