Microsoft Expands AI Stack with Multimodal Models
-

Microsoft has introduced three new foundational AI models capable of generating text, voice, and images, marking a significant step in its push toward building a full multimodal AI ecosystem. The models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are designed to handle everything from speech-to-text transcription across 25 languages to real-time audio and visual content generation.
Developed by Microsoft’s AI division led by Mustafa Suleyman, these models are now available through Microsoft Foundry and MAI Playground. The release highlights Microsoft’s strategy to develop in-house AI capabilities while continuing to integrate them across its broader product ecosystem.