Microsoft Expands AI Stack with New Multimodal Models
-

Microsoft has released three new foundational AI models through its Microsoft AI division, signaling a deeper push into multimodal capabilities alongside its ongoing partnership with OpenAI. The models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are designed to generate text, audio, and visual content within a unified ecosystem.
MAI-Transcribe-1 supports speech-to-text across 25 languages and operates 2.5 times faster than Microsoft’s previous offering, while MAI-Voice-1 can generate up to 60 seconds of audio in one second and supports custom voice creation. MAI-Image-2, a video-generation model, is now available alongside the others via Microsoft Foundry. Developed by the MAI Superintelligence team led by Mustafa Suleyman, the models are positioned as cost-competitive alternatives in the AI market, with pricing starting as low as $0.36 per hour for transcription and token-based pricing for voice and image generation.