Desi Slice 2: Indian Language Voice AI Business Is All the Rage
Aug 30 -Seize Opportunities In India's Race to $5T! Dekho Ma! No Hands! Just speak!
Photo by George Milton: https://www.pexels.com/photo/microphone-on-tripod-attached-to-laptop-in-studio-6953871/
India has 22 officially recognized languages and 760-1360 dialects spoken by its 1.4 billion people across its 28 states and 8 union territories. So, if you are building something that connects to people through a machine that speaks and listens, it better speak ALL the languages. Ha Ha!!
Not easy? Of course, it is not easy.
But that is what Voice AI startups are trying to do. And to some extent they are getting there - initially covering the top languages most widely spoken and then hoping to get enough traction to go the extra mile to cover the rest.
Last week Sarvam AI launched in Bangalore its full range of Voice AI products in front of a crowd that included glitterati from the global AI industry including Big Tech leaders.
Sarvam AI is a pioneering technology company dedicated to developing cutting-edge AI solutions that address the unique challenges of the Indian market. The company aims to empower businesses and individuals with transformative AI technologies.
Sarvam AI introduced software for businesses that interact with customers using spoken voice rather than just text. The technology was developed with data from 10 native Indian languages and priced at a rupee per minute to capture the market. ($0.012 per minute).
Sarvam products were launched last week in Bengaluru
Sarvam Agents: Voice-enabled, multilingual, action-oriented, custom business agents deployable via telephone, WhatsApp, or in-app. Currently available in 10 Indian languages, including Hindi, Tamil, Telugu, Malayalam, Punjabi, Odia, Gujarati, Marathi, Kannada, and Bengali. The cost of these voice agents starts at Rs. 1 / min
Sarvam 2B: India’s first foundational, open source, 2B small Indic LLM. It is the first LLM trained from scratch on an internal dataset of 4 Trillion tokens, by an Indian company, with compute in India, with efficient representation for 10 Indian languages
Shuka 1.0: India's first open-source AudioLM, an audio extension on the Llama 8B model to support Indian language voice in and text out, which is more accurate than frontier models.
Sarvam Models: The Best-in-class Indic models used in the creation of Sarvam agents are now also available to be consumed as APIs. These include models for translation, speech recognition, speech synthesis, & document parsing. Sarvam announced their API platform for developers to leverage these models for building their GenAI use cases.
A1: A generative AI workbench designed for modern lawyers to enhance their capabilities with features such as regulatory chat, document drafting, redaction, and data extraction. (source: company website)
In a video at the product launch event, Vinod Khosla, a billionaire venture capitalist and investor in Sarvam, said, “These voice bots have the potential to reach a billion people.” (source: ET, Bloomberg)
Who are these Voice AI Startups?
Samsung invested in Gnani AI (which does millions of voice conversations for Indian banks, and enterprises)
CoRover AI offers voice bots in 14 Indian languages to PSU railway corporation IRCTC and others. The company has signed up with several large corporations Google, Amazon, IRCTC, LIC, and several banks and overall usage metrics are staggering.
Haloocom Technologies’ voice bot can speak in five Indian languages to handle customer service tasks and help screen job candidates. The company has executed over 6200 projects across 5 countries from a suite of 15 products and about 18 applications.
An Indian Language Quirk is that conversationally many are dual language mash-ups like Hinglish (a mix of Hindi and English) and so on. This typically makes it a bigger challenge for a pure-play AI bot that was created in the West and so these Indian startups truly have a unique opportunity to take over the space - unlike a messaging app (WhatsApp) or Search (Google or Bing).
The US companies do not have access to enough spoken Indian language data, including accents that vary from region to region. For example, Sri Mandir a devotional app with over 10 mil downloads on the App Store, and a client of Sarvam effortlessly manages bidirectional voice input/output -something a chat GPT or Claude would have never been able to do.
Bit of History
Conversational AI is a general term for AI systems designed to interact with humans in a natural, conversational manner. AI Voice chatbots are a subset of conversational AI that specifically uses voice as the primary interface. A summary of the differences between the two is included below.
Interface: Conversational AI can be text or voice-based, while AI voice chatbots are specifically voice-based.
Scope: Conversational AI is a broader category that includes AI voice chatbots.
Technology focus: AI voice chatbots require additional technologies like speech recognition and voice synthesis, which aren't necessary for text-based conversational AI.
Use cases: While there's overlap, conversational AI might be more commonly used in text-based customer service or chatbots on websites, whereas AI voice chatbots are often used in home automation, virtual assistants, and hands-free applications.
Multilingual capability: Advanced conversational AI systems can be trained to understand and respond in multiple languages, including regional and local languages.
Text-based advantage: For text-based conversational AI, supporting multiple languages is relatively easier as it primarily involves natural language processing (NLP) models trained on written text.
Cultural nuances: Good conversational AI systems can be trained to understand cultural contexts, idioms, and regional expressions.
Scalability: It's often easier to add new languages to text-based systems.
Script variations: Can handle different scripts (e.g., Devanagari, Bengali, Tamil) more easily in text form.
I remember a decade ago, IBM, which was heavily into Voice XML, claimed that the technology would eventually solve the problem for the entire planet, including the illiterate because voice was the natural interface for humans. That claim fell flat over the years. But you can now hear the bells ringing again. And the same claim too :).
We are at a different zone today regarding how much power Gen AI brings to the table with the power of LLMs guiding and training intelligence in a context-sensitive manner, with continuous self-learning, the Voice AI bots are a different beast.
So where do we go from here?
It is hard to predict the future. But some good guesswork is always welcome. Here are a few thoughts.
Short-term: Multimodal Interfaces
Situational appropriateness: Sometimes speaking isn't ideal or possible and not appropriate.
Accessibility: Caters to a wider range of users, including those with speech or hearing impairments.
Enhanced accuracy: Combining modalities can lead to a more accurate interpretation of user intent.
Natural interaction: Humans naturally communicate through a combination of speech, gestures, and expressions.
Long-term: Proactive AI and AGI
Predictive AI: Systems that anticipate needs based on patterns, context, and learned behaviors.
Ambient intelligence: Environments that are aware and responsive without explicit commands.
Consent and control: Pre-filled options that still leave ultimate control with the user, similar to current data collection consent forms.
Reduced cognitive load: By presenting options or taking actions proactively, these systems could significantly reduce the mental effort required for daily tasks.
Challenges and Likely Roadblocks:
Privacy and data use: More intelligent and predictive systems will require access to more personal data.
User autonomy: Balancing helpfulness with maintaining user control and decision-making.
Technological challenges: True AGI remains a significant challenge and may take longer than 5 years to realize.
Ethical implications: As AI becomes more proactive, questions of responsibility and decision-making become more complex.
Expect more action in this space - with more innovation, more language support, and multi-modal deployments for a variety of common everyday applications with higher accuracy.
What help do you need?
If you are looking to either start in the Desi Voice AI space, find a portfolio to invest in, look for a co-founder, or help in setting up an AI dev center in Bangalore or looking for a development partner, drop me a line (contact info below).
Also, there will be more investments in the sector. Watch this space for more action.
Get your desi slice!!!
Owner, Publisher: Sridhar Pai Tonse - Tonse Pai Academy.
Disclaimer: This newsletter may contain AI-assisted content with human oversight. The views and opinions expressed are those of the publisher and do not necessarily reflect the official policy or position of any other agency, organization, or company.
Unsubscribe: Use the unsubscribe option to discontinue receiving this newsletter.
Content: Some content may have been sourced from the public domain or external sources. No claims are made to intellectual property rights. All rights reserved.
Contact Me: stpai2001@gmail.com
Follow Me: LinkedIn: https://www.linkedin.com/in/tonsepai/ YouTube: https://www.youtube.com/@tonsepai FB: https://www.facebook.com/sridhar.pai.9/