What is our primary use case?
I built an AI chatbot application that communicates with users. The use case involved converting speech to text using Microsoft Azure Speech Service.
For example, our voice is converted into a text using Microsoft Azure Speech Service, then that text is sent to OpenAI's API, and the response was sent back to Microsoft Service for text-to-speech conversion. Essentially, it facilitated speech-to-text and text-to-speech communication.
How has it helped my organization?
For my workflow, I used speech as input to the AI. What I spoke was converted to text, and then that text was provided to the AI chatbot. Based on the input, the chatbot gave a response, and then that response was again converted to speech.
The chatbot answered in English, so the words were proper. Whatever the chatbot responded with, the text would be converted to speech. The issue, if any, would be mainly because the speech service might not be able to accurately predict what the user spoke.
For example, if I am speaking a sentence, then based on my tone or the way I am speaking, there might be a case where the speech service won't properly comprehend my words and send an incorrect text to the chatbot. If the text is wrong, then there are chances that the output generated by the chatbot would be wrong, and ultimately, the result would be not as expected.
So the main concern is that this service should correctly convert the speech or voice of the user into text.
What is most valuable?
The simplicity impressed me the most. We just needed a single API key. The documentation was also great.
I developed the AI application using Unity, a game engine that uses C#. Then, I searched online for instructions on how to use it. I found Microsoft's GitHub repository, which provided the necessary code for integrating the Speech Service into Unity with C#. The ease of use and the availability of documentation made the process smooth and impressed me the most.
The documentation and boilerplate code [a template of code] was available, which I incorporated into my application with modifications. Initially, the code functioned so that when a button was clicked, the microphone would activate and recognize my speech.
One of the benefits was the ability to see my spoken words visually on the screen as I spoke. For example, if I said "I am Abhishek Rana," I could see the sentence appear in real-time. When I stopped speaking, it automatically recognized the silence and ceased, sending the text for further processing. So, the real-time translation feature has helped me a lot.
What needs improvement?
For general use cases and vocabulary used in normal, everyday language, it was able to recognize those. However, it can improve based on the native language. Apart from English, other languages and even complex words in English, there is definitely room for improvement.
For how long have I used the solution?
I used it for around two months while developing an AI application for a particular project.
What do I think about the stability of the solution?
I didn't face any issues with the stability while using it, like with bugs or breakdowns. Mainly, like, when I started the service, each time I would turn on the application and speak something, it would be able to recognize it. There were just maybe certain time delays sometimes, but apart from that, it was functioning well.
I would rate the stability a nine out of ten.
What do I think about the scalability of the solution?
Since I was developing it not as a public application, just for my own learning, I didn't publish the project on platforms like Google Play Store or Apple Store. So there were not many users. I was the only one testing the application and showcasing it. So I didn't face any scalability issues. And I think that even if we scale it up, it would be able to perform well considering it is a cloud service, and the number of users won't affect it much.
How was the initial setup?
There were certain callenges while itnegration it with other technologies. To use this specific service in an Android application that I built, we needed to ensure we asked for user permission beforehand. My app has different screens, and before starting the screen where this service is used, I needed to ask for user permission. Since there are multiple ways to get to that screen, I had to ensure I asked for permission each time before entering the screen where the service was being used.
What's my experience with pricing, setup cost, and licensing?
I'm a college student. I signed up for the Microsoft Azure portal using my college account, so I got a $100 credit. I've used it for various services from the portal.
I have used different services from the Azure portal, as I had received a $100 in credit. I don't know the specific pricing for each use case or how much each use affects the budget. But from what I've observed, there are no significant differences in the price each time I used it. So it is cost-effective compared to other services I've used.
What other advice do I have?
A little bit of development experience is definitely useful. I would advise a complete novice would face some challenges. But, someone who has made one or two projects earlier would be able to easily navigate through the process.
I would recommend Azure Speech Service to other people. If there's a student out there who's trying to experiment with speech service and all, this would be a great place to start. We can create an account and experiment with the service.
Overall, I would rate it a nine out of ten.