Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
This project provides a simple and unified client wrapper for multiple Speech-to-test (STT) providers on the basic use cases, and gives users an easy way to switch and test among different providers.
The accuracy of Speech-to-text (STT) improved significantly during the past few years. There are a lot of cloud STT providers on the market, including some big players like Google and AWS, and a few ambitious new providers like Voicegain.ai and Assembly.ai.
As a Speech Recognition Scientist, I have reviewed many providers in the last few years, and I have noticed that each provider has its own unique features. However, the majority of users do not necessarily need those additional features, especially in the early testing stage. Their requirements are very simple and basic -- getting an accurate transcript of the provided audio.
Regarding my personal background, I am an Senior AI Scientist at Voicegain (specializing in Speech Recognition), but this repository, USTTC, is a personal project, and I intend to work on it without any bias. As mentioned, the goal of this project is to enable more people in the community to explore and test STT without too much trouble dealing with varied providers, APIs, and documentation.
Please ensure that you have ffmpeg installed before install USTTC.
You can install the module using the Python Package Index using the command below.
pip install usttc
Currently, USTTC supports the following 6 STT providers. We are going to include a few more providers later on.
These six providers are included because they all have comparable accuracy, reasonable complete features, and simple-to-use client SDKs. Now you need to decide which providers you want to test. This is truly an overwhelming task, because there is no single right answer. Each provider has unique strengths and weaknesses, as well as its own unique pricing strategy. If you don't know which one is best for your application, we suggest you test all of them on your own audio samples to get a sense. Fortunately, USTTC makes it very easy to test multiple providers using (almost) the same code, which is also the original intention of USTTC.
The following table shows the price of each provider, so that you can also choose based on your budget.
Provider Price Details[1] | $ per minute[2] | Free Tier per month | Free Credits | Minimum per request charge[3] | Increments |
---|---|---|---|---|---|
Google STT | $0.0360 | 60 minutes | 8,333 minutes ($300)[4] | 15 seconds | 15 seconds |
AWS Transcribe | $0.0240 | 60 minutes[5] | No | 15 seconds | 1 second |
Voicegain.ai | $0.0095 | No | 5,263 minutes ($50) | 1 second | 1 second |
Rev.ai | $0.0350 | No | 300 minutes | 15 seconds | 15 seconds |
Assembly.ai | $0.0150 | 180 minutes | No | 1 second | 1 second |
Deepgram | $0.0125 | No | 12,000 minutes ($150) | Not clear | Not clear |
[1]: The price may change. Please check the pricing page for each provider
[2]: This is the pay-as-you-go price. All providers provide discounts for high volumes
[3]: You need to consider this if the average audio duration is shorter than 15s in your application
[4]: The Google Cloud Free credits are distributed across all cloud services and are only valid for the first 90 days
[5]: The AWS Free Tier is only available for the first 12 months
Once you decide which providers to test, you can create an account with them by following the steps below.
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key_file(
asr_provider=AsrProvider.GOOGLE,
filename="<YOUR_GOOGLE_CLOUD_JSON_KEY_FILE_PATH>",
google_storage_bucket="<YOUR_GOOGLE_STORAGE_BUCKET_NAME>"
)
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key(
asr_provider=AsrProvider.AMAZON_AWS,
key="<YOUR_AWS_USER_ACCESS_KEY_ID>",
aws_secret_access_key="<YOUR_AWS_USER_SECRET_ACCESS_KEY>",
region_name='<YOUR_S3_BUCKET_REGION>',
s3_bucket='<YOUR_S3_BUCKET_NAME>'
)
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key(
asr_provider=AsrProvider.VOICEGAIN,
key="<YOUR_VOICEGAIN_JWT_TOKEN>"
)
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key(
asr_provider=AsrProvider.REV,
key="<YOUR_REV_ACCESS_TOKEN>"
)
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key(
asr_provider=AsrProvider.ASSEMBLY_AI,
key="<YOUR_ASSEMBLY_AI_API_KEY>"
)
from usttc import AsrClientFactory, AsrProvider
asr_client = AsrClientFactory.get_client_from_key(
asr_provider=AsrProvider.DEEPGRAM,
key="<YOUR_DEEPGRAM_API_KEY>"
)
Both pre-recorded audio files and real-time audio streams can be transcribed with USTTC.
Using USTTC, it's super easy to transcribe your audio file in (almost) any format. Here is an end-to-end example of an .wav audio as the input.
from usttc.audio import AudioFile
audio = AudioFile(file_path="<YOUR_AUDIO_FILE_PATH>")
result = asr_client.recognize(audio)
print(result.transcript)
An audio file can contain multiple speakers in two ways.
Please note here:
To compare the results from multiple recognizers and know which one is more accurate for the application, I'll normally start by reviewing a few results and getting a sense of the weaknesses and strengths of each recognizer. Sometimes, after I see a few examples, I can easily tell for a specific project which recognizers work and which do not.
If you want to compare the results in a more scientific manner, you can prepare the gold standard reference and calculate Word Error Rate (WER) of the results from each STT provider. However, calculating WER is not trivial, because we don't want to penalize a recognizer if the difference (its result vs. the gold reference) is just the punctuation and capitalization. Moreover, for a digit, it's both acceptable no matter whether using digit-format or spelled-out format.
Voicegain.ai provides a python package called transcribe-compare to help you calculate WER (and do more than that). It solves many issues when calculating WER, including punctuation, capitalization, and digits mentioned above. You can install the module using the Python Package Index using the command below.
pip install transcribe-compare
We provide a simple example of using USTTC and transcribe-compare together. You can also check their GitHub page for more examples of advanced use cases.
[This feature will be available soon]
After you compare the results from multiple recognizers, you might realize that none of them is perfect (it is a cold and brute reality). Different STT providers might make mistakes in different places. If your budget allows, you can run multiple recognizers at the same time and get higher accuracy by ensembling their results. This feature is on our roadmap.
[This feature will be available soon]
In some applications (e.g. real-time), it's important to stream the audio to the recognizer and get the result simultaneously. All the STT providers that USTTC selected have the streaming feature. The streaming wrapper will be available soon.
FAQs
Unified Speech-to-text Client
We found that usttc demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.