It’s frustrating when you’re not understood — especially when you’re trying to speak to Siri, Alexa, or another internet-connected device.
Voice datasets that power voice recognition services are owned by a handful of major companies, and they can wildly underrepresent the voices of non-dominant accents, Black, Indigenous, and other people of color, disabled people and gender marginalised people. In fact, for people speaking other global languages - there may be no datasets at all.
That’s why Mozilla launched Common Voice — the world’s largest public voice database, powered by the voices of volunteer contributors. Our goal is to teach machines how real people speak.
Today, we’re asking you to contribute to Common Voice, but we want you to choose how you’ll do it. Will you donate your voice to one of our Common Voice language datasets? Or will you make a $34 donation to Mozilla to support projects like this to reclaim the internet? (Or both!)
I’d be curious about the privacy concerns, but this might help a lot with underrepresented voice data. It might come down to if someone wants more datasets for their particular voice/language more than the other concerns.
If your language/accent is already well documented, it might not help as much?
Mozilla: “We’d like to build a dataset of underrepresented languages and accents so that voice recognition works for everyone. It’ll be under an open license.”
Most of this thread: “GIVE ME MONEY.”
Sigh. As soon as it turned out that AI training data was “worth something” everyone turned into a money-grubbing mercenary.
As long as the actual software would be free and open-sourced, I’m willing to help
The data set is available under the Mozilla Public License v2 through the Common Voice GitHub page. I’m not sure if I’m reading the terms of the license correctly, but I believe it allows commercial use.
Why would black or gender marginalised people have a different voice?
Dude, I clicked on the link pretty excited to volunteer. I have a professional mic, a little time, and a decent voice. The first thing that greets me is “Voice datasets also underrepresent: non-English speakers, people of colour, disabled people, women and LGBTQIA+ people.”
Well, I’m none of those. So maybe they don’t want my donation, or I’d spend time and they wouldn’t use my recordings… Sort of a letdown.
Right? Like we all are made the same on the inside
deleted by creator
So, wait, you’re fine with Mozilla impersonating you as long as you get a little money in the process?
Not that this is what Mozilla wants this data for, mind you, I’m just puzzled by this place you’ve jumped to.
deleted by creator