FRESH

Hacker News

Home

Show HN: I trained a 9M speech model to fix my Mandarin tones

466 points by simedw

by dapangzi

4 subcomments

Longtime lurker, made an account specifically to give feedback here as an intermediate speaker. :)
This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.
In short, my experience lines up with your native speakers.
I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.
For example, if I say 他是我的朋友 at normal conversational speed, it will assign `de` to 我, sometimes it interprets that I didn't have the retroflexive in `shi` and renders it `si`. Listened back to make sure I said everything, the phonemes are there in the recording, but the UI displays the wrong phonemes and tones.
By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.
Also, is this taking into account tone transformation? Example, third tones (bottom out tone) tend to smoosh into a second tone (rising) when multiple third tones are spoken in a row. Sometimes the first tone influences the next tone slightly, etc.
Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.

by yunusabd

1 subcomments

Super nice, thanks for sharing!
There's one thing that gave me pause: In the phrase 我想学中文 it identified "wén" as "guó". While my pronunciation isn't perfect, there's no way that what I said is closer to "guó" than to "wén".
This indicates to me that the model learned word structures instead of tones here. "Zhōng guó" probably appears in the training data a lot, so the model has a bias towards recognizing that.
- Edit -
From the blog post:
> If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.
Your architecture also doesn't tell you what you actually said. It just maps what you said to the likeliest of the 1254 syllables that you allow. For example, it couldn't tell you that you said "wi" or "wr" instead of "wo", because those syllables don't exist in your setup.

by namelosw

6 subcomments

Impressive work! The idea and the UI is very intuitive.
Though, as a guy who speaks perfect mandarin from Beijing, I’m struggle even to pass the easy ones… So it can definitely used some improvements. The example 你好吃饭了吗 returns hào → hǎo, fān → fàn, le → liǎo. The first two are the model listen my tone mistakenly, and the last one should be le instead of liǎo in this context.
Also I see in the comment section people are worry about tones. I can guarantee tones are not particularly useful and you can communicate with native speakers with all the tones messed up and that’s perfectly fine. Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. So don’t let tone stuff slow your learning process down.

by ecshafer

7 subcomments

Anyone that is a native European language speaker that hasn't tried to learn Chinese or some other tonal language, its really hard to understand how hard it is. The tones can really be very subtle, and your ear is not fine tuned to them. So you think you are saying it right, but native speakers have no idea what you are saying.

by tifan

0 subcomment

Well, it would work only when I speak word by word, not as a sentence or in a normal speed for daily conversations. The model thinks I was making mistakes when I speak casually (as a native Chinese speaker, I had Mandarin 2A certification, which is required for teachers or other occupations that requires a very high degree of Mandarin accuracy). You wouldn’t really notice it but language pronunciations is very different between causal and formal speech…

by vunderba

5 subcomments

When I was living in Taiwan, one of the ways I forced myself to remember to pronounce the tones distinctly was by waving my hand in front of me, tracing the arc of each character’s tone.
It helped a lot even if I did look like an insane expat conducting an invisible orchestra.
One more thing: there's quite a bit of variation in how regional accents in the mainland can affect tonal pronunciation. It might be worth reaching to some native speakers to give you some baseline figures.

by bunderbunder

2 subcomments

This is very cool, but from one Mandarin learner to another I’d caution against relying too heavily on any external feedback mechanism for improving your pronunciation.
If you can’t easily hear your pronunciation mistakes so clearly it hurts, consider putting more energy into training your ear. Adult language learners usually have brains that have become resistant to, but not incapable of, changing the parts of the brain responsible for phoneme recognition. The neuroplasticity is still there but it needs some nudging with focused exercises that make it clear to your brain exactly what the problem is. Minimal pair recognition drills, for example, are a great place to start.
It’s not the most fun task, but it’s worth it. You will tighten the pronunciation practice feedback loop much more than is possible with external feedback, so a better accent is the most obvious benefit. But beyond that, it will make a night and day difference for your listening comprehension. And that will get you access to more interesting learning materials sooner. Which hopefully increases your enjoyment and hence your time on task. Plus, more accurate and automatic phoneme recognition leaves more neurological resources free for processing other aspects of your input materials. So it may even help speed things like vocabulary and grammar acquisition.

by memalign

3 subcomments

I wish this had a pinyin mode…! I am learning to speak Mandarin but I am not learning to read/write.
( I’m learning using a flashcards web app I made and continue to update with vocab I encounter or need: https://memalign.github.io/m/mandarin/cards/index.html )

by rahimnathwani

1 subcomments

This is incredible. When I was first learning Chinese (casually, ~20 years ago), my teacher used some Windows software that drew a diagram of the shape of my pronunciation, so she could illustrate what I was getting wrong in some objective way.
The thing you've built is so good, and I would have loved to have it when I was learning Mandarin.
I tried it with a couple of sentences and it did a good job of identifying which tones were off.

by alixwang

1 subcomments

As a native speaker of Mandarin the demo it's not work for me. It can't check the pronounce of my voice. I don't know what's wrong of it, may be it's too sensitive(my daughter watch carton on my side).

0 subcomment

by affogarty

1 subcomments

This is extremely cool, although I asked my wife (who is Chinese) to try it out and it said she made some mistakes.

by zelphirkalt

1 subcomments

I think this is a good time for a shameless plug. The last 2 month or so I am working on my own project [1] for learning more characters. I have made a tool with powerful search function, training mode, and other useful features, such as displaying plots that show you your progress and whether you are reaching your daily training goal, and the ability to save searches, a la Thunderbird saved filters. It is written in Python and oldschool tkinter with custom widgets for a somewhat more modern and capable feel. It is very configurable. Though currently configuring it means touching a JSON file, as I have not yet bothered writing GUI for that.
I am mostly developing this for myself, to have the perfect tool for me, but I dare say, that I have not seen anything comparable and that I let my 10y+ experience in learning Chinese influence my design decisions. Oh, and it is free /libre software of course (AGPL). It comes with an ever improving vocabulary file that has tons of metadata about words, their usage, how to memorize them, etc. under ODbL (open database license).
[1]: https://codeberg.org/ZelphirKaltstahl/xiaolong-dictionary

by holg

0 subcomment

Great idea and effort, thanks for sharing. It is even way more strict than my native chinese tryarounds :D

by frozennothing

0 subcomment

This is really cool. Thank you for sharing. Before now I had not sought to understand how this technology works under the hood, but seeing it done at this scale made me curious to see if I could do something similar.

by ChadNauseam

0 subcomment

This is amazing. I'm also working on free language learning tech. (I have some SOTA NLP models on huggingface and a free app.) I have some SOTA NLP models on huggingface and a free app. My most recent research is a list of every phrase [0].
Pronunciation correction is an insanely underdeveloped field. Hit me up via email/twitter/discord (my bio) if you're interested in collabing.
[0]: https://gist.github.com/anchpop/acbfb6599ce8c273cc89c7d1bb36...

by stuxnet79

1 subcomments

How difficult would it be to adapt this to Cantonese? It is a surprisingly difficult language to learn. It has more tones than Mandarin plus comparatively less access to learning resources (in my experience)

by sgt

0 subcomment

How do you actually go about training specialized speech models? Let's say you have a language dialect you want to specialize on, or a pidgin English from West Africa, or a regular language but with highly specialized terminologies being used.
Just curious - would you need insane HW infrastructure to begin with, or hosted/managed. And what tooling is preferred by the industry for the "training"?

by rablackburn

0 subcomment

> And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.
There are still holdouts!
Come back to me in a couple of decades when the trove of humanity's data has been pored over and drifted further out of sync with (verifiable) reality.
Hand-tuning is the only way to make progress when you've hit a domain's limits. Go deep and have fun.

by ctkhn

0 subcomment

This is fantastic. Been looking for a way to get feedback on my pronunciation since I came back from Shanghai and haven't been seeing native speakers every day. Is there any plan to make this a download for desktop or mobile? Would be using it weekly to get back up to par on Mandarin

by erdemo

0 subcomment

This thread is like a diamond to me because I have been thinking about building almost the same thing for English tones. I need a model like this.
I'm sure there are a bunch of apps out there that claim they do the same thing, but they don't, IMO. Even if they do, as you said, where is the fun in that?
Great post, thanks for it!

by cocoa19

1 subcomments

Have you tried the Azure Speech Studio? I wonder how your custom model compares to this solution.
I played around with python scripts for the same purpose. The AI gives feedback that can be transformed to a percentage of correctness. One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.

by alexandermorgan

0 subcomment

I wish this were available for more languages! It would also be neat to estimate the native language of the speaker, given their pronunciation of the target language, and propose a prioritization of the pronunciation mistakes the language learner should work on first.

by kris_builds

0 subcomment

Super interesting project. Curious about the data collection - did you record yourself, use existing datasets, or both? I've been thinking about building something similar for Hebrew vowels (which are often omitted in writing). Would love to hear what the hardest part of the pipeline was.

by drekipus

0 subcomment

instantly awesome.
I suck at chinese but I want to get better and I'm too embarassed to try and talk with real people and practise.
This is a great compromise. even just practising for a few minutes I already feel way more confident based on its feedback, and I feel like I know more about the details of pronunciation.
I'm worried this might get too big and start sucking like everything else.

by arjie

0 subcomment

Very cool. As a super newbie who's only made it to Pimsleur 15 and only for the speaking, it would be cool to have a pinyin text entry and so on. In the end, I just type into ChatGPT what I want and paste it in your box so it's not a big deal.

by SequoiaHope

0 subcomment

Amazingly I just did the same thing! Only with AISHELL. It needs work. I used the encoder from the Meta MMS model.
https://github.com/sequoia-hope/mandarin-practice

by redleader55

0 subcomment

This is a very cool to have! Thanks for putting the time to build it.
For me it doesn't work very well. Even easy phrases like 他很忙 get transcribed completely random "ma he yu". Is it maybe over-fitted to some type of voice?

by tomaytotomato

0 subcomment

Can the implementation used here for tone and pronounciation apply for Music?
It would be cool if a model could tell you if you are singing or playing a piece of music with the right intonation and other ways.

by jainaayush05

0 subcomment

Any plans on releasing the inference/training code?

by olalonde

0 subcomment

It might be a mic issue but my wife, who is a native speaker, seems to get most characters wrong. I will try again later in a quieter place to see if that helps.

by sim04ful

0 subcomment

I'm also working on a Chinese learning app (heyzima.com) and my "solution" to this was to use the TTS token/word log probabilities.

by byb

0 subcomment

Neat. A personal tone trainer. Seriously, shut up and take my money now. Of course, it needs a vocabulary trainer, and zhuyin/traditional character support.

by JCharante

0 subcomment

Cool! I'm not great at Chinese but I have to speak slowly for it to recognize the tones/words. I wonder how fast the training data is.

by jrockway

0 subcomment

Interesting application! A friend of mine built a model like this to help her make her voice more feminine, and it is neat to see a similar use case here.

by baby

0 subcomment

For people trying to say the "j" sound correctly, as in "jiu" (old), just say "dz", so in that example "dziu"

by dionian

0 subcomment

it heard wu2 but i heard wo2 from you fine. and it should sound like wo2 not wo3 if spoken quickly. not a native speaker though so i could be wrong

by namr2000

0 subcomment

Wow, I was going to make something almost exactly like this! Really cool work and thank you for sharing

by while_true_

0 subcomment

Suggestion: in addition to the microphone input, allow the user to upload an audio file.

by eudamoniac

0 subcomment

How do you know that what it tells you is correct if you can't hear it yourself?

by bytesandbits

1 subcomments

great work! I am going to try it out. Currently about to learn some Mandarin to be able to talk with hawker stand owners for a trip I am doing soon. I am trilingual and can speak a few languages on top of that, but none of them tonal. I am new to tonal languages and I find myself struggling with this... a lot!

by victorbjorklund

1 subcomments

Cool. Would love a write up about how you did it if you have time

by btrlsnqtn

1 subcomments

The article mentions the bitter lesson. I'm confused about the status of Sutton's opinion of the bitter lesson. On the one hand, he invented the concept. On the other hand, he appears to be saying that LLMs are not the correct approach to artificial intelligence, which to a naive outsider looks like a contradiction. What gives?

by irl_zebra

0 subcomment

I am a huge, huge AI hater. I hate, hate, hate all the "Show HN: My Latest AI Slop App That Sucks and Required No Creativity to Think Up or Vibe Code and is Useless and I am a Useless Void of a Person for Having Created It." I say that to give context to say that this is the first legitimately useful "Show HN" I have seen in this AI sphere. It's really great, it seems to work quite well (I am an amateur mandarin speaker, I "know" about 5,000 words, so can vaguely judge) and fulfills a legitimate use case. I would pay you to use this once the model improves a little bit. It's really fantastic. You did well.

by mentalgear

0 subcomment

Very cool ! Will you make the source available as well?

by cheonn638

0 subcomment

Unclear if it wants 媽媽 / 妈妈 as:
- māmā (incorrect)
- māma (correct)

by maximedupre

0 subcomment

This is sick... you can just do things :D

by martianlantern

0 subcomment

Nice! I need something similar for english now

0 subcomment

by jellojello

2 subcomments

This is amazing, if you feel like opening an entire language to being learned more easily.. Farsi is a VERY overlooked language, my wife/her family speak it but it's so difficult finding great language lessons (it's also called Persian/Dari)

by somesun

0 subcomment

is there a English or Japanese learning model like this?

0 subcomment

by nirvanatikku

0 subcomment

talk about 30 seconds to wow. great app, UX and demo. would love to use this. kudos.

by felixbecker

0 subcomment

What a brilliant project!

by contingencies

0 subcomment

Man, get a girlfriend.

by cmuguythrow

0 subcomment

Awesome idea!

by wenjian

0 subcomment

Chinese here, some of the tune is wrong, maybe the env here has some noise, good luck on learning mandarin ;)

by kanwisher

0 subcomment

Nice now just add Thai support ;)

by iamanllm

0 subcomment

holy crap, I was literally imaging how I wanted something exactly like this yesterday! you are a hero!

by maximgeorge

0 subcomment

[dead]

by asyncadventure

0 subcomment

[dead]

by Irishappy

0 subcomment

[dead]

by Irishappy

0 subcomment

[dead]

by asyncadventure

0 subcomment

[dead]

by genie3io

0 subcomment

[dead]

by funkyfiddler369

0 subcomment

[flagged]