Forums

I've recently been playing with robot voice effect algorithms for an unrelated project. One of the outputs I got was a barely intelligible, very bleepy voice similar to PSG arpeggios.

This made me wonder if it was possible to imitate human speech through partial reconstitution of the formants on the SN76489.

I made a dirty program to, in order:
-Split a wave file in 1/f length chunks (16.6ms for 60Hz).
-Do a FFT of each of the chunks, to get their power spectrum
-For each of those, get the 3 most powerful components
-Record frequency and power of each chunk in an asm include file
-Optionally generate a simulation wave file, by using 3 square oscillators

What I got is this:
http://furrtek.free.fr/tmp/sega_psg.mp3

First sample is the original file at 44100Hz.
Second is synthesized, updating the PSG registers at 60Hz.
Third updates at 120Hz (twice per frame, best result I think).
Fourth updates at 240Hz (four times per frame).

Altough the intelligibility is heavily dependent on the original voice's clarity, tone, and the use of common (identifiable) words, I think that this approach is still interesting in a data size and time consumption point of view.

It may be used for example to give more depth to game characters during dialogs or cutscenes, where the narrated text is displayed on the screen (matching the text to the voice can work pretty well), or even some sort of voice synthesis from syllabes and a dictionary.

Regarding data size and CPU time, the PSG data can be packed/reduced to 5 bytes per chunk, giving a bitrate of 600 B/s with 2 updates per frame. In comparison to the 2000 B/s and 100% CPU use of the common volume-only based approach.

Evolutions may include overlap analysis, the use of the noise channel, and a frenquency matching algorithm which takes into account the harmonics of square waveforms.

Should I dig more into this ? :o

sega_psg.mp3 (212.32 KB)
Attachment fairy

I would love it if you did. Even though the samples aren't as crisp as they "could" be, I think this is a great proof of concept. I've been thinking of diving into some study of this myself, but I'm currently lacking understanding on a good number of topics necessary to do this. The SMS and especially GG, show on some occasions, how clear voice samples can sound when done by someone who knows what they are doing (the Madou Monogatari series I recently worked on come to mind).

I recently assumed that samples took up so much space that games were limited to a few, but seeing games with 40+ voice samples, with rather clear diction threw that misconception out the window. It is more probably, that like FM, most composers/sound engineers were not competent enough to bring out its best features. Compile comes to mind as a developer who worked with both very well.

I'm not sure what the intended outcome of this would be, if voices were given a melodic overlay (like auto-tune?) or some other gimmick, I'm sure it would be an appreciated resource.

It sounds as if this was something you just decided to work on one afternoon as a "what if". It would be nice to see where you could take this. It doesn't need to be anything as sophisticated as a PSGlib for voices, but I'm sure those of us who are less technically capable would appreciate your efforts. With some polishing, maybe a bit more exploration, I can imagine people might want to incorporate something like this into their projects. I know I would.

So, if it is not too much trouble, please continue.

Yes, it's a cool technique that can have some interesting effects applied to it.

Another good example: #t=257

Great demo, I was sure someone already experimented with this but couldn't find the right keywords to hear examples !

I'll be sure to release a tool to export the right data from a wave file, and also the player's source to include.
I'm far from being an audio expert or voice actor, so I can't really know in advance how to improve the sound's quality except by simply updating the PSG registers more often.

I didn't write about playback: The raster line interrupt makes a good timer.

Are there any SMS or GG games which manage to animate stuff while reading samples ? I've always seen the CPU just sit in a playback loop with multiple timing NOPs after disabling interrupts.

furrtek wrote

Are there any SMS or GG games which manage to animate stuff while reading samples ? I've always seen the CPU just sit in a playback loop with multiple timing NOPs after disabling interrupts.

The only example that comes to my mind is Space Harrier : when you "die", the whole game freezes and the only animated thing is the player's character falling :

This is very interesting. I always stood clear from even experimenting with voice samples because I thought it a necessity to stop the whole game while playing a sample. A sample playback at 60Hz, while admittedly almost intelligible, could still be used to add some character to a game while keeping it running at full speed. The quality would certainly be good enough for non-word voice sounds ("Huh", "Ah", "Mh") or death screams.

It's a nice idea.I wonder if the noise channel could be used for the sibilant consonants; I doubt it would improve by much, though.

Here: https://github.com/furrtek/PSGTalk

Only takes 44100 8bit mono raw files in for now, and only gives a simulation in the same format, no PSG parameters export. Will do that tonight along with the player asm code.

That's neat ! Really chirpy sound though, much like the FM based attempts bit worse.

haroldoop wrote

It's a nice idea.I wonder if the noise channel could be used for the sibilant consonants; I doubt it would improve by much, though.

Fricatives (like /f/) are only noise, so they would benefit from being simulated. "Ancient" discrete-logic voice simulators used a sine/square wave generator coupled with a noise source.

Cool though slightly nightmare inducing.

The first parse has a base resonance. Be nice if that could be isolated.

EDIT: first parse referring to Second is synthesized, updating the PSG registers at 60Hz. Third sample also has it but reduced.

Is this caused by multiple waves producing artificial square waves below native levels?

psidum wrote

The first parse has a base resonance. Be nice if that could be isolated.

EDIT: first parse referring to Second is synthesized, updating the PSG registers at 60Hz. Third sample also has it but reduced.

Is this caused by multiple waves producing artificial square waves below native levels?

It's the opposite: the square waves are a sum of sine waves.
In other words, the distortion you hear is a consequence of using a sum of sine waves to represent a single sine wave.
I guess there could be a way of producing a better approximation but, right now, I can't think of any that doesn't involve brute force. :P

Basically, its an optimization problem where one has the sum of sine waves that compose the original voice, and three square wave generators that themselves produce sums of waves; the objective is to setup the frequency and volume of each generator so that the sum of their signals produce the minimum amount of error when compared to the original sum of senoids. I guess one could also ignore the senoids that are beyond the range of human hearing in order to reduce the number of wariables.

Thanks for the run down.

It sounds quite cool when you put it through a low pass filter.

Here is a C implementation of the old C64 SAM, text to speech.
https://github.com/s-macke/SAM

This used the 3 sound channels of the C64 to do text to speech, though the SID can select between square, triangle & sawtooth for each channel plus filters. I guess there is some interesting info there.

I guess anyway you have to account for the first few harmonics, those who are audible (say up to 16kHz at least) to approximate the FFT using 3 square waves.
Can't say if 60Hz update is enough, though.
The generated VGM anyway can be converted to a PSG file and you could use PSGlib to replay it :)

FluBBa wrote

This used the 3 sound channels of the C64 to do text to speech, though the SID can select between square, triangle & sawtooth for each channel plus filters. I guess there is some interesting info there.

SAM actually used the digi method of tweaking the volume register (4bit) to produce the voice, it didn't use any of the oscillators at all.

best regards,
- dink

Yes, you're completely right. I guess I read very late one night =)
I wonder how often it updates its values? And could it be better if using square waves instead of sines?

Old topic dig-up, I updated the code so that it's now useful (maybe).

https://github.com/furrtek/PSGTalk

The program can take multiple parameters and can output data for various clock rates. Also added example asm playback code using the raster interrupt.

It sounds nowhere near the simulation file so there's still lots of improvement to do... Todo-list in source file.

I saw somethink similiar zapping on youtube these days:

https://www.youtube.com/watch?v=YBxq7k45pBo

the code for the audio part is in the video description.

I aslo was thinking: How about intead using the FFT, use a square wave transform. wouldn't that be more fitted to the PSG chip?

I just spent some time reading about the square wave transform.
I found a paper explaining the approximation method, but the amplitude equations are puzzling, how are they found for a given precision value ?
I can't figure out how they're related to the value or between each other.

Author	Message
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Talking PSG Posted: Fri May 01, 2015 9:38 pm
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	I've recently been playing with robot voice effect algorithms for an unrelated project. One of the outputs I got was a barely intelligible, very bleepy voice similar to PSG arpeggios. This made me wonder if it was possible to imitate human speech through partial reconstitution of the formants on the SN76489. I made a dirty program to, in order: -Split a wave file in 1/f length chunks (16.6ms for 60Hz). -Do a FFT of each of the chunks, to get their power spectrum -For each of those, get the 3 most powerful components -Record frequency and power of each chunk in an asm include file -Optionally generate a simulation wave file, by using 3 square oscillators What I got is this: http://furrtek.free.fr/tmp/sega_psg.mp3 First sample is the original file at 44100Hz. Second is synthesized, updating the PSG registers at 60Hz. Third updates at 120Hz (twice per frame, best result I think). Fourth updates at 240Hz (four times per frame). Altough the intelligibility is heavily dependent on the original voice's clarity, tone, and the use of common (identifiable) words, I think that this approach is still interesting in a data size and time consumption point of view. It may be used for example to give more depth to game characters during dialogs or cutscenes, where the narrated text is displayed on the screen (matching the text to the voice can work pretty well), or even some sort of voice synthesis from syllabes and a dictionary. Regarding data size and CPU time, the PSG data can be packed/reduced to 5 bytes per chunk, giving a bitrate of 600 B/s with 2 updates per frame. In comparison to the 2000 B/s and 100% CPU use of the common volume-only based approach. Evolutions may include overlap analysis, the use of the noise channel, and a frenquency matching algorithm which takes into account the harmonics of square waveforms. Should I dig more into this ? :o sega_psg.mp3 (212.32 KB) Attachment fairy

sherpa Joined: 28 Nov 2014 Posts: 365	Posted: Fri May 01, 2015 10:00 pm
sherpa Joined: 28 Nov 2014 Posts: 365	I would love it if you did. Even though the samples aren't as crisp as they "could" be, I think this is a great proof of concept. I've been thinking of diving into some study of this myself, but I'm currently lacking understanding on a good number of topics necessary to do this. The SMS and especially GG, show on some occasions, how clear voice samples can sound when done by someone who knows what they are doing (the Madou Monogatari series I recently worked on come to mind). I recently assumed that samples took up so much space that games were limited to a few, but seeing games with 40+ voice samples, with rather clear diction threw that misconception out the window. It is more probably, that like FM, most composers/sound engineers were not competent enough to bring out its best features. Compile comes to mind as a developer who worked with both very well. I'm not sure what the intended outcome of this would be, if voices were given a melodic overlay (like auto-tune?) or some other gimmick, I'm sure it would be an appreciated resource. It sounds as if this was something you just decided to work on one afternoon as a "what if". It would be nice to see where you could take this. It doesn't need to be anything as sophisticated as a PSGlib for voices, but I'm sure those of us who are less technically capable would appreciate your efforts. With some polishing, maybe a bit more exploration, I can imagine people might want to incorporate something like this into their projects. I know I would. So, if it is not too much trouble, please continue.

ccovell Joined: 26 Dec 2004 Posts: 374 Location: Japan	Posted: Fri May 01, 2015 10:45 pm
ccovell Joined: 26 Dec 2004 Posts: 374 Location: Japan	Yes, it's a cool technique that can have some interesting effects applied to it. Another good example: #t=257

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Sat May 02, 2015 7:33 am
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Great demo, I was sure someone already experimented with this but couldn't find the right keywords to hear examples ! I'll be sure to release a tool to export the right data from a wave file, and also the player's source to include. I'm far from being an audio expert or voice actor, so I can't really know in advance how to improve the sound's quality except by simply updating the PSG registers more often. I didn't write about playback: The raster line interrupt makes a good timer. Are there any SMS or GG games which manage to animate stuff while reading samples ? I've always seen the CPU just sit in a playback loop with multiple timing NOPs after disabling interrupts.

vingazole Joined: 20 Feb 2008 Posts: 118 Location: Saintes, France	Posted: Sat May 02, 2015 8:10 am
	furrtek wrote Are there any SMS or GG games which manage to animate stuff while reading samples ? I've always seen the CPU just sit in a playback loop with multiple timing NOPs after disabling interrupts. The only example that comes to my mind is Space Harrier : when you "die", the whole game freezes and the only animated thing is the player's character falling :

Kagesan Joined: 01 Feb 2014 Posts: 877	Posted: Sat May 02, 2015 9:28 am
Kagesan Joined: 01 Feb 2014 Posts: 877	This is very interesting. I always stood clear from even experimenting with voice samples because I thought it a necessity to stop the whole game while playing a sample. A sample playback at 60Hz, while admittedly almost intelligible, could still be used to add some character to a game while keeping it running at full speed. The quality would certainly be good enough for non-word voice sounds ("Huh", "Ah", "Mh") or death screams.

haroldoop Joined: 25 Feb 2006 Posts: 874 Location: Belo Horizonte, MG, Brazil	Posted: Sat May 02, 2015 11:38 am
	It's a nice idea.I wonder if the noise channel could be used for the sibilant consonants; I doubt it would improve by much, though.

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Sat May 02, 2015 2:23 pm
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Here: https://github.com/furrtek/PSGTalk Only takes 44100 8bit mono raw files in for now, and only gives a simulation in the same format, no PSG parameters export. Will do that tonight along with the player asm code.

TmEE Joined: 31 Oct 2007 Posts: 853 Location: Estonia, Rapla city	Posted: Sat May 02, 2015 4:27 pm
	That's neat ! Really chirpy sound though, much like the FM based attempts bit worse.

ccovell Joined: 26 Dec 2004 Posts: 374 Location: Japan	Posted: Sat May 02, 2015 11:04 pm
ccovell Joined: 26 Dec 2004 Posts: 374 Location: Japan	haroldoop wrote It's a nice idea.I wonder if the noise channel could be used for the sibilant consonants; I doubt it would improve by much, though. Fricatives (like /f/) are only noise, so they would benefit from being simulated. "Ancient" discrete-logic voice simulators used a sine/square wave generator coupled with a noise source.

psidum Joined: 01 Jan 2014 Posts: 331	Posted: Sun May 03, 2015 3:56 am
psidum Joined: 01 Jan 2014 Posts: 331	Cool though slightly nightmare inducing. The first parse has a base resonance. Be nice if that could be isolated. EDIT: first parse referring to Second is synthesized, updating the PSG registers at 60Hz. Third sample also has it but reduced. Is this caused by multiple waves producing artificial square waves below native levels?

haroldoop Joined: 25 Feb 2006 Posts: 874 Location: Belo Horizonte, MG, Brazil	Posted: Sun May 03, 2015 12:16 pm
	psidum wrote The first parse has a base resonance. Be nice if that could be isolated. EDIT: first parse referring to Second is synthesized, updating the PSG registers at 60Hz. Third sample also has it but reduced. Is this caused by multiple waves producing artificial square waves below native levels? It's the opposite: the square waves are a sum of sine waves. In other words, the distortion you hear is a consequence of using a sum of sine waves to represent a single sine wave. I guess there could be a way of producing a better approximation but, right now, I can't think of any that doesn't involve brute force. :P Basically, its an optimization problem where one has the sum of sine waves that compose the original voice, and three square wave generators that themselves produce sums of waves; the objective is to setup the frequency and volume of each generator so that the sum of their signals produce the minimum amount of error when compared to the original sum of senoids. I guess one could also ignore the senoids that are beyond the range of human hearing in order to reduce the number of wariables.

psidum Joined: 01 Jan 2014 Posts: 331	Posted: Sun May 03, 2015 1:02 pm
psidum Joined: 01 Jan 2014 Posts: 331	Thanks for the run down. It sounds quite cool when you put it through a low pass filter.

FluBBa Joined: 21 Jul 2005 Posts: 412 Location: GBG	Posted: Sun May 03, 2015 10:28 pm Last edited by FluBBa on Thu May 07, 2015 3:59 pm; edited 1 time in total
FluBBa Joined: 21 Jul 2005 Posts: 412 Location: GBG	Here is a C implementation of the old C64 SAM, text to speech. https://github.com/s-macke/SAM This used the 3 sound channels of the C64 to do text to speech, though the SID can select between square, triangle & sawtooth for each channel plus filters. I guess there is some interesting info there.

sverx Joined: 05 Sep 2013 Posts: 3828 Location: Stockholm, Sweden	Posted: Tue May 05, 2015 8:32 am
	I guess anyway you have to account for the first few harmonics, those who are audible (say up to 16kHz at least) to approximate the FFT using 3 square waves. Can't say if 60Hz update is enough, though. The generated VGM anyway can be converted to a PSG file and you could use PSGlib to replay it :)

dink Joined: 09 Dec 2013 Posts: 228 Location: detroit	Posted: Thu May 07, 2015 1:53 pm
dink Joined: 09 Dec 2013 Posts: 228 Location: detroit	FluBBa wrote This used the 3 sound channels of the C64 to do text to speech, though the SID can select between square, triangle & sawtooth for each channel plus filters. I guess there is some interesting info there. SAM actually used the digi method of tweaking the volume register (4bit) to produce the voice, it didn't use any of the oscillators at all. best regards, - dink

FluBBa Joined: 21 Jul 2005 Posts: 412 Location: GBG	Posted: Thu May 07, 2015 4:01 pm
FluBBa Joined: 21 Jul 2005 Posts: 412 Location: GBG	Yes, you're completely right. I guess I read very late one night =) I wonder how often it updates its values? And could it be better if using square waves instead of sines?

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Thu Oct 29, 2015 9:59 pm
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Old topic dig-up, I updated the code so that it's now useful (maybe). https://github.com/furrtek/PSGTalk The program can take multiple parameters and can output data for various clock rates. Also added example asm playback code using the raster interrupt. It sounds nowhere near the simulation file so there's still lots of improvement to do... Todo-list in source file.

gvx32 Joined: 17 Sep 2013 Posts: 128 Location: Gravataí, RS, Brazil	Posted: Thu Oct 29, 2015 10:21 pm
	I saw somethink similiar zapping on youtube these days: https://www.youtube.com/watch?v=YBxq7k45pBo the code for the audio part is in the video description. I aslo was thinking: How about intead using the FFT, use a square wave transform. wouldn't that be more fitted to the PSG chip?

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Thu Oct 29, 2015 11:21 pm
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	I just spent some time reading about the square wave transform. I found a paper explaining the approximation method, but the amplitude equations are puzzling, how are they found for a given precision value ? I can't figure out how they're related to the value or between each other.

Forums

View topic - Talking PSG