|
ForumsSega Master System / Mark III / Game GearSG-1000 / SC-3000 / SF-7000 / OMV |
Home - Forums - Games - Scans - Maps - Cheats - Credits Music - Videos - Development - Hacks - Translations - Homebrew |
Author | Message |
---|---|
|
PDM (aka PWM) code optimisation challenge
Posted: Wed Jan 13, 2016 7:39 pm Last edited by Maxim on Fri Feb 19, 2016 8:18 pm; edited 3 times in total |
Not really serious, but here goes anyway. The intro is big, but the challenge can be seen in isolation.
I've been looking at PDM audio (which you may think of as PWM). This is where samples are played by only ever outputting one of two audio levels - maximum or minimum. Early SMS games with speech tend to use this. Normally the theory of PDM is that you emit the data so fast, the responding system is unable to actually reproduce it like that and instead it effectively low-pass filters it and you get a kind of average of the bits, so 111111... gets you a high wave, 000000... gets you a low wave, and 101010... gets you the middle. The ratio of 1s to 0s controls the speaker position. This actually works, provided the 1s and 0s are being output really fast - on the order of MHz for real-world applications. Super Audio CDs work this way, so do some DACs. However, on the SMS we are CPU bound and we can't update the PSG volume that fast. In practice, the 1-bit audio rate for SMS games is usually no more than 30kHz, typically closer to 20kHz, and they sound pretty terrible. Furthermore, the common sample player code I've seen doesn't manage to make the time between bits being emitted constant - it depends on whether the next bit is a 0 or a 1, plus there's an extra delay between each byte. So I was wondering if I could make a player which worked kind of the same, but did have constant cycle counts per bit played, and went as fast as it could. Here's my attempt: MyPDM:
; Inputs: ; hl = data pointer ; bc = byte count ; Emits data to PSG as values $9f or $90, for 1 or 0 data (need not be this way round) ; Time between outputs must be constant ld e,$90 ; Use a register instead of a constant to save 3 cycles per bit -:ld a,(hl) ; 7 .macro playbit rlca ; 4 Get a bit in carry ld d,a ; 4 Save data ld a,$f ; 7 Initial value adc a,0 ; 7 Gets us $0f or $10 or e ; 4 Gets us $9f or $90 out ($7f),a ; 11 ld a,d ; 4 .endm .rept 7 playbit ; Have 37 cycles to burn to balance the loop - except on the 8th bit push hl ; 11 pop hl ; 10 nop ; 4 x4 nop nop nop .endr playbit inc hl ; 6 ; then check byte counter dec bc ; 6 ld a,b ; 4 or c ; 4 jp nz,- ; 10 This unrolls the loop for each byte, to save some time, but also wastes some time in order to balance out the outer loop time. I think the timings are equal no matter the path, by trying to have a branchless carry check. It comes out as 78 cycles per bit, for a playback rate of 45891Hz (assuming a 3579545Hz CPU). So, here's the challenge: can you do any better? The comments at the start should be the only constraints. One option would be to use a sentinel value at the end of the data, and preprocess it to not contain it otherwise (by changing a bit). I'm interested in any other ideas on how to speed it up. |
|
|
Posted: Wed Jan 13, 2016 7:43 pm Last edited by Maxim on Wed Jan 13, 2016 9:33 pm; edited 1 time in total |
Here's a quick saving of 4 cycles, by reading and shifting in a register other than a:
MyPDM:
; Inputs: ; hl = data pointer ; bc = byte count ld e,$90 ; Constant - we actually need three constants... -:ld d,(hl) ; 7 .macro playbit rl d ; 8 Get a bit in carry ld a,$f ; 7 Initial value adc a,0 ; 7 Gets us $0f or $10 or e ; 4 Gets us $9f or $90 out ($7f),a ; 11 .endm .rept 7 playbit ; Have 37 cycles to burn to balance the loop - except on the 8th bit push hl ; 11 pop hl ; 10 nop ; 4 x4 nop nop nop .endr playbit inc hl ; 6 ; then check byte counter dec bc ; 6 ld a,b ; 4 or c ; 4 jp nz,- ; 10 71 cycles -> 50416Hz |
|
|
Posted: Wed Jan 13, 2016 9:10 pm |
You forgot to increment hl, didn't you ? | |
|
Posted: Wed Jan 13, 2016 9:22 pm |
Dammit. Edits above. I had a cycle count wrong too. | |
|
Posted: Wed Jan 13, 2016 10:51 pm |
It's late, therefore the following could be totally wrong ;)
; a 8*256 byte look-up table must be located in ROM at $4000 (&lut[0]=0x4000)
; ; lut definition : ; ; for ( n = 7; n >= 0; n--) ; for ( i = 0; i <= 255; i++) ; lut[ (7-n) * 256 + i ] = ( (i & (1<<n)) != 0 ? 0x9F : 0x90 ); ; ; Inputs: hl = data pointer ; bc = byte count -: ld e,(hl) ; 7 ld d,$40 ; 7 ld a,(de) ; 7 out ($7F),a ; 11 .rept 7 push ix ; 15 waste 40 cycles pop ix ; 14 ld a,0 ; 7 rra ; 4 inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 .endr inc hl ; 6 dec bc ; 6 ld a,b ; 4 or c ; 4 jp nz,- ; 10 62 cycles between each output ? |
|
|
Posted: Wed Jan 13, 2016 11:12 pm |
2KB to save 9 cycles, I like it :) Would unrolled code for all 256 values go any faster? It looks like it costs 37 cycles to get a byte of data, 11 to emit the result for a bit, leaving not much space to optimise further without breaking the equal timing restriction. | |
|
Posted: Wed Jan 13, 2016 11:26 pm |
We could go even faster by directly storing the $90/$9F sequence in ROM (but it would limit the data to 65536/8=8192 8-bit samples) :D | |
|
Posted: Thu Jan 14, 2016 3:15 am |
This is on my to do list.
I can see a potential solution involving unrolled code for all 256 values intertwined with jump logic (based on next byte) + byte counter. Its speed would depend on how well you could intertwine these requirements while keeping equal timing of outs, the kind of thing you can't really know until you write.The code would probably be an abomination. Do you have a tool to produce the data. I would be interested in giving it a go. EDIT: Totally not what I am suposed to be doing with my time right now... Proof of concept breakdown. Create LUT with following structure; PDM_PlayByteLUT: .dw PDM_PlayByte_00000000, PDM_PlayByte_00000001, PDM_PlayByte_00000010, etc Segments of code before intertwined; Out Code Unrolled. Uses c, d, e. Use of d or e to reflect byte. ld c, $7F ; port
ld d, $9F ; value 1 ld e, $90 ; value 0 out (c), d / e ; either d or e depending on byte. out (c), d / e out (c), d / e out (c), d / e out (c), d / e out (c), d / e out (c), d / e out (c), d / e Byte Counter Uses iy. iy = count dec iy ; 10 iy = byte count
ld a, iyh ; 8 cp 0 ; 7 (25) !PADDING or iyl ; 8 jp z, PDM_End ; 10 iy == 0 > exit cp 0 ; 7 (25) !PADDING Jump Code Uses ix, a, hl; ix = data pointer dec hl ; 6 !PADDING ld l, (ix + 0) ; 19 (25) inc ix ; 10 ld h, $00 ; 7 Replace $00 with half high byte of PDM_PlayByteLUT nop ; 4 !PADDING nop ; 4 (25) !PADDING add hl, hl ; 11 ld a,(hl) ; 7 ; Indirection starts cp 0 ; 7 (25) !PADDING inc hl ; 6 ld h,(hl) ; 7 ld l,a ; 4 nop ; 4 !PADDING jp (hl) ; 4 (25) Explanation: load data byte in to low byte of hl. Load half the high byte of PDM_PlayByteLUT in to high byte. 2x HL to get appropriate point in LUT. Indirection on HL (equivilant of ld hl, (hl)) to get routine address from LUT. Jump to this address for next byte. Intertwined; PDM_PlayByte_00000000: ; 00000000 represents unrolled ; == out #1 out (c), d / e ; 12 dec iy ; 10 iy = byte count ld a, iyh ; 8 cp 0 ; 7 (25) ; total (37) ; == out #2 out (c), d / e ; 12 or iyl ; 8 jp z, PDM_End ; 10 iy == 0 > exit cp 0 ; 7 (25) ; total (37) ; == out #3 out (c), d / e ; 12 cp 0 ; 7 cp 0 ; 7 cp 0 ; 7 nop ; 4 (25) ; total (37) ; == out #4 out (c), d / e ; 12 cp 0 ; 7 cp 0 ; 7 cp 0 ; 7 nop ; 4 (25) ; total (37) ; == out #5 out (c), d / e ; 12 dec hl ; 6 ld l, (ix + 0) ; 19 (25) ; total (37) ; == out #6 out (c), d / e ; 12 inc ix ; 10 ld h, $00 ; 7 nop ; 4 nop ; 4 (25) ; total (37) ; == out #7 out (c), d / e ; 12 add hl, hl ; 11 ld a,(hl) ; 7 cp 0 ; 7 (25) ; total (37) ; == out #8 out (c), d / e ; 12 inc hl ; 6 ld h,(hl) ; 7 ld l,a ; 4 nop ; 4 jp (hl) ; 4 (25) ; total (37) There is no way there is not an error in there. Too complex to work out without testing. This is 37 cycle intervals. It might be possible to get lower with different byte counter + jump code. What makes this hard is you don't necessarily want optimal byte counter + jump code, rather you are after the code that fits the cycle counting optimally. You would need to append a dummy byte at end of data (it cuts off mid way in last byte). Abomination code indeed. |
|
|
Posted: Thu Jan 14, 2016 8:33 am |
I haven't gone so far as to make a PDM data generator but now I'm getting a bit curious... We have existing players (in games, plus the above might work) to test the data and then we can start testing the players.
I guess the extreme case would be to generate code from the sample, which would eliminate the byte retrieval, shifting and branching (or branch avoidance) at the cost of a huge amount of ROM space. This is similar to what was done for 8088 Domination. You'd still have to pad the timing for things like bank changes. The other option is to write an unbalanced player and preprocess the data to account for it, choosing each bit (and weighting the error) according to the time span for which it will have an effect. This might make the data ugly though. |
|
|
Posted: Thu Jan 14, 2016 9:04 am |
Don't say that, it's great ! Congratulations for this z80 masterpiece :) |
|
|
Posted: Thu Jan 14, 2016 9:15 am |
Self modifying code in RAM?
You could have code to read and 'decode' next byte 'interleaved' with code that OUTs the values using OUT (c),d and OUT (c),e ( d and e being the two values we want to output). Of course we should double buffer. The hard part of course is to keep output speed constant. |
|
|
Posted: Thu Jan 14, 2016 9:59 am Last edited by psidum on Thu Jan 14, 2016 10:23 am; edited 1 time in total |
I get technical terms screwed up. I have trouble explaining things like this in writing.
I originally started with idea of self modifying code to remove the need for LUT but it got lost somewhere. It would be even better if you could modify the out statements. out (c), b | ED41 | (41) 0100 0001 out (c), d | ED51 | (51) 0101 0001 Only one bit apart. If you could set a bit with contents of carry flag you could avoid the need for branch and do it in SMC nicely. I looked for bitwise solution but could not find one. Purpose being no need to unroll. ZX Spectrum community might have a PDM data generator we could re purpose. |
|
|
Posted: Thu Jan 14, 2016 10:10 am |
The outer loop requires you to waste time in the inner unrolled loop. You can put the inc hl and dec bc into the unrolled loop and save 14 cycles per bit, with the 2KB table that gets you down to 47 cycles per bit. | |
|
Posted: Thu Jan 14, 2016 10:42 am |
Is the PWM or PDM getting better output than PCM (especially when PCM uses multiple channels to get more resolution) ? | |
|
Posted: Thu Jan 14, 2016 11:49 am |
In theory, it offers a high dynamic range for the same bit rate as PCM, plus some noise shaping advantages. These only really appear at much higher rates though. Using PSG channels as you describe can only get so far in terms of quality.
Empirically, PDM samples are a lot louder, but that's more because the PCM samples are badly preprocessed. I guess the thing to do is to make the highest quality demo you can and see what is achievable, and at what data rate. |
|
|
Posted: Thu Jan 14, 2016 1:38 pm |
When I experimented with PDM in past I couldn't get satisfactory result, it was louder but also incredibly noisy and the freq range i could get was poor.
I also aimed to beat this : http://www.tmeeco.eu/BitShit/SMSPCM.SMS (4bit log compensated and offset shifted PCM at something near 16KHz) |
|
|
Posted: Thu Jan 14, 2016 2:47 pm |
4 bit PCM at 16kHz ought to map somewhat to 1 bit PDM at 64kHz. That's why we have the challenge to get the output up to higher frequencies. Then it's a matter of raising the rate further to get quality higher than that.
Did you experiment with sampling rates and dithering of the quantized data? I guess a PCM player could get quite high sampling rates, although at some point the ROM use becomes painful (and beyond emulator support). |
|
|
Posted: Thu Jan 14, 2016 3:01 pm |
Dithering made everything super noisy, I got much better "recovery" effect from noise trails on quiet sounds at higher sample rates than trying to dither low rates, the main parts sounded lot better that way due to less noise. You can also use dynamics processing on the sound to raise volumes of the quiet parts enough that they won't get lost in the quantisation noise, it won't work for everything though... | |
|
Posted: Thu Jan 14, 2016 3:02 pm |
So I spent the time to understand psidium's idea and it is ingenious :) Basically interleaving the data output into some code which determines what the next data is, using separate registers from each other, and then duplicating all the code once for every possible data byte (as the output is all hard coded) as the non output code figures out where to jump to next. Did you count bytes for the duplicated code? I guess it may end up a few KB, hopefully less than 16.
In the work I've been doing in extracting sample data, there's clearly a lack of preprocessing of the samples. Some aren't even normalised, let alone compressed, to make use of the limited dynamic range available. Many truncate the 8-bit data rather than round it, leading to amplification of noise (7f-80 becoming 7-8), and I haven't found anything preprocessed for log noise yet. The Lemmings 2 sample player is more advanced (and log compensating) than anything I've come across so far, but then I've hit a lot of early Sega games with the same PDM player code. |
|
|
Posted: Fri Jan 15, 2016 12:32 pm |
Thanks :)
Not sure about bytes used. It might be possible to reduce cycles further by avoiding ix for jump code. At some point in the future i would like to come to grips with technical details behind PDM and give it a crack but as always it is about getting the free time. |
|
|
Posted: Fri Jan 15, 2016 1:20 pm |
One more thing I remembered from my messing around with PDM is that the results were also poor because there's very little filtering on the actual sound output, and you get the fast signal change going all over the circuits and actually causing distortion due to low bandwidth handling capability of the things the sound line connects to. (I didn't try RF output as I have removed the modulator but high bit rate PDM should be able to push audio side into overmodulation and cause some level of issues with video output even lol).
I could get more positive results by sticking a heavy LPF on the sound output line, that reduced the noise a lot and made things little more smoother and less distorted sounding but also destroyed the normal sound output making it really muffled. |
|
|
Posted: Sat Jan 16, 2016 2:22 pm |
I improved my routine (37 cycles between each output) by intertwining everything "a la psidum" ;)
; a 8*256 byte look-up table must be located in ROM at $4000 (&lut[0]=0x4000)
; ; lut definition : ; ; for ( n = 7; n >= 0; n--) ; for ( i = 0; i <= 255; i++) ; lut[ (7-n) * 256 + i ] = ( (i & (1<<n)) != 0 ? 0x9F : 0x90 ); ; ; Inputs: hl = data pointer ; bc = byte count PDM_Player: push bc ld ix,loop ld e,(hl) xor a ; => zf = 1 loop: ld d,$40 ; 7 ld a,(de) ; 7 out ($7F),a ; 11 pop bc ; 10 ret nz ; 5 *** (always false) inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 dec bc ; 6 ld a,i ; 9 *** inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 inc d ; 4 ld a,b ; 4 or c ; 4 ld a,(de) ; 7 *** ld a,(de) ; 7 out ($7F),a ; 11 ret z ; 5 if false / 11 if true (end of data) inc hl ; 6 nop ; 4 *** inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 ld a,ixl ; 8 *** ld a,(hl) ; 7 *** inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 push bc ; 11 nop ; 4 *** inc d ; 4 ld a,(de) ; 7 out ($7F),a ; 11 inc d ; 4 ld a,(hl) ; 7 ld b,a ; 4 xor a ; 4 => zf = 1 ld a,(de) ; 7 out ($7F),a ; 11 ld e,b ; 4 jp (ix) ; 8 You must call PDM_Player and add an extra byte to use it, because it ends the loop with a "ret z" after the 3rd bit of the last byte of data have been processed. Look-up table attached with this post. |
|
|
Posted: Sun Jan 17, 2016 12:47 am |
All round much nicer than my solution. Structuring the data so you can iterate over the high byte is very clever.
The diversity of solutions has been really interesting. |
|
|
Posted: Sun Jan 17, 2016 10:38 pm |
I've found this to be quite true, but why does truncation amplify noise vs. rounding? Both still cut off lower bits and map them to a range of 16 values (in binary): 00-0F vs. 08-17 being the respective cutoff "ranges". |
|
|
Posted: Mon Jan 18, 2016 5:42 am |
Because the noisy silence is right on a truncation boundary, its movements by 1/256 get amplified to 1/16 (or more, due to the log output). You can't remove it from 4-bit data but you could probably clean up 8-bit data to play more clearly. | |
|
Posted: Fri Feb 19, 2016 8:09 pm Last edited by Maxim on Fri Feb 19, 2016 11:18 pm; edited 1 time in total |
Here's a quick and dirty PDM-playing ROM, playing the Sega voice (sourced from Mega Drive Sonic 1 and converted using my own tool). It sounds terrible in every emulator I tried, possibly due to my bad PDM data. However, a conversion of the PDM data back to a WAV sounds sort of OK (but noisy) to me. Would anyone like to try it on a real system?
I modified vingazole's player only to accept the LUT at any multiple of 256 bytes, as the $4000 location made it hard to play samples larger than 16KB. (The one I used came out as 20KB for 1.674s.) Source and data are included, the converter tool isn't. |
|
|
Posted: Fri Feb 19, 2016 10:29 pm |
I tried it with my Everdrive : it sounds far better than what we get with Kega (for example), but it's still noisy.
I hear a rather loud hiss (high frequency sound) over the voices. |
|
|
Posted: Sat Feb 20, 2016 2:21 am |
Here's a recording of it running on my PAL SMS 2
I guess it's not the best representation as the audio has been compressed multiple times. |
|
|
Posted: Sat Feb 20, 2016 7:53 am |
That sounds almost exactly the same as expected from a conversion back to a wave file. I guess PDM isn't a good option on the SMS... 4-bit PCM at 15kHz will be much better quality for a lower bitrate. | |