Forums

Sega Master System / Mark III / Game Gear
SG-1000 / SC-3000 / SF-7000 / OMV
Home - Forums - Games - Scans - Maps - Cheats - Credits
Music - Videos - Development - Hacks - Translations - Homebrew

View topic - PDM (aka PWM) code optimisation challenge

Reply to topic
Author Message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
PDM (aka PWM) code optimisation challenge
Post Posted: Wed Jan 13, 2016 7:39 pm
Last edited by Maxim on Fri Feb 19, 2016 8:18 pm; edited 3 times in total
Not really serious, but here goes anyway. The intro is big, but the challenge can be seen in isolation.

I've been looking at PDM audio (which you may think of as PWM). This is where samples are played by only ever outputting one of two audio levels - maximum or minimum. Early SMS games with speech tend to use this.

Normally the theory of PDM is that you emit the data so fast, the responding system is unable to actually reproduce it like that and instead it effectively low-pass filters it and you get a kind of average of the bits, so 111111... gets you a high wave, 000000... gets you a low wave, and 101010... gets you the middle. The ratio of 1s to 0s controls the speaker position.

This actually works, provided the 1s and 0s are being output really fast - on the order of MHz for real-world applications. Super Audio CDs work this way, so do some DACs. However, on the SMS we are CPU bound and we can't update the PSG volume that fast. In practice, the 1-bit audio rate for SMS games is usually no more than 30kHz, typically closer to 20kHz, and they sound pretty terrible. Furthermore, the common sample player code I've seen doesn't manage to make the time between bits being emitted constant - it depends on whether the next bit is a 0 or a 1, plus there's an extra delay between each byte.

So I was wondering if I could make a player which worked kind of the same, but did have constant cycle counts per bit played, and went as fast as it could. Here's my attempt:

MyPDM:
  ; Inputs:
  ; hl = data pointer
  ; bc = byte count
  ; Emits data to PSG as values $9f or $90, for 1 or 0 data (need not be this way round)
  ; Time between outputs must be constant
  ld e,$90 ; Use a register instead of a constant to save 3 cycles per bit
-:ld a,(hl) ; 7

.macro playbit
  rlca      ; 4  Get a bit in carry
  ld d,a    ; 4  Save data
  ld a,$f   ; 7  Initial value
  adc a,0   ; 7  Gets us $0f or $10
  or e      ; 4  Gets us $9f or $90
  out ($7f),a ; 11
  ld a,d    ; 4
.endm

.rept 7
  playbit
  ; Have 37 cycles to burn to balance the loop - except on the 8th bit
  push hl   ; 11
  pop hl    ; 10
  nop       ; 4 x4
  nop
  nop
  nop
.endr
  playbit
  inc hl    ; 6
  ; then check byte counter
  dec bc    ; 6
  ld a,b    ; 4
  or c      ; 4
  jp nz,-   ; 10

This unrolls the loop for each byte, to save some time, but also wastes some time in order to balance out the outer loop time. I think the timings are equal no matter the path, by trying to have a branchless carry check. It comes out as 78 cycles per bit, for a playback rate of 45891Hz (assuming a 3579545Hz CPU).

So, here's the challenge: can you do any better? The comments at the start should be the only constraints.

One option would be to use a sentinel value at the end of the data, and preprocess it to not contain it otherwise (by changing a bit). I'm interested in any other ideas on how to speed it up.
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Wed Jan 13, 2016 7:43 pm
Last edited by Maxim on Wed Jan 13, 2016 9:33 pm; edited 1 time in total
Here's a quick saving of 4 cycles, by reading and shifting in a register other than a:

MyPDM:
  ; Inputs:
  ; hl = data pointer
  ; bc = byte count
  ld e,$90 ; Constant - we actually need three constants...
-:ld d,(hl) ; 7

.macro playbit
  rl d      ; 8  Get a bit in carry
  ld a,$f   ; 7  Initial value
  adc a,0   ; 7  Gets us $0f or $10
  or e      ; 4  Gets us $9f or $90
  out ($7f),a ; 11
.endm

.rept 7
  playbit
  ; Have 37 cycles to burn to balance the loop - except on the 8th bit
  push hl   ; 11
  pop hl    ; 10
  nop       ; 4 x4
  nop
  nop
  nop
.endr
  playbit
  inc hl    ; 6
  ; then check byte counter
  dec bc    ; 6
  ld a,b    ; 4
  or c      ; 4
  jp nz,-   ; 10

71 cycles -> 50416Hz
  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Wed Jan 13, 2016 9:10 pm
You forgot to increment hl, didn't you ?
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Wed Jan 13, 2016 9:22 pm
Dammit. Edits above. I had a cycle count wrong too.
  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Wed Jan 13, 2016 10:51 pm
It's late, therefore the following could be totally wrong ;)

; a 8*256 byte look-up table must be located in ROM at $4000 (&lut[0]=0x4000)
;
; lut definition :
;
;   for ( n = 7; n >= 0; n--)
;     for ( i = 0;  i <= 255; i++)
;       lut[ (7-n) * 256 + i ] = ( (i & (1<<n)) != 0 ? 0x9F : 0x90 );
;
; Inputs:   hl = data pointer 
;           bc = byte count


-:  ld    e,(hl)      ;  7
    ld    d,$40       ;  7
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11

.rept 7
    push  ix          ; 15     waste 40 cycles
    pop   ix          ; 14
    ld    a,0         ;  7
    rra               ;  4

    inc   d           ;  4   
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
.endr

    inc   hl          ;  6
    dec   bc          ;  6
    ld    a,b         ;  4
    or    c           ;  4
    jp    nz,-        ; 10 


62 cycles between each output ?
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Wed Jan 13, 2016 11:12 pm
2KB to save 9 cycles, I like it :) Would unrolled code for all 256 values go any faster? It looks like it costs 37 cycles to get a byte of data, 11 to emit the result for a bit, leaving not much space to optimise further without breaking the equal timing restriction.
  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Wed Jan 13, 2016 11:26 pm
We could go even faster by directly storing the $90/$9F sequence in ROM (but it would limit the data to 65536/8=8192 8-bit samples) :D
  View user's profile Send private message
  • Joined: 01 Jan 2014
  • Posts: 326
Reply with quote
Post Posted: Thu Jan 14, 2016 3:15 am
This is on my to do list.

I can see a potential solution involving unrolled code for all 256 values intertwined with jump logic (based on next byte) + byte counter. Its speed would depend on how well you could intertwine these requirements while keeping equal timing of outs, the kind of thing you can't really know until you write.The code would probably be an abomination.

Do you have a tool to produce the data. I would be interested in giving it a go.


EDIT:

Totally not what I am suposed to be doing with my time right now... Proof of concept breakdown.

Create LUT with following structure;


PDM_PlayByteLUT:
.dw PDM_PlayByte_00000000, PDM_PlayByte_00000001, PDM_PlayByte_00000010, etc


Segments of code before intertwined;

Out Code
Unrolled. Uses c, d, e. Use of d or e to reflect byte.

ld c, $7F ; port
ld d, $9F ; value 1
ld e, $90 ; value 0

out (c), d / e ; either d or e depending on byte.
out (c), d / e
out (c), d / e
out (c), d / e
out (c), d / e
out (c), d / e
out (c), d / e
out (c), d / e


Byte Counter
Uses iy. iy = count

dec iy            ; 10      iy = byte count
ld a, iyh         ; 8
cp 0            ; 7 (25)   !PADDING
               
or iyl            ; 8
jp z, PDM_End      ; 10      iy == 0 > exit
cp 0            ; 7 (25)   !PADDING



Jump Code
Uses ix, a, hl; ix = data pointer

dec hl            ; 6       !PADDING
ld l, (ix + 0)      ; 19 (25)

inc ix            ; 10
ld h, $00         ; 7         Replace $00 with half high byte of PDM_PlayByteLUT
nop               ; 4         !PADDING
nop               ; 4 (25)   !PADDING

add hl, hl         ; 11      
ld a,(hl)           ; 7         ; Indirection starts      
cp 0            ; 7   (25)   !PADDING

inc hl              ; 6         
ld h,(hl)         ; 7
ld l,a            ; 4
nop               ; 4         !PADDING
jp (hl)            ; 4 (25)   


Explanation: load data byte in to low byte of hl. Load half the high byte of PDM_PlayByteLUT in to high byte. 2x HL to get appropriate point in LUT.

Indirection on HL (equivilant of ld hl, (hl)) to get routine address from LUT. Jump to this address for next byte.


Intertwined;



PDM_PlayByte_00000000:  ; 00000000 represents unrolled

; == out #1
out (c), d / e      ; 12

dec iy            ; 10      iy = byte count
ld a, iyh         ; 8
cp 0            ; 7 (25)
               ; total (37)

; == out #2
out (c), d / e      ; 12

or iyl            ; 8
jp z, PDM_End      ; 10      iy == 0 > exit
cp 0            ; 7 (25)
               ; total (37)


; == out #3
out (c), d / e      ; 12

cp 0            ; 7
cp 0            ; 7
cp 0            ; 7
nop               ; 4 (25)
               ; total (37)

; == out #4
out (c), d / e      ; 12

cp 0            ; 7
cp 0            ; 7
cp 0            ; 7
nop               ; 4 (25)
               ; total (37)
               
               


; == out #5
out (c), d / e      ; 12

dec hl            ; 6
ld l, (ix + 0)      ; 19 (25)
               ; total (37)


; == out #6
out (c), d / e      ; 12

inc ix            ; 10
ld h, $00         ; 7
nop               ; 4
nop               ; 4 (25)
               ; total (37)


; == out #7
out (c), d / e      ; 12   
               
add hl, hl         ; 11
ld a,(hl)           ; 7               
cp 0            ; 7   (25)
               ; total (37)


; == out #8
out (c), d / e      ; 12   

inc hl              ; 6
ld h,(hl)         ; 7
ld l,a            ; 4
nop               ; 4
jp (hl)            ; 4 (25)   
               ; total (37)   


There is no way there is not an error in there. Too complex to work out without testing.

This is 37 cycle intervals. It might be possible to get lower with different byte counter + jump code.

What makes this hard is you don't necessarily want optimal byte counter + jump code, rather you are after the code that fits the cycle counting optimally.

You would need to append a dummy byte at end of data (it cuts off mid way in last byte). Abomination code indeed.
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Thu Jan 14, 2016 8:33 am
I haven't gone so far as to make a PDM data generator but now I'm getting a bit curious... We have existing players (in games, plus the above might work) to test the data and then we can start testing the players.

I guess the extreme case would be to generate code from the sample, which would eliminate the byte retrieval, shifting and branching (or branch avoidance) at the cost of a huge amount of ROM space. This is similar to what was done for 8088 Domination. You'd still have to pad the timing for things like bank changes.

The other option is to write an unbalanced player and preprocess the data to account for it, choosing each bit (and weighting the error) according to the time span for which it will have an effect. This might make the data ugly though.
  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Thu Jan 14, 2016 9:04 am
psidum wrote
Abomination code indeed.

Don't say that, it's great !

Congratulations for this z80 masterpiece :)
  View user's profile Send private message
  • Joined: 05 Sep 2013
  • Posts: 2542
Reply with quote
Post Posted: Thu Jan 14, 2016 9:15 am
Self modifying code in RAM?
You could have code to read and 'decode' next byte 'interleaved' with code that OUTs the values using OUT (c),d and OUT (c),e ( d and e being the two values we want to output). Of course we should double buffer. The hard part of course is to keep output speed constant.
  View user's profile Send private message Visit poster's website
  • Joined: 01 Jan 2014
  • Posts: 326
Reply with quote
Post Posted: Thu Jan 14, 2016 9:59 am
Last edited by psidum on Thu Jan 14, 2016 10:23 am; edited 1 time in total
I get technical terms screwed up. I have trouble explaining things like this in writing.

I originally started with idea of self modifying code to remove the need for LUT but it got lost somewhere.

It would be even better if you could modify the out statements.


out (c), b  |  ED41  | (41) 0100 0001
out (c), d  |  ED51  | (51) 0101 0001


Only one bit apart. If you could set a bit with contents of carry flag you could avoid the need for branch and do it in SMC nicely. I looked for bitwise solution but could not find one. Purpose being no need to unroll.

ZX Spectrum community might have a PDM data generator we could re purpose.
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Thu Jan 14, 2016 10:10 am
The outer loop requires you to waste time in the inner unrolled loop. You can put the inc hl and dec bc into the unrolled loop and save 14 cycles per bit, with the 2KB table that gets you down to 47 cycles per bit.
  View user's profile Send private message Visit poster's website
  • Joined: 31 Oct 2007
  • Posts: 847
  • Location: Estonia, Rapla city
Reply with quote
Post Posted: Thu Jan 14, 2016 10:42 am
Is the PWM or PDM getting better output than PCM (especially when PCM uses multiple channels to get more resolution) ?
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Thu Jan 14, 2016 11:49 am
In theory, it offers a high dynamic range for the same bit rate as PCM, plus some noise shaping advantages. These only really appear at much higher rates though. Using PSG channels as you describe can only get so far in terms of quality.

Empirically, PDM samples are a lot louder, but that's more because the PCM samples are badly preprocessed. I guess the thing to do is to make the highest quality demo you can and see what is achievable, and at what data rate.
  View user's profile Send private message Visit poster's website
  • Joined: 31 Oct 2007
  • Posts: 847
  • Location: Estonia, Rapla city
Reply with quote
Post Posted: Thu Jan 14, 2016 1:38 pm
When I experimented with PDM in past I couldn't get satisfactory result, it was louder but also incredibly noisy and the freq range i could get was poor.
I also aimed to beat this : http://www.tmeeco.eu/BitShit/SMSPCM.SMS
(4bit log compensated and offset shifted PCM at something near 16KHz)
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Thu Jan 14, 2016 2:47 pm
4 bit PCM at 16kHz ought to map somewhat to 1 bit PDM at 64kHz. That's why we have the challenge to get the output up to higher frequencies. Then it's a matter of raising the rate further to get quality higher than that.

Did you experiment with sampling rates and dithering of the quantized data? I guess a PCM player could get quite high sampling rates, although at some point the ROM use becomes painful (and beyond emulator support).
  View user's profile Send private message Visit poster's website
  • Joined: 31 Oct 2007
  • Posts: 847
  • Location: Estonia, Rapla city
Reply with quote
Post Posted: Thu Jan 14, 2016 3:01 pm
Dithering made everything super noisy, I got much better "recovery" effect from noise trails on quiet sounds at higher sample rates than trying to dither low rates, the main parts sounded lot better that way due to less noise. You can also use dynamics processing on the sound to raise volumes of the quiet parts enough that they won't get lost in the quantisation noise, it won't work for everything though...
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Thu Jan 14, 2016 3:02 pm
So I spent the time to understand psidium's idea and it is ingenious :) Basically interleaving the data output into some code which determines what the next data is, using separate registers from each other, and then duplicating all the code once for every possible data byte (as the output is all hard coded) as the non output code figures out where to jump to next. Did you count bytes for the duplicated code? I guess it may end up a few KB, hopefully less than 16.

In the work I've been doing in extracting sample data, there's clearly a lack of preprocessing of the samples. Some aren't even normalised, let alone compressed, to make use of the limited dynamic range available. Many truncate the 8-bit data rather than round it, leading to amplification of noise (7f-80 becoming 7-8), and I haven't found anything preprocessed for log noise yet. The Lemmings 2 sample player is more advanced (and log compensating) than anything I've come across so far, but then I've hit a lot of early Sega games with the same PDM player code.
  View user's profile Send private message Visit poster's website
  • Joined: 01 Jan 2014
  • Posts: 326
Reply with quote
Post Posted: Fri Jan 15, 2016 12:32 pm
Thanks :)

Not sure about bytes used. It might be possible to reduce cycles further by avoiding ix for jump code.

At some point in the future i would like to come to grips with technical details behind PDM and give it a crack but as always it is about getting the free time.
  View user's profile Send private message
  • Joined: 31 Oct 2007
  • Posts: 847
  • Location: Estonia, Rapla city
Reply with quote
Post Posted: Fri Jan 15, 2016 1:20 pm
One more thing I remembered from my messing around with PDM is that the results were also poor because there's very little filtering on the actual sound output, and you get the fast signal change going all over the circuits and actually causing distortion due to low bandwidth handling capability of the things the sound line connects to. (I didn't try RF output as I have removed the modulator but high bit rate PDM should be able to push audio side into overmodulation and cause some level of issues with video output even lol).
I could get more positive results by sticking a heavy LPF on the sound output line, that reduced the noise a lot and made things little more smoother and less distorted sounding but also destroyed the normal sound output making it really muffled.
  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Sat Jan 16, 2016 2:22 pm
I improved my routine (37 cycles between each output) by intertwining everything "a la psidum" ;)

; a 8*256 byte look-up table must be located in ROM at $4000 (&lut[0]=0x4000)
;
; lut definition :
;
;   for ( n = 7; n >= 0; n--)
;     for ( i = 0;  i <= 255; i++)
;       lut[ (7-n) * 256 + i ] = ( (i & (1<<n)) != 0 ? 0x9F : 0x90 );
;
; Inputs:   hl = data pointer
;           bc = byte count

PDM_Player:
    push  bc
    ld    ix,loop
    ld    e,(hl)
    xor   a           ;  =>  zf = 1

loop:
    ld    d,$40       ;  7
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    pop   bc          ; 10
    ret   nz          ;  5 ***  (always false)

    inc   d           ;  4
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    dec   bc          ;  6
    ld    a,i         ;  9 ***

    inc   d           ;  4
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    inc   d           ;  4
    ld    a,b         ;  4
    or    c           ;  4
    ld    a,(de)      ;  7 ***

    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    ret   z           ;  5 if false / 11 if true (end of data)
    inc   hl          ;  6
    nop               ;  4 ***

    inc   d           ;  4   
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    ld    a,ixl       ;  8 ***
    ld    a,(hl)      ;  7 ***
 
    inc   d           ;  4   
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    push  bc          ; 11
    nop               ;  4 ***

    inc   d           ;  4
    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    inc   d           ;  4
    ld    a,(hl)      ;  7
    ld    b,a         ;  4
    xor   a           ;  4    =>   zf = 1

    ld    a,(de)      ;  7
    out   ($7F),a     ; 11
    ld    e,b         ;  4
    jp    (ix)        ;  8


You must call PDM_Player and add an extra byte to use it, because it ends the loop with a "ret z" after the 3rd bit of the last byte of data have been processed.

Look-up table attached with this post.

  View user's profile Send private message
  • Joined: 01 Jan 2014
  • Posts: 326
Reply with quote
Post Posted: Sun Jan 17, 2016 12:47 am
All round much nicer than my solution. Structuring the data so you can iterate over the high byte is very clever.

The diversity of solutions has been really interesting.
  View user's profile Send private message
  • Joined: 26 Dec 2004
  • Posts: 374
  • Location: Japan
Reply with quote
Post Posted: Sun Jan 17, 2016 10:38 pm
Maxim wrote
Many [games] truncate the 8-bit data rather than round it, leading to amplification of noise (7f-80 becoming 7-8)...

I've found this to be quite true, but why does truncation amplify noise vs. rounding? Both still cut off lower bits and map them to a range of 16 values (in binary): 00-0F vs. 08-17 being the respective cutoff "ranges".
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Mon Jan 18, 2016 5:42 am
Because the noisy silence is right on a truncation boundary, its movements by 1/256 get amplified to 1/16 (or more, due to the log output). You can't remove it from 4-bit data but you could probably clean up 8-bit data to play more clearly.
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Fri Feb 19, 2016 8:09 pm
Last edited by Maxim on Fri Feb 19, 2016 11:18 pm; edited 1 time in total
Here's a quick and dirty PDM-playing ROM, playing the Sega voice (sourced from Mega Drive Sonic 1 and converted using my own tool). It sounds terrible in every emulator I tried, possibly due to my bad PDM data. However, a conversion of the PDM data back to a WAV sounds sort of OK (but noisy) to me. Would anyone like to try it on a real system?

I modified vingazole's player only to accept the LUT at any multiple of 256 bytes, as the $4000 location made it hard to play samples larger than 16KB. (The one I used came out as 20KB for 1.674s.) Source and data are included, the converter tool isn't.
PDM player.zip (31.8 KB)

  View user's profile Send private message Visit poster's website
  • Joined: 20 Feb 2008
  • Posts: 114
  • Location: Les Herbiers, France
Reply with quote
Post Posted: Fri Feb 19, 2016 10:29 pm
I tried it with my Everdrive : it sounds far better than what we get with Kega (for example), but it's still noisy.

I hear a rather loud hiss (high frequency sound) over the voices.
  View user's profile Send private message
  • Joined: 09 Apr 2013
  • Posts: 98
  • Location: Sydney Australia
Reply with quote
Post Posted: Sat Feb 20, 2016 2:21 am
Here's a recording of it running on my PAL SMS 2



I guess it's not the best representation as the audio has been compressed multiple times.
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 12981
  • Location: London
Reply with quote
Post Posted: Sat Feb 20, 2016 7:53 am
That sounds almost exactly the same as expected from a conversion back to a wave file. I guess PDM isn't a good option on the SMS... 4-bit PCM at 15kHz will be much better quality for a lower bitrate.
  View user's profile Send private message Visit poster's website
Reply to topic



Back to the top of this page

Back to SMS Power!