Forums

Inspired by a discussion on the discord with Trirosmos, Maxim, sverx and lidnariq, I've given writing an almost sample-accurate VGM player a go:

https://github.com/joffb/snompiler

Made a little example video of it playing a daft little Snoozetracker loop I made, a bit of a normal Sonic VGM (at 14 seconds) and a good Snoozetracker tune by its creator (at 40 seconds):

It works on hardware - though the Snoozetracker feature of playing samples on the noise channel doesn't actually seem to work in real life or in Emulicious (not demonstrated in the youtube vid - for a tune which tries to do this look at examples/snooz_underly.sms)

How does it work then?

The SMS runs at like 3.57MHz (or 3.54MHz) and VGM is played back at a sample rate of 44.1kHz. If you divide 3579540 by 44100 you get 81, so that means there's 81 CPU cycles (T-states) per sample.

You can't really interpret multiple VGM commands at this sort of rate, so snompiler "compiles" the VGM's SN chip writes and sample wait commands into Z80 code and data. The snompiled VGM code runs 100% CPU time, interrupts are disabled and it's either writing to the SN chip or waiting around for the next sample.

To keep things as small as possible, the snompiled code is mostly rst calls which jump away and write to the SN chip or delay for a number of samples. The code that's executed is followed by all the data which will be written to the chip as a big blob. When the code reaches the end of the bank it changes to the next bank and starts playing again!

The file size ends up being around the same size as an uncompressed VGM, so quite chunky - but that's to be expected really.

Why's it "almost" sample accurate?

Currently, writing 4 SN values in one sample takes 85 cycles, so it's a bit slower than it should be. Currently, writing 3 SN values takes 80 cycles, so it's slightly faster than it should be.

Writing more than 4 SN values will generally take more than 81 cycles. If the VGM file tries to write say 6 SN values in a sample, then the code that will be generated will:

* Write 4 SN values, using 1 sample's worth of time
* Write 2 SN values and wait for the rest of another sample

This doesn't matter in the case of 50/60hz VGMs where there's 700 or 800 samples between each set of writes so you'll never hear a difference. However VGMs like Snoozetracker ones might update every sample and if they're really writing a lot of values per sample it might cause some "jitter" or a slight pitch difference. Luckily from the Snoozetracker files I've tried, the effects of this are minimal.

Hey @joe, that's cool - I'd been thinking about trying something along the lines of compiling PSG playback at some point, but it had been no more than an idea for an experiment. I'll definitely have a more in depth look at what you've done!

(also I keep meaning to try Banjo - that looks very interesting too)

It's quite impressive, even if the idea behind is simple.

This is basically data converted to code... but I wonder if instead you could just convert VGM to data - PSGlib style - and have a very simple player that just pushes data to the PSG port and waits the correct amount of cycles in between.

The data could be simply stored in tokens like
- number of bytes to send to the PSG chip in this audio frame (1 to 11)
- actual data to send to the PSG chip in this audio frame
- delay in frames (two bytes)

the delay could take into account when the previous writes are too many, in the converter...

joe wrote

The SMS runs at like 3.57MHz (or 3.54MHz) and VGM is played back at a sample rate of 44.1kHz. If you divide 3579540 by 44100 you get 81, so that means there's 81 CPU cycles (T-states) per sample.

maybe i'm dumb, but why not go with the 22KHz? i am pretty sure the difference will be inaudible on the real device.

pcmenc does a similar job of spamming writes to the PSG but it is data driven, and this makes it hard to meet the 81 cycle deadline for more than one write per sample. This is related to unpacking nibbles, cycling though channels and also checking for the end of the bank. With less efficient data (a byte per write instead of a nibble), less code is needed but more space is needed.

@willbritton: thanks! would be interesting to see what you think

sverx wrote

The data could be simply stored in tokens like...

That definitely seems possible but I feel like it would be tricky handling a variable number of SN writes and then compensating for a variable number of cycles afterwards!

toxa wrote

maybe i'm dumb, but why not go with the 22KHz? i am pretty sure the difference will be inaudible on the real device.

This was mostly a proof of concept for seeing whether Snoozetracker tracks could be played back on actual hardware at 44.1kHz! You're probably right that if it was running at 22kHz you'd have smaller files and looser timing requirements with little detectable loss in quality.

It's probably possible to "downsample" the VGM to 22kHz and do stuff like seeing if a channel has two consecutive volume writes and discarding the first one to save space.

Maxim wrote

pcmenc does a similar job of spamming writes to the PSG but it is data driven, and this makes it hard to meet the 81 cycle deadline for more than one write per sample.

Yeah this is really one of those classic trade-offs of space vs cpu time!

I've rewritten some of the rst calls so they get the sample wait counts from the data blob where all the SN values are.

Have added another rst which is used when there's < 256 samples to wait so only one byte is used to store the sample wait. This has zero effect on normal VGMs as the wait times are all > 700 samples, but saves a lot of bytes in Snoozetracker type ones where the sample waits are small!

Noticed in the examples it was common to have runs where they'd write one sn value and then wait for < 256 samples, so I've also added that as an rst call using fall-through.

rst 0x08 - get a word from the data and wait for that many samples
rst 0x10 - write one SN value and then fall through to ->
rst 0x18 - get a byte from the data and wait for that many samples
rst 0x20 - write one SN value and wait for the rest of the sample
rst 0x28 - write two SN values and wait for the rest of the sample
rst 0x30 - write three SN values and wait for the rest of the sample
rst 0x38 - write four SN values (and the sample is over by the time it's done!)

sverx wrote

The data could be simply stored in tokens like...

I was pondering this while not being able to sleep and I managed to rework things to be closer to that - though it ends up using more cycles in certain situations (writing 4 SN values with no following wait takes 89 cycles rather than 85 for the rst version)

The code in the player for all the "commands" is between 0x1000 and 0x1100 so only one byte of a jump address needs to be changed. HL points at the commands and data, DE has the jump address. The player loads the command's lower address byte into E then exchanges DE and HL and jumps to the command:

; hl: points at data
; de: points at 0x1000, low byte will be replaced
; 21 cycles
player:
ld e, (hl)    ; cycles: 7
inc hl       ; cycles: 6
ex de, hl    ; cycles: 4
jp (hl)       ; cycles: 4

Then the command exchanges back DE and HL, does its OUTIs and wastes however much extra time needs wasting, and can be followed by no wait, or by waiting for a byte/word amount of samples:

; 81 cycles w/ player
write_1_wait_byte:

ex de, hl          ; cycles: 4
outi             ; cycles: 16

dec ix             ; cycles: 10
dec ix             ; cycles: 10
dec ix             ; cycles: 10

jp wait_byte       ; cycles: 10

Then when that's done it jumps back to the player.
At this point however I don't think it counts as being compiled - maybe to bytecode? haha

joe wrote

sverx wrote

The data could be simply stored in tokens like...

I was pondering this while not being able to sleep

Sorry mate! :|

joe wrote

At this point however I don't think it counts as being compiled

Yes, the idea was exactly that. It's pure data, and the player is code - which also means you can create a library... ;)

Author	Message
joe Joined: 25 Mar 2023 Posts: 9	snompiler - almost sample accurate SN76489 VGM compiler/player Posted: Fri Apr 05, 2024 1:17 am
joe Joined: 25 Mar 2023 Posts: 9	Inspired by a discussion on the discord with Trirosmos, Maxim, sverx and lidnariq, I've given writing an almost sample-accurate VGM player a go: https://github.com/joffb/snompiler Made a little example video of it playing a daft little Snoozetracker loop I made, a bit of a normal Sonic VGM (at 14 seconds) and a good Snoozetracker tune by its creator (at 40 seconds): It works on hardware - though the Snoozetracker feature of playing samples on the noise channel doesn't actually seem to work in real life or in Emulicious (not demonstrated in the youtube vid - for a tune which tries to do this look at examples/snooz_underly.sms) How does it work then? The SMS runs at like 3.57MHz (or 3.54MHz) and VGM is played back at a sample rate of 44.1kHz. If you divide 3579540 by 44100 you get 81, so that means there's 81 CPU cycles (T-states) per sample. You can't really interpret multiple VGM commands at this sort of rate, so snompiler "compiles" the VGM's SN chip writes and sample wait commands into Z80 code and data. The snompiled VGM code runs 100% CPU time, interrupts are disabled and it's either writing to the SN chip or waiting around for the next sample. To keep things as small as possible, the snompiled code is mostly rst calls which jump away and write to the SN chip or delay for a number of samples. The code that's executed is followed by all the data which will be written to the chip as a big blob. When the code reaches the end of the bank it changes to the next bank and starts playing again! The file size ends up being around the same size as an uncompressed VGM, so quite chunky - but that's to be expected really. Why's it "almost" sample accurate? Currently, writing 4 SN values in one sample takes 85 cycles, so it's a bit slower than it should be. Currently, writing 3 SN values takes 80 cycles, so it's slightly faster than it should be. Writing more than 4 SN values will generally take more than 81 cycles. If the VGM file tries to write say 6 SN values in a sample, then the code that will be generated will: * Write 4 SN values, using 1 sample's worth of time * Write 2 SN values and wait for the rest of another sample This doesn't matter in the case of 50/60hz VGMs where there's 700 or 800 samples between each set of writes so you'll never hear a difference. However VGMs like Snoozetracker ones might update every sample and if they're really writing a lot of values per sample it might cause some "jitter" or a slight pitch difference. Luckily from the Snoozetracker files I've tried, the effects of this are minimal.

willbritton Joined: 06 Mar 2022 Posts: 689 Location: London, UK	Posted: Fri Apr 05, 2024 8:53 am
	Hey @joe, that's cool - I'd been thinking about trying something along the lines of compiling PSG playback at some point, but it had been no more than an idea for an experiment. I'll definitely have a more in depth look at what you've done! (also I keep meaning to try Banjo - that looks very interesting too)

sverx Joined: 05 Sep 2013 Posts: 3865 Location: Stockholm, Sweden	Posted: Fri Apr 05, 2024 9:57 am
	It's quite impressive, even if the idea behind is simple. This is basically data converted to code... but I wonder if instead you could just convert VGM to data - PSGlib style - and have a very simple player that just pushes data to the PSG port and waits the correct amount of cycles in between. The data could be simply stored in tokens like - number of bytes to send to the PSG chip in this audio frame (1 to 11) - actual data to send to the PSG chip in this audio frame - delay in frames (two bytes) the delay could take into account when the previous writes are too many, in the converter...

toxa Joined: 09 Aug 2021 Posts: 142	Posted: Fri Apr 05, 2024 10:10 am
toxa Joined: 09 Aug 2021 Posts: 142	joe wrote The SMS runs at like 3.57MHz (or 3.54MHz) and VGM is played back at a sample rate of 44.1kHz. If you divide 3579540 by 44100 you get 81, so that means there's 81 CPU cycles (T-states) per sample. maybe i'm dumb, but why not go with the 22KHz? i am pretty sure the difference will be inaudible on the real device.

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14763 Location: London	Posted: Fri Apr 05, 2024 11:40 am
	pcmenc does a similar job of spamming writes to the PSG but it is data driven, and this makes it hard to meet the 81 cycle deadline for more than one write per sample. This is related to unpacking nibbles, cycling though channels and also checking for the end of the bank. With less efficient data (a byte per write instead of a nibble), less code is needed but more space is needed.

joe Joined: 25 Mar 2023 Posts: 9	Posted: Fri Apr 05, 2024 2:07 pm
joe Joined: 25 Mar 2023 Posts: 9	@willbritton: thanks! would be interesting to see what you think sverx wrote The data could be simply stored in tokens like... That definitely seems possible but I feel like it would be tricky handling a variable number of SN writes and then compensating for a variable number of cycles afterwards! toxa wrote maybe i'm dumb, but why not go with the 22KHz? i am pretty sure the difference will be inaudible on the real device. This was mostly a proof of concept for seeing whether Snoozetracker tracks could be played back on actual hardware at 44.1kHz! You're probably right that if it was running at 22kHz you'd have smaller files and looser timing requirements with little detectable loss in quality. It's probably possible to "downsample" the VGM to 22kHz and do stuff like seeing if a channel has two consecutive volume writes and discarding the first one to save space. Maxim wrote pcmenc does a similar job of spamming writes to the PSG but it is data driven, and this makes it hard to meet the 81 cycle deadline for more than one write per sample. Yeah this is really one of those classic trade-offs of space vs cpu time! I've rewritten some of the rst calls so they get the sample wait counts from the data blob where all the SN values are. Have added another rst which is used when there's < 256 samples to wait so only one byte is used to store the sample wait. This has zero effect on normal VGMs as the wait times are all > 700 samples, but saves a lot of bytes in Snoozetracker type ones where the sample waits are small! Noticed in the examples it was common to have runs where they'd write one sn value and then wait for < 256 samples, so I've also added that as an rst call using fall-through. rst 0x08 - get a word from the data and wait for that many samples rst 0x10 - write one SN value and then fall through to -> rst 0x18 - get a byte from the data and wait for that many samples rst 0x20 - write one SN value and wait for the rest of the sample rst 0x28 - write two SN values and wait for the rest of the sample rst 0x30 - write three SN values and wait for the rest of the sample rst 0x38 - write four SN values (and the sample is over by the time it's done!)

joe Joined: 25 Mar 2023 Posts: 9	Posted: Sun Apr 07, 2024 12:40 am
joe Joined: 25 Mar 2023 Posts: 9	sverx wrote The data could be simply stored in tokens like... I was pondering this while not being able to sleep and I managed to rework things to be closer to that - though it ends up using more cycles in certain situations (writing 4 SN values with no following wait takes 89 cycles rather than 85 for the rst version) The code in the player for all the "commands" is between 0x1000 and 0x1100 so only one byte of a jump address needs to be changed. HL points at the commands and data, DE has the jump address. The player loads the command's lower address byte into E then exchanges DE and HL and jumps to the command: ; hl: points at data ; de: points at 0x1000, low byte will be replaced ; 21 cycles player: ld e, (hl) ; cycles: 7 inc hl ; cycles: 6 ex de, hl ; cycles: 4 jp (hl) ; cycles: 4 Then the command exchanges back DE and HL, does its OUTIs and wastes however much extra time needs wasting, and can be followed by no wait, or by waiting for a byte/word amount of samples: ; 81 cycles w/ player write_1_wait_byte: ex de, hl ; cycles: 4 outi ; cycles: 16 dec ix ; cycles: 10 dec ix ; cycles: 10 dec ix ; cycles: 10 jp wait_byte ; cycles: 10 Then when that's done it jumps back to the player. At this point however I don't think it counts as being compiled - maybe to bytecode? haha

sverx Joined: 05 Sep 2013 Posts: 3865 Location: Stockholm, Sweden	Posted: Sun Apr 07, 2024 12:27 pm
	joe wrote sverx wrote The data could be simply stored in tokens like... I was pondering this while not being able to sleep Sorry mate! :\| joe wrote At this point however I don't think it counts as being compiled Yes, the idea was exactly that. It's pure data, and the player is code - which also means you can create a library... ;)

Forums

View topic - snompiler - almost sample accurate SN76489 VGM compiler/player