Forums

Sega Master System / Mark III / Game Gear
SG-1000 / SC-3000 / SF-7000 / OMV
Home - Forums - Games - Scans - Maps - Cheats - Credits
Music - Videos - Development - Hacks - Translations - Homebrew

View topic - Stack Move

Reply to topic
Author Message
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Stack Move
Post Posted: Wed Mar 28, 2018 10:19 am
I'm try to work out how the "fast move" with Stack move works efficiently .
set SP to end of start area/load SP
pull AF
pull BC
pull DE
pull HL
exx
pull AF
pull BC
pull DE
pull HL
save SP
set SP to dest
push HL
push DE
push BC
push AF
exx
push HL
push DE
push BC
push AF

The main issue is setting and saving the SP. Its eats 20clocks. going to an address. moving to from IX/Y would be better but it seems I can go IX/Y -> SP but not SP -> IX/Y ? Load IX with 0, add SP to it, but that is even slower..
Am I missing a trick here?
  View user's profile Send private message
  • Joined: 14 Aug 2000
  • Posts: 741
  • Location: Adelaide, Australia
Reply with quote
Stack Move
Post Posted: Fri Mar 30, 2018 4:18 am
Your routine takes over 200 clock cycles to execute. Why is 20 cycles breaking the bank?

PUSH and POP are the fastest ways to move bytes between memory registers. Not including set up time, even an unrolled LDI routine would take over 256 cycles to move the same number of 16 bytes.
  View user's profile Send private message
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Fri Mar 30, 2018 12:38 pm
Because my Z80 is not up against a Z80 its up against a 6502 ;) before I found the catch the Z80 version was a good 25% faster than the 6502 code, but adding in the extra set up clocks really waters down the speed gained. Which puts me in the opposite mental state of "YAY so much faster than the old LDI code". It was "sweet speed boost" to "not really sure its worth the switch over time". So every clock counts, as the more clocks eaten the less effective it becomes ;) (at the moment I'm working on learning the Z80 before moving to learning the SMS, I feel that taking on 2 new things would just make my life harder, the Z80 is a very different beast to the ones I'm use to )

Being able to get IX/Y -> SP but not the other way around, just seemed an odd restriction, an its almost perfect but nope this opcode doesn't exist so no nice things for you ;)
But I though that I was just being dense and I missed something ;)

I'm thinking it might be faster to ditch the save and use some maths to write the SP address for the load into the code before the main loop starts.
so

ld a,[offset]
ld b,16
ld [firstRun+1],a
add a,c
ld [secondRun+1],a
add a,c
ld [thirdRun+1],a
...
firstRun
ld SP,$20FF
pop AF
pop BC
pop DE
pop HL
exx
.....
secondRun
ld SP,$20FF
...
then before it runs I change the SP address, since the load point will be on the same Page I only need to update the lo byte, at fixed constant offsets which means the code might work out faster in the long run, if the Z80 is too slow, I can try setting them with the 6502 before I switch over. If only you could add to 'l ' this cpu its always 95% :D
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Fri Mar 30, 2018 1:36 pm
To be honest, memory to memory copies aren't that useful on the SMS. Some home computers have memory-mapped video RAM and it's much more useful there.

In addition, it becomes quite tricky when you have to allow for interrupts during the stack manipulation. Is it acceptable to overrun your destination if the user presses Pause?

Self-modifying code is another thing we rarely use as usually we leave the code in ROM... only some significant performance gain would make it worthwhile.
  View user's profile Send private message Visit poster's website
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Fri Mar 30, 2018 3:11 pm
It would still be faster to use it to pull data into Regs, then pump down the port right? So you would get more updated per VBlank?

Looking up Pause button, its on NMI?? how does one handle getting it during VBlank, as when you resume, your code is going to think you're in vblank and try and pump data right.. I guess you make your NMI handler wait until you are in vblank again and resume on the off chance.... On the SMS I would probably be pulling pre-built data out of ROM, so the Stack push would be "to never never land", so I would set a flag somewhere in RAM. Then I would just wait for VBLank, restore the stack position, then enable interrupts, jump to the VBlank handler..
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Fri Mar 30, 2018 6:34 pm
It may be faster, yes. You need to use c to output other registers, leaving only de and hl, plus ix and iy if the prefixes are not too slow. Then you need to be able to chunk your data to the right size - although VBlank pushes tend to be the palette, sprite table and tiles which are all nice multiples of 8 anyway.

I'd be interested to see the numbers for this approach compared to outi.
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Fri Mar 30, 2018 6:57 pm
Last edited by Maxim on Wed Apr 04, 2018 1:39 pm; edited 1 time in total
Well, I had a quick go:

  ; hl = src
  ; de = dest
  ; bc = count
  ; stack version
  ; Set VRAM address
  ld c,Port_VDPAddress
  out (c),e
  out (c),d
  ld c,Port_VDPData 

  ld ($c001),sp
  ld sp,hl
 
  ; Output 32 byte chunks
  ; ccccccccbbbbbbbb
  ;    ^^^^^^^^ 
  ld a,c
  .repeat 3
  sla b
  rla
  .endr
  ld b,a
-:
.repeat 16
  pop hl      ; 1b
  out (c),l   ; 1b
  out (c),h   ; 1b
.endr

  djnz -
  ld sp,($c001)
  ret

You only need one register pair to pop into, as there's no need to keep swapping sp. This only works for multiples of 64 bytes, of course. Compared to this:

  ; outi256 version
  ; save c
  ld a, c
  ; Set VRAM address
  ld c,Port_VDPAddress
  out (c),e
  out (c),d
  ld c,Port_VDPData 
  ; Output 256 byte chunks
  inc b
  jp +
-:call outiblock
+:djnz -
  ; Output remaining bits
  ; We want to jump to outiblock + (256 - a) * 2
  neg ; 256 - a
  add a,a ; *2
  ld ixl,a
  ; handle carry - sense is inverted
  ld ixh,>outiblock
  jr nc,+
  inc ixh
+:jp (ix) ; and ret

...where outiblock is a block of 256 outis. The stats say (for my test data set):

Stack: 134270 cycles
Outi: 157247 cycles

So it does save ~15%.

Edit: this is wrong, see below.
  View user's profile Send private message Visit poster's website
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Sat Mar 31, 2018 4:49 am

  ; hl = src
  ; de = dest
  ; bc = count
  ; stack version
  ; Set VRAM address
  ld c,Port_VDPAddress ; won't this trash BC and hence break the count?
  out (c),e
  out (c),d
  ld c,Port_VDPData

  ld ($c001),sp
  ld sp,hl
 
  ; Output 32 byte chunks
  ; ccccccccbbbbbbbb
  ;    ^^^^^^^^
  ld a,c         ; This is the Port_VDPData ?
  .repeat 3
  sla b
  rla       
  .endr
  ld b,a
-:
.repeat 16
  pop hl      ; 1b
  out (c),l   ; 1b
  out (c),h   ; 1b
.endr

  djnz -
  ld sp,($c001)
  ret

Why not instead of doing the shifts with the CPU, get the assembler to presift the value loaded into B before the call?
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Sat Mar 31, 2018 6:12 am
Good points, I had considered that the shift is easy to do at assembly time, which would also take away the c register problem. I was fitting it into my compressor benchmark which already passed the length in bytes, and the values in question seem to have meant it seemed to work, but maybe I'm not actually copying all the data.
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Mon Apr 02, 2018 3:58 pm
Hmm... I did it again, this time taking the count divided by 64 as otherwise it could not cover my data length ($2600 bytes) - which also gives it some advantage for inlining, but some disadvantage as it can no longer use djnz.


raw_decompress_stack:
  ; hl = src
  ; de = dest
  ; b = count/64
  ; stack version, limited lengths possible
  ; Set VRAM address
  ld c,Port_VDPAddress
  out (c),e
  out (c),d
  ld c,Port_VDPData 

  ld ($c001),sp
  ld sp,hl
 
-:
.repeat 32
  pop hl      ; 1b 10c
  out (c),l   ; 2b 12c
  out (c),h   ; 2b 12c
.endr
  dec b
  jp nz,-
  ld sp,($c001)
  ret

...and it is actually slower than the outi version, 167598 cycles for stack vs 161389 for outi, which is also interrupt safe. If you get an NMI while the stack pointer is in ROM, the game may as well reset.
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Mon Apr 02, 2018 8:51 pm
Thinking about it, runtime is totally dominated by the data output. outi is 16 cycles per byte, pop/out/out is 17.
  View user's profile Send private message Visit poster's website
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Tue Apr 03, 2018 8:04 am
Yeah you really need the pop/push combo which gets you to 10.5 clocks per byte.

at 16 bytes = 168 clocks vs 256 for outi. Its just a case of if the extra setup eats more than 88 clocks... But I guess you are either on a machine that needs PORT or needs Stack push, nothing really does both... well actually the 128 does do both kinda, 40 col is Memory Mapped so Stack push, 80 col is via a port so outi... but its a unique snowflake in this regard I would think. Seems LDI/D is just as fast/slow as outi at 16 clocks per byte.

so this gives an interesting table
Cpu  Speed   Method   normalised time
Z80  4.2Mhz  Stack    2.5 clocks
Z80  3.58Mhz Stack    3 clocks
6502 2Mhz    ZP/PHA   3 clocks
Z80  4.2Mhz  Outi/ldi 3.8 clocks
6502 2Mhz    LDA/STA  4 clocks
Z80  3.58Mhz Outi/ldi 4.5 clocks
Z80  2Mhz    Stack    5.25 clock
6502 1Mhz    ZP/PHA   6 clocks
6502 1Mhz    LDA/STA  8 clocks
Z80  2Mhz    Outi/ldi 8 clocks

This ignores setup times and other shuffling needed and is just the raw "actually moves stuff" clock rates...
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Tue Apr 03, 2018 9:15 am
As a rule of thumb, the Z80 is clocked 4x higher than 6502 but gets 1/4 as much work done per cycle. Direct comparisons need to then settle on specific hardware (e.g. SMS vs NES) and then the other features start to dominate (CHR ROM vs VRAM). Still, memory mapped VRAM - or even DMA - would have made a huge difference. Working with the VDP can feel like accessing a library via the letter box at times.
  View user's profile Send private message Visit poster's website
  • Joined: 14 Oct 2008
  • Posts: 510
Reply with quote
Post Posted: Tue Apr 03, 2018 7:13 pm
The Game Boy is the only tile-based console I know of that does map VRAM into the CPU memory. (maybe its successors as well) All others I know of use either memory-map or I/O-map register-based access.
Though GB has its con in that I don't think mid-frame VRAM access is permitted at all, not even like GG/SMS where it seems slowly writing to VRAM during active display is permitted.
  View user's profile Send private message
  • Joined: 05 Sep 2013
  • Posts: 3794
  • Location: Stockholm, Sweden
Reply with quote
Post Posted: Wed Apr 04, 2018 9:06 am
I confirm, GBA (and DS) are "tile-based consoles" with VRAM mapped to CPU memory too. :)
  View user's profile Send private message Visit poster's website
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Wed Apr 04, 2018 10:31 am
Try the VDC sometime.. you will love the VDP after it.
You see the main system runs at a clock derived from either 8.19 or 7.98 NTSC or PAL, and then it /4 to get the bus clock and then /2 to get the CPU clock basically. However the VDC runs at 16Mhz in its own little clock domain... so to talk to the VDC you do this

Hello operator Can I speak to register 21 please
...
Hi yes, please hold
now no
now no
now no
Hello this 21, be quick...
pump
pump
pump
pump
BEEEEEEEEPPPPP
Hello operator Can I speak to register 21 please
...
Hi yes, please hold...

The GB lets you access during HBlank as far as I know. The GB has the benefit of a 4.02 Mhz or GBC at 8Mhz :) but alas its not a Z80 its more of an 8080 with a little tiny bit more... So no IX/IY and no 2nd set of registers, no H'L' or EXX... but you still get the Z80 "bit" opcode extensions..
  View user's profile Send private message
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Wed Apr 04, 2018 1:26 pm
Memory mapped video RAM (or even shared RAM for screen and general use) seems quite common in home computers, although it's not really tile based in the sense we're used to, often you can redefine the font.
  View user's profile Send private message Visit poster's website
  • Joined: 05 Sep 2013
  • Posts: 3794
  • Location: Stockholm, Sweden
Reply with quote
Post Posted: Wed Apr 04, 2018 1:37 pm
Maxim wrote
The stats say (for my test data set):
Stack: 134270 cycles
Outi: 157247 cycles
So it does save ~15%.


two OUTIs should be 32 cycles, one POP followed by two OUT (C) should be 34 cycles - how can using the stack be faster? Am I calculating something wrong?
  View user's profile Send private message Visit poster's website
  • Site Admin
  • Joined: 19 Oct 1999
  • Posts: 14726
  • Location: London
Reply with quote
Post Posted: Wed Apr 04, 2018 1:39 pm
No, I had a bug in my code - see above.
  View user's profile Send private message Visit poster's website
  • Joined: 05 Sep 2013
  • Posts: 3794
  • Location: Stockholm, Sweden
Reply with quote
Post Posted: Wed Apr 04, 2018 1:43 pm
oh - OK, didn't see that.
  View user's profile Send private message Visit poster's website
  • Joined: 14 Mar 2018
  • Posts: 19
Reply with quote
Post Posted: Wed Apr 04, 2018 4:17 pm
yeah on a computer using 1 set of RAM is the way to go, as it has to handle doing games ( were you would happily throw 8/16K at just graphics ) and a word-processor where you want 2~3K wasted on a text screen so you can hold as much text as possible.

Most still have a screen and data in a fixed location though.
What makes the C64 so powerful is you can move the 16K VICII bank to any 16K aligned bank in the 64K($0000,$4000,$8000,$C000) you can even move it per line if you want. So on a C64 I can have the top half of the screen us one 16K bank and the lower half use another 16K.
The 128 lets me put the 16K VICIIe bank into any bank in the 128K.
The banks have to be 16K aligned though.
The 128 then also has the VDC which is port based and has its own 16/64K of VRAM allowing the 128 to display a 40 column screen and an 80 column screen simultaneously.
the VIC II(e) has char mode ( you call them Tiles), Bitmap mode, hires and multicolour modes + hires and or multicolour sprites.
The VDC has Text mode and Bitmap modes.

I think the MSX range has the same(MSX2?) or similar chips to the SMS for its graphics system.
Spectrum I think only has Bitmap mode.
BBC Micro has Text and Bitmap.
Apple II I has text and bitmap mode.
  View user's profile Send private message
Reply to topic



Back to the top of this page

Back to SMS Power!