Forums

I'm try to work out how the "fast move" with Stack move works efficiently .

set SP to end of start area/load SP
pull AF
pull BC
pull DE
pull HL
exx
pull AF
pull BC
pull DE
pull HL
save SP
set SP to dest
push HL
push DE
push BC
push AF
exx
push HL
push DE
push BC
push AF

The main issue is setting and saving the SP. Its eats 20clocks. going to an address. moving to from IX/Y would be better but it seems I can go IX/Y -> SP but not SP -> IX/Y ? Load IX with 0, add SP to it, but that is even slower..
Am I missing a trick here?

Your routine takes over 200 clock cycles to execute. Why is 20 cycles breaking the bank?

PUSH and POP are the fastest ways to move bytes between memory registers. Not including set up time, even an unrolled LDI routine would take over 256 cycles to move the same number of 16 bytes.

Because my Z80 is not up against a Z80 its up against a 6502 ;) before I found the catch the Z80 version was a good 25% faster than the 6502 code, but adding in the extra set up clocks really waters down the speed gained. Which puts me in the opposite mental state of "YAY so much faster than the old LDI code". It was "sweet speed boost" to "not really sure its worth the switch over time". So every clock counts, as the more clocks eaten the less effective it becomes ;) (at the moment I'm working on learning the Z80 before moving to learning the SMS, I feel that taking on 2 new things would just make my life harder, the Z80 is a very different beast to the ones I'm use to )

Being able to get IX/Y -> SP but not the other way around, just seemed an odd restriction, an its almost perfect but nope this opcode doesn't exist so no nice things for you ;)
But I though that I was just being dense and I missed something ;)

I'm thinking it might be faster to ditch the save and use some maths to write the SP address for the load into the code before the main loop starts.
so

ld a,[offset]
ld b,16
ld [firstRun+1],a
add a,c
ld [secondRun+1],a
add a,c
ld [thirdRun+1],a
...
firstRun
ld SP,$20FF
pop AF
pop BC
pop DE
pop HL
exx
.....
secondRun
ld SP,$20FF
...

then before it runs I change the SP address, since the load point will be on the same Page I only need to update the lo byte, at fixed constant offsets which means the code might work out faster in the long run, if the Z80 is too slow, I can try setting them with the 6502 before I switch over. If only you could add to 'l ' this cpu its always 95% :D

To be honest, memory to memory copies aren't that useful on the SMS. Some home computers have memory-mapped video RAM and it's much more useful there.

In addition, it becomes quite tricky when you have to allow for interrupts during the stack manipulation. Is it acceptable to overrun your destination if the user presses Pause?

Self-modifying code is another thing we rarely use as usually we leave the code in ROM... only some significant performance gain would make it worthwhile.

It would still be faster to use it to pull data into Regs, then pump down the port right? So you would get more updated per VBlank?

Looking up Pause button, its on NMI?? how does one handle getting it during VBlank, as when you resume, your code is going to think you're in vblank and try and pump data right.. I guess you make your NMI handler wait until you are in vblank again and resume on the off chance.... On the SMS I would probably be pulling pre-built data out of ROM, so the Stack push would be "to never never land", so I would set a flag somewhere in RAM. Then I would just wait for VBLank, restore the stack position, then enable interrupts, jump to the VBlank handler..

It may be faster, yes. You need to use c to output other registers, leaving only de and hl, plus ix and iy if the prefixes are not too slow. Then you need to be able to chunk your data to the right size - although VBlank pushes tend to be the palette, sprite table and tiles which are all nice multiples of 8 anyway.

I'd be interested to see the numbers for this approach compared to outi.

Well, I had a quick go:

; hl = src
; de = dest
; bc = count
; stack version
; Set VRAM address
ld c,Port_VDPAddress
out (c),e
out (c),d
ld c,Port_VDPData

ld ($c001),sp
ld sp,hl

; Output 32 byte chunks
; ccccccccbbbbbbbb
; ^^^^^^^^
ld a,c
.repeat 3
sla b
rla
.endr
ld b,a
-:
.repeat 16
pop hl ; 1b
out (c),l ; 1b
out (c),h ; 1b
.endr

djnz -
ld sp,($c001)
ret

You only need one register pair to pop into, as there's no need to keep swapping sp. This only works for multiples of 64 bytes, of course. Compared to this:

; outi256 version
; save c
ld a, c
; Set VRAM address
ld c,Port_VDPAddress
out (c),e
out (c),d
ld c,Port_VDPData
; Output 256 byte chunks
inc b
jp +
-:call outiblock
+:djnz -
; Output remaining bits
; We want to jump to outiblock + (256 - a) * 2
neg ; 256 - a
add a,a ; *2
ld ixl,a
; handle carry - sense is inverted
ld ixh,>outiblock
jr nc,+
inc ixh
+:jp (ix) ; and ret

...where outiblock is a block of 256 outis. The stats say (for my test data set):

Stack: 134270 cycles
Outi: 157247 cycles

So it does save ~15%.

Edit: this is wrong, see below.

; hl = src
; de = dest
; bc = count
; stack version
; Set VRAM address
ld c,Port_VDPAddress ; won't this trash BC and hence break the count?
out (c),e
out (c),d
ld c,Port_VDPData

ld ($c001),sp
ld sp,hl

; Output 32 byte chunks
; ccccccccbbbbbbbb
; ^^^^^^^^
ld a,c ; This is the Port_VDPData ?
.repeat 3
sla b
rla
.endr
ld b,a
-:
.repeat 16
pop hl ; 1b
out (c),l ; 1b
out (c),h ; 1b
.endr

djnz -
ld sp,($c001)
ret

Why not instead of doing the shifts with the CPU, get the assembler to presift the value loaded into B before the call?

Good points, I had considered that the shift is easy to do at assembly time, which would also take away the c register problem. I was fitting it into my compressor benchmark which already passed the length in bytes, and the values in question seem to have meant it seemed to work, but maybe I'm not actually copying all the data.

Hmm... I did it again, this time taking the count divided by 64 as otherwise it could not cover my data length ($2600 bytes) - which also gives it some advantage for inlining, but some disadvantage as it can no longer use djnz.

raw_decompress_stack:
; hl = src
; de = dest
; b = count/64
; stack version, limited lengths possible
; Set VRAM address
ld c,Port_VDPAddress
out (c),e
out (c),d
ld c,Port_VDPData

ld ($c001),sp
ld sp,hl

-:
.repeat 32
pop hl ; 1b 10c
out (c),l ; 2b 12c
out (c),h ; 2b 12c
.endr
dec b
jp nz,-
ld sp,($c001)
ret

...and it is actually slower than the outi version, 167598 cycles for stack vs 161389 for outi, which is also interrupt safe. If you get an NMI while the stack pointer is in ROM, the game may as well reset.

Thinking about it, runtime is totally dominated by the data output. outi is 16 cycles per byte, pop/out/out is 17.

Yeah you really need the pop/push combo which gets you to 10.5 clocks per byte.

at 16 bytes = 168 clocks vs 256 for outi. Its just a case of if the extra setup eats more than 88 clocks... But I guess you are either on a machine that needs PORT or needs Stack push, nothing really does both... well actually the 128 does do both kinda, 40 col is Memory Mapped so Stack push, 80 col is via a port so outi... but its a unique snowflake in this regard I would think. Seems LDI/D is just as fast/slow as outi at 16 clocks per byte.

so this gives an interesting table

Cpu Speed Method normalised time
Z80 4.2Mhz Stack 2.5 clocks
Z80 3.58Mhz Stack 3 clocks
6502 2Mhz ZP/PHA 3 clocks
Z80 4.2Mhz Outi/ldi 3.8 clocks
6502 2Mhz LDA/STA 4 clocks
Z80 3.58Mhz Outi/ldi 4.5 clocks
Z80 2Mhz Stack 5.25 clock
6502 1Mhz ZP/PHA 6 clocks
6502 1Mhz LDA/STA 8 clocks
Z80 2Mhz Outi/ldi 8 clocks

This ignores setup times and other shuffling needed and is just the raw "actually moves stuff" clock rates...

As a rule of thumb, the Z80 is clocked 4x higher than 6502 but gets 1/4 as much work done per cycle. Direct comparisons need to then settle on specific hardware (e.g. SMS vs NES) and then the other features start to dominate (CHR ROM vs VRAM). Still, memory mapped VRAM - or even DMA - would have made a huge difference. Working with the VDP can feel like accessing a library via the letter box at times.

The Game Boy is the only tile-based console I know of that does map VRAM into the CPU memory. (maybe its successors as well) All others I know of use either memory-map or I/O-map register-based access.
Though GB has its con in that I don't think mid-frame VRAM access is permitted at all, not even like GG/SMS where it seems slowly writing to VRAM during active display is permitted.

I confirm, GBA (and DS) are "tile-based consoles" with VRAM mapped to CPU memory too. :)

Try the VDC sometime.. you will love the VDP after it.
You see the main system runs at a clock derived from either 8.19 or 7.98 NTSC or PAL, and then it /4 to get the bus clock and then /2 to get the CPU clock basically. However the VDC runs at 16Mhz in its own little clock domain... so to talk to the VDC you do this

Hello operator Can I speak to register 21 please
...
Hi yes, please hold
now no
now no
now no
Hello this 21, be quick...
pump
pump
pump
pump
BEEEEEEEEPPPPP
Hello operator Can I speak to register 21 please
...
Hi yes, please hold...

The GB lets you access during HBlank as far as I know. The GB has the benefit of a 4.02 Mhz or GBC at 8Mhz :) but alas its not a Z80 its more of an 8080 with a little tiny bit more... So no IX/IY and no 2nd set of registers, no H'L' or EXX... but you still get the Z80 "bit" opcode extensions..

Memory mapped video RAM (or even shared RAM for screen and general use) seems quite common in home computers, although it's not really tile based in the sense we're used to, often you can redefine the font.

Maxim wrote

The stats say (for my test data set):
Stack: 134270 cycles
Outi: 157247 cycles
So it does save ~15%.

two OUTIs should be 32 cycles, one POP followed by two OUT (C) should be 34 cycles - how can using the stack be faster? Am I calculating something wrong?

No, I had a bug in my code - see above.

oh - OK, didn't see that.

yeah on a computer using 1 set of RAM is the way to go, as it has to handle doing games ( were you would happily throw 8/16K at just graphics ) and a word-processor where you want 2~3K wasted on a text screen so you can hold as much text as possible.

Most still have a screen and data in a fixed location though.
What makes the C64 so powerful is you can move the 16K VICII bank to any 16K aligned bank in the 64K($0000,$4000,$8000,$C000) you can even move it per line if you want. So on a C64 I can have the top half of the screen us one 16K bank and the lower half use another 16K.
The 128 lets me put the 16K VICIIe bank into any bank in the 128K.
The banks have to be 16K aligned though.
The 128 then also has the VDC which is port based and has its own 16/64K of VRAM allowing the 128 to display a 40 column screen and an 80 column screen simultaneously.
the VIC II(e) has char mode ( you call them Tiles), Bitmap mode, hires and multicolour modes + hires and or multicolour sprites.
The VDC has Text mode and Bitmap modes.

I think the MSX range has the same(MSX2?) or similar chips to the SMS for its graphics system.
Spectrum I think only has Bitmap mode.
BBC Micro has Text and Bitmap.
Apple II I has text and bitmap mode.

Author	Message
oziphantom Joined: 14 Mar 2018 Posts: 19	Stack Move Posted: Wed Mar 28, 2018 10:19 am
oziphantom Joined: 14 Mar 2018 Posts: 19	I'm try to work out how the "fast move" with Stack move works efficiently . set SP to end of start area/load SP pull AF pull BC pull DE pull HL exx pull AF pull BC pull DE pull HL save SP set SP to dest push HL push DE push BC push AF exx push HL push DE push BC push AF The main issue is setting and saving the SP. Its eats 20clocks. going to an address. moving to from IX/Y would be better but it seems I can go IX/Y -> SP but not SP -> IX/Y ? Load IX with 0, add SP to it, but that is even slower.. Am I missing a trick here?

asynchronous Joined: 14 Aug 2000 Posts: 741 Location: Adelaide, Australia	Stack Move Posted: Fri Mar 30, 2018 4:18 am
	Your routine takes over 200 clock cycles to execute. Why is 20 cycles breaking the bank? PUSH and POP are the fastest ways to move bytes between memory registers. Not including set up time, even an unrolled LDI routine would take over 256 cycles to move the same number of 16 bytes.

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Fri Mar 30, 2018 12:38 pm
oziphantom Joined: 14 Mar 2018 Posts: 19	Because my Z80 is not up against a Z80 its up against a 6502 ;) before I found the catch the Z80 version was a good 25% faster than the 6502 code, but adding in the extra set up clocks really waters down the speed gained. Which puts me in the opposite mental state of "YAY so much faster than the old LDI code". It was "sweet speed boost" to "not really sure its worth the switch over time". So every clock counts, as the more clocks eaten the less effective it becomes ;) (at the moment I'm working on learning the Z80 before moving to learning the SMS, I feel that taking on 2 new things would just make my life harder, the Z80 is a very different beast to the ones I'm use to ) Being able to get IX/Y -> SP but not the other way around, just seemed an odd restriction, an its almost perfect but nope this opcode doesn't exist so no nice things for you ;) But I though that I was just being dense and I missed something ;) I'm thinking it might be faster to ditch the save and use some maths to write the SP address for the load into the code before the main loop starts. so ld a,[offset] ld b,16 ld [firstRun+1],a add a,c ld [secondRun+1],a add a,c ld [thirdRun+1],a ... firstRun ld SP,$20FF pop AF pop BC pop DE pop HL exx ..... secondRun ld SP,$20FF ... then before it runs I change the SP address, since the load point will be on the same Page I only need to update the lo byte, at fixed constant offsets which means the code might work out faster in the long run, if the Z80 is too slow, I can try setting them with the 6502 before I switch over. If only you could add to 'l ' this cpu its always 95% :D

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Fri Mar 30, 2018 1:36 pm
	To be honest, memory to memory copies aren't that useful on the SMS. Some home computers have memory-mapped video RAM and it's much more useful there. In addition, it becomes quite tricky when you have to allow for interrupts during the stack manipulation. Is it acceptable to overrun your destination if the user presses Pause? Self-modifying code is another thing we rarely use as usually we leave the code in ROM... only some significant performance gain would make it worthwhile.

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Fri Mar 30, 2018 3:11 pm
oziphantom Joined: 14 Mar 2018 Posts: 19	It would still be faster to use it to pull data into Regs, then pump down the port right? So you would get more updated per VBlank? Looking up Pause button, its on NMI?? how does one handle getting it during VBlank, as when you resume, your code is going to think you're in vblank and try and pump data right.. I guess you make your NMI handler wait until you are in vblank again and resume on the off chance.... On the SMS I would probably be pulling pre-built data out of ROM, so the Stack push would be "to never never land", so I would set a flag somewhere in RAM. Then I would just wait for VBLank, restore the stack position, then enable interrupts, jump to the VBlank handler..

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Fri Mar 30, 2018 6:34 pm
	It may be faster, yes. You need to use c to output other registers, leaving only de and hl, plus ix and iy if the prefixes are not too slow. Then you need to be able to chunk your data to the right size - although VBlank pushes tend to be the palette, sprite table and tiles which are all nice multiples of 8 anyway. I'd be interested to see the numbers for this approach compared to outi.

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Fri Mar 30, 2018 6:57 pm Last edited by Maxim on Wed Apr 04, 2018 1:39 pm; edited 1 time in total
	Well, I had a quick go: ; hl = src ; de = dest ; bc = count ; stack version ; Set VRAM address ld c,Port_VDPAddress out (c),e out (c),d ld c,Port_VDPData ld ($c001),sp ld sp,hl ; Output 32 byte chunks ; ccccccccbbbbbbbb ; ^^^^^^^^ ld a,c .repeat 3 sla b rla .endr ld b,a -: .repeat 16 pop hl ; 1b out (c),l ; 1b out (c),h ; 1b .endr djnz - ld sp,($c001) ret You only need one register pair to pop into, as there's no need to keep swapping sp. This only works for multiples of 64 bytes, of course. Compared to this: ; outi256 version ; save c ld a, c ; Set VRAM address ld c,Port_VDPAddress out (c),e out (c),d ld c,Port_VDPData ; Output 256 byte chunks inc b jp + -:call outiblock +:djnz - ; Output remaining bits ; We want to jump to outiblock + (256 - a) * 2 neg ; 256 - a add a,a ; *2 ld ixl,a ; handle carry - sense is inverted ld ixh,>outiblock jr nc,+ inc ixh +:jp (ix) ; and ret ...where outiblock is a block of 256 outis. The stats say (for my test data set): Stack: 134270 cycles Outi: 157247 cycles So it does save ~15%. Edit: this is wrong, see below.

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Sat Mar 31, 2018 4:49 am
oziphantom Joined: 14 Mar 2018 Posts: 19	; hl = src ; de = dest ; bc = count ; stack version ; Set VRAM address ld c,Port_VDPAddress ; won't this trash BC and hence break the count? out (c),e out (c),d ld c,Port_VDPData ld ($c001),sp ld sp,hl ; Output 32 byte chunks ; ccccccccbbbbbbbb ; ^^^^^^^^ ld a,c ; This is the Port_VDPData ? .repeat 3 sla b rla .endr ld b,a -: .repeat 16 pop hl ; 1b out (c),l ; 1b out (c),h ; 1b .endr djnz - ld sp,($c001) ret Why not instead of doing the shifts with the CPU, get the assembler to presift the value loaded into B before the call?

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Sat Mar 31, 2018 6:12 am
	Good points, I had considered that the shift is easy to do at assembly time, which would also take away the c register problem. I was fitting it into my compressor benchmark which already passed the length in bytes, and the values in question seem to have meant it seemed to work, but maybe I'm not actually copying all the data.

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Mon Apr 02, 2018 3:58 pm
	Hmm... I did it again, this time taking the count divided by 64 as otherwise it could not cover my data length ($2600 bytes) - which also gives it some advantage for inlining, but some disadvantage as it can no longer use djnz. raw_decompress_stack: ; hl = src ; de = dest ; b = count/64 ; stack version, limited lengths possible ; Set VRAM address ld c,Port_VDPAddress out (c),e out (c),d ld c,Port_VDPData ld ($c001),sp ld sp,hl -: .repeat 32 pop hl ; 1b 10c out (c),l ; 2b 12c out (c),h ; 2b 12c .endr dec b jp nz,- ld sp,($c001) ret ...and it is actually slower than the outi version, 167598 cycles for stack vs 161389 for outi, which is also interrupt safe. If you get an NMI while the stack pointer is in ROM, the game may as well reset.

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Mon Apr 02, 2018 8:51 pm
	Thinking about it, runtime is totally dominated by the data output. outi is 16 cycles per byte, pop/out/out is 17.

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Tue Apr 03, 2018 8:04 am
oziphantom Joined: 14 Mar 2018 Posts: 19	Yeah you really need the pop/push combo which gets you to 10.5 clocks per byte. at 16 bytes = 168 clocks vs 256 for outi. Its just a case of if the extra setup eats more than 88 clocks... But I guess you are either on a machine that needs PORT or needs Stack push, nothing really does both... well actually the 128 does do both kinda, 40 col is Memory Mapped so Stack push, 80 col is via a port so outi... but its a unique snowflake in this regard I would think. Seems LDI/D is just as fast/slow as outi at 16 clocks per byte. so this gives an interesting table Cpu Speed Method normalised time Z80 4.2Mhz Stack 2.5 clocks Z80 3.58Mhz Stack 3 clocks 6502 2Mhz ZP/PHA 3 clocks Z80 4.2Mhz Outi/ldi 3.8 clocks 6502 2Mhz LDA/STA 4 clocks Z80 3.58Mhz Outi/ldi 4.5 clocks Z80 2Mhz Stack 5.25 clock 6502 1Mhz ZP/PHA 6 clocks 6502 1Mhz LDA/STA 8 clocks Z80 2Mhz Outi/ldi 8 clocks This ignores setup times and other shuffling needed and is just the raw "actually moves stuff" clock rates...

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Tue Apr 03, 2018 9:15 am
	As a rule of thumb, the Z80 is clocked 4x higher than 6502 but gets 1/4 as much work done per cycle. Direct comparisons need to then settle on specific hardware (e.g. SMS vs NES) and then the other features start to dominate (CHR ROM vs VRAM). Still, memory mapped VRAM - or even DMA - would have made a huge difference. Working with the VDP can feel like accessing a library via the letter box at times.

KingMike Joined: 14 Oct 2008 Posts: 510	Posted: Tue Apr 03, 2018 7:13 pm
KingMike Joined: 14 Oct 2008 Posts: 510	The Game Boy is the only tile-based console I know of that does map VRAM into the CPU memory. (maybe its successors as well) All others I know of use either memory-map or I/O-map register-based access. Though GB has its con in that I don't think mid-frame VRAM access is permitted at all, not even like GG/SMS where it seems slowly writing to VRAM during active display is permitted.

sverx Joined: 05 Sep 2013 Posts: 3794 Location: Stockholm, Sweden	Posted: Wed Apr 04, 2018 9:06 am
	I confirm, GBA (and DS) are "tile-based consoles" with VRAM mapped to CPU memory too. :)

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Wed Apr 04, 2018 10:31 am
oziphantom Joined: 14 Mar 2018 Posts: 19	Try the VDC sometime.. you will love the VDP after it. You see the main system runs at a clock derived from either 8.19 or 7.98 NTSC or PAL, and then it /4 to get the bus clock and then /2 to get the CPU clock basically. However the VDC runs at 16Mhz in its own little clock domain... so to talk to the VDC you do this Hello operator Can I speak to register 21 please ... Hi yes, please hold now no now no now no Hello this 21, be quick... pump pump pump pump BEEEEEEEEPPPPP Hello operator Can I speak to register 21 please ... Hi yes, please hold... The GB lets you access during HBlank as far as I know. The GB has the benefit of a 4.02 Mhz or GBC at 8Mhz :) but alas its not a Z80 its more of an 8080 with a little tiny bit more... So no IX/IY and no 2nd set of registers, no H'L' or EXX... but you still get the Z80 "bit" opcode extensions..

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Wed Apr 04, 2018 1:26 pm
	Memory mapped video RAM (or even shared RAM for screen and general use) seems quite common in home computers, although it's not really tile based in the sense we're used to, often you can redefine the font.

sverx Joined: 05 Sep 2013 Posts: 3794 Location: Stockholm, Sweden	Posted: Wed Apr 04, 2018 1:37 pm
	Maxim wrote The stats say (for my test data set): Stack: 134270 cycles Outi: 157247 cycles So it does save ~15%. two OUTIs should be 32 cycles, one POP followed by two OUT (C) should be 34 cycles - how can using the stack be faster? Am I calculating something wrong?

Maxim Site Admin Joined: 19 Oct 1999 Posts: 14726 Location: London	Posted: Wed Apr 04, 2018 1:39 pm
	No, I had a bug in my code - see above.

sverx Joined: 05 Sep 2013 Posts: 3794 Location: Stockholm, Sweden	Posted: Wed Apr 04, 2018 1:43 pm
	oh - OK, didn't see that.

oziphantom Joined: 14 Mar 2018 Posts: 19	Posted: Wed Apr 04, 2018 4:17 pm
oziphantom Joined: 14 Mar 2018 Posts: 19	yeah on a computer using 1 set of RAM is the way to go, as it has to handle doing games ( were you would happily throw 8/16K at just graphics ) and a word-processor where you want 2~3K wasted on a text screen so you can hold as much text as possible. Most still have a screen and data in a fixed location though. What makes the C64 so powerful is you can move the 16K VICII bank to any 16K aligned bank in the 64K($0000,$4000,$8000,$C000) you can even move it per line if you want. So on a C64 I can have the top half of the screen us one 16K bank and the lower half use another 16K. The 128 lets me put the 16K VICIIe bank into any bank in the 128K. The banks have to be 16K aligned though. The 128 then also has the VDC which is port based and has its own 16/64K of VRAM allowing the 128 to display a 40 column screen and an 80 column screen simultaneously. the VIC II(e) has char mode ( you call them Tiles), Bitmap mode, hires and multicolour modes + hires and or multicolour sprites. The VDC has Text mode and Bitmap modes. I think the MSX range has the same(MSX2?) or similar chips to the SMS for its graphics system. Spectrum I think only has Bitmap mode. BBC Micro has Text and Bitmap. Apple II I has text and bitmap mode.

Forums

View topic - Stack Move