Block instructions
The repeating block instructions like otir
and ldir
are used to copy or transfer large strings of data. Internally the Z80 handles these like individual outi
and ldi
instructions, after executing one it will move the program counter back to re-execute the instruction until the value in register pair bc
becomes zero.
It is actually faster to use multiple outi
and ldi
instructions in sequence to mimic the behavior of otir
or ldir
. For example:
.rept 1024
outi
.endr
outiblk: ret
This defines 1024 outi
instructions with a ret
at the end, making it into a subroutine. You can then call it like so:
ld hl, data ; Source data
ld c, $be ; Output port
call outiblk-768*2 ; Transfer 768 bytes from (hl) to (c)
To keep your code readable, it might be a good idea to put a call to the block of outi
instructions in a macro.
Using this technique with ldi
can be useful for filling memory too. Just replace the source address with a table that contains the fill value.
cleartab: .rept 1024 ; Table of 1024 zeroes
.db $00
.endr
.rept 1024
ldi
.endr
ldiblk: ret
ld hl, cleartab ; Quickly clear $c000-$c3ff
ld de, $c000
call ldiblk-1024*2
Code speed and size optimizations
Left-shifting
add a,a
is faster than a sla a
for single bit left shifts.
Zeroing a
xor a
is faster and smaller than ld a,0
.
ret
== reti
- IRQ handlers can end in
ei; ret
instead of ei; reti
to spare a byte and 4 clock cycles.
Extra registers
- If you are out of registers, try using
ixl
/ixh
/iyl
/iyh
and even the i
register for loop counters instead of maintaining a counter in memory or pushing/popping an already used register to the stack inside a loop.
Conditional rst
- For a smaller conditional
rst $38
, use jr cc, -1
. This will cause a conditional jump to the displacement byte ($FF
) which is the rst $38
opcode.
Use shadow registers for interrupts
- To maximize interrupt handler response (such as with raster effects) you can load all your working data into the alternate register set and switch to the working one using EXX at the start of your interrupt routine, then calculate data for the next interrupt to be stored in the alternate register set before returning.
Rotate the other way, it's shorter
- When moving a bitfield within a register to a desired position, it may be faster to rotate the register in the opposite direction rather than shift it in the intended direction. E.g. rotating right twice instead of shifting left 6 times to move bits 1,0 to bit 7,6.
Fallthrough looping
- If you need to repeat a routine several times but can't spare registers for a loop counter or ROM space to duplicate the code, try structuring the routine so it can call itself several times and fall through at the end. For example:
foo:
ld hl, data
call bar ; Run routine once
call bar ; .. twice
call bar ; .. three times
bar:
ld a, (hl) ; .. fourth and final time
inc l
and $0F
out (c), a
ret
Incrementing pages
- The write-only ROM paging registers located at $FFFC-$FFFF overlay work RAM. This allows you to conveniently increment the page count when cycling through multiple ROM pages:
; Outside of loop
ld ix, $FFFE ; Point to register
:
; Within loop
inc (ix) ; Next page
Table alignment
- If you align tables to a 256-byte boundary, you can access the contents by placing the index in a register such as
l
and the table address in h
. This is faster than loading the full unaligned 16-bit address and adding a 16-bit index to it, and makes accessing tables with a size of 256 bytes or less very convenient:
ld h, (sineTable >> 8) & $FF ; Get MSB of table
ld a, (frame_count) ; Get index
ld l, a
ld a, (hl) ; Look up value
Instead of:
ld hl, sineTable ; Get address of table
xor a
ld d, a ; Set index high byte to zero
ld a, (frame_count)
ld e, a ; Set index low byte
add hl, de ; Add offset to base
ld a, (hl) ; Look up value
sub hl,de
- Cursing the lack of a 16-bit SUB instruction? If you're using a constant for one side of the operation, try this:
; 4 bytes, 21 cycles
ld de,-1000
add hl,de
Instead of:
; 5 bytes, 30 cycles
ld de,1000
or a ; reset carry flag
sbc hl,de
Two's complement takes care of the rest.
Never call
and then ret
- Any function that looks like
SomeFunction:
; ...
call SomeOtherFunction
ret
can be optimised to
SomeFunction:
; ...
jp SomeOtherFunction
16-bit neg
Changes hl to -hl in 6 bytes and 24 cycles.
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
Returning set/reset carry flag
Rather than
return_set:
scf
jr +
return_unset:
or a ; clears carry flag
+:pop ...
ret
...you can save a bit of space and execution time with:
return_set:
scf
.db $3e
return_unset:
or a ; clears carry flag
pop ...
ret
...which changes the "or a" into a (relatively) harmless ld a,$b7
, at the cost of being quite obtuse to read. It saves you 1 byte and 8 cycles (or 2 bytes and 6 cycles if you used a jp
instead of a jr
in the first case).
8-bit Loop Counters
- Prefer to use the
b
register to hold 8-bit loop counters. This allows the djnz
instruction to be used, which efficiently decrements the counter and performs a conditional jump back to the top of the loop.
- If
b
is not available and the counter is placed in a different register, the loop will require separate decrement and jump instructions. If the loop body is likely to execute 3 or more times, it is faster to use dec
& jp nz
rather than the slightly smaller dec
& jr nz
.