Forums

SavagePencil wrote

Pretty sure this is how the NES version of Hard Drivin' worked:

I think you do still need your main loop defined. There's player input, there's buffers to fill in ROM and RAM for the VDP to start pulling, and there are a LOT of timing hazards here.

I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (or could be rewritten to do so) and add in trees, billboards, signs, stands, and other structures that fly past at Formula One speeds (100kph through turns, 200kph on straightaways).

Then color cyle the race track itself to give the player a wrap around (canyon and dome) feel of rapid motion.

This seems to be pretty near the SMS's actual limits, even using prerendered tiles, precalculated sprite-texture tables, HiFi audio, digitized backgrounds, and 8x16 sprites that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.

Remember, we must build the algorithm around the hardware that is actually there. That means that, only able to display 64 8x16 sprites per screen, with another race car using up 6 of those at one time, we're already down to 58 sprites. The race track and barriers would all be done with background tiles and well proven techniques like what we see in Hang On.

That means that any new 3D effects need to fill two areas about 120x96 with sprites that would only fill about 64x96.

Using background tiles to cover large parts of that area with solid or dithered colors (a large section of buildings, forest, canyons on one side with a few sprite columns to provide a feeling of scaling and rapid movement), we'd only have 5-6 columns on the other side to draw anything interesting that doesn't use background tiles as a safety net.

And we can only stream, IIRC, another 91 tiles per frame, 13x7 or 9x10 per frame. Even flipping those horizontally, that's not enough to fill out the screen. It's barely enough to give the player that canyon and dome feeling of being surrounded by rapidly moving objects.

Kagesan wrote

It seems you still haven't made up your mind what exactly you want to achieve, just that it's supposed to be something that's vaguely 3D-ish. However, defining a clear goal is the very first step if you want to get somewhere.

Outside of a demo context, effects are not an end in itself, so if you want to put them in the context of a game, you need to make sure that the game side of your project works first and foremost.

You need to get away from those Coding Secrets videos. The techniques presented there are not universally reusable and taylored very specifically to the Mega Drive. The Master System has a very different architecture, which you should be aware of by now after several people have explained it to you in detail, but which you choose to ignore.

Of the effects shown, the floor could be easily done on the SMS, but the furniture would need to use up a lot of sprites you'd probably want to use elsewhere for more game-relevant things, like, you know, displaying a player character or some enemies.
Those 3D walls? Forget it. The best you can hope for is something like the dungeons in Phantasy Star, but those are basically prerendered animations, and even those only work at the speed required through some crazy optimizations one of which is the sparseness of their design.

But having this conversation go around in yet another circle is pointless. Since you seem determined to think that us naysayers are just to narrow-minded and don't know what we're talking about while you have it all figured out, I suggest you get to work and prove us wrong. That code's not going to write itself, and I think it's clear by now that no one else is willing to do it for you either.

In the meantime, "we" are eagerly waiting for your results. Good luck!

This is all very sad to me, because this forum is one of the last "pure" places on the internet. In the years I've been here there has never really been any drama and tons of great information and experienced people who are willing to help when they can.

However, I do feel a bit angry at myself for getting frustrated with this guy. As you have said, and others, he needs to write some code. Just do it, experiment, anything. All this conjecture going around in more circles is annoying. It was annoying last week, and even more so now. No advice is being absorbed. It's a real shame; I make an effort to read every new thread and reply, because it's a community where the daily activity is just the right amount that it's actually feasible to do so. I may not understand every last thing in the dev section as my ASM knowledge is novice x86, but it's still an interesting read and great to see the knowledge living on. It's sad that for the past two weeks I come here to see what's new and it's the same result in this thread, just paraphrased.

I want to be clear, I am not trying to be mean. But maxx, it's good to have enthusiasm and to throw new ideas out there... but nobody will write this for you. A "team" will not happen until you show some promising progress. Maybe you are young? When I was young I used to think I was gonna write duke3d mods on a "team" and clearly that was silly and it's something I had to learn. Write some code, any code, for the SMS. Start building on those mistakes. When you start coming here with interesting code that can be refined it will attract attention and someone will be interested in working with you. But, at this point in the conversation, it is going nowhere and is a frustrating experience all around.

maxxoccupancy wrote

This seems to be pretty near the SMS's actual limits

this is as near to SMS limits as moon.

I would love to be blown away to see the Master System hitting new heights using modern tools and modern computer science.....but I'm struggling to see it here.

The Z80 is slow at any kind of math due to its single accumulator and low registers count. Even using LUTs is slow. And it only gets worse when you start doing meaningful things like handling data structures and program loops etc.

To show what I mean, here is a simple hypothetical block of code to perform multiplication of 2 FP8 numbers using a LUT:

LUT-based FP8 Multiplication

Takes in two FP8 numbers and returns the FP8 multiplication result using a LUT
LUT is 64kB starting on a 16kB boundary

HL = the two FP8 numbers to be multiplied
A = the FP8 multiplication result

MultiplyFP8LUT:
LD A,H ;work out which 16kB segment of the 64kB LUT to map in to $8000-$BFFF
RLA
RLA
AND $3
ADD A,LUTBasePage
LD (FFFF),A ;map in the 16kB LUT segment
LD A,H ;form the pointer
AND $3F
OR $80
LD H,A
LD A,(HL) ;FP8 multiplication result is in A

This block of code takes 68 CPU cycles to execute. That's 3 calculations per scanline, or 640 calculations per 192-line frame.

I imagine by the time you add program and data structures around this it will probably be half that rate.

That aint great. :(

probably even a FP8 sum would have to be performed by using a LUT, and that would require the same amount of code, so at the end your Z80 would be barely able to perform at most around 880 (on NTSC) 'operations' per frame, meaning with 100% usage on this, so no CPU left for anything else.

I suggest sticking to 16 bit fixed point. An addition is 11 cycles.

sverx wrote

probably even a FP8 sum would have to be performed by using a LUT, and that would require the same amount of code, so at the end your Z80 would be barely able to perform at most around 880 (on NTSC) 'operations' per frame, meaning with 100% usage on this, so no CPU left for anything else.

and what to do with that result of multiplication? :) you need to convert that value into the normal number with another table. also fp8 tolerance does not nearly allow to perform all the required calculations. here are ALL possible values for FP8 format: https://en.wikipedia.org/wiki/Minifloat#All_values_as_decimals

one can use FP8 as a delta, but you must have at least single (or better double) value as accumulator.

toxa wrote

and what to do with that result of multiplication? :) you need to convert that value into the normal number with another table. also fp8 tolerance does not nearly allow to perform all the required calculations. here are ALL possible values for FP8 format: https://en.wikipedia.org/wiki/Minifloat#All_values_as_decimals

wow, this seems to be pretty useless then, at least in all the cases I can think of...

as said before: https://www.smspower.org/forums/18937-AdvancedMathematicsFor3dPolygonsOnTheSMS?s...

Is it worth anyone's time coding and actual demo if there are experienced people here who already know about some of the obstacles. For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. I looked it up, and sure enough, 16KB.

That's the problem with being a newb. We have to keep asking questions about what the limitations of the hardware are.

However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. That's 9,000 instructions per frame. Since we only have to move about 90 tiles and draw 64 sprites per frame, that isn't terrible and has made psuedo 3d possible on the earliest SMS games like Space Harrier and Outrun.

While more accurate placement and better 3D effects can be achieved using floating point to place groups of sprites, I feel like this thread has pretty well exhausted itself. Though I've learned a ton here about what can't be done on SMS1 hardware, I'm going to heed the advice of many and start a P.O.C. thread for a much, much simpler project that I'd started out with.

Send me a PM if anyone would like to go further on this race track game, but I don't see enough enthusiasm to continue this thread to complete an actual project.

maxxoccupancy wrote

Is it worth anyone's time coding and actual demo if there are experienced people here who already know about some of the obstacles. For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. I looked it up, and sure enough, 16KB.

That's the problem with being a newb. We have to keep asking questions about what the limitations of the hardware are.

However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. That's 9,000 instructions per frame. Since we only have to move about 90 tiles and draw 64 sprites per frame, that isn't terrible and has made psuedo 3d possible on the earliest SMS games like Space Harrier and Outrun.

While more accurate placement and better 3D effects can be achieved using floating point to place groups of sprites, I feel like this thread has pretty well exhausted itself. Though I've learned a ton here about what can't be done on SMS1 hardware, I'm going to heed the advice of many and start a P.O.C. thread for a much, much simpler project that I'd started out with.

Send me a PM if anyone would like to go further on this race track game, but I don't see enough enthusiasm to continue this thread to complete an actual project.

Also, just wanted to mention that reducing number of instructions is good, but if the instructions take longer than the cycle count of your current code it is not always better. This may sound obvious, but I didn't know this and apparently was (is?) a common misconception.

I learned this from Michael Abrash's Graphics Programming Black Book. The book is now available for free on the web, or you can pay some $$$ to get a physical copy. It's a good read if you're really interested in graphics tricks as well as optimization. Obviously, it will be geared towards x86 so not everything will be applicable.

[quote="maxxoccupancy"]
I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (...) using prerendered tiles (...) that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.[/quote]

Prerendered tiles? Out Run Europa. See Turbo Charge (C64) or Power Drift (C64) for similar tile-based engines.

The problem with Out Run Europa engine is that scenery moves too linearly and too fast: visual distance = 5 4 3 2 1.
In my opinion, using a "more progressive" visual distance, where far scenery moves slower and keeps longer in the screen, the result would look much better: visual distance = 5 5 5 5 4 4 4 3 3 2 1.

maxxoccupancy wrote

For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else.

Banks can not be 64K on any Z80 system because 64K is a whole possible address space of the CPU. No exceptions.

maxxoccupancy wrote

However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive.

68000 is a 16-bit CPU (if we count by bitness of ALU) with 32-bit instruction set. You can not compare it with Z80 simply BECAUSE.

It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing.

theNestruo wrote

maxxoccupancy wrote

I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (...) using prerendered tiles (...) that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.

Prerendered tiles? Out Run Europa. See Turbo Charge (C64) or Power Drift (C64) for similar tile-based engines.

The problem with Out Run Europa engine is that scenery moves too linearly and too fast: visual distance = 5 4 3 2 1.
In my opinion, using a "more progressive" visual distance, where far scenery moves slower and keeps longer in the screen, the result would look much better: visual distance = 5 5 5 5 4 4 4 3 3 2 1.

There are three ways to index for distance. One is to just use the actual distance between planes, as most of the 1980s sprite superscalars did in the arcades. Another, like modern 3D raster engines, is to use trigonometry to calculate the distance. The third approach is to set the distance according to the scale factor that you're going to use.

Using the race track example, we might cut the race track into a set of lines or slices along the track. Each object along the track has its own slice or distance. Using a 16-bit index, we could subtract the two:

uint_fast16_t distance = distObject - distCar

We could then scale something up by the square root of the distance, or we could use a hash table

distance address
3 0x80
4 0xA0
5 0xC0

and so forth.

Regarding the Z80's 520,000 instructions per second and the 68k's 1.2 MIPS, it's true that the 68k had wider registers and a larger ALU with more instructions. However, for the address calculations and memory moves, the Z80A runs at just under half the speed of the 68k. The SMS VDP is a 14MHz 16-bit RISC coprocessor that operates like many other coprocessors out there: controlled through registers rather than having its own decoder, branch unit, etc. There are many coprocessors still in use that operate this way.

toxa wrote

It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing.

🤡

Maraakate wrote

toxa wrote

It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing.

🤡

My specialization is RISC pipelines. Admittedly, the Z80 ISA is very different, and optimization techniques are very different, but quasi-personal attacks don't do anything but send out an email notification to everyone on this thread, and they come in and see that someone is trashing the newb rather than offering anything constructive.

The actual mini demo concept has moved to its own thread to avoid confusion with the advanced floating point, trigonometric operations, LookUp Tables, and complex topics that are better left in this thread.

If you don't have anything constructive to say, don't post anything.

Quote

If you don't have anything constructive to say, don't post anything.

Does this also apply to you?
As the saying goes ; well-ordered charity begins with oneself.

maxxoccupancy wrote

The SMS VDP is a 14MHz 16-bit RISC coprocessor that operates like many other coprocessors out there: controlled through registers rather than having its own decoder, branch unit, etc. There are many coprocessors still in use that operate this way.

Why do you keep repeating this mantra? So what? How that will help you to render 3D scenes?

How 14MHz help you? You can not write to VDP faster than each 29 (twenty nine!) t states and only one byte at once. If you need random access, speed drops DRAMATICALLY.

How 16-bit help you? All user visible registers are 8-bit and the only internal 16-bit VRAM pointer should be written in two iterations and can not be even read back...

How RISC help you? You can not write programs for it like shaders or whatever.

All in all - VDP is a very limited and slow graphics chip with inconvenient and slow access that is intended for rendering of tiled 2D graphics.

maxxoccupancy wrote

Maraakate wrote

toxa wrote

It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing.

🤡

My specialization is RISC pipelines. Admittedly, the Z80 ISA is very different, and optimization techniques are very different, but quasi-personal attacks don't do anything but send out an email notification to everyone on this thread, and they come in and see that someone is trashing the newb rather than offering anything constructive.

The actual mini demo concept has moved to its own thread to avoid confusion with the advanced floating point, trigonometric operations, LookUp Tables, and complex topics that are better left in this thread.

If you don't have anything constructive to say, don't post anything.

Specialization? You went to school for this? Are you currently IN college for this? If yes, that would probably make a lot of sense.

There's newer replies below my previous post, but toxa and ichigobankai are right. You do not listen and are just regurgitating the same things like there is some kind of authority in what you are saying, but you have proved nothing.

But go ahead, keep talking about FP8 and word sizes. Post more videos of OTHER PLATFORMS where interesting things have been done. To people like me, someone is a software developer as a profession in C/C++ with some light ASM, at first glance it sounds like you are saying smart things. But, then we have experienced people who know this hardware inside and out and say why it's bologna and will never work and I believe them, because they have proven multiple times over that they know what they are talking about. This is just like the Bogdanov twins!

In all seriousness, it's not because you are a "newb". You can be brand new to Z80 assembly, nobody here will make fun of you for that (really!). But when you keep asking the same, silly thing, over and over and do not listen to why it cannot work then yes, people get fed up and either argue with, make fun of, or ignore you.

Just for reference, this is the closest video I could find for the Pseudo 3D effect for the road and objects on the side of it for the 3D race track game. Road Rash does something like this with the rolling hills. I'm currently reading up on ways that this visual trick can be done.

For the stands, signs, billboards, backgrounds, and scenery, see the other posts about placing groups of sprite-textures together.

cool! i like this one:

OP sounds like a clueless politician in his late 40's who got dropped in a position he knows nothing about, who had to cobble up various bits of vaguely related information in a hurry to try and gain the public's confidence.

You're inventing problems (floating point on a Z80 to do basic 3D) to bring your own solutions (LUTs), effectively achieving nothing useful towards that initial goal.
You're convinced that any state machine (VDP) can be a CPU (no it isn't).
First 3D polygons, then "sprite-textures" (whatever that is... a sprite is a simple and well defined term in the 2D console world), then pseudo 3D à la Super Scaler which has hardware specifically designed to render very large amounts of scaled sprites (unlike the SMS).
Then out of nowhere you mention Romero and start fantasising about a team working to build your dream of pushing things to their limits with programs written in C, or "tweaks" of ready made games like it was just a matter of turning a knob...

As others have said, having high ambitions is a good thing, but keeping your feet on the ground is essential. All you have right now are clouds of dots you haven't been able to connect.

Sorry man, you can't expect to be taken seriously on a tech forum while sitting at the peak of the Duning-Kruger curve, unable to follow or express clear ideas, and on top of that being borderline arrogant.

Like many others who were interested in rather obscure technical stuff, I've had my cringy know-it-all frantic micro-manager period when I was 16, but aren't you passed that stage ?

That's literally how Tomb Raider was ported to the GBA, although I didn't know about that when I brought up this approach. They even use LUTs for the more complex mathematics. Floating point and embedded development are literally my background.

Sprite-textures I described earlier as sprites that are used as textures. 4MB ROMs can now hold thousands of prerendered polygons that are displayed as sprites. This shouldn't require any further explanation.

The Z80 gets used to write the SAT and update the locations of those sprite-textures each frame, just as many other consoles and machines do now to render 3D graphics. The only question is whether the Z80A at 3.59MHz is powerful enough to color cycle and move 64 sprite locations in 60,000 cycles, or one frame.

You're saying that the Video Display Processor is not a processor for some reason, even though it's described as a 14MHz 16-bit "Video Display Processor" that is programmed using registers, tables, and interrupts, even though that's how most graphics coprocessors are programmed--including the SMS Video Display Processor.

There are more personal attacks on this thread than I've seen on almost any other forum. Others have also posted lots of examples of polygons and 3D graphics performed by the Z80 and even the SMS itself. Lead, follow, or get out of the way.

Quote

That's literally how Tomb Raider was ported to the GBA

What does this have to do with the SMS ? Still no answer.

Quote

Floating point and embedded development are literally my background.

If the only tool you have is a hammer, you will start treating all your problems like a nail.

Quote

Sprite-textures I described earlier as sprites that are used as textures. 4MB ROMs can now hold thousands of prerendered polygons that are displayed as sprites.

Eager to see what jerky, slow, and jumbled pixel mess is possible with 40 8x16 pre-rendered triangles.

Also, have you finally read about the 8 sprites per line limit ?
Ignoring it won't make it go away.

Quote

The Z80 gets used to write the SAT and update the locations of those sprite-textures each frame

Thanks for explaining what almost every single piece of software written for the SMS does.

Quote

, just as many other consoles and machines do now to render 3D graphics.

No. Vertex-defined shapes in a 3D space aren't sprites. The concept of hardware sprites doesn't even exist anymore on modern GPUs.

The fact that you're mixing up such things says a lot.

Quote

The only question is whether the Z80A at 3.59MHz is powerful enough to color cycle and move 64 sprite locations in 60,000 cycles, or one frame.

The Z80 can do much more than change a few palette entries and write 128 bytes in the SAT in the period of one frame. But that doesn't make anything 3D.

Quote

You're saying that the Video Display Processor is not a processor for some reason

Read again. I used the acronym that you used yourself: "CPU", not "processor".

Quote

, even though it's described as a 14MHz 16-bit "Video Display Processor"

Yup, it's a processor alright, it processes things.
A clock frequency and a bus width doesn't make a "processor" a CPU.
The embeded dev that you are (were ?) ought to make the difference.

Quote

that is programmed using registers, tables, and interrupts

VDP registers aren't a program, they're just a few configuration switches to select some options. "Configuration registers" aren't called "program memory" for a reason.
The official VDP doc is like 20 pages long, read them.

What you call "tables" is VRAM.

Stating that "interrupts program the VDP" doesn't make any sense.
Someone who has basic understanding of either or both wouldn't write that.

Quote

There are more personal attacks on this thread than I've seen on almost any other forum.

I'm attacking your nonsense, not your person.

Quote

Others have also posted lots of examples of polygons and 3D graphics performed by the Z80 and even the SMS itself.

So in a single, simple sentence, what are you looking for now that others have done your research work ?

Quote

Lead, follow, or get out of the way.

Cheesy manager talk.

You're leading nobody, and you're following noone either since you've ignored all the advice you've been given until now.
So who's got to get out of the way ?

I am having a serious crisis of conscience about whether to post this, but here goes nothing:

As a newcomer to contributing to this forum (lurker for a fair bit longer), ENORMOUS appreciator of the resource that it represents and - dare I say it - sometime expert on a fair few things and a self-confessed “newb” on many, many more; I think this thread is starting to do a discredit to what appears to be an otherwise incredibly respectful and harmonious community of enthusiasts.

It’s not that I don’t agree with all of the comments about the OP’s netiquette - to be clear, I do - but in the course of this conversation nothing appears to be improving, only escalating and becoming more and more unpleasant to observe.

As another has alluded to, it seems highly likely that the OP is not, as some may have suspected, some impressionable teenager whose behaviour can be "corrected" with the perfect mixture of logic and chastisement, but most likely a grown adult who, by this point in life, may well be set in his ways and unlikely to respond positively in any way to what has now become a fairly constant stream of criticism, however well intentioned that criticism may ultimately be.

With the greatest of respect for everyone’s views, feelings and contributions to this conversation to date; and also very mindful that I’m an outsider quite possibly with no right to say so, might I suggest that it could be time to just let this conversation quietly fade into the past and move on, for the sake of maintaining the generally welcoming and constructive ethos of this community? 🕊️

To be honest, I thought that the OP was some kind of troll from his very first post, but I did my best to keep my mouth shut because I didn't want to be rude (since I have a history of bad behaviour on various forums and I'm trying to steer away from that). Glad that someone else took care of the elephant in the room and said what many of us have been thinking. As much as I'd love to be proven wrong, I don't think we're going to see an "advanced 3D" demo anytime soon.

maxxoccupancy wrote

That's literally how Tomb Raider was ported to the GBA

but GBA has six buttons: "A","B","START","SELECT","L","R" while SMS has only two: "1" and "2"!

With all due respect, I'll have to quote the text from https://atariage.com/forums/topic/82555-to-all-non-programmer-idea-peddlers/

Quote

To All Non-Programmer Idea Peddlers:

Programmers already have more ideas than they know what do do with; without any of yours. They don't need your ideas. Probably most of them don't want your ideas. Most game programmers have more ideas already than they have time to begin, let alone complete.

That said, a good, well-thought out, well-presented idea is worth looking at, always.

So if you want a programmer to even consider your idea for 30 seconds, here's what you need to do:

1. Present a concrete, good idea with lots of visual aids. Writing a game takes hours and hours of work. If you want a coder to even consider, for a minute, dedicating that kind of time, you had better put in some serious time of your own preparing your idea. Time measured in hours. Make mock screenshots. Design some sprites. Learn the capabilities of the machine you want the game written for and fit your idea to them. It isn't easy to understand the Stella guide if you aren't a coder, but if you want somebody to even glance sideways at your idea you better be willing to put in the time to at least understand a little of it. Do some legwork and demonstrate it. Spend some time working out your idea on paper; playtesting it to make sure it works.

2. Present a compelling reason why a coder should take on the project *other* than the fact that you think it would be really cool. Does your idea fill an underserved niche in the 2600's library? Is it a completely unique concept? Are lots of people clamoring for a game of this type? Does it present a unique, fun challenge?

3. Be humble. You are asking for far, far, far more than you are giving or will ever contribute. Coders already work for pennies/hour working on their own ideas, if that. You want the programmer to do something for you, essentially for free? Don't make demands.

4. Be flexible. Be willing to put in yet more time reworking screenshots, rethinking game mechanics, designing different sprites.

The gold standard here is Adam Tierney (salstadt here at AA). Find some of the threads he started to publicize his own ideas and see what he did. See especially the Prince of Persia thread, and see how much work he put into that, over a period of weeks. You don't have to be the artist he is, but you better be willing to make up the difference in sweat.

What's a good idea? Can't give an exact definition, but here are some starters:

1. It is unique. Either absolutely unique or unique to the platform.

2. It has a tested, proven game mechanic. Which is fun.

3. It uses the capabilities of the machine it is designed for well.

4. It is fun to more people than just you.

The most important thing is to DO SOME WORK. If your idea doesn't have some mock screenshots, then it is worthless. Period.

haroldoop wrote

With all due respect,

we simply need more LUTs.

I thought furrtek’s post was civil myself.

Ultimately this is still a technical discussion about our favourite gaming console, even if its capabilities are not fully understood by all.

@toxa, c’mon man, no need to LUT-shame the guy. Some people just like big LUTs. :)

maxxoccupancy wrote

Lead, follow, or get out of the way.

I totally agree. Now provide some POC, or understand where the limits are, or get out of the way. You have plenty of time to pick yours.

Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol.

Well, not a master system, but using the tms9918a:

About as impressive as I've ever seen on a computer of this vintage...

segarule wrote

https://www.youtube.com/watch?v=qrq9GqWyBF0
Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol.

What's the floating point precision?

Maraakate wrote

segarule wrote

https://www.youtube.com/watch?v=qrq9GqWyBF0
Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol.

What's the floating point precision?

If you were referring to my proposal, the precision of the Lookup Table limits us to seven bits, plus another seven bits for the exponent and another bit for the sign. That assumes a mantissa with implied '1' before the 1.xxxxxxx. That's the equivalent of 9-bit fractions.

Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code. That would give us a resectable 5-10k polygons/sec, more than the Sega CD, but not as accurate.

So, 7-bits plus a sign and implied 1, or 9-bit fraction.

word.

maxxoccupancy wrote

Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code.

If you can describe the steps needed to perform an addition between two of those floating point values I can code a snippet to perform that and we can see how many cycles are required. I suspect we're nowhere close to those values that since you'd need less than 60 cycles per FLOP to reach 60k kFLOPS (and that's using the whole frame time just for math).

sverx wrote

maxxoccupancy wrote

Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code.

If you can describe the steps needed to perform an addition between two of those floating point values I can code a snippet to perform that and we can see how many cycles are required. I suspect we're nowhere close to those values that since you'd need less than 60 cycles per FLOP to reach 60k kFLOPS (and that's using the whole frame time just for math).

I know. I'm just trying to get within an order of magnitude to see where we are. IIRC, the Atari Jaguar could render about 10,000 polygons per second, so even 3-4,000 flat shaded polygons would be impressive and enough for a basic FPS.

Both operations assume that, when we load the mantissa from memory into the registers, the '1' is implied. The multiply-add and multiply LUTs results are already filled out with this assumption in mind. That is, we would not support tiny denormalized numbers.

Exponents are signed bytes. Mantissa could be signed or unsigned, depending on the final implementation, but I'm assuming signed to simplify the logic. The 7x7 LUT table in the cartridge would only see the bottom seven bits.

// Multiply: Exponent (one byte), Mantissa (one byte, upper is always 0)
// C = A * B
Cexp = Aexp + Bexp // overflow results in max saturation or 255
Cman = Aman * Bman // Multiply LUT (7x7) access from ROM cart
Norm = CountLeadingZeroes (Cman) //number of zeroes should be 1 or none, so this step may be simplified
Cman = Cman - Norm
Cman = LeftShift(Cexp, Norm)

// Add: Exponent (one byte), Mantissa (one byte, upper is always 0)
// D = A + B
Dexp = Greater(Aexp, Bexp)
DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented)
Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent

Since geometry transforms are made up of four multiplies and 12 multiply-adds, we may be able to save ourselves a normalization and rounding step by using a single function for both as modern hardware does.

Let's find errors in logic before we start optimizing first, then look for features in the Z80 that allow us to accomplish some of these tasks for free.

Since the 16KB LUT table delivers 8 bits of results, we could return either the 7-bit fraction with the upper bit showing an overflow and losing the ULP (unit in the last place) or preserve the ULP and always shift up by one bit, letting us use the free overflow detection to increment the Exponent. The latter approach might be a bit slower, but would preserve some of the already borderline accuracy of these 16-bit fp numbers.

Here are some optimized Z80 routines for 24 and 32 bit fp numbers for the TI 83/84's.
https://www.ticalc.org/archives/files/fileinfo/472/47243.html

Accurate subtraction (or adding negative numbers) is basically impossible because there are so many subtractions where A and B are close enough together to lose 2, 3, or even 4 bits of accuracy in a single operation.

I'm proposing that we use something similar to bfloat16 (top image in attachment), since it's already in wide use, has well tested routines and code, and can be operated on in two efficient byte operations in the Z80.
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

bfloat16.png (31.44 KB)

maxxoccupancy wrote

// Add: Exponent (one byte), Mantissa (one byte, upper is always 0)
// D = A + B
Dexp = Greater(Aexp, Bexp)
DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented)
Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent

I'm not an expert on the subject but I do believe this doesn't work. I suspect it would give something like 100 + 1 = 200

did anyone see that toy story coding secrets video?

Maraakate wrote

did anyone see that toy story coding secrets video?

Yes...?

sverx wrote

maxxoccupancy wrote

// Add: Exponent (one byte), Mantissa (one byte, upper is always 0)
// D = A + B
Dexp = Greater(Aexp, Bexp)
DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented)
Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent

I'm not an expert on the subject but I do believe this doesn't work. I suspect it would give something like 100 + 1 = 200

A = 1.0000000 * 2^0000001 // 2
B = 1.0000000 * 2^0000000 // 1
A + B should be 3, or
C = 1.1000000 * 2^000001

Dexp = Greater(1, 0) // 1
DexpOV = 1++ (10, or 2 in decimal, forcing Bman to be rightshifted by 1)
Dman = 1.000 + 0.100{0} // Dman gets 1.100{0}

If we drop/truncate {0} in the Guard bit, we end up with
1.100 * 2^0001
11.00, which is 3 in decimal

Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent.

Does anyone know any hacks with Z80 Assembly to preserve the values in the lower order bits that might otherwise be lost when they're shifted out?

I suppose we could just perform both adds, one using the Aexp and another for Bexp, then using one branch for rounding to +/- infinity if the {Guard and Round} bits are 1 and 1. Truncation (dropping these low order bits) is fastest and usually accurate enough with an effectively 17-bit number.

maxxoccupancy wrote

Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent.

Yes. So the whole process would be
- find what's the max exponent and who provided that
- right shift the mantissa of the other number a number of times that it's the difference of the two exponents
- add the two mantissa
- if there's carry, increment the exponent

then there's the infinity problem, how should it be tackled?

sverx wrote

maxxoccupancy wrote

Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent.

Yes. So the whole process would be
- find what's the max exponent and who provided that
- right shift the mantissa of the other number a number of times that it's the difference of the two exponents
- add the two mantissa
- if there's carry, increment the exponent

then there's the infinity problem, how should it be tackled?

I should've been even more precise: The fraction is the 1.xxxxxxx, whereas the mantissa is just the .xxxxxx, where the '1' is implied.

For addition and subtraction, you do have to calculate
Aexp - Bexp

If the result is negative, then B is larger, so we branch and rightshift Aman.

If performance is all that matters, we can require ordered operations such that Aexp must be => Bexp.

Infinity is represented as maximum saturation (usually all 1's) in graphics calculations. This turns out to produce acceptable results in generating images, though it fails in most scientific calculations and simulations.

An even simpler implementation would use a custom 8-bit format where the lower 7 bits are used only for the LookUp Tables, and that's what I'd originally proposed.

nVidia's FP8 (E4M3) is a well studied format where the limitations are pretty well known. Since our output is 256x192, the inaccuracies that occur with such short precision are likely to be less noticeable.

The entire calculation is already worked out in the 7x7 (16KB) LookUp Table for even functions, including division, square root, and trigonometric functions.

The ideal would be to use two 16KB (7x7) LookUp Tables, one for the exponent and one for the fraction. However, no simple solution comes to mind. Once you know that the exponent of one is larger than the other, the number with the smaller exponent must be shifted down by the difference between them, by definition.

The alternative is to create a completely custom format or use 16-bit fixed point numbers, forcing four different lookups from the ROM cart for each multiply, followed by several adds and carries.

maxxoccupancy wrote

The alternative is to create a completely custom format or use 16-bit fixed point numbers

the Z80 will handle 16 bit fixed point numbers just fine... well, at least additions will be pretty fast

sverx wrote

maxxoccupancy wrote

The alternative is to create a completely custom format or use 16-bit fixed point numbers

the Z80 will handle 16 bit fixed point numbers just fine... well, at least additions will be pretty fast

So how about 14-bit fixed point numbers plus one for the sign, then another bit for overflows. That is, if we're anticipating overflows (based on the scale to be used), then the second bit in the number is reserved to show that the number has exceeded MaxSat, or Maximum Saturation.

Using 14-bit numbers, we can also use 16KB LookUp Tables for reciprocals, square root, RecipSqrt, Sin, Cos, Tan, etc, where you have only a single operand and an 8-bit output. Each of these mathematical operations could then use separate Upper and Lower output tables, so a dozen of these one-operand functions would consume just:

12 functions * 2 tables * 16KB = 192KB or about 1.6 megabits, plus the most frequently used fp operation, multiply.

Fixed Point 15-bit numbers can capture most of the accuracy needed for geometry transforms, object physics, and even DSP functions.

I think it would surely be an interesting exercise, and I think you should probably learn Z80 asm and try to implement the basic operations yourself. I would say with your background it would probably just take a few weeks to get the grasp of how the Z80 processor works, after all it's a pretty simple processor and doesn't have many complex features like modern ones like out of order execution, speculative execution, advanced pipelining, etc... there's even no cache so you get exactly what you code.

heck, just write it up in C first to prove it works. Then worry about making it fast.

sverx wrote

I think it would surely be an interesting exercise, and I think you should probably learn Z80 asm and try to implement the basic operations yourself. I would say with your background it would probably just take a few weeks to get the grasp of how the Z80 processor works, after all it's a pretty simple processor and doesn't have many complex features like modern ones like out of order execution, speculative execution, advanced pipelining, etc... there's even no cache so you get exactly what you code.

Just the opposite. I learned dynamic scheduling, dynamic branch prediction, caching, etc, after I left college. All of these features are abstracted away from the programmer and are now available even in x86 implementations.

After looking into Z80, it's actually really complex if you want to fully optimize the code and build performance fp code out of its faster primatives. You can't build high performance libraries without cycle counting instructions and picking data types ahead of time that are just the right fit.

The chip is capable of about 600,000 instructions per second, but that drops off dramatically without careful selection of algorithms and data types.

I really need someone who understands the inner workings of the chip and the best possible use of its unusual upper and lower register sets.

I need programmers who can figure out how many random/sequential lookups per second we can get out of the game ROM.

I need to know if we could use 14-bit mantissa to create a 15-bit fraction and 2-bit exponent (using a trap for overflows), because a lot of mathematical functions become a lot easier if we can do that.

The simple theory of using the lower 14 bits to select from a 16K-entry LUT is not hard, and we could get fp performance on the order of 100 KFlops, which

"The fastest data copying is 10.5 cycles per byte on Z80 with no address limits using the stack pointer:"
https://retrocomputing.stackexchange.com/questions/5748/comparing-raw-performanc...

Since the proposed approach to performing fp using LUTs is basically a data copy plus 2-6 additional instructions, the most that we could realistically hope for would be about 200 KFlops, so maybe 5,000 polygons/second (160 polygons/frame at 30fps) on the best day ever using sprite engine rendering in the 16-bit VDP.

Virtua Racing's SVP was supposed to be able to generate 20,000 polygons/sec, but we never saw more than 9,000.

Author	Message
maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Mon Mar 28, 2022 11:48 pm
	SavagePencil wrote Pretty sure this is how the NES version of Hard Drivin' worked: I think you do still need your main loop defined. There's player input, there's buffers to fill in ROM and RAM for the VDP to start pulling, and there are a LOT of timing hazards here. I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (or could be rewritten to do so) and add in trees, billboards, signs, stands, and other structures that fly past at Formula One speeds (100kph through turns, 200kph on straightaways). Then color cyle the race track itself to give the player a wrap around (canyon and dome) feel of rapid motion. This seems to be pretty near the SMS's actual limits, even using prerendered tiles, precalculated sprite-texture tables, HiFi audio, digitized backgrounds, and 8x16 sprites that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Tue Mar 29, 2022 12:03 am
	Remember, we must build the algorithm around the hardware that is actually there. That means that, only able to display 64 8x16 sprites per screen, with another race car using up 6 of those at one time, we're already down to 58 sprites. The race track and barriers would all be done with background tiles and well proven techniques like what we see in Hang On. That means that any new 3D effects need to fill two areas about 120x96 with sprites that would only fill about 64x96. Using background tiles to cover large parts of that area with solid or dithered colors (a large section of buildings, forest, canyons on one side with a few sprite columns to provide a feeling of scaling and rapid movement), we'd only have 5-6 columns on the other side to draw anything interesting that doesn't use background tiles as a safety net. And we can only stream, IIRC, another 91 tiles per frame, 13x7 or 9x10 per frame. Even flipping those horizontally, that's not enough to fill out the screen. It's barely enough to give the player that canyon and dome feeling of being surrounded by rapidly moving objects.

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Tue Mar 29, 2022 3:30 am
	Kagesan wrote It seems you still haven't made up your mind what exactly you want to achieve, just that it's supposed to be something that's vaguely 3D-ish. However, defining a clear goal is the very first step if you want to get somewhere. Outside of a demo context, effects are not an end in itself, so if you want to put them in the context of a game, you need to make sure that the game side of your project works first and foremost. You need to get away from those Coding Secrets videos. The techniques presented there are not universally reusable and taylored very specifically to the Mega Drive. The Master System has a very different architecture, which you should be aware of by now after several people have explained it to you in detail, but which you choose to ignore. Of the effects shown, the floor could be easily done on the SMS, but the furniture would need to use up a lot of sprites you'd probably want to use elsewhere for more game-relevant things, like, you know, displaying a player character or some enemies. Those 3D walls? Forget it. The best you can hope for is something like the dungeons in Phantasy Star, but those are basically prerendered animations, and even those only work at the speed required through some crazy optimizations one of which is the sparseness of their design. But having this conversation go around in yet another circle is pointless. Since you seem determined to think that us naysayers are just to narrow-minded and don't know what we're talking about while you have it all figured out, I suggest you get to work and prove us wrong. That code's not going to write itself, and I think it's clear by now that no one else is willing to do it for you either. In the meantime, "we" are eagerly waiting for your results. Good luck! This is all very sad to me, because this forum is one of the last "pure" places on the internet. In the years I've been here there has never really been any drama and tons of great information and experienced people who are willing to help when they can. However, I do feel a bit angry at myself for getting frustrated with this guy. As you have said, and others, he needs to write some code. Just do it, experiment, anything. All this conjecture going around in more circles is annoying. It was annoying last week, and even more so now. No advice is being absorbed. It's a real shame; I make an effort to read every new thread and reply, because it's a community where the daily activity is just the right amount that it's actually feasible to do so. I may not understand every last thing in the dev section as my ASM knowledge is novice x86, but it's still an interesting read and great to see the knowledge living on. It's sad that for the past two weeks I come here to see what's new and it's the same result in this thread, just paraphrased. I want to be clear, I am not trying to be mean. But maxx, it's good to have enthusiasm and to throw new ideas out there... but nobody will write this for you. A "team" will not happen until you show some promising progress. Maybe you are young? When I was young I used to think I was gonna write duke3d mods on a "team" and clearly that was silly and it's something I had to learn. Write some code, any code, for the SMS. Start building on those mistakes. When you start coming here with interesting code that can be refined it will attract attention and someone will be interested in working with you. But, at this point in the conversation, it is going nowhere and is a frustrating experience all around.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Tue Mar 29, 2022 7:23 am
toxa Joined: 09 Aug 2021 Posts: 140	maxxoccupancy wrote This seems to be pretty near the SMS's actual limits this is as near to SMS limits as moon.

asynchronous Joined: 14 Aug 2000 Posts: 746 Location: Adelaide, Australia	advanced mathematics for 3d polygons on the SMS Posted: Tue Mar 29, 2022 11:57 am
	I would love to be blown away to see the Master System hitting new heights using modern tools and modern computer science.....but I'm struggling to see it here. The Z80 is slow at any kind of math due to its single accumulator and low registers count. Even using LUTs is slow. And it only gets worse when you start doing meaningful things like handling data structures and program loops etc. To show what I mean, here is a simple hypothetical block of code to perform multiplication of 2 FP8 numbers using a LUT: LUT-based FP8 Multiplication Takes in two FP8 numbers and returns the FP8 multiplication result using a LUT LUT is 64kB starting on a 16kB boundary HL = the two FP8 numbers to be multiplied A = the FP8 multiplication result MultiplyFP8LUT: LD A,H ;work out which 16kB segment of the 64kB LUT to map in to $8000-$BFFF RLA RLA AND $3 ADD A,LUTBasePage LD (FFFF),A ;map in the 16kB LUT segment LD A,H ;form the pointer AND $3F OR $80 LD H,A LD A,(HL) ;FP8 multiplication result is in A This block of code takes 68 CPU cycles to execute. That's 3 calculations per scanline, or 640 calculations per 192-line frame. I imagine by the time you add program and data structures around this it will probably be half that rate. That aint great. :(

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Tue Mar 29, 2022 2:09 pm
	probably even a FP8 sum would have to be performed by using a LUT, and that would require the same amount of code, so at the end your Z80 would be barely able to perform at most around 880 (on NTSC) 'operations' per frame, meaning with 100% usage on this, so no CPU left for anything else. I suggest sticking to 16 bit fixed point. An addition is 11 cycles.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Tue Mar 29, 2022 2:34 pm
toxa Joined: 09 Aug 2021 Posts: 140	sverx wrote probably even a FP8 sum would have to be performed by using a LUT, and that would require the same amount of code, so at the end your Z80 would be barely able to perform at most around 880 (on NTSC) 'operations' per frame, meaning with 100% usage on this, so no CPU left for anything else. and what to do with that result of multiplication? :) you need to convert that value into the normal number with another table. also fp8 tolerance does not nearly allow to perform all the required calculations. here are ALL possible values for FP8 format: https://en.wikipedia.org/wiki/Minifloat#All_values_as_decimals one can use FP8 as a delta, but you must have at least single (or better double) value as accumulator.

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Tue Mar 29, 2022 2:37 pm
	toxa wrote and what to do with that result of multiplication? :) you need to convert that value into the normal number with another table. also fp8 tolerance does not nearly allow to perform all the required calculations. here are ALL possible values for FP8 format: https://en.wikipedia.org/wiki/Minifloat#All_values_as_decimals wow, this seems to be pretty useless then, at least in all the cases I can think of...

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Tue Mar 29, 2022 3:14 pm
toxa Joined: 09 Aug 2021 Posts: 140	as said before: https://www.smspower.org/forums/18937-AdvancedMathematicsFor3dPolygonsOnTheSMS?s...

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Tue Mar 29, 2022 5:29 pm
	Is it worth anyone's time coding and actual demo if there are experienced people here who already know about some of the obstacles. For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. I looked it up, and sure enough, 16KB. That's the problem with being a newb. We have to keep asking questions about what the limitations of the hardware are. However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. That's 9,000 instructions per frame. Since we only have to move about 90 tiles and draw 64 sprites per frame, that isn't terrible and has made psuedo 3d possible on the earliest SMS games like Space Harrier and Outrun. While more accurate placement and better 3D effects can be achieved using floating point to place groups of sprites, I feel like this thread has pretty well exhausted itself. Though I've learned a ton here about what can't be done on SMS1 hardware, I'm going to heed the advice of many and start a P.O.C. thread for a much, much simpler project that I'd started out with. Send me a PM if anyone would like to go further on this race track game, but I don't see enough enthusiasm to continue this thread to complete an actual project.

segarule Joined: 23 Jan 2010 Posts: 445	Posted: Tue Mar 29, 2022 6:17 pm
segarule Joined: 23 Jan 2010 Posts: 445	maxxoccupancy wrote Is it worth anyone's time coding and actual demo if there are experienced people here who already know about some of the obstacles. For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. I looked it up, and sure enough, 16KB. That's the problem with being a newb. We have to keep asking questions about what the limitations of the hardware are. However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. That's 9,000 instructions per frame. Since we only have to move about 90 tiles and draw 64 sprites per frame, that isn't terrible and has made psuedo 3d possible on the earliest SMS games like Space Harrier and Outrun. While more accurate placement and better 3D effects can be achieved using floating point to place groups of sprites, I feel like this thread has pretty well exhausted itself. Though I've learned a ton here about what can't be done on SMS1 hardware, I'm going to heed the advice of many and start a P.O.C. thread for a much, much simpler project that I'd started out with. Send me a PM if anyone would like to go further on this race track game, but I don't see enough enthusiasm to continue this thread to complete an actual project.

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Tue Mar 29, 2022 7:07 pm
	Also, just wanted to mention that reducing number of instructions is good, but if the instructions take longer than the cycle count of your current code it is not always better. This may sound obvious, but I didn't know this and apparently was (is?) a common misconception. I learned this from Michael Abrash's Graphics Programming Black Book. The book is now available for free on the web, or you can pay some $$$ to get a physical copy. It's a good read if you're really interested in graphics tricks as well as optimization. Obviously, it will be geared towards x86 so not everything will be applicable.

theNestruo Joined: 07 Jul 2021 Posts: 7	Posted: Tue Mar 29, 2022 7:45 pm
theNestruo Joined: 07 Jul 2021 Posts: 7	[quote="maxxoccupancy"] I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (...) using prerendered tiles (...) that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.[/quote] Prerendered tiles? Out Run Europa. See Turbo Charge (C64) or Power Drift (C64) for similar tile-based engines. The problem with Out Run Europa engine is that scenery moves too linearly and too fast: visual distance = 5 4 3 2 1. In my opinion, using a "more progressive" visual distance, where far scenery moves slower and keeps longer in the screen, the result would look much better: visual distance = 5 5 5 5 4 4 4 3 3 2 1.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Tue Mar 29, 2022 9:04 pm
toxa Joined: 09 Aug 2021 Posts: 140	maxxoccupancy wrote For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. Banks can not be 64K on any Z80 system because 64K is a whole possible address space of the CPU. No exceptions. maxxoccupancy wrote However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. 68000 is a 16-bit CPU (if we count by bitness of ALU) with 32-bit instruction set. You can not compare it with Z80 simply BECAUSE. It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing.

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Tue Mar 29, 2022 10:01 pm
	theNestruo wrote maxxoccupancy wrote I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (...) using prerendered tiles (...) that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer. Prerendered tiles? Out Run Europa. See Turbo Charge (C64) or Power Drift (C64) for similar tile-based engines. The problem with Out Run Europa engine is that scenery moves too linearly and too fast: visual distance = 5 4 3 2 1. In my opinion, using a "more progressive" visual distance, where far scenery moves slower and keeps longer in the screen, the result would look much better: visual distance = 5 5 5 5 4 4 4 3 3 2 1. There are three ways to index for distance. One is to just use the actual distance between planes, as most of the 1980s sprite superscalars did in the arcades. Another, like modern 3D raster engines, is to use trigonometry to calculate the distance. The third approach is to set the distance according to the scale factor that you're going to use. Using the race track example, we might cut the race track into a set of lines or slices along the track. Each object along the track has its own slice or distance. Using a 16-bit index, we could subtract the two: uint_fast16_t distance = distObject - distCar We could then scale something up by the square root of the distance, or we could use a hash table distance address 3 0x80 4 0xA0 5 0xC0 and so forth. Regarding the Z80's 520,000 instructions per second and the 68k's 1.2 MIPS, it's true that the 68k had wider registers and a larger ALU with more instructions. However, for the address calculations and memory moves, the Z80A runs at just under half the speed of the 68k. The SMS VDP is a 14MHz 16-bit RISC coprocessor that operates like many other coprocessors out there: controlled through registers rather than having its own decoder, branch unit, etc. There are many coprocessors still in use that operate this way.

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Wed Mar 30, 2022 1:36 am
	toxa wrote It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing. 🤡

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Wed Mar 30, 2022 1:50 am
	Maraakate wrote toxa wrote It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing. 🤡 My specialization is RISC pipelines. Admittedly, the Z80 ISA is very different, and optimization techniques are very different, but quasi-personal attacks don't do anything but send out an email notification to everyone on this thread, and they come in and see that someone is trashing the newb rather than offering anything constructive. The actual mini demo concept has moved to its own thread to avoid confusion with the advanced floating point, trigonometric operations, LookUp Tables, and complex topics that are better left in this thread. If you don't have anything constructive to say, don't post anything.

ichigobankai Joined: 04 Jul 2010 Posts: 542 Location: Angers, France	Posted: Wed Mar 30, 2022 6:15 am
	Quote If you don't have anything constructive to say, don't post anything. Does this also apply to you? As the saying goes ; well-ordered charity begins with oneself.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Wed Mar 30, 2022 6:35 am
toxa Joined: 09 Aug 2021 Posts: 140	maxxoccupancy wrote The SMS VDP is a 14MHz 16-bit RISC coprocessor that operates like many other coprocessors out there: controlled through registers rather than having its own decoder, branch unit, etc. There are many coprocessors still in use that operate this way. Why do you keep repeating this mantra? So what? How that will help you to render 3D scenes? How 14MHz help you? You can not write to VDP faster than each 29 (twenty nine!) t states and only one byte at once. If you need random access, speed drops DRAMATICALLY. How 16-bit help you? All user visible registers are 8-bit and the only internal 16-bit VRAM pointer should be written in two iterations and can not be even read back... How RISC help you? You can not write programs for it like shaders or whatever. All in all - VDP is a very limited and slow graphics chip with inconvenient and slow access that is intended for rendering of tiled 2D graphics.

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Wed Mar 30, 2022 12:17 pm
	maxxoccupancy wrote Maraakate wrote toxa wrote It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing. 🤡 My specialization is RISC pipelines. Admittedly, the Z80 ISA is very different, and optimization techniques are very different, but quasi-personal attacks don't do anything but send out an email notification to everyone on this thread, and they come in and see that someone is trashing the newb rather than offering anything constructive. The actual mini demo concept has moved to its own thread to avoid confusion with the advanced floating point, trigonometric operations, LookUp Tables, and complex topics that are better left in this thread. If you don't have anything constructive to say, don't post anything. Specialization? You went to school for this? Are you currently IN college for this? If yes, that would probably make a lot of sense. There's newer replies below my previous post, but toxa and ichigobankai are right. You do not listen and are just regurgitating the same things like there is some kind of authority in what you are saying, but you have proved nothing. But go ahead, keep talking about FP8 and word sizes. Post more videos of OTHER PLATFORMS where interesting things have been done. To people like me, someone is a software developer as a profession in C/C++ with some light ASM, at first glance it sounds like you are saying smart things. But, then we have experienced people who know this hardware inside and out and say why it's bologna and will never work and I believe them, because they have proven multiple times over that they know what they are talking about. This is just like the Bogdanov twins! In all seriousness, it's not because you are a "newb". You can be brand new to Z80 assembly, nobody here will make fun of you for that (really!). But when you keep asking the same, silly thing, over and over and do not listen to why it cannot work then yes, people get fed up and either argue with, make fun of, or ignore you.

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Fri Apr 01, 2022 4:21 am
	Just for reference, this is the closest video I could find for the Pseudo 3D effect for the road and objects on the side of it for the 3D race track game. Road Rash does something like this with the rolling hills. I'm currently reading up on ways that this visual trick can be done. For the stands, signs, billboards, backgrounds, and scenery, see the other posts about placing groups of sprite-textures together.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Fri Apr 01, 2022 9:31 am
toxa Joined: 09 Aug 2021 Posts: 140	cool! i like this one:

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Fri Apr 01, 2022 2:03 pm
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	OP sounds like a clueless politician in his late 40's who got dropped in a position he knows nothing about, who had to cobble up various bits of vaguely related information in a hurry to try and gain the public's confidence. You're inventing problems (floating point on a Z80 to do basic 3D) to bring your own solutions (LUTs), effectively achieving nothing useful towards that initial goal. You're convinced that any state machine (VDP) can be a CPU (no it isn't). First 3D polygons, then "sprite-textures" (whatever that is... a sprite is a simple and well defined term in the 2D console world), then pseudo 3D à la Super Scaler which has hardware specifically designed to render very large amounts of scaled sprites (unlike the SMS). Then out of nowhere you mention Romero and start fantasising about a team working to build your dream of pushing things to their limits with programs written in C, or "tweaks" of ready made games like it was just a matter of turning a knob... As others have said, having high ambitions is a good thing, but keeping your feet on the ground is essential. All you have right now are clouds of dots you haven't been able to connect. Sorry man, you can't expect to be taken seriously on a tech forum while sitting at the peak of the Duning-Kruger curve, unable to follow or express clear ideas, and on top of that being borderline arrogant. Like many others who were interested in rather obscure technical stuff, I've had my cringy know-it-all frantic micro-manager period when I was 16, but aren't you passed that stage ?

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Sat Apr 02, 2022 12:16 am
	That's literally how Tomb Raider was ported to the GBA, although I didn't know about that when I brought up this approach. They even use LUTs for the more complex mathematics. Floating point and embedded development are literally my background. Sprite-textures I described earlier as sprites that are used as textures. 4MB ROMs can now hold thousands of prerendered polygons that are displayed as sprites. This shouldn't require any further explanation. The Z80 gets used to write the SAT and update the locations of those sprite-textures each frame, just as many other consoles and machines do now to render 3D graphics. The only question is whether the Z80A at 3.59MHz is powerful enough to color cycle and move 64 sprite locations in 60,000 cycles, or one frame. You're saying that the Video Display Processor is not a processor for some reason, even though it's described as a 14MHz 16-bit "Video Display Processor" that is programmed using registers, tables, and interrupts, even though that's how most graphics coprocessors are programmed--including the SMS Video Display Processor. There are more personal attacks on this thread than I've seen on almost any other forum. Others have also posted lots of examples of polygons and 3D graphics performed by the Z80 and even the SMS itself. Lead, follow, or get out of the way.

furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Posted: Sat Apr 02, 2022 1:55 am
furrtek Joined: 05 Mar 2006 Posts: 53 Location: France	Quote That's literally how Tomb Raider was ported to the GBA What does this have to do with the SMS ? Still no answer. Quote Floating point and embedded development are literally my background. If the only tool you have is a hammer, you will start treating all your problems like a nail. Quote Sprite-textures I described earlier as sprites that are used as textures. 4MB ROMs can now hold thousands of prerendered polygons that are displayed as sprites. Eager to see what jerky, slow, and jumbled pixel mess is possible with 40 8x16 pre-rendered triangles. Also, have you finally read about the 8 sprites per line limit ? Ignoring it won't make it go away. Quote The Z80 gets used to write the SAT and update the locations of those sprite-textures each frame Thanks for explaining what almost every single piece of software written for the SMS does. Quote , just as many other consoles and machines do now to render 3D graphics. No. Vertex-defined shapes in a 3D space aren't sprites. The concept of hardware sprites doesn't even exist anymore on modern GPUs. The fact that you're mixing up such things says a lot. Quote The only question is whether the Z80A at 3.59MHz is powerful enough to color cycle and move 64 sprite locations in 60,000 cycles, or one frame. The Z80 can do much more than change a few palette entries and write 128 bytes in the SAT in the period of one frame. But that doesn't make anything 3D. Quote You're saying that the Video Display Processor is not a processor for some reason Read again. I used the acronym that you used yourself: "CPU", not "processor". Quote , even though it's described as a 14MHz 16-bit "Video Display Processor" Yup, it's a processor alright, it processes things. A clock frequency and a bus width doesn't make a "processor" a CPU. The embeded dev that you are (were ?) ought to make the difference. Quote that is programmed using registers, tables, and interrupts VDP registers aren't a program, they're just a few configuration switches to select some options. "Configuration registers" aren't called "program memory" for a reason. The official VDP doc is like 20 pages long, read them. What you call "tables" is VRAM. Stating that "interrupts program the VDP" doesn't make any sense. Someone who has basic understanding of either or both wouldn't write that. Quote There are more personal attacks on this thread than I've seen on almost any other forum. I'm attacking your nonsense, not your person. Quote Others have also posted lots of examples of polygons and 3D graphics performed by the Z80 and even the SMS itself. So in a single, simple sentence, what are you looking for now that others have done your research work ? Quote Lead, follow, or get out of the way. Cheesy manager talk. You're leading nobody, and you're following noone either since you've ignored all the advice you've been given until now. So who's got to get out of the way ?

willbritton Joined: 06 Mar 2022 Posts: 689 Location: London, UK	Posted: Sat Apr 02, 2022 7:46 am
	I am having a serious crisis of conscience about whether to post this, but here goes nothing: As a newcomer to contributing to this forum (lurker for a fair bit longer), ENORMOUS appreciator of the resource that it represents and - dare I say it - sometime expert on a fair few things and a self-confessed “newb” on many, many more; I think this thread is starting to do a discredit to what appears to be an otherwise incredibly respectful and harmonious community of enthusiasts. It’s not that I don’t agree with all of the comments about the OP’s netiquette - to be clear, I do - but in the course of this conversation nothing appears to be improving, only escalating and becoming more and more unpleasant to observe. As another has alluded to, it seems highly likely that the OP is not, as some may have suspected, some impressionable teenager whose behaviour can be "corrected" with the perfect mixture of logic and chastisement, but most likely a grown adult who, by this point in life, may well be set in his ways and unlikely to respond positively in any way to what has now become a fairly constant stream of criticism, however well intentioned that criticism may ultimately be. With the greatest of respect for everyone’s views, feelings and contributions to this conversation to date; and also very mindful that I’m an outsider quite possibly with no right to say so, might I suggest that it could be time to just let this conversation quietly fade into the past and move on, for the sake of maintaining the generally welcoming and constructive ethos of this community? 🕊️

Tom Joined: 16 May 2002 Posts: 1356 Location: italy	Posted: Sat Apr 02, 2022 8:52 am
Tom Joined: 16 May 2002 Posts: 1356 Location: italy	To be honest, I thought that the OP was some kind of troll from his very first post, but I did my best to keep my mouth shut because I didn't want to be rude (since I have a history of bad behaviour on various forums and I'm trying to steer away from that). Glad that someone else took care of the elephant in the room and said what many of us have been thinking. As much as I'd love to be proven wrong, I don't think we're going to see an "advanced 3D" demo anytime soon.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Sat Apr 02, 2022 9:43 am
toxa Joined: 09 Aug 2021 Posts: 140	maxxoccupancy wrote That's literally how Tomb Raider was ported to the GBA but GBA has six buttons: "A","B","START","SELECT","L","R" while SMS has only two: "1" and "2"!

haroldoop Joined: 25 Feb 2006 Posts: 875 Location: Belo Horizonte, MG, Brazil	Posted: Sat Apr 02, 2022 10:11 am
	With all due respect, I'll have to quote the text from https://atariage.com/forums/topic/82555-to-all-non-programmer-idea-peddlers/ Quote To All Non-Programmer Idea Peddlers: Programmers already have more ideas than they know what do do with; without any of yours. They don't need your ideas. Probably most of them don't want your ideas. Most game programmers have more ideas already than they have time to begin, let alone complete. That said, a good, well-thought out, well-presented idea is worth looking at, always. So if you want a programmer to even consider your idea for 30 seconds, here's what you need to do: 1. Present a concrete, good idea with lots of visual aids. Writing a game takes hours and hours of work. If you want a coder to even consider, for a minute, dedicating that kind of time, you had better put in some serious time of your own preparing your idea. Time measured in hours. Make mock screenshots. Design some sprites. Learn the capabilities of the machine you want the game written for and fit your idea to them. It isn't easy to understand the Stella guide if you aren't a coder, but if you want somebody to even glance sideways at your idea you better be willing to put in the time to at least understand a little of it. Do some legwork and demonstrate it. Spend some time working out your idea on paper; playtesting it to make sure it works. 2. Present a compelling reason why a coder should take on the project other than the fact that you think it would be really cool. Does your idea fill an underserved niche in the 2600's library? Is it a completely unique concept? Are lots of people clamoring for a game of this type? Does it present a unique, fun challenge? 3. Be humble. You are asking for far, far, far more than you are giving or will ever contribute. Coders already work for pennies/hour working on their own ideas, if that. You want the programmer to do something for you, essentially for free? Don't make demands. 4. Be flexible. Be willing to put in yet more time reworking screenshots, rethinking game mechanics, designing different sprites. The gold standard here is Adam Tierney (salstadt here at AA). Find some of the threads he started to publicize his own ideas and see what he did. See especially the Prince of Persia thread, and see how much work he put into that, over a period of weeks. You don't have to be the artist he is, but you better be willing to make up the difference in sweat. What's a good idea? Can't give an exact definition, but here are some starters: 1. It is unique. Either absolutely unique or unique to the platform. 2. It has a tested, proven game mechanic. Which is fun. 3. It uses the capabilities of the machine it is designed for well. 4. It is fun to more people than just you. The most important thing is to DO SOME WORK. If your idea doesn't have some mock screenshots, then it is worthless. Period.

toxa Joined: 09 Aug 2021 Posts: 140	Posted: Sat Apr 02, 2022 10:32 am
toxa Joined: 09 Aug 2021 Posts: 140	haroldoop wrote With all due respect, we simply need more LUTs.

asynchronous Joined: 14 Aug 2000 Posts: 746 Location: Adelaide, Australia	advanced mathematics for 3d polygons on the SMS Posted: Sat Apr 02, 2022 4:29 pm
	I thought furrtek’s post was civil myself. Ultimately this is still a technical discussion about our favourite gaming console, even if its capabilities are not fully understood by all. @toxa, c’mon man, no need to LUT-shame the guy. Some people just like big LUTs. :)

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Mon Apr 04, 2022 12:53 pm
	maxxoccupancy wrote Lead, follow, or get out of the way. I totally agree. Now provide some POC, or understand where the limits are, or get out of the way. You have plenty of time to pick yours.

segarule Joined: 23 Jan 2010 Posts: 445	Posted: Sat Aug 06, 2022 6:05 pm
segarule Joined: 23 Jan 2010 Posts: 445	Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol.

TheMole Joined: 19 Oct 2012 Posts: 22	Posted: Tue Aug 09, 2022 3:52 am
TheMole Joined: 19 Oct 2012 Posts: 22	Well, not a master system, but using the tms9918a: About as impressive as I've ever seen on a computer of this vintage...

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Tue Aug 09, 2022 12:40 pm
	segarule wrote https://www.youtube.com/watch?v=qrq9GqWyBF0 Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol. What's the floating point precision?

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Tue Sep 20, 2022 5:43 pm
	Maraakate wrote segarule wrote https://www.youtube.com/watch?v=qrq9GqWyBF0 Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol. What's the floating point precision? If you were referring to my proposal, the precision of the Lookup Table limits us to seven bits, plus another seven bits for the exponent and another bit for the sign. That assumes a mantissa with implied '1' before the 1.xxxxxxx. That's the equivalent of 9-bit fractions. Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code. That would give us a resectable 5-10k polygons/sec, more than the Sega CD, but not as accurate. So, 7-bits plus a sign and implied 1, or 9-bit fraction.

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Wed Sep 21, 2022 4:32 am
	word.

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Wed Sep 21, 2022 8:11 am
	maxxoccupancy wrote Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code. If you can describe the steps needed to perform an addition between two of those floating point values I can code a snippet to perform that and we can see how many cycles are required. I suspect we're nowhere close to those values that since you'd need less than 60 cycles per FLOP to reach 60k kFLOPS (and that's using the whole frame time just for math).

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	advanced mathematics for 3d polygons on the SMS Posted: Wed Sep 21, 2022 7:11 pm
	sverx wrote maxxoccupancy wrote Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code. If you can describe the steps needed to perform an addition between two of those floating point values I can code a snippet to perform that and we can see how many cycles are required. I suspect we're nowhere close to those values that since you'd need less than 60 cycles per FLOP to reach 60k kFLOPS (and that's using the whole frame time just for math). I know. I'm just trying to get within an order of magnitude to see where we are. IIRC, the Atari Jaguar could render about 10,000 polygons per second, so even 3-4,000 flat shaded polygons would be impressive and enough for a basic FPS. Both operations assume that, when we load the mantissa from memory into the registers, the '1' is implied. The multiply-add and multiply LUTs results are already filled out with this assumption in mind. That is, we would not support tiny denormalized numbers. Exponents are signed bytes. Mantissa could be signed or unsigned, depending on the final implementation, but I'm assuming signed to simplify the logic. The 7x7 LUT table in the cartridge would only see the bottom seven bits. // Multiply: Exponent (one byte), Mantissa (one byte, upper is always 0) // C = A * B Cexp = Aexp + Bexp // overflow results in max saturation or 255 Cman = Aman * Bman // Multiply LUT (7x7) access from ROM cart Norm = CountLeadingZeroes (Cman) //number of zeroes should be 1 or none, so this step may be simplified Cman = Cman - Norm Cman = LeftShift(Cexp, Norm) // Add: Exponent (one byte), Mantissa (one byte, upper is always 0) // D = A + B Dexp = Greater(Aexp, Bexp) DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented) Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent Since geometry transforms are made up of four multiplies and 12 multiply-adds, we may be able to save ourselves a normalization and rounding step by using a single function for both as modern hardware does. Let's find errors in logic before we start optimizing first, then look for features in the Z80 that allow us to accomplish some of these tasks for free. Since the 16KB LUT table delivers 8 bits of results, we could return either the 7-bit fraction with the upper bit showing an overflow and losing the ULP (unit in the last place) or preserve the ULP and always shift up by one bit, letting us use the free overflow detection to increment the Exponent. The latter approach might be a bit slower, but would preserve some of the already borderline accuracy of these 16-bit fp numbers. Here are some optimized Z80 routines for 24 and 32 bit fp numbers for the TI 83/84's. https://www.ticalc.org/archives/files/fileinfo/472/47243.html Accurate subtraction (or adding negative numbers) is basically impossible because there are so many subtractions where A and B are close enough together to lose 2, 3, or even 4 bits of accuracy in a single operation. I'm proposing that we use something similar to bfloat16 (top image in attachment), since it's already in wide use, has well tested routines and code, and can be operated on in two efficient byte operations in the Z80. https://en.wikipedia.org/wiki/Bfloat16_floating-point_format bfloat16.png (31.44 KB)

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Thu Sep 22, 2022 8:53 am
	maxxoccupancy wrote // Add: Exponent (one byte), Mantissa (one byte, upper is always 0) // D = A + B Dexp = Greater(Aexp, Bexp) DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented) Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent I'm not an expert on the subject but I do believe this doesn't work. I suspect it would give something like 100 + 1 = 200

Maraakate Joined: 05 Jun 2010 Posts: 757 Location: Pennsylvania, USA	Posted: Sat Sep 24, 2022 4:54 am
	did anyone see that toy story coding secrets video?

haroldoop Joined: 25 Feb 2006 Posts: 875 Location: Belo Horizonte, MG, Brazil	Posted: Sat Sep 24, 2022 11:45 am
	Maraakate wrote did anyone see that toy story coding secrets video? Yes...?

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Mon Sep 26, 2022 4:24 am
	sverx wrote maxxoccupancy wrote // Add: Exponent (one byte), Mantissa (one byte, upper is always 0) // D = A + B Dexp = Greater(Aexp, Bexp) DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented) Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent I'm not an expert on the subject but I do believe this doesn't work. I suspect it would give something like 100 + 1 = 200 A = 1.0000000 * 2^0000001 // 2 B = 1.0000000 * 2^0000000 // 1 A + B should be 3, or C = 1.1000000 * 2^000001 Dexp = Greater(1, 0) // 1 DexpOV = 1++ (10, or 2 in decimal, forcing Bman to be rightshifted by 1) Dman = 1.000 + 0.100{0} // Dman gets 1.100{0} If we drop/truncate {0} in the Guard bit, we end up with 1.100 * 2^0001 11.00, which is 3 in decimal Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent. Does anyone know any hacks with Z80 Assembly to preserve the values in the lower order bits that might otherwise be lost when they're shifted out? I suppose we could just perform both adds, one using the Aexp and another for Bexp, then using one branch for rounding to +/- infinity if the {Guard and Round} bits are 1 and 1. Truncation (dropping these low order bits) is fastest and usually accurate enough with an effectively 17-bit number.

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Mon Sep 26, 2022 7:34 am
	maxxoccupancy wrote Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent. Yes. So the whole process would be - find what's the max exponent and who provided that - right shift the mantissa of the other number a number of times that it's the difference of the two exponents - add the two mantissa - if there's carry, increment the exponent then there's the infinity problem, how should it be tackled?

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Tue Sep 27, 2022 9:03 pm
	sverx wrote maxxoccupancy wrote Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent. Yes. So the whole process would be - find what's the max exponent and who provided that - right shift the mantissa of the other number a number of times that it's the difference of the two exponents - add the two mantissa - if there's carry, increment the exponent then there's the infinity problem, how should it be tackled? I should've been even more precise: The fraction is the 1.xxxxxxx, whereas the mantissa is just the .xxxxxx, where the '1' is implied. For addition and subtraction, you do have to calculate Aexp - Bexp If the result is negative, then B is larger, so we branch and rightshift Aman. If performance is all that matters, we can require ordered operations such that Aexp must be => Bexp. Infinity is represented as maximum saturation (usually all 1's) in graphics calculations. This turns out to produce acceptable results in generating images, though it fails in most scientific calculations and simulations. An even simpler implementation would use a custom 8-bit format where the lower 7 bits are used only for the LookUp Tables, and that's what I'd originally proposed. nVidia's FP8 (E4M3) is a well studied format where the limitations are pretty well known. Since our output is 256x192, the inaccuracies that occur with such short precision are likely to be less noticeable. The entire calculation is already worked out in the 7x7 (16KB) LookUp Table for even functions, including division, square root, and trigonometric functions. The ideal would be to use two 16KB (7x7) LookUp Tables, one for the exponent and one for the fraction. However, no simple solution comes to mind. Once you know that the exponent of one is larger than the other, the number with the smaller exponent must be shifted down by the difference between them, by definition. The alternative is to create a completely custom format or use 16-bit fixed point numbers, forcing four different lookups from the ROM cart for each multiply, followed by several adds and carries.

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Wed Sep 28, 2022 7:01 am
	maxxoccupancy wrote The alternative is to create a completely custom format or use 16-bit fixed point numbers the Z80 will handle 16 bit fixed point numbers just fine... well, at least additions will be pretty fast

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Wed Sep 28, 2022 10:46 pm
	sverx wrote maxxoccupancy wrote The alternative is to create a completely custom format or use 16-bit fixed point numbers the Z80 will handle 16 bit fixed point numbers just fine... well, at least additions will be pretty fast So how about 14-bit fixed point numbers plus one for the sign, then another bit for overflows. That is, if we're anticipating overflows (based on the scale to be used), then the second bit in the number is reserved to show that the number has exceeded MaxSat, or Maximum Saturation. Using 14-bit numbers, we can also use 16KB LookUp Tables for reciprocals, square root, RecipSqrt, Sin, Cos, Tan, etc, where you have only a single operand and an 8-bit output. Each of these mathematical operations could then use separate Upper and Lower output tables, so a dozen of these one-operand functions would consume just: 12 functions * 2 tables * 16KB = 192KB or about 1.6 megabits, plus the most frequently used fp operation, multiply. Fixed Point 15-bit numbers can capture most of the accuracy needed for geometry transforms, object physics, and even DSP functions.

sverx Joined: 05 Sep 2013 Posts: 3859 Location: Stockholm, Sweden	Posted: Thu Sep 29, 2022 8:09 am
	I think it would surely be an interesting exercise, and I think you should probably learn Z80 asm and try to implement the basic operations yourself. I would say with your background it would probably just take a few weeks to get the grasp of how the Z80 processor works, after all it's a pretty simple processor and doesn't have many complex features like modern ones like out of order execution, speculative execution, advanced pipelining, etc... there's even no cache so you get exactly what you code.

SavagePencil Joined: 23 Aug 2009 Posts: 213 Location: Seattle, WA	Posted: Thu Sep 29, 2022 3:56 pm
	heck, just write it up in C first to prove it works. Then worry about making it fast.

maxxoccupancy Joined: 05 Mar 2022 Posts: 129 Location: Seabrook, New Hampshire	Posted: Fri Sep 30, 2022 4:35 am
	sverx wrote I think it would surely be an interesting exercise, and I think you should probably learn Z80 asm and try to implement the basic operations yourself. I would say with your background it would probably just take a few weeks to get the grasp of how the Z80 processor works, after all it's a pretty simple processor and doesn't have many complex features like modern ones like out of order execution, speculative execution, advanced pipelining, etc... there's even no cache so you get exactly what you code. Just the opposite. I learned dynamic scheduling, dynamic branch prediction, caching, etc, after I left college. All of these features are abstracted away from the programmer and are now available even in x86 implementations. After looking into Z80, it's actually really complex if you want to fully optimize the code and build performance fp code out of its faster primatives. You can't build high performance libraries without cycle counting instructions and picking data types ahead of time that are just the right fit. The chip is capable of about 600,000 instructions per second, but that drops off dramatically without careful selection of algorithms and data types. I really need someone who understands the inner workings of the chip and the best possible use of its unusual upper and lower register sets. I need programmers who can figure out how many random/sequential lookups per second we can get out of the game ROM. I need to know if we could use 14-bit mantissa to create a 15-bit fraction and 2-bit exponent (using a trap for overflows), because a lot of mathematical functions become a lot easier if we can do that. The simple theory of using the lower 14 bits to select from a 16K-entry LUT is not hard, and we could get fp performance on the order of 100 KFlops, which "The fastest data copying is 10.5 cycles per byte on Z80 with no address limits using the stack pointer:" https://retrocomputing.stackexchange.com/questions/5748/comparing-raw-performanc... Since the proposed approach to performing fp using LUTs is basically a data copy plus 2-6 additional instructions, the most that we could realistically hope for would be about 200 KFlops, so maybe 5,000 polygons/second (160 polygons/frame at 30fps) on the best day ever using sprite engine rendering in the 16-bit VDP. Virtua Racing's SVP was supposed to be able to generate 20,000 polygons/sec, but we never saw more than 9,000.

Forums

View topic - advanced mathematics for 3d polygons on the SMS