|
ForumsSega Master System / Mark III / Game GearSG-1000 / SC-3000 / SF-7000 / OMV |
Home - Forums - Games - Scans - Maps - Cheats - Credits Music - Videos - Development - Hacks - Translations - Homebrew |
Goto page Previous 1, 2, 3, 4 Next |
Author | Message |
---|---|
|
Posted: Mon Mar 28, 2022 11:48 pm |
I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (or could be rewritten to do so) and add in trees, billboards, signs, stands, and other structures that fly past at Formula One speeds (100kph through turns, 200kph on straightaways). Then color cyle the race track itself to give the player a wrap around (canyon and dome) feel of rapid motion. This seems to be pretty near the SMS's actual limits, even using prerendered tiles, precalculated sprite-texture tables, HiFi audio, digitized backgrounds, and 8x16 sprites that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer. |
|
|
Posted: Tue Mar 29, 2022 12:03 am |
Remember, we must build the algorithm around the hardware that is actually there. That means that, only able to display 64 8x16 sprites per screen, with another race car using up 6 of those at one time, we're already down to 58 sprites. The race track and barriers would all be done with background tiles and well proven techniques like what we see in Hang On.
That means that any new 3D effects need to fill two areas about 120x96 with sprites that would only fill about 64x96. Using background tiles to cover large parts of that area with solid or dithered colors (a large section of buildings, forest, canyons on one side with a few sprite columns to provide a feeling of scaling and rapid movement), we'd only have 5-6 columns on the other side to draw anything interesting that doesn't use background tiles as a safety net. And we can only stream, IIRC, another 91 tiles per frame, 13x7 or 9x10 per frame. Even flipping those horizontally, that's not enough to fill out the screen. It's barely enough to give the player that canyon and dome feeling of being surrounded by rapidly moving objects. |
|
|
Posted: Tue Mar 29, 2022 3:30 am |
This is all very sad to me, because this forum is one of the last "pure" places on the internet. In the years I've been here there has never really been any drama and tons of great information and experienced people who are willing to help when they can. However, I do feel a bit angry at myself for getting frustrated with this guy. As you have said, and others, he needs to write some code. Just do it, experiment, anything. All this conjecture going around in more circles is annoying. It was annoying last week, and even more so now. No advice is being absorbed. It's a real shame; I make an effort to read every new thread and reply, because it's a community where the daily activity is just the right amount that it's actually feasible to do so. I may not understand every last thing in the dev section as my ASM knowledge is novice x86, but it's still an interesting read and great to see the knowledge living on. It's sad that for the past two weeks I come here to see what's new and it's the same result in this thread, just paraphrased. I want to be clear, I am not trying to be mean. But maxx, it's good to have enthusiasm and to throw new ideas out there... but nobody will write this for you. A "team" will not happen until you show some promising progress. Maybe you are young? When I was young I used to think I was gonna write duke3d mods on a "team" and clearly that was silly and it's something I had to learn. Write some code, any code, for the SMS. Start building on those mistakes. When you start coming here with interesting code that can be refined it will attract attention and someone will be interested in working with you. But, at this point in the conversation, it is going nowhere and is a frustrating experience all around. |
|
|
Posted: Tue Mar 29, 2022 7:23 am |
this is as near to SMS limits as moon. |
|
|
advanced mathematics for 3d polygons on the SMS
Posted: Tue Mar 29, 2022 11:57 am
|
I would love to be blown away to see the Master System hitting new heights using modern tools and modern computer science.....but I'm struggling to see it here.
The Z80 is slow at any kind of math due to its single accumulator and low registers count. Even using LUTs is slow. And it only gets worse when you start doing meaningful things like handling data structures and program loops etc. To show what I mean, here is a simple hypothetical block of code to perform multiplication of 2 FP8 numbers using a LUT: LUT-based FP8 Multiplication
Takes in two FP8 numbers and returns the FP8 multiplication result using a LUT LUT is 64kB starting on a 16kB boundary HL = the two FP8 numbers to be multiplied A = the FP8 multiplication result MultiplyFP8LUT: LD A,H ;work out which 16kB segment of the 64kB LUT to map in to $8000-$BFFF RLA RLA AND $3 ADD A,LUTBasePage LD (FFFF),A ;map in the 16kB LUT segment LD A,H ;form the pointer AND $3F OR $80 LD H,A LD A,(HL) ;FP8 multiplication result is in A This block of code takes 68 CPU cycles to execute. That's 3 calculations per scanline, or 640 calculations per 192-line frame. I imagine by the time you add program and data structures around this it will probably be half that rate. That aint great. :( |
|
|
Posted: Tue Mar 29, 2022 2:09 pm |
probably even a FP8 sum would have to be performed by using a LUT, and that would require the same amount of code, so at the end your Z80 would be barely able to perform at most around 880 (on NTSC) 'operations' per frame, meaning with 100% usage on this, so no CPU left for anything else.
I suggest sticking to 16 bit fixed point. An addition is 11 cycles. |
|
|
Posted: Tue Mar 29, 2022 2:34 pm |
and what to do with that result of multiplication? :) you need to convert that value into the normal number with another table. also fp8 tolerance does not nearly allow to perform all the required calculations. here are ALL possible values for FP8 format: https://en.wikipedia.org/wiki/Minifloat#All_values_as_decimals one can use FP8 as a delta, but you must have at least single (or better double) value as accumulator. |
|
|
Posted: Tue Mar 29, 2022 2:37 pm |
wow, this seems to be pretty useless then, at least in all the cases I can think of... |
|
|
Posted: Tue Mar 29, 2022 3:14 pm |
as said before: https://www.smspower.org/forums/18937-AdvancedMathematicsFor3dPolygonsOnTheSMS?s... | |
|
Posted: Tue Mar 29, 2022 5:29 pm |
Is it worth anyone's time coding and actual demo if there are experienced people here who already know about some of the obstacles. For example, someone reminded me that banks were 16KB--not 64KB as I'd read somewhere else. I looked it up, and sure enough, 16KB.
That's the problem with being a newb. We have to keep asking questions about what the limitations of the hardware are. However, I don't agree with the poor performance of the Z80. At 3.59 MHz, it's rated at a sad 520,000 instructions per second--compared to 1.2 million for the 68000 on the Megadrive. That's 9,000 instructions per frame. Since we only have to move about 90 tiles and draw 64 sprites per frame, that isn't terrible and has made psuedo 3d possible on the earliest SMS games like Space Harrier and Outrun. While more accurate placement and better 3D effects can be achieved using floating point to place groups of sprites, I feel like this thread has pretty well exhausted itself. Though I've learned a ton here about what can't be done on SMS1 hardware, I'm going to heed the advice of many and start a P.O.C. thread for a much, much simpler project that I'd started out with. Send me a PM if anyone would like to go further on this race track game, but I don't see enough enthusiasm to continue this thread to complete an actual project. |
|
|
Posted: Tue Mar 29, 2022 6:17 pm |
|
|
|
Posted: Tue Mar 29, 2022 7:07 pm |
Also, just wanted to mention that reducing number of instructions is good, but if the instructions take longer than the cycle count of your current code it is not always better. This may sound obvious, but I didn't know this and apparently was (is?) a common misconception.
I learned this from Michael Abrash's Graphics Programming Black Book. The book is now available for free on the web, or you can pay some $$$ to get a physical copy. It's a good read if you're really interested in graphics tricks as well as optimization. Obviously, it will be geared towards x86 so not everything will be applicable. |
|
|
Posted: Tue Mar 29, 2022 7:45 pm |
[quote="maxxoccupancy"]
I think that a better approach is to take an existing Hang On or Outrun style race game that already runs at 60fps (...) using prerendered tiles (...) that gradually move away from eachother to give the appearance that buildings, signs, billboards, stands, and trees are getting closer.[/quote] Prerendered tiles? Out Run Europa. See Turbo Charge (C64) or Power Drift (C64) for similar tile-based engines. The problem with Out Run Europa engine is that scenery moves too linearly and too fast: visual distance = 5 4 3 2 1. In my opinion, using a "more progressive" visual distance, where far scenery moves slower and keeps longer in the screen, the result would look much better: visual distance = 5 5 5 5 4 4 4 3 3 2 1. |
|
|
Posted: Tue Mar 29, 2022 9:04 pm |
Banks can not be 64K on any Z80 system because 64K is a whole possible address space of the CPU. No exceptions.
68000 is a 16-bit CPU (if we count by bitness of ALU) with 32-bit instruction set. You can not compare it with Z80 simply BECAUSE. It looks to me that you are just writing some random statements that come into your head not even bothering to make any analysis. Not even fully understand what you are writing. |
|
|
Posted: Tue Mar 29, 2022 10:01 pm |
There are three ways to index for distance. One is to just use the actual distance between planes, as most of the 1980s sprite superscalars did in the arcades. Another, like modern 3D raster engines, is to use trigonometry to calculate the distance. The third approach is to set the distance according to the scale factor that you're going to use. Using the race track example, we might cut the race track into a set of lines or slices along the track. Each object along the track has its own slice or distance. Using a 16-bit index, we could subtract the two: uint_fast16_t distance = distObject - distCar
We could then scale something up by the square root of the distance, or we could use a hash table distance address
3 0x80 4 0xA0 5 0xC0 and so forth. Regarding the Z80's 520,000 instructions per second and the 68k's 1.2 MIPS, it's true that the 68k had wider registers and a larger ALU with more instructions. However, for the address calculations and memory moves, the Z80A runs at just under half the speed of the 68k. The SMS VDP is a 14MHz 16-bit RISC coprocessor that operates like many other coprocessors out there: controlled through registers rather than having its own decoder, branch unit, etc. There are many coprocessors still in use that operate this way. |
|
|
Posted: Wed Mar 30, 2022 1:36 am |
🤡 |
|
|
Posted: Wed Mar 30, 2022 1:50 am |
My specialization is RISC pipelines. Admittedly, the Z80 ISA is very different, and optimization techniques are very different, but quasi-personal attacks don't do anything but send out an email notification to everyone on this thread, and they come in and see that someone is trashing the newb rather than offering anything constructive. The actual mini demo concept has moved to its own thread to avoid confusion with the advanced floating point, trigonometric operations, LookUp Tables, and complex topics that are better left in this thread. If you don't have anything constructive to say, don't post anything. |
|
|
Posted: Wed Mar 30, 2022 6:15 am |
Does this also apply to you? As the saying goes ; well-ordered charity begins with oneself. |
|
|
Posted: Wed Mar 30, 2022 6:35 am |
Why do you keep repeating this mantra? So what? How that will help you to render 3D scenes? How 14MHz help you? You can not write to VDP faster than each 29 (twenty nine!) t states and only one byte at once. If you need random access, speed drops DRAMATICALLY. How 16-bit help you? All user visible registers are 8-bit and the only internal 16-bit VRAM pointer should be written in two iterations and can not be even read back... How RISC help you? You can not write programs for it like shaders or whatever. All in all - VDP is a very limited and slow graphics chip with inconvenient and slow access that is intended for rendering of tiled 2D graphics. |
|
|
Posted: Wed Mar 30, 2022 12:17 pm |
Specialization? You went to school for this? Are you currently IN college for this? If yes, that would probably make a lot of sense. There's newer replies below my previous post, but toxa and ichigobankai are right. You do not listen and are just regurgitating the same things like there is some kind of authority in what you are saying, but you have proved nothing. But go ahead, keep talking about FP8 and word sizes. Post more videos of OTHER PLATFORMS where interesting things have been done. To people like me, someone is a software developer as a profession in C/C++ with some light ASM, at first glance it sounds like you are saying smart things. But, then we have experienced people who know this hardware inside and out and say why it's bologna and will never work and I believe them, because they have proven multiple times over that they know what they are talking about. This is just like the Bogdanov twins! In all seriousness, it's not because you are a "newb". You can be brand new to Z80 assembly, nobody here will make fun of you for that (really!). But when you keep asking the same, silly thing, over and over and do not listen to why it cannot work then yes, people get fed up and either argue with, make fun of, or ignore you. |
|
|
Posted: Fri Apr 01, 2022 4:21 am |
Just for reference, this is the closest video I could find for the Pseudo 3D effect for the road and objects on the side of it for the 3D race track game. Road Rash does something like this with the rolling hills. I'm currently reading up on ways that this visual trick can be done.
For the stands, signs, billboards, backgrounds, and scenery, see the other posts about placing groups of sprite-textures together. |
|
|
Posted: Fri Apr 01, 2022 9:31 am |
cool! i like this one:
|
|
|
Posted: Fri Apr 01, 2022 2:03 pm |
OP sounds like a clueless politician in his late 40's who got dropped in a position he knows nothing about, who had to cobble up various bits of vaguely related information in a hurry to try and gain the public's confidence.
You're inventing problems (floating point on a Z80 to do basic 3D) to bring your own solutions (LUTs), effectively achieving nothing useful towards that initial goal. You're convinced that any state machine (VDP) can be a CPU (no it isn't). First 3D polygons, then "sprite-textures" (whatever that is... a sprite is a simple and well defined term in the 2D console world), then pseudo 3D à la Super Scaler which has hardware specifically designed to render very large amounts of scaled sprites (unlike the SMS). Then out of nowhere you mention Romero and start fantasising about a team working to build your dream of pushing things to their limits with programs written in C, or "tweaks" of ready made games like it was just a matter of turning a knob... As others have said, having high ambitions is a good thing, but keeping your feet on the ground is essential. All you have right now are clouds of dots you haven't been able to connect. Sorry man, you can't expect to be taken seriously on a tech forum while sitting at the peak of the Duning-Kruger curve, unable to follow or express clear ideas, and on top of that being borderline arrogant. Like many others who were interested in rather obscure technical stuff, I've had my cringy know-it-all frantic micro-manager period when I was 16, but aren't you passed that stage ? |
|
|
Posted: Sat Apr 02, 2022 12:16 am |
That's literally how Tomb Raider was ported to the GBA, although I didn't know about that when I brought up this approach. They even use LUTs for the more complex mathematics. Floating point and embedded development are literally my background.
Sprite-textures I described earlier as sprites that are used as textures. 4MB ROMs can now hold thousands of prerendered polygons that are displayed as sprites. This shouldn't require any further explanation. The Z80 gets used to write the SAT and update the locations of those sprite-textures each frame, just as many other consoles and machines do now to render 3D graphics. The only question is whether the Z80A at 3.59MHz is powerful enough to color cycle and move 64 sprite locations in 60,000 cycles, or one frame. You're saying that the Video Display Processor is not a processor for some reason, even though it's described as a 14MHz 16-bit "Video Display Processor" that is programmed using registers, tables, and interrupts, even though that's how most graphics coprocessors are programmed--including the SMS Video Display Processor. There are more personal attacks on this thread than I've seen on almost any other forum. Others have also posted lots of examples of polygons and 3D graphics performed by the Z80 and even the SMS itself. Lead, follow, or get out of the way. |
|
|
Posted: Sat Apr 02, 2022 1:55 am |
What does this have to do with the SMS ? Still no answer.
If the only tool you have is a hammer, you will start treating all your problems like a nail.
Eager to see what jerky, slow, and jumbled pixel mess is possible with 40 8x16 pre-rendered triangles. Also, have you finally read about the 8 sprites per line limit ? Ignoring it won't make it go away.
Thanks for explaining what almost every single piece of software written for the SMS does.
No. Vertex-defined shapes in a 3D space aren't sprites. The concept of hardware sprites doesn't even exist anymore on modern GPUs. The fact that you're mixing up such things says a lot.
The Z80 can do much more than change a few palette entries and write 128 bytes in the SAT in the period of one frame. But that doesn't make anything 3D.
Read again. I used the acronym that you used yourself: "CPU", not "processor".
Yup, it's a processor alright, it processes things. A clock frequency and a bus width doesn't make a "processor" a CPU. The embeded dev that you are (were ?) ought to make the difference.
VDP registers aren't a program, they're just a few configuration switches to select some options. "Configuration registers" aren't called "program memory" for a reason. The official VDP doc is like 20 pages long, read them. What you call "tables" is VRAM. Stating that "interrupts program the VDP" doesn't make any sense. Someone who has basic understanding of either or both wouldn't write that.
I'm attacking your nonsense, not your person.
So in a single, simple sentence, what are you looking for now that others have done your research work ?
Cheesy manager talk. You're leading nobody, and you're following noone either since you've ignored all the advice you've been given until now. So who's got to get out of the way ? |
|
|
Posted: Sat Apr 02, 2022 7:46 am |
I am having a serious crisis of conscience about whether to post this, but here goes nothing:
As a newcomer to contributing to this forum (lurker for a fair bit longer), ENORMOUS appreciator of the resource that it represents and - dare I say it - sometime expert on a fair few things and a self-confessed “newb” on many, many more; I think this thread is starting to do a discredit to what appears to be an otherwise incredibly respectful and harmonious community of enthusiasts. It’s not that I don’t agree with all of the comments about the OP’s netiquette - to be clear, I do - but in the course of this conversation nothing appears to be improving, only escalating and becoming more and more unpleasant to observe. As another has alluded to, it seems highly likely that the OP is not, as some may have suspected, some impressionable teenager whose behaviour can be "corrected" with the perfect mixture of logic and chastisement, but most likely a grown adult who, by this point in life, may well be set in his ways and unlikely to respond positively in any way to what has now become a fairly constant stream of criticism, however well intentioned that criticism may ultimately be. With the greatest of respect for everyone’s views, feelings and contributions to this conversation to date; and also very mindful that I’m an outsider quite possibly with no right to say so, might I suggest that it could be time to just let this conversation quietly fade into the past and move on, for the sake of maintaining the generally welcoming and constructive ethos of this community? 🕊️ |
|
|
Posted: Sat Apr 02, 2022 8:52 am |
To be honest, I thought that the OP was some kind of troll from his very first post, but I did my best to keep my mouth shut because I didn't want to be rude (since I have a history of bad behaviour on various forums and I'm trying to steer away from that). Glad that someone else took care of the elephant in the room and said what many of us have been thinking. As much as I'd love to be proven wrong, I don't think we're going to see an "advanced 3D" demo anytime soon. | |
|
Posted: Sat Apr 02, 2022 9:43 am |
but GBA has six buttons: "A","B","START","SELECT","L","R" while SMS has only two: "1" and "2"! |
|
|
Posted: Sat Apr 02, 2022 10:11 am |
With all due respect, I'll have to quote the text from https://atariage.com/forums/topic/82555-to-all-non-programmer-idea-peddlers/
|
|
|
Posted: Sat Apr 02, 2022 10:32 am |
we simply need more LUTs. |
|
|
advanced mathematics for 3d polygons on the SMS
Posted: Sat Apr 02, 2022 4:29 pm
|
I thought furrtek’s post was civil myself.
Ultimately this is still a technical discussion about our favourite gaming console, even if its capabilities are not fully understood by all. @toxa, c’mon man, no need to LUT-shame the guy. Some people just like big LUTs. :) |
|
|
Posted: Mon Apr 04, 2022 12:53 pm |
I totally agree. Now provide some POC, or understand where the limits are, or get out of the way. You have plenty of time to pick yours. |
|
|
Posted: Sat Aug 06, 2022 6:05 pm |
Our friend, Supermaxx, would be proud this, but he is busy with F1 2022 season, lol. |
|
|
Posted: Tue Aug 09, 2022 3:52 am |
Well, not a master system, but using the tms9918a:
About as impressive as I've ever seen on a computer of this vintage... |
|
|
Posted: Tue Aug 09, 2022 12:40 pm |
What's the floating point precision? |
|
|
Posted: Tue Sep 20, 2022 5:43 pm |
If you were referring to my proposal, the precision of the Lookup Table limits us to seven bits, plus another seven bits for the exponent and another bit for the sign. That assumes a mantissa with implied '1' before the 1.xxxxxxx. That's the equivalent of 9-bit fractions. Using this method for fp, we would see approximately 100 kFLOPS at best, but realistically 60-80k with well written code. That would give us a resectable 5-10k polygons/sec, more than the Sega CD, but not as accurate. So, 7-bits plus a sign and implied 1, or 9-bit fraction. |
|
|
Posted: Wed Sep 21, 2022 4:32 am |
word. | |
|
Posted: Wed Sep 21, 2022 8:11 am |
If you can describe the steps needed to perform an addition between two of those floating point values I can code a snippet to perform that and we can see how many cycles are required. I suspect we're nowhere close to those values that since you'd need less than 60 cycles per FLOP to reach 60k kFLOPS (and that's using the whole frame time just for math). |
|
|
advanced mathematics for 3d polygons on the SMS
Posted: Wed Sep 21, 2022 7:11 pm
|
I know. I'm just trying to get within an order of magnitude to see where we are. IIRC, the Atari Jaguar could render about 10,000 polygons per second, so even 3-4,000 flat shaded polygons would be impressive and enough for a basic FPS. Both operations assume that, when we load the mantissa from memory into the registers, the '1' is implied. The multiply-add and multiply LUTs results are already filled out with this assumption in mind. That is, we would not support tiny denormalized numbers. Exponents are signed bytes. Mantissa could be signed or unsigned, depending on the final implementation, but I'm assuming signed to simplify the logic. The 7x7 LUT table in the cartridge would only see the bottom seven bits. // Multiply: Exponent (one byte), Mantissa (one byte, upper is always 0) // C = A * B Cexp = Aexp + Bexp // overflow results in max saturation or 255 Cman = Aman * Bman // Multiply LUT (7x7) access from ROM cart Norm = CountLeadingZeroes (Cman) //number of zeroes should be 1 or none, so this step may be simplified Cman = Cman - Norm Cman = LeftShift(Cexp, Norm) // Add: Exponent (one byte), Mantissa (one byte, upper is always 0) // D = A + B Dexp = Greater(Aexp, Bexp) DexpOV = Dexp + 1 (a large minority of adds lead to overflow, requiring that the exponent be incremented) Dman = Aman + Bman // if overflow occurs, we use DexpOV for the exponent Since geometry transforms are made up of four multiplies and 12 multiply-adds, we may be able to save ourselves a normalization and rounding step by using a single function for both as modern hardware does. Let's find errors in logic before we start optimizing first, then look for features in the Z80 that allow us to accomplish some of these tasks for free. Since the 16KB LUT table delivers 8 bits of results, we could return either the 7-bit fraction with the upper bit showing an overflow and losing the ULP (unit in the last place) or preserve the ULP and always shift up by one bit, letting us use the free overflow detection to increment the Exponent. The latter approach might be a bit slower, but would preserve some of the already borderline accuracy of these 16-bit fp numbers. Here are some optimized Z80 routines for 24 and 32 bit fp numbers for the TI 83/84's. https://www.ticalc.org/archives/files/fileinfo/472/47243.html Accurate subtraction (or adding negative numbers) is basically impossible because there are so many subtractions where A and B are close enough together to lose 2, 3, or even 4 bits of accuracy in a single operation. I'm proposing that we use something similar to bfloat16 (top image in attachment), since it's already in wide use, has well tested routines and code, and can be operated on in two efficient byte operations in the Z80. https://en.wikipedia.org/wiki/Bfloat16_floating-point_format |
|
|
Posted: Thu Sep 22, 2022 8:53 am |
I'm not an expert on the subject but I do believe this doesn't work. I suspect it would give something like 100 + 1 = 200 |
|
|
Posted: Sat Sep 24, 2022 4:54 am |
did anyone see that toy story coding secrets video? | |
|
Posted: Sat Sep 24, 2022 11:45 am |
Yes...? |
|
|
Posted: Mon Sep 26, 2022 4:24 am |
A = 1.0000000 * 2^0000001 // 2 B = 1.0000000 * 2^0000000 // 1 A + B should be 3, or C = 1.1000000 * 2^000001 Dexp = Greater(1, 0) // 1 DexpOV = 1++ (10, or 2 in decimal, forcing Bman to be rightshifted by 1) Dman = 1.000 + 0.100{0} // Dman gets 1.100{0} If we drop/truncate {0} in the Guard bit, we end up with 1.100 * 2^0001 11.00, which is 3 in decimal Obviously, I should've been more explicit about the right-shifting of the value with the smaller exponent. Does anyone know any hacks with Z80 Assembly to preserve the values in the lower order bits that might otherwise be lost when they're shifted out? I suppose we could just perform both adds, one using the Aexp and another for Bexp, then using one branch for rounding to +/- infinity if the {Guard and Round} bits are 1 and 1. Truncation (dropping these low order bits) is fastest and usually accurate enough with an effectively 17-bit number. |
|
|
Posted: Mon Sep 26, 2022 7:34 am |
Yes. So the whole process would be - find what's the max exponent and who provided that - right shift the mantissa of the other number a number of times that it's the difference of the two exponents - add the two mantissa - if there's carry, increment the exponent then there's the infinity problem, how should it be tackled? |
|
|
Posted: Tue Sep 27, 2022 9:03 pm |
I should've been even more precise: The fraction is the 1.xxxxxxx, whereas the mantissa is just the .xxxxxx, where the '1' is implied. For addition and subtraction, you do have to calculate Aexp - Bexp If the result is negative, then B is larger, so we branch and rightshift Aman. If performance is all that matters, we can require ordered operations such that Aexp must be => Bexp. Infinity is represented as maximum saturation (usually all 1's) in graphics calculations. This turns out to produce acceptable results in generating images, though it fails in most scientific calculations and simulations. An even simpler implementation would use a custom 8-bit format where the lower 7 bits are used only for the LookUp Tables, and that's what I'd originally proposed. nVidia's FP8 (E4M3) is a well studied format where the limitations are pretty well known. Since our output is 256x192, the inaccuracies that occur with such short precision are likely to be less noticeable. The entire calculation is already worked out in the 7x7 (16KB) LookUp Table for even functions, including division, square root, and trigonometric functions. The ideal would be to use two 16KB (7x7) LookUp Tables, one for the exponent and one for the fraction. However, no simple solution comes to mind. Once you know that the exponent of one is larger than the other, the number with the smaller exponent must be shifted down by the difference between them, by definition. The alternative is to create a completely custom format or use 16-bit fixed point numbers, forcing four different lookups from the ROM cart for each multiply, followed by several adds and carries. |
|
|
Posted: Wed Sep 28, 2022 7:01 am |
the Z80 will handle 16 bit fixed point numbers just fine... well, at least additions will be pretty fast |
|
|
Posted: Wed Sep 28, 2022 10:46 pm |
So how about 14-bit fixed point numbers plus one for the sign, then another bit for overflows. That is, if we're anticipating overflows (based on the scale to be used), then the second bit in the number is reserved to show that the number has exceeded MaxSat, or Maximum Saturation. Using 14-bit numbers, we can also use 16KB LookUp Tables for reciprocals, square root, RecipSqrt, Sin, Cos, Tan, etc, where you have only a single operand and an 8-bit output. Each of these mathematical operations could then use separate Upper and Lower output tables, so a dozen of these one-operand functions would consume just: 12 functions * 2 tables * 16KB = 192KB or about 1.6 megabits, plus the most frequently used fp operation, multiply. Fixed Point 15-bit numbers can capture most of the accuracy needed for geometry transforms, object physics, and even DSP functions. |
|
|
Posted: Thu Sep 29, 2022 8:09 am |
I think it would surely be an interesting exercise, and I think you should probably learn Z80 asm and try to implement the basic operations yourself. I would say with your background it would probably just take a few weeks to get the grasp of how the Z80 processor works, after all it's a pretty simple processor and doesn't have many complex features like modern ones like out of order execution, speculative execution, advanced pipelining, etc... there's even no cache so you get exactly what you code. | |
|
Posted: Thu Sep 29, 2022 3:56 pm |
heck, just write it up in C first to prove it works. Then worry about making it fast. | |
|
Posted: Fri Sep 30, 2022 4:35 am |
Just the opposite. I learned dynamic scheduling, dynamic branch prediction, caching, etc, after I left college. All of these features are abstracted away from the programmer and are now available even in x86 implementations. After looking into Z80, it's actually really complex if you want to fully optimize the code and build performance fp code out of its faster primatives. You can't build high performance libraries without cycle counting instructions and picking data types ahead of time that are just the right fit. The chip is capable of about 600,000 instructions per second, but that drops off dramatically without careful selection of algorithms and data types. I really need someone who understands the inner workings of the chip and the best possible use of its unusual upper and lower register sets. I need programmers who can figure out how many random/sequential lookups per second we can get out of the game ROM. I need to know if we could use 14-bit mantissa to create a 15-bit fraction and 2-bit exponent (using a trap for overflows), because a lot of mathematical functions become a lot easier if we can do that. The simple theory of using the lower 14 bits to select from a 16K-entry LUT is not hard, and we could get fp performance on the order of 100 KFlops, which "The fastest data copying is 10.5 cycles per byte on Z80 with no address limits using the stack pointer:" https://retrocomputing.stackexchange.com/questions/5748/comparing-raw-performanc... Since the proposed approach to performing fp using LUTs is basically a data copy plus 2-6 additional instructions, the most that we could realistically hope for would be about 200 KFlops, so maybe 5,000 polygons/second (160 polygons/frame at 30fps) on the best day ever using sprite engine rendering in the 16-bit VDP. Virtua Racing's SVP was supposed to be able to generate 20,000 polygons/sec, but we never saw more than 9,000. |
|
Goto page Previous 1, 2, 3, 4 Next |