-+> Simple Speed Optimizations <+-

A trainer by Kent "MenThal" Dahl

Unofficial edition Version 0.5

Latest change: 07/03-98

 DISCLAIMER:

This information is LEARNWARE, which means the copyright remains with the author, who allows it to be freely distributed, in its original form. It is illegal to sell, lend, or in other ways make money on this product. If you find this very useful, please give credit when credit is due. (Mainly like Freeware!)

NB: This trainer is aimed at hobby coders and beginners. More experienced coders will probably know most of the stuff in this trainer. This is mainly the stuff I wish I knew when I was a really inexperienced coder. Still, if you code, you might learn something from my ravings, so go ahead and have a look.

Code snippets are either Pascal or pseudo-code if not stated. Knowledge of programming is obviously required to make any use of this trainer. I don't take any responsible for the validity of ANY of this, although I have tested most of it and I trust it myself...

Why bother OPTIMISE?

Speed, speed, speed is the answer. Speed optimisation is important in games, graphic intensive code, and nearly all sorts of programs. A good portion optimisation can maybe change your slowly-chugging-along-at-a-snails-pace program into a speedy-slick-supercool program. (Then again, maybe not!)

Look, just trust me. Speed is important. Imagine Wolfenstein, Doom, Descent, Future Crew's "Second Reality" demo and any program that needs a 486 (or a Pentium for that sake), running at less than half its normal speed. Or try running Quake on a 486/33.. BOOORING, eh? Oh, you probably agree, since you're reading this. Lets get on with it, shall we?

How do I OPTIMISE?

Speed Opt'ing your code is mainly a case of helping your PC to use a faster way of getting the work done. There are very many ways to approach this issue, several of which are closely related. Through this trainer I will explain several speed opt'ing concepts, examples and techniques, so when you're done, you will have a few ideas on where to start, when you want to speed opt' your code.

Am I supposed to OPTIMISE it all?

No. You only have to (and you only ought to) optimise the more time critical parts of your code. F.ex. frequently used functions and procedures, loops and other code that is used a lot. If your program draws a lot on the screen at each frame update/screen rewrite, your drawing functions ought to be well optimised.

Traps to avoid... (Help, I've fallen and I can't get up!)

All this talk about speed has probably got some of you into a speed-freak state. Snap out off it. Do NOT start optimising your code as you develop it. Get it working PERFECTLY & SLOWLY, then start squeezing speed out of it. Then you'll be sure that all your formulas and stuff is working, before your optimisations make your code a dreadful mess... Of course, it would be smart to consider future speed optimisations when designing the code, but you should try to avoid making your code too complex early in your project.

-+> SPEED OPTIMIZING CONCEPTS <+-

Extract code from within loops.

The code inside loops are repeated several times. So if some of the code does exactly the same thing, over and over, why not extract it from the loop?

F.ex.

FOR i := 1 TO 1000 DO y := i*10*a*b

As you see, the variables _a_ and _b_ don't change within the loop. Neither does the constant 10. How about calculating 10*a*b outside the loop, put it in a variable, and then just calculate y=i*precalced_variable. The next example is as follows:

temp := 10*a*b
FOR i := 1 TO 1000 DO y := i*temp

This is a bit faster, as the computer does not have to recalculate 10*a*b a thousand times inside the loop.

Specialisation

Trading adjustability and compatibility for more speed is a reasonable alternative at the end of a development. If you have decided what screen mode your program will use, you can opt' your code for the specified screen mode, making it faster. General specialisation of your code will a) reduce adjustability, b) reduce compatibility, c) make your code more messy and unreadable, but it will also d) IMPROVE the program execution SPEED. Example: My mouse routines used to be in procedures, and these procedures called library procedures which called a interrupt. Very structured programming. I recoded the mouse routines in assembly, directly calling the interrupt, and putting the results where I wanted it. BINGO. More than twice as fast. (OK, it's a crap example, and it was VERY early, when I was still extremely novice at Asm. on the PC! Come to think of it, I'm still not too good...)

Assembly / Machine Code. Taking the stairs down.

The more closely your code communicates with your system, the faster it will run. Direct hardware coding is REALLY fast, as coin-ops and game consoles have shown. But on a PC, assembly is the closest to your system, and the fastest. Today it is unnecessary to code completely in assembly, instead use assembly in the really speed critical parts of your program. In the beginning, (almost?) all professional games were coded completely in assembly. With today's faster machines, this isn't necessary, but some assembly here and there is worth it. I won't go much into assembly in this trainer, there are books and other trainers on that subject.

Pre-calculated values / Look-up tables

Sine, cosine, logarithms, square roots? This kind of heavy math can slow your program down to below a cooked snails pace :-) When I use sine and cosine, I have my program calculate the values, and store them in an array/vector. Sine and cosine go through 360 degrees, so I usually set up an array of floating point or fixed point numbers with room for 3600 numbers. F.ex.(Pascal)

sine, cosine : array[0..3599] of float;



.......


FOR i:=0 TO 3599 DO
BEGIN



sine[i] := sin(rad(i/10));
cosine[i] := cos(rad(i/10));


END;

When I need a sine value, I just :

sine[degree*10 MOD 3600]

This reduces the accuracy of my results, but it is hardly noticeable.

I've recently dropped the use of 3600 FP values myself, and gone for any number that's a power of two (256,512,1024,2048,4096...) as this makes it possible to replace the MOD with an AND(much faster). Please read about "Multiplies, Divides and SHR and SHL" later on.....

Oh, and another thing... If you calculate a value, like cos(a)*sin(b), and you use this value more than one place in your code, then you could do worse than calculating it once, putting it in a variable and using the variable in your calculations. That is if the variables a & b don't change meanwhile...

Fixed-point Math

Floating point math is the best your PC can do, trying to represent a real number. It can handle several digits above and below the decimal point, but it is SLOOOW. Having a maths co-processor does help considerably, but not everybody has got one, and it still leaves a bit to be desired. In comes FIXED-POINT numbers. Less accuracy, more speed.

A fixed-point number is a word(16 bits), longword(32 bits) or some bigger datatype which is to represent a real number, by having some digits holding the integer part and the rest holding the fraction part. F.ex. you could have a longword(32 bits), with 16 bits representing your number's integer, and 16 to represent its fraction. Ex: real decimal 0,125 = hex $0000.2000 = integer decimal 8192.

Addition and subtraction goes like a charm:

 $0000.2000 = 0.125
+$0000.2000 = 0.125
-------------------
 $0000.4000 = 0.250

But be careful when you multiply and divide:

  $00008000 * $00008000 = 32768 * 32768
= $40000000             = 1073741824

 $0000.8000 * $0000.8000 = 0.5 * 0.5
=$0000.4000              = 0.25     (This is what it SHOULD be)

As you can see above, when you multiply, the result must be shifted down the exact number of bits that are used to denote the decimals. But then you loose some important bits, the integer part of the number.

 $00004000 / $00008000 = 0.25 / 0.5
=$00000000      = 0 (The answer is too small for the longword)

 $0000.4000 / $0000.8000 = 0.25 / 0.5
=$0000.8000              = 0,5

Here the answer should be shifted up the number of bits that are used to denote the decimals. But it's too late when the calculation has been performed, as all decimals disappear.

HOW DO WE SOLVE FIXED-POINT MULS/DIVS?

One way would be to treat the numbers as a bigger datatype when we MUL and DIV, i.e. make a longword out of a word, calculate and resize the result by shifting the bits like I mentioned. But this is problematic if you use the biggest datatype available to integer math. A 32bit number would need to be treated as a 64bit number(!!!) A 64bit num. would be calculated as a 128bit number etc. See the problem? And with fixed-point math, we often need the biggest possible datatype to acquire the necessary accuracy.

So what do we do? As for multiplies, try shifting both numbers (the factors) down by half of the number of bits used for decimal notation, and just multiply. This reduces the accuracy of the decimals, but at least it preserves the important integer part.

 $0000.8000 * $0000.8000 = 0,5 * 0,5
 $0000 8000 * $0000 8000 = 32768 * 32768
 $0000 0080 * $0000 0080 = 128 * 128      (Shift them halfway down)
=$0000 4000              = 16384          (Multiply)
=$0000.4000              = 0,25           (And BINGO!)

Divides can be solved using the same logic. Shift the dividator 1/2 (halfway) down, and shift the result 1/2 up. This reduces the decimal accuracy, but works. Alternatively shift the number to be divided 1/2 up, the dividator 1/2 down and use the result directly. This may lose some integer bits, so it's all up to you, which method you use.

Be careful when using fixed-point math, as all this shifting can make the result go haywire if you don't watch out. Mainly you need to evaluate your needs, and weigh the need for accuracy against the need for speed. And range is REALLY important whenever you're using fixed-point math.

Michael Abrash (graphics guru, and on the Quake team at Id) has been writing a series of 3D-articles in "Dr.Dobbs SourceBook", and he is promoting the use of FLOATING POINT, mainly because of the growing number of people with Pentiums, and the accuracy problems with fixed point. True, as the power of our processors expand, the gains of fixed point math over floating point is shrinking, but it still is worthwhile sometimes. To balance Abrash's statement, I'd just like to mention that Intel's Pentium with an MMX extension, is/was designed with the fact that power-hungry games use integer instructions (as one does with Fixed-Point math) for speed, in mind... at least, I read that somewhere...

Multiply, Divide and Remainder of 2^X, SHL, SHR and AND

The computers multiply and divide commands are slow, but multiplying/dividing by a power of 2 is fast, just use the bitwise shift operators SHL (bitwise shift left) and SHR (bitwise shift right).

x  *   2   <=>    x shl 1
x div  2   <=>    x shr 1

x  *   4   <=>    x shl 2
x div 16   <=>    x shr 4

So you see:

x SHL y = x*(2^y)		x SHR y = x/(2^y)

This is the old concept of moving the decimal point when multiplying with ten in ordinary decimal math, moved over to the binary nature of bytes...

The slow MODulator(remainder) command can also sometimes be swapped with a more clever alternative. The AND command returns a bitwise, logical AND of all bits. This, like SHL/SHR only applies when we're calculating the remainder of a power of 2

Ex.

    %0010111101 = 189
AND %0000001111 =  15
---------------------
 =  %0000001101 =  13

So:

x MOD 16  <=>  x AND 15
x MOD  4  <=>  x AND  3
x MOD  y  <=>  x AND (y-1), but only if y=2^Z, & z is an integer

This stuff is really useful when using FIXED POINT MATH. But be careful with negative values, I don't know how they'll behave. Of course it won't work with floating point numbers. (Or will it?)

Clever Math

Mathematical operations can be a heavy drag, but you could try cutting it into smaller, faster pieces.

Ex.: In graphics programming, you'll probably encounter something like the " 0ffset:= y*320+x " formula. Now, as you know, MULtiply is SLOOW, so what we gonna do?


y*320+x
y*64*5+x         the y*64 = y shl 6, but we still got a MUL 5
y*64*4+y*64+x
y*256+y*64+x     y shl 8 + y shl 6 + x, should be snappy.

(Since y usually is less than 256, you can punch it into the hi-byte (AH,BH,CH,DH) If you don't understand, don't worry! In Asm. AH is the High 8 bits of AX, which means that AH=Y, AL=0 will give AX = Y*256.)

Try fiddling like this with your own formulas, but only once you're sure they are correct. A good rule of thumb is : Get it working perfectly first, THEN optimise to your hearts content. I know I've said it before, it's just that it is so important!!!

FPS: God of game execution speed...

When programming a game that runs fluidly, and not step-by-step, you'll probably start measuring something like the FPS- Frames Per Second. The process of rewriting the contents of the screen, updating the objects position etc., is, by some, called the FRAME UPDATE. The number of frame updates per second, or simply frames per second, fps, is quite important when it comes to how smooth your game/demo/fast-graphical-project looks.

Now, just adding some debug code, counting the number of frames, setting up a timer, and just dividing the counter by the number of seconds might seem to give you the average fps. Yes, it does, but this value is not quite an accurate measure for your program speed and smoothness. Your average fps may be 15 fps (reasonable enough), but it is important to consider your CRITICAL values; maximal and minimal FPS. If your program only manages 5 (or LESS!!) fps at it's worst, it is going to chug BADLY. What you need to do is to reduce the workload at the right places, creating a more consistent average FPS.

How do we do that? There is no finite solution, but one thing to check out, is how well you've spread the workload onto the frame updates. An example: Creating a maze game with monsters, we wouldn't want them to move EACH frame as they'd be too fast for the player. So, the obvious solution is to ONLY move the monsters at certain times, f. ex. every 8 frames. So we bluntly insert this code:

IF FrameCounter MOD 8 = 0 



THEN FOR i:=0 TO MaxMonsters DO MoveMonster(i)

Or something like that. Now, every eight frame, the workload is a lot bigger, resulting in a speed drop. Especially the more monster-packed levels might chug badly. But your testing may not show this, as your average FPS seems just fine, and there may not be enough monsters on the level to noticeably slow the game down ...

As the bright person you are, you've probably figured out the solution. Move one eigth (1/8) of the monsters EVERY frame!!! By spreading the workload, our average fps might go a bit down, but our worst fps will (probably) improve!! This problem I actually faced in a small, early, simple game project of mine, when I still wasn't too a seasoned coder... ;-) But the solution worked!

Clever Consepts

The potential in speed increase by using clever designs and algorithms is vast. You'll have to think some out for yourself, but I thought I'd share this simple example that for some reason popped into my head:

You've got two "objects", A and B, placed at co-ordinates (A.x, A.y) and (B.x, B.y). Now, the problem is: are they within a range of 5 of each other? Gee, that's simple... let's just use the ol' Pythagorean formula:

IF SquareRoot( (A.x-B.x)^2 + (A.y-B.y)^2 )<=5 THEN YepTheyreWithin

I hope you don't have to be a brain surgeon to figure out that ... this... is... a... sloooow method. How about using some clever math? Remember doing equations in math class? Well, let us remove the square-root function by multiplying the other side by itself:

IF ( (A.x-B.x)^2 + (A.y-B.y)^2 )<=25 THEN YepTheyreWithin

But all this multiplying isn't always necessary. If either the vertical or horizontal distance between the objects are over 5, then they're not within the range:

IF Abs(A.x-B.x)<=5 THEN



IF Abs(A.y-B.y)<=5 THEN
IF ((A.x-B.x)^2+(A.y-B.y)^2)<=25 THEN YepTheyreWithin

This last example increases the workload for objects that are within each others range, but if a major part of the objects we're going to compare, are way out of each others range, this saves us quite some time.

** TECH-HEADS ONLY / The remaining notes left on my desk when I was done...

At the end of this project, I was left with some information that was uncertain, untested and more difficult to understand. I mention them here briefly for the reading pleasure of those who understand it. I'm TOTALLY UN-responsible for this stuff...

* Concerning assembly coding only!

"...code for the 486 should deliberately DEC and then branch, [instead of LOOPing]" Heard mention of an article by Abrash about splitting an instruction into 2, or more, simpler, yet equivalent instructions. F. ex. LOOP should be split into DEC CX and JNE ... I believe this to be true for 486's and Pentiums... As for the 686/Pentium Pro, it is supposed to be splitting CISC instructions into smaller RISC instructions, making the gain non-existent.....?

Note on the Pentium: It can ..sometimes.. execute more than one (integer) instruction per clock-cycle. It can also (sometimes) issue a branch after an integer instruction on the same clock, and it can do 2 floating point instructions on the same clock-cycle if one of the instructions is an FXCH....

(Hmm, I'm having difficulties grasping this myself!) As for assembly and low-level instruction optimisation, I suggest you get some other, more thorough, trainer or document. If you're hooked into the Net, you could, f.ex., try out the X2FTP.OULU.FI ftp site, which I found through Future Crew's home-page

..LAST GASP..

I hope this info was interesting, useful or something. It is far from perfect, but at least it is a gentle (and nearly hopeless :-) introduction to the area of speed optimisation. Right now, it lacks some practical usefulness, examples and code, and I've just skimmed the surface. But, a later release ...perhaps.

[<Back] [<<Home]

Send your well-devised thoughts to me at MenThal@bigfoot.com ( or kent_dahl@hotmail.com )


  URL:   http://www.geocities.com/ResearchTriangle/Thinktank/1207/speedopt.html


G E O C I T I E S
- the people to thank for this nice, little, FREE Homepage in Cyberspace.