Jump to content
Sign in to follow this  
basil4j

Optimising math

Recommended Posts

I'm replying only to this statement, but in the spirit of the topic.

 

MSP430 has an average ALU by MCU-standards.  Shifting is fast.  Addition is pretty fast too.  Multiplication, however, occurs through a peripheral and it is quite slow.  It is faster than using a software multiplier, but it is still pretty slow because it does not work through registers.

Shift of a single bit is "fast", but shifting multiple bits is linear in the number of bits (not so much c+n, but c*n): IMO the msp430 does a pretty poor job of shifting in general. The time also depends on which instruction is used: the standard one-bit shift operation in a loop, or either the constant (1-4 bits) or register (1-15 bits) CPUX instruction. The latter are slightly faster and take less code.

 

Multiplication is not that slow. The user's guide for the FR57xx MPY32 has full results available no more than 4 cycles (16-bit; 11 cycles for 32-bit) after initiating the operation. Other families probably differ slightly.

 

It strikes me as I write this that, without doing the timing to prove it, left shifts by more than about four bits would probably be faster as multiplications, if the multiplier peripheral is used.   The compiler might use strength reduction and generate shifts instead.  (Dunno if CCS generates code to use the multiplier peripheral; mspgcc does, but msp430-elf-gcc does not.)

 

Anyway, often it is much faster to add than to multiply.  In particular, one algorithm I wrote once had enormous, enormous, ENORMOUS improvement by taking logarithm (I made a table), then doing additions, then doing antilog (again, a table).

I agree that changing the domain of the calculation is the best path to improving speed. Even if you don't need to go all the way to log tables, I'd make a serious effort to avoid floating point, by scaling and factoring to express an integer approximation that retains the accuracy required. The biggest hassle is making sure no intermediate values overflow the integral type.

Share this post


Link to post
Share on other sites

Ok I was thinking about what Enl said about pre-computing stuff.

 

Pressure target for altitude is easy. I can do this on the pad at any frequency, say 1Hz, and stop after say 30 seconds to allow everything to stabilise and average out.

 

altitude = user altitude target

p0 = sea level pressure (changes depending on the weather)

T = temperature at sea level (changes depending on the weather)

L = lapse rate (Can be determined based on use configured altitude target and 4 'if else'

^ = to the power of, not XOR :)

 

p = p0 * (1- (L*altitude/T)) ^ (0.2840437/8.31447*L)

 

In flight I still need to calculate temperature compensated pressure. This needs to be done within 5ms on a 20Mhz clock which I understand shouldn't be a problem.

The actual ADC readings from the barometer are only occurring at 50Hz, but the math is all being done in the (interruptable) main loop, and so the time frame is determined by the accelerometer math which is being performed at 200Hz and is also in the main loop.

 

dT      = temp reading - Tref.................... ....(Tref is a value stored in baro, so only needs to be read once. These values are signed shorts)

temp  = 2000 + dT * TEMPSENS / 2........... (TEMPSENS is another variable stored in baro, read once). Result is signed short (20.01deg =  2001 etc)

OFF   = POFF * 2^16 + (TCO*dT)/2..... .......(POFF and TCO from baro, read once). Result is signed 64bit(??) int. Maybe make this on 32bit. Will see how it effects precision.

SENS = PSENS * 2^15 + (TCS * dT)/2^8......(PSENS and TCS from baro, read once). Again, 64bit result...

P        = (ADC Reading * SENS / 2^21 - OFF)/2^15

 

Then in flight all i'd need to do is compare p with P...

 

So I count 1 integer mul, 1 integer div, 1 integer add, 4 FP muls, 4 FP div's, 5 FP exponentials, 2 FP adds and 1 FP subtract...

Share this post


Link to post
Share on other sites

Lots of interesting stuff, but you're forgetting one thing. Do you NEED to optimize? For example, if you're checking altitude 5 times a second and your non-optimised code takes a (relatively) very slow millisecond to do the calculation then you are very far from needing to optimize.

 

It's very relevant to consider performance on an embedded device. However, optimised code tends to be less readable, less maintainable and more likely to have bugs. If your "improved" code is quicker but causes you to trip up later on a subtle bug, is it really better?

 

Don't take this it mean that you shouldn't have asked or that I'm not interested in the answers! My view may be tainted by the fact I spend a lot of my day job (as a C#/Java developer) working with convoluted buggy code where people were trying to prove they were clever when really they didn't need to.

Share this post


Link to post
Share on other sites

@Fred: I agree with you 100%. As I said: modern compilers tend to be smart.

 

@basil4j: Based on the operations you listed, I would guess close order of magnitude of 200 cycles NOT INCLUDING the logs/antilogs for the exponentials (basing this on about 20 cycles for FP mult or div with the hardware integer mult) At 25MHz, this is 8 microseconds. The only real time taker will be the exponentials. I doubt that they will be more than on order of magnitude longer, giving an estimate of roughly 100 microseconds for the math, almost certainly less than a millisecond.

 

Going back to your needed time scale, I don't think efficiency is likely your main concern. Efficient enough does the job. Within broad limits, the guideline is make it right, then make it fast. If you need to worry about fast to make it work, only make it as fast as you must, then put effort elsewhere. Ditto for size: If it fits, don't worry about saving a few bytes. If it doesn't fit, you need to worry, and it is often more than a few bytes that are the concern.

Share this post


Link to post
Share on other sites

You can probably used fixed point arithmetic instead of floating point with improved speed and accuracy, but at the expense of slightly more complex code. I have done this for e.g. PID temperature control.

 

The libraries I am familiar with are templated C++ (i.e. Energia friendly) but you can probably search around for something open source and usable in C if that's preferable to you.

Share this post


Link to post
Share on other sites

Fred & Enl. I completely agree, I always start a program with the idea in mind that it has to make sense for someone else to read it.

I asked about optimising because I like to learn all I can about whatever i'm doing to enable me to make smart choices. This was an educational thread, not one which is currently critical to my project. 

 

As it is, I have already applied some of the suggestions to some non-math code in my interrupts to keep them as fast as possible :) And they were a bit convoluted...

 

Fred, not that it is too important in the scheme of things, but I need to do all the math within 5ms. Parts of the math need to be performed at 200Hz, and the other bits cant hold it up :) But as ENL said (and has been pointed out to me in other threads), even then its not something I need to worry about, because with a 20Mhz clock I get 100,000 clock cycles to work with...

 

Regarding size. 736 bytes of my 1kb are being consumed by a circular buffer :/ This doesn't leave me with much

Share this post


Link to post
Share on other sites

Ok looks like I do have a problem :/ When I add anything that needs the math library (either msp430_math.h or just math.h) my program wont fit in RAM.

 

I have a 0.4second circular buffer which is 736 bytes, which if I reduce for 368 bytes allows me to fit everything in. Trouble is 0.2 second is hardly enough time as I need to store the 0.4 seconds of data prior to launch detection (which is delayed by a smidgen to prevent false detection).

 

Also, the math library takes my FRAM usage from 2700bytes to 14438bytes! Nearly full!

 

This is all just with the addition of

 

     palt = psealevel / pres;
    palt = pow(palt,lr);
    palt = palt - 1;
    palt = palt * (tempsealevel + 273.15f);
    palt = palt * 153.84615;
 
(Even if I just add the first line I get these problems)
 
Any tips? I can understand the FRAM, but the math library must a heck of a lot of variables to use that much RAM...

Share this post


Link to post
Share on other sites

Well, this is where things get interesting... The math lib likely uses a bit of RAM, but only functions you use should build in, so anywhere you can eliminate a function (floating point) you gain. RAM usage is a different matter.  Pow will take a bit of workspace. Shouldn't be that much. Knock the buffer size down 4 bytes at a time to find the limit. I will guess at about 20 bytes (5 floats) but wouldn't be surprised at 64 bytes (16 floats). 

 

The mult and div also come in as functions but they should be fairly small, and are required for the power function anyway. Have you looked at the optimization settings? The only functions that should be included are those that are used or needed by those used.I have not had any problem memory-wise using float mult. add. and subtract on a G2553. No where near that much memory usage (CCS5).

 

Also, Don't break up the computation. Make it one statement and let the compiler do it, with optimization on. It can likely do better than you or I can, and may same a little memory.

 

Have you looked at MSPMATHLIB as a possible solution? (MSP5 and 6 series only, so I haven't tried it... see slau499) It is nominally for performance, so I don't know what it has as space requirements. Might be worse, might be better.

 

You may find that fixed point, a table, and interpolation for the power will do the job for you with less memory.

Share this post


Link to post
Share on other sites

Hi Enl,

 

I am using MSP430_math.h which is, I believe, MSPMATHLIB?

Making that math all 1 line doesnt change anything unfortunately.

 

I have to reduce the buffer to 444bytes to fit it in, but that takes it right to 1024bytes RAM...

 

Using Whole program optimisation for size has a negligible effect.

 

Im sure most of this could be done in fixed point, and I will only ever need 8 decimal places, but I have no idea where to start with fixed point...

EDIT : Ill start by reading this http://www.ti.com/lit/an/slaa329/slaa329.pdf

 

Edit 2: Plenty of good stuff on Wikipedia too!

 

Share this post


Link to post
Share on other sites

Over in the Energia libraries forum I have posted a port of the libfixmath 16.16 fixed point math library. It is not Energia specific, so should compile if you are using CCS. It is much smaller and faster than the floating point libraries, and is more than accurate enough for barometric altitude calculations. The whole reason I ported it was because I was using a BMP085 on an MSP430G2553 and the floating point was taking up so much flash I didn't have enough left to do anything useful.

Share this post


Link to post
Share on other sites

Over in the Energia libraries forum I have posted a port of the libfixmath 16.16 fixed point math library. It is not Energia specific, so should compile if you are using CCS. It is much smaller and faster than the floating point libraries, and is more than accurate enough for barometric altitude calculations. The whole reason I ported it was because I was using a BMP085 on an MSP430G2553 and the floating point was taking up so much flash I didn't have enough left to do anything useful.

Thanks! Ill take a look!

 

Does the fact that the math routines have a 'do not optimise' flag mean anything? Thats what the compiler is saying and im not sure the effect its having

 

Sent from my GT-I9300 using Tapatalk

 

 

Share this post


Link to post
Share on other sites

Ok, have managed to get my accelerometer math into fixed point which was pretty easy.

    		//Calculate temperature compensated pressure.
    		/*
    		  	dT      = temp reading - Tref
				temp  	= 2000 + dT * TEMPSENS / 2... Result is signed short (20.01deg =  2001 etc)
				OFF   	= POFF * 2^16 + (TCO*dT)/2...Result is signed 64bit(??) long long. 
				SENS 	= PSENS * 2^15 + (TCS * dT)/2^8.....Again, 64bit result
				P       = (ADC Reading * SENS / 2^21 - OFF)/2^15
    		*/

     		pDT 		= bar[0] - Tref * 0x100;
    		pTEMP 		= 2000 + pDT * TEMPSENS / 0x1000000;
    		pOFF 		= OFF * 0x10000 + (TCO * pDT) / 0x80;
    		pSENS		= SENS * 0x780 + (TCS * pDT) / 0x100;
    		pPres		= (bar[1] * pSENS / 0x200000 - pOFF) / 0x8000; 

The barometric stuff is proving difficult still.

 

This POW function is what really puts it over the top, it adds about 600 RAM usage. I dont know much about how the routine works so I have a few questions.

Will changing the number of decimal places in the exponent reduce RAM requirements?

Is there any way to do fixed point POW without using this function? The exponent ranges from +0.19026666 to -0.08196102

 

Under 'failed allocation', I see '.data' is in there twice, and each entry seems to have identical values (450byte each to make a total 900 unallocated). Why is this?

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...