Jump to content
43oh

Stellaris/Tiva MFLOPS/MHz?


Recommended Posts

Good question.

 

VADD/VMUL take 1 clock cycle each (to issue). Then there is VFMA (fused multiply add) which takes 3 clock cycles, which implies a latency of 1 clock for the multiply.

 

So I'd say you have 1 MFLOP/MHz, assuming perfect code. That however does not take into consideration that you have to load the operands and then store the results again.

 

- Thomas

Link to post
Share on other sites

Thanks for info.

 

So VADD/VMUL take 1 cycle, but you have to load/store the operands... So unless these instructions can read operands and write results into SRAM without needing extra cycles, each operation will take more than 1 cycle (maybe 3 or 4, right?).

 

In the past I have worked with some TI DSPs (like C55XX ones) that had a MAC instruction that could read two operands from RAM, multiply-accumulate them (and shift result, and do stuff with operand pointers) and write back the result to internal RAM in 1 cycle(*). I suppose this is not the case here

 

 

_________

(*): Not really one cycle, it would take 5 cycles, but because of the pipeline you can count it as a 1 cycle instruction unless you are calculating latencies.

Link to post
Share on other sites

Thanks for info.

 

So VADD/VMUL take 1 cycle, but you have to load/store the operands... So unless these instructions can read operands and write results into SRAM without needing extra cycles, each operation will take more than 1 cycle (maybe 3 or 4, right?).

 

In the past I have worked with some TI DSPs (like C55XX ones) that had a MAC instruction that could read two operands from RAM, multiply-accumulate them (and shift result, and do stuff with operand pointers) and write back the result to internal RAM in 1 cycle(*). I suppose this is not the case here

 

 

_________

(*): Not really one cycle, it would take 5 cycles, but because of the pipeline you can count it as a 1 cycle instruction unless you are calculating latencies.

 

This is all tricky, and to be honest there are somethings I have not understood (especially how VFMA is implemented with 3 cycles + 1 cycle latency as opposed to 1 cycle + 3 cycles latency).

 

VFMA is a MAC, but it takes 3 cycles. Why ? If VMUL and VDD take 1 cycle each, what do the various MAC variants help ?

 

Anyway lets' say you do matrix operations (which are typically MAC operations). Let's say you multiply a 4 element vector by a 4x4 matrix. Then you have 20+2 loads, 4+1 stores, 16 multiplications, and 12 additions. This is 55 operations, hence 55 cycles, whereby you crammed in 28 floating point operations. Thus about 0.50 Mflops/MHz. (The +2 & +1 is the overhead of the VLDM/VSTM where no data gets transferred).

 

The data is from the Cortex-M4 TRM, section 7.2. It also points out a latency of 1 cycle.

 

Back to the example above. Say you want to multiply an array of 4 element vectors by a 4x4 matrix, and the matrix is preloaded, then you'd spend 4+1 loads, 4+1 stores, 16 multiplies and 12 adds. 38 cycles to do 28 cycles math, or 0.74 Mflops/MHz.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...