doragasu 0 Posted December 11, 2014 Share Posted December 11, 2014 I think the topic title is clear enough. Does anybody have info about the FPU performance of these Cortex-M4 chips? I have been googling for a while and can't find anything. It would be great to know the total MFLOPS achievable with these chips, or even better the MFLOPS/MHz. Quote Link to post Share on other sites
igor 163 Posted December 11, 2014 Share Posted December 11, 2014 Or, simplifying the units: MFLOPS/MHz = M * FLO/S / (M * Cycles/S) = FLO/Cycle = Floating point operations per cycle Thought I had a reference on this, but can't find at the moment. (Of course you could download a benchmark like Whetstone, try it, and post results.) Quote Link to post Share on other sites
GrumpyOldPizza 15 Posted December 11, 2014 Share Posted December 11, 2014 Good question. VADD/VMUL take 1 clock cycle each (to issue). Then there is VFMA (fused multiply add) which takes 3 clock cycles, which implies a latency of 1 clock for the multiply. So I'd say you have 1 MFLOP/MHz, assuming perfect code. That however does not take into consideration that you have to load the operands and then store the results again. - Thomas Quote Link to post Share on other sites
doragasu 0 Posted December 14, 2014 Author Share Posted December 14, 2014 Thanks for info. So VADD/VMUL take 1 cycle, but you have to load/store the operands... So unless these instructions can read operands and write results into SRAM without needing extra cycles, each operation will take more than 1 cycle (maybe 3 or 4, right?). In the past I have worked with some TI DSPs (like C55XX ones) that had a MAC instruction that could read two operands from RAM, multiply-accumulate them (and shift result, and do stuff with operand pointers) and write back the result to internal RAM in 1 cycle(*). I suppose this is not the case here _________ (*): Not really one cycle, it would take 5 cycles, but because of the pipeline you can count it as a 1 cycle instruction unless you are calculating latencies. Quote Link to post Share on other sites
GrumpyOldPizza 15 Posted December 15, 2014 Share Posted December 15, 2014 Thanks for info. So VADD/VMUL take 1 cycle, but you have to load/store the operands... So unless these instructions can read operands and write results into SRAM without needing extra cycles, each operation will take more than 1 cycle (maybe 3 or 4, right?). In the past I have worked with some TI DSPs (like C55XX ones) that had a MAC instruction that could read two operands from RAM, multiply-accumulate them (and shift result, and do stuff with operand pointers) and write back the result to internal RAM in 1 cycle(*). I suppose this is not the case here _________ (*): Not really one cycle, it would take 5 cycles, but because of the pipeline you can count it as a 1 cycle instruction unless you are calculating latencies. This is all tricky, and to be honest there are somethings I have not understood (especially how VFMA is implemented with 3 cycles + 1 cycle latency as opposed to 1 cycle + 3 cycles latency). VFMA is a MAC, but it takes 3 cycles. Why ? If VMUL and VDD take 1 cycle each, what do the various MAC variants help ? Anyway lets' say you do matrix operations (which are typically MAC operations). Let's say you multiply a 4 element vector by a 4x4 matrix. Then you have 20+2 loads, 4+1 stores, 16 multiplications, and 12 additions. This is 55 operations, hence 55 cycles, whereby you crammed in 28 floating point operations. Thus about 0.50 Mflops/MHz. (The +2 & +1 is the overhead of the VLDM/VSTM where no data gets transferred). The data is from the Cortex-M4 TRM, section 7.2. It also points out a latency of 1 cycle. Back to the example above. Say you want to multiply an array of 4 element vectors by a 4x4 matrix, and the matrix is preloaded, then you'd spend 4+1 loads, 4+1 stores, 16 multiplies and 12 adds. 38 cycles to do 28 cycles math, or 0.74 Mflops/MHz. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.