Jump to content
43oh

Energia-written program runs faster than CCS equivalent


Recommended Posts

Hello,

 

I'm writing a program which includes fast pin switching (as fast as it can go!) on MSP432P401R. Initially, I had used Energia for this, but due to some unpredictable behaviour I switched to CCS 6.1.0. Now, apart from porting some Energia specific functions, like I/O functions and so on, I soon found out that even the bare minimum code for pin state flipping differs in speed, significantly. 

 

Below is an example code of Pin switching in CCS and Energia, respectively:

#include "msp.h"

void main(void)
{
	
    WDTCTL = WDTPW | WDTHOLD;     

    CSKEY = 0x695A;
    CSCTL0 = 0;
    CSCTL0 = DCORSEL_5; //set DCO to 48 MHz
    CSKEY = 0;

    P6DIR |= BIT4; //set pin 6.4 as output

    while(1){
    	P6OUT ^= BIT4; //flip state
    	P6OUT ^= BIT4; 
    }
}

Execution speed is around 1.05 MHz. 

 

Energia code:

#include "msp432p401r.h"

void setup()
{
  //nothing goes here, makes execution even faster for some reason?
}

void loop()
{
 
  while (1){
    P6OUT ^= BIT4;
    P6OUT ^= BIT4;
  }
}

Execution speed: 2.16 MHz! 

 

My guess is that the speed increase is due to settings in somehow hidden Energia's main.cpp and related libraries, whereas CCS is "as it is" and needs more knowledge to set up properly. Where can I go from here? My current knowledge doesn't extend much past DCO settings.

 

Thanks for the answers.

Link to post
Share on other sites

I'm not familiar with the internals of Energia, but from your plain C version I can see a few possible problems. First of all you're not setting the Power Control Manager to VCORE1, which is required for MCLK frequencies above 24MHz.

 

Second, you're probably getting hit with excess flash wait states. You only need 2 wait states for ordinary flash reads at 48MHz, but the default is 3. See this thread for more information: http://forum.43oh.com/topic/8435-msp432-sram-retention-and-flash-waitstates-check-your-settings/

Link to post
Share on other sites

Thanks for the tips, it works now, but in the meantime I installed some updates for CCS and now the speed reaches 2 MHz even without the VCORE1 command, seems strange:

PCMCTL0 = AMR__AM_DCDC_VCORE1;

I changed the flash wait state number to 2, it adds some speed:

 FLCTL_BANK0_RDCTL = FLCTL_BANK0_RDCTL_WAIT_2;

I'm now sitting at 2.19 MHz, so it's looking pretty alright. Could I go much higher than this? Thanks!

Link to post
Share on other sites

Thanks for the tips, it works now, but in the meantime I installed some updates for CCS and now the speed reaches 2 MHz even without the VCORE1 command, seems strange:

Yeah, I was surprised it worked at all without VCORE1. The chances are that the MCU would flake out and crash or reset if your program did anything too demanding at 48MHz with VCORE0.

 

I'm now sitting at 2.19 MHz, so it's looking pretty alright. Could I go much higher than this? Thanks!

 

You can go a lot higher by using the timer peripherals instead of bit-banging the port with the CPU. You can also output one of the clocks (MCLK or SMCLK, can't remember which) to a particular port pin. That's useful if you just need a fast square wave for some reason.

 

There's also the option of using DMA to write to a port, but then you you have to set 8 pins at a time (ie the whole port has its eight bits overwritten).

 

Similarly, the CPU can toggle a pin faster if you don't make it read the current state of the port. Read-modify-write operations (as they're called) are relatively slow on ARM cores because you have to load values into a CPU register to modify them, then store them back to where they came from.

 

For instance, you could replace your loop with:

  while (1){
    P6OUT = BIT4;
    P6OUT = 0;
  }
or something like this:

  unsigned char P6Bit4Off = P6OUT & ~BIT4;
  unsigned char P6Bit4On = P6Bit4Off | BIT4;

  while (1){
    P6OUT = P6Bit4On;
    P6OUT = P6Bit4Off;
  }
In either case the CPU doesn't read the current state of the P6OUT during the loop, it just writes to P6OUT. The other pins end up as zero in the first case, or retain the value they had before the loop started in the second.
Link to post
Share on other sites

If for some reason you need faster I/O access (sometimes it's useful for multiple SPIs or weird protocols) normally M cores are really good at that, I'm actually surprised it's only about 2Mhz. In a Tiva you can get half the system clock (so at 80Mhz you get a 40Mhz bit-banged I/O) - my guess it's the MSP432 low power peripheral busses - I'm actually curious what would be the bit-banged speed with the DMA on the MSP432

 

Link to post
Share on other sites

If for some reason you need faster I/O access (sometimes it's useful for multiple SPIs or weird protocols) normally M cores are really good at that, I'm actually surprised it's only about 2Mhz. In a Tiva you can get half the system clock (so at 80Mhz you get a 40Mhz bit-banged I/O)

 

That IO performance on Tiva is impressive. It looks like it's due to the use of an AHB bus for the GPIO on the Tiva rather than APB bus used on MSP432. That said, I think it can only achieve that rate when not performing RMW operations. Toggling one pin with XOR would presumably have a maximum frequency of 13.33MHz (one cycle to read, one to XOR and one to write back).

 

I'm actually curious what would be the bit-banged speed with the DMA on the MSP432

 

The uDMA on Tiva, CC3200 and MSP432 isn't great in terms of raw performance. It takes a lot of cycles to read the channel data structure and source data before each write to the destination. It's quicker to write using the CPU with values held in registers.

Link to post
Share on other sites

 

That IO performance on Tiva is impressive. It looks like it's due to the use of an AHB bus for the GPIO on the Tiva rather than APB bus used on MSP432. That said, I think it can only achieve that rate when not performing RMW operations. Toggling one pin with XOR would presumably have a maximum frequency of 13.33MHz (one cycle to read, one to XOR and one to write back).

 

 

The uDMA on Tiva, CC3200 and MSP432 isn't great in terms of raw performance. It takes a lot of cycles to read the channel data structure and source data before each write to the destination. It's quicker to write using the CPU with values held in registers.

I saw the 40Mhz achieved with direct register access methods, using the faster bus (there's the 2 options on TM4C123 devices) and the I/O toggling was made with either "high", "low", "loop", or simply a ton of high and low commands coppied over and over again (I don't quite remember).

 

I remember using the DMA for 800Khz bit-banging and it was with 4 DMA transfers happening at the same time, never really tested the limits (humm maybe I should when I get some time and access to the logic analyzer)

Link to post
Share on other sites

Thanks for all the suggestions, I'll try them out now. I also found out that building a program in CCS with "debug" checked instead of "release" makes the execution slower - understandable but easily overlooked, in case someone was wondering  about mysterious lack of performance.

Link to post
Share on other sites

I remember using the DMA for 800Khz bit-banging and it was with 4 DMA transfers happening at the same time, never really tested the limits (humm maybe I should when I get some time and access to the logic analyzer)

 

I took some measurements for CC3200 DMA in this thread. The peak speed is four cycles per transfer (whether byte, halfword or word). The first transfer takes eight cycles and the transfer following an arbitration check is seven cycles. I think Tiva will follow the same pattern, since this matches the timings indicated by the ARM documentation.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...