Jump to content
L293D

__delay_cycles

Recommended Posts

Is there a work around for the StellarPad for the __delay_cycles macro?

 

It is used extensively some of the library code for the MSP430, however I do not see a NOP, or __delay_cycles for the StellarPad.  

 

The best I could come up with would be to define __delay_cycles(x) as something like delayMicroseconds(x/80)

 

Any help would be appreciated.

 

L293D

Share this post


Link to post
Share on other sites

You can use SysCtlDelay() from driverlib. Below is an example Sketch.

#include <stdint.h>
#include <driverlib/sysctl.h>
void setup()
{
}

void loop()
{
  SysCtlDelay(2000000);  
}
21.2.2.7 SysCtlDelay
    Provides a small delay.
 
Prototype:
 
  void SysCtlDelay(uint32_t ui32Count)
 
Parameters:
ui32Count is the number of delay loop iterations to perform.
 
Description:
This function provides a means of generating a constant length delay. It is written in assembly
to keep the delay consistent across tool chains, avoiding the need to tune the delay based on
the tool chain in use.
The loop takes 3 cycles/loop.
 
Returns:
None
 
Also see TivaWare driverlib user guide here: http://www.ti.com/lit/ug/spmu298/spmu298.pdf

Share this post


Link to post
Share on other sites

Which code needs a tweak, and what is that tweak?

 

TimingDelay, was the function I had in mind.

What I was thinking about was what happens when the cycle count overflows, so start is greater than current.

But maybe I spoke too soon.  Right now I am not sure if that is a problem, or if it actually works in all cases there.

Share this post


Link to post
Share on other sites

TimingDelay, was the function I had in mind.

What I was thinking about was what happens when the cycle count overflows, so start is greater than current.

But maybe I spoke too soon.  Right now I am not sure if that is a problem, or if it actually works in all cases there.

I think it's probably right: C defines the behavior of arithmetic on unsigned int very carefully. (Beware, though, that unsigned char does not behave as nicely as it can promote to a signed type, and once something becomes signed all bets are off.)

 

http://www.thetaeng.com/TimerWrap.htm is my go-to site whenever I start having doubts about whether I'm handling counter comparisons correctly.

Share this post


Link to post
Share on other sites

As a followup to the original question - how does one go about adding a very small delay to a C program.

By small I mean maybe 1 or two clock cycles (as compared SysCtlDelay - which involves the overhead for a procedure call

plust 3 cycles per loop).

From this thread http://forum.stellarisiti.com/topic/1577-very-simple-question-using-noop/ it appears that noop

doesn't necessarily fill the bill.

Share this post


Link to post
Share on other sites

From this thread http://forum.stellarisiti.com/topic/1577-very-simple-question-using-noop/ it appears that noop doesn't necessarily fill the bill.

Apparently not. From the ARM CMSIS core_cmInstr.h header (ARM GCC flavor):

/** \brief  No Operation
   No Operation does nothing. This instruction can be used for code alignment purposes.
 */
__attribute__( ( always_inline ) ) __STATIC_INLINE void __NOP(void)
{
  __ASM volatile ("nop");
}

So this is the first "nop" I've ever encountered that explicitly notes its use might not produce a delay. Good to know, and for pipelined architectures obvious (at least once it's pointed out).

 

I could imagine that an unadorned asm("nop") as described by @@Lyon in that thread might not work if the compiler optimizes it away, but the volatile qualifier should ensure it gets into the instruction stream.  In practice, if it's present in the instruction stream it simply has to impact execution, even if the net effect is it gets absorbed into an unused pipeline stage.  If you're concerned about that happening, insert another one, and repeat until the desired effect is visible.

 

The situation where I've used __NOP() is to enforce the 3-cycle delay after enabling a GPIO module before accessing its registers, and __NOP() works fine there (in the sense that I get a hard fault if I don't put any in, and don't when I put three in).

Share this post


Link to post
Share on other sites

Hi,

In my previous post about "nop" I indicated the CCS syntax since that was the user's IDE. The operation (behavior) should be the same on all systems, since all should have in mind the core hardware, as licensed by ARM.

If you need extremely short delays, one system clock, best idea is to use .asm instructions like "mov r5,r5", which does not harm or push/pop combination. NOP is treated by ARM as 'hint' instruction, which does not necessarily do something.

L

Share this post


Link to post
Share on other sites

Very interesting.  So the reason I asked this in the first place was because I would really like to use the SharpLCD library code with the StellarPad.  It fails to compile because of the call to __delay_cycles.  Should I just define it as a longer delay...like 1 us? 

Share this post


Link to post
Share on other sites

Hi,

In my previous post about "nop" I indicated the CCS syntax since that was the user's IDE. The operation (behavior) should be the same on all systems, since all should have in mind the core hardware, as licensed by ARM.

If you need extremely short delays, one system clock, best idea is to use .asm instructions like "mov r5,r5", which does not harm or push/pop combination. NOP is treated by ARM as 'hint' instruction, which does not necessarily do something.

It gets interesting if you dig into it.

 

The ARMv6-M Architecture Reference Manual in section A6.7.47 specifies that NOP is an "architected NOP" that is a hint instruction as defined in section A5.2.5. Hint instructions are what implement sleep/wake features (SEV, WFE, WFI, YIELD). The assembly code "NOP" expands to the 16-bit instruction 0xBF00. This specific instruction was introduced in ARMv6T2. (The phrase "architected" appears to mean that every ARM implementation must behave within the defined limits, as opposed to non-architected behaviors where a vendor may change the behavior, e.g. to reduce power consumption.)

 

A6.7.47 does say that the timing effects are not guaranteed (it could even reduce execution time), and notes they are not suitable for timing loops. Other resources also note that it may be removed from the pipeline before it reaches the execution stage.

 

Section D.2 says that before the Unified Assembly Language, NOP was a pseudo-instruction that was replaced by MOV r0, r0 (ARM) or MOV r8, r8 (Thumb).

 

Based on that detail, I'm going to conclude that the architected behavior of NOP is indeed limited to instruction alignment, and that it is incidental and non-architected that the pseudo-instruction implementation had the effect of a delay.

 

I'm still reluctant to roll my own __DELAY_ONE_CYCLE() function that expands to MOV r8, r8. I still say that if the instruction has to be decoded, put enough of them in there and you'll get a delay. But I'm not as confident in that decision as I was before doing the research.

Share this post


Link to post
Share on other sites

Very interesting.  So the reason I asked this in the first place was because I would really like to use the SharpLCD library code with the StellarPad.  It fails to compile because of the call to __delay_cycles.  Should I just define it as a longer delay...like 1 us?

The SharpLCD library code I find has __delay_cycles parameters no smaller than 100 on an MCU that's running no faster than 20 MHz, except for a couple that are based explicitly on the clock speed and the comment says they need to be at least 2us.

 

So it looks like you do need something that's on the order of microseconds (as opposed to < 100ns), and that's long enough to be worth using the cycle timer. I'd use the same approach as BSPACM_CORE_DELAY_CYCLES. (Well, of course I would; I wrote it specifically for that sort of situation.)

Share this post


Link to post
Share on other sites

Hi,

@@pabigot,

It would be interesting to see (if you can post) some 5-6 .asm instructions before your "nop" and 1-2 instructions after. Could be just pipe line flushing, so that behavior of 3 cycles delay.

Also replacing with a mov rx,rx could reveal something interesting.

Regards,

Share this post


Link to post
Share on other sites

It would be interesting to see (if you can post) some 5-6 .asm instructions before your "nop" and 1-2 instructions after. Could be just pipe line flushing, so that behavior of 3 cycles delay.

Also replacing with a mov rx,rx could reveal something interesting.

I don't get why every time I get into this sort of discussion I'm the one who gets to spend time doing the experiment to find the real answer, but what the heck.

 

In the interests of best scientific practices, before I get started I'll define the experiment:

  • Timing will be performed by reading the cycle count register, executing an instruction sequence, then reading the cycle counter. The observation will be the difference between the two counter reads.
  • The sequence will consist of zero or one context instructions followed by zero or more (max 7) delay instructions
  • The only context instruction tested will be a bit-band write of 1 to SYSCTL->RCGCGPIO enabling a GPIO module that had not been enabled prior to the sequence.
  • The two candidate delay instructions will be NOP and MOV R8, R8
  • Evaluation will be performed on an EK-TM4C123GXL experimenter board using gcc-arm-none-eabi-4_8-2013q4 with the following flags: -Wall -Wno-main -Werror -std=c99 -ggdb -Os -ffunction-sections -fdata-sections -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=softfp
  • The implementation will be in C using BSPACM, with the generated assembly code inspected to ensure the sequences as defined above are what has been tested
The predictions:
  • Null hypothesis (my bet): There will be no measurable cycle count difference in any test cases that vary only in the selected delay instruction. I.e., there is no pipeline difference on the Cortex-M4.
  • "Learn something" result (consistent with my previous claims but not my expectations): For cases where N>0, one cycle fewer will be measured in sequences using NOP than in sequences using MOV R8,R8. I have no prediction whether the context instruction will impact this behavior. I.e., on the Cortex-M4 only one NOP instruction may be absorbed.
  • "Surprise me" result (still consistent with my previous claims but demonstrating a much higher level of technology in Cortex-M4 than I would predict): A difference in more than one cycle will be observed between any two cases that vary only in the selected delay instruction, but the difference has an upper bound less than the sequence length. I.e., the pipeline is so deep multiple decoded instructions can be dropped without impacting execution time.
  • "The universe is borked" result (can't happen): The duration of a sequence involving NOP is constant regardless of sequence length while the duration of the sequence involving MOV R8,R8 is (in the limit) linear in sequence length. I.e., the CPU is able to decode and discard an arbitrary number of NOPinstructions in constant time.
@@Lyon, @@spirilis, @@igor, @@bluehash, and anybody else: please post your predictions (or state you have none) while I'm gone. I expect the execution and analysis of the experiment to take less than the 40 minutes it took to design the experiment and document the plan, but because I'm curious at a meta level and I'm donating time to this I'm not going to comment further or post my results until other people put some skin in the game.

 

To the lab!

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...