Jump to content
Sign in to follow this  
pabigot

On Bit-Banding

Recommended Posts

Having reached a new stage of enlightenment, what I wrote in another topic turns out to be wrong. I can't find any other posts here that discuss bit-banding in detail, so I'm starting this so the correction is easier to find. See also the relevant section of Cortex-M3 Application Note 179.

 

I had said:

AFSEL is a single register that is not accessed using bitbanding: the TivaWare code that implements the equivalent to the CMSIS operation is:

   HWREG(ui32Port + GPIO_O_AFSEL) = ((ui32PinIO & 2) ?
                                      (HWREG(ui32Port + GPIO_O_AFSEL) |
                                       ui8Pins) :
                                      (HWREG(ui32Port + GPIO_O_AFSEL) &
                                       ~(ui8Pins)));
Which is just as much an RMW operation as the CMSIS version.

 

Bit-banding is relevant for operations on a GPIO DATA register, which is (in the CMSIS-based modification I used) declared as an array, and:

  GPIOF->DATA[GPIO_PIN_2 | GPIO_PIN_6] = 0xFF
does use bit-banding and only affects pins 2 and 6.

 

This second use which allows manipulation of GPIO pins with only certain pins being modified is really cool and slick and is not bit-banding.

 

It works because TI's implementation provides a 256-element array map to the state of the eight pins supported by each GPIO module. In Energy Micro's GPIO interface, which has 16 pins per port, the same problem is solved by having DOUTSET, DOUTCLR, and DOUTTGL registers which set, clear, and toggle (respectively) the GPIO pins for which a 1 is set in the word written to the register.

 

Bit-banding is specifically mapping each bit of an addressable 32-bit word to thirty-two consecutive addressable words for which only bit 0 is significant in each. In the Cortex-M architecture it can be done for words in either one 1MB region of SRAM or one 1MB region of the peripheral memory. Only bit 0 of the RHS is relevant to the assignment regardless of which bit in the original location is acccessed, unlike the case above where it was only bits 2 and 6 of the RHS that were relevant.

 

Bit-banding is not useful for the original example (enabling two distinct bits in a register simultaneously). It is useful for enabling or disabling a single bit without affecting its neighbors, or eliminating a window of vulnerability if an interrupt might occur during the multi-instruction read-modify-write sequence that the ARM instruction set requires. (MSP430 has instructions that set and clear multiple bits in a value atomically; ARM does not, and bitbanding solves that problem for the case of a single-bit manipulation. The DOUTSET/DOUTCLR approach of EFM32 is much closer to the MSP430 model.)

 

If the address of the word being modified and the bit index are not known at compile time, there will be little to no code savings from using bit-banding: the calculation for the address is:

#define BITBAND_SRAM(word_, bit_) (*(volatile uint32_t *)((BITBAND_RAM_BASE + 4 * ((bit_) + 8 * ((uintptr_t)&(word_) - SRAM_BASE)))))
#define BITBAND_PER(word_, bit_) (*(volatile uint32_t *)((BITBAND_PER_BASE + 4 * ((bit_) + 8 * ((uintptr_t)&(word_) - PER_RAM_BASE)))))
which can be optimized to a constant only if both &word_ and bit_ are compile-time constants.

 

Bit-banding is used extremely rarely in TI's driverlib (only in SysCtlPeripheral functions, and the motivation in those situations is unclear). It's used somewhat more frequently in Energy Micro's emlib.

 

The primary use I can see for it in user code is to set a flag in a word that holds shared state without having to disable interrupts. It is used in quite a few TivaWare examples this way:

 HWREGBITW(&g_ui32Flags, FLAGS_STREAMING) = 1;

Share this post


Link to post
Share on other sites

As noted in the previous post, a primary value of bit-band for user memory is to record an event in a flag variable without risking a race condition. Finding myself using this idiom inside an interrupt handler where it wasn't necessary I wanted to see whether I was decreasing code size or improving performance by doing so.

 

Compiler: gcc version 4.8.3 20131129 (release) [ARM/embedded-4_8-branch revision 205641] (GNU Tools for ARM Embedded Processors)

 

Optimization-related flags: -ggdb -Os -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=softfp

 

A Read-Modify-Write update of a single bit in a SRAM variable produces this code:

  34:main.c        ****   events |= EVENT;
  36 0002 054A                  ldr     r2, .L2+4
  41 0006 1068                  ldr     r0, [r2]     /* BEGIN RACE CONDITION */
  42 0008 40F01000              orr     r0, r0, #16
  43 000c 1060                  str     r0, [r2]     /* END RACE CONDITION */
which executes in 7 cycles (including 1 cycle overhead reading the cycle counter) on a TM4C123GH6PM. (NB: I removed from the listing the instruction at offset zero that read the cycle counter).

 

The bitband update produces this code:

  45:main.c        ****   BSPACM_CORE_BITBAND_SRAM32(events, EVENT_S) = 1;
  73 0000 0549                  ldr     r1, .L5
  77 0004 4901                  lsls    r1, r1, #5
  78 0006 01F10851              add     r1, r1, #570425344 /* 0x22000000 */
  79 000a 0120                  movs    r0, #1
which executes in 6 cycles. (Cycle counter read at offset 2 removed from listing.)

 

So: No difference in code size, one cycle timing difference. No clear reason to pick one over the other for performance reasons.

 

Full code (which will also eventually show up in BSPACM). There is no performance difference between inline and outline code; both were included to ensure previous use of the address of events within the function didn't affect the timing.

/* BSPACM - misc/bitband demonstration application
 *
 * Written in 2014 by Peter A. Bigot <http://pabigot.github.io/bspacm/>
 *
 * To the extent possible under law, the author(s) have dedicated all
 * copyright and related and neighboring rights to this software to
 * the public domain worldwide. This software is distributed without
 * any warranty.
 *
 * You should have received a copy of the CC0 Public Domain Dedication
 * along with this software. If not, see
 * <http://creativecommons.org/publicdomain/zero/1.0/>.
 */

/* Evaluate performance of a read-modify-write sequence to set a
 * single bit in an event mask versus a bitband assignment.
 */

#include <bspacm/core.h>
#include <stdio.h>

#define EVENT_S 4
#define EVENT (1U << EVENT_S)
volatile unsigned int events;

unsigned int rmw_set ()
{
  unsigned int t0;
  unsigned int t1;

  t0 = BSPACM_CORE_CYCCNT();
  events |= EVENT;
  t1 = BSPACM_CORE_CYCCNT();
  return t1-t0;
}

unsigned int bitband_set ()
{
  unsigned int t0;
  unsigned int t1;

  t0 = BSPACM_CORE_CYCCNT();
  BSPACM_CORE_BITBAND_SRAM32(events, EVENT_S) = 1;
  t1 = BSPACM_CORE_CYCCNT();
  return t1-t0;
}

void main ()
{
  unsigned int t0;
  unsigned int t1;
  unsigned int cycles;

  BSPACM_CORE_ENABLE_INTERRUPT();

  printf("\n" __DATE__ " " __TIME__ "\n");
  printf("System clock %lu Hz\n", SystemCoreClock);
  BSPACM_CORE_ENABLE_CYCCNT();

  events = 0;
  t0 = BSPACM_CORE_CYCCNT();
  events |= EVENT;
  t1 = BSPACM_CORE_CYCCNT();
  printf("Inline RMW %x took %u cycles including overhead\n", events, t1-t0);

  events = 0;
  t0 = BSPACM_CORE_CYCCNT();
  BSPACM_CORE_BITBAND_SRAM32(events, EVENT_S) = 1;
  t1 = BSPACM_CORE_CYCCNT();
  printf("Inline BITBAND %x took %u cycles including overhead\n", events, t1-t0);

  events = 0;
  cycles = rmw_set();
  printf("Outline RMW %x took %u cyclesincluding overhead\n", events, cycles);

  events = 0;
  cycles = bitband_set();
  printf("Outline BITBAND %x took %u cycles including overhead\n", events, cycles);

  t0 = BSPACM_CORE_CYCCNT();
  t1 = BSPACM_CORE_CYCCNT();
  printf("Timing overhead %u cycles\n", t1-t0);
}

Share this post


Link to post
Share on other sites

One last comment: bitbanding is an optional feature of ARM Cortex-M processors, and is not supported on some (all?) Cortex-M0 and M0+ devices (EFM32 Zero Gecko, in particular). Don't get too used to using it if you want to support those architectures.

Share this post


Link to post
Share on other sites

Have you tried using exclusive load/stores?  Something like:

while(__strex((__ldrex(&events)|EVENTS),&events));

using intrinsics supplied by CCS.. not sure what would be the equivalent for GCC, other than simply inlining the assembly.

Share this post


Link to post
Share on other sites

Have you tried using exclusive load/stores?

 

Thanks for the pointer; I was unaware of them.  More information here.

 

They're accessed in gcc through the ARM variants of the atomic builtins.  Thus something like:

  (void)__atomic_fetch_or(&events_, flg, __ATOMIC_ACQUIRE);
becomes something like:

.L2:
        ldrex   r1, [r3]
        orr     r2, r1, r0
        strex   r4, r2, [r3]
        cmp     r4, #0
        bne     .L2
        dmb     sy
which looks to be about what your loop does.

 

It appears this was designed for multiprocessor systems, not for protecting writes from interrupts. The instructions don't exist on Cortex-M0 but the GCC builtins are implemented and delegate to a library function which presumably does the right thing.

 

I think bit-banding is a superior solution where it's available as there's no loop, and no overhead for reading the updated value back in the case where it's not interesting. Where bit-banding isn't available I'd probably just disable interrupts around the RMW code.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...