Jump to content
jon1426459908

Tiva FatFs driver with DMA

Recommended Posts

https://github.com/jmagnuson/fatfs-tiva-cm4f/blob/master/src/third_party/fatfs/port/mmc-tiva-cm4f.c

 

It's still a work in progress, but should be a working drop-in replacement for the MMC driver TI provides.  I've only done minimal testing on it thus far, so any corrections or improvements would be greatly appreciated!

Share this post


Link to post
Share on other sites

https://github.com/jmagnuson/fatfs-tiva-cm4f/blob/master/src/third_party/fatfs/port/mmc-tiva-cm4f.c

 

It's still a work in progress, but should be a working drop-in replacement for the MMC driver TI provides.  I've only done minimal testing on it thus far, so any corrections or improvements would be greatly appreciated!

 

Pretty cool. Working on that for my own FAT File System as well (after I figured out why I get every now and then a 200ms delay from the SDHC Card I am using).

 

Couple of questions/comments:

 

(1) You might want to use SSI_FRF_MOTO_MODE_3. The way MODE_0 works causes an extra clock cycle to get the SCLK/MOSI lines back to idle signal levels. Hence you are running at 88% of the max throughput.

 

(2) Where does the ROM allocate the space for the uDMA control structs ? Hardcoded, or dynamic throu some wrapper ?

 

(3) Keep the drive strength at 4mA. The SD spec would actually lead you to use 8mA to get the steeper signal edges, but I have found a lot of SDHC Cards just won't work.

 

(4) In DESELECT, you might want to wait for the SSI to be idle (NOT BUSY I think was the bit to poll). There is also the other issue is that 7.5.1.1 states that the card drives the DO line at least one more clock cycle after CS goes H. So if you share the SPI bus, you need to send an extra byte after driving CS to high (RobG's TFT/SD booster pack comes to mind). Latter part is partly in the code, but it would be wise to move it into DESELECT so that after sending the dummy byte you also can wait to SSI to be idle.

 

(5) In the equivalent of disk_initialize, I found that some cards will not respond to CMD0 right away, because the SD CARD might be in a state where it is perhaps still reading or writing (due to a MCU reset). So I added a loop to allow a few retries for it to answer.

 

(6) In xmit_datablock() ... one needs to write the 16 bit CRC (or dummy it). Given that the card ignores it, and given that the last few bytes of the RAM are taken by the stack, one could simply write 514 bytes, rather than 512 ... 

 

- Thomas

Share this post


Link to post
Share on other sites

Hi Thomas, thanks for the response!

 

To be honest, I haven't looked too closely at the existing TI code (mainly just set on getting DMA working), but I'll definitely test your suggestions after I get RX running.

 

(2) Where does the ROM allocate the space for the uDMA control structs ? Hardcoded, or dynamic throu some wrapper ?

 

That's one omission that I really need to figure out.  On a different project (dma_memcpy) I chose to put the control structure within the driver file itself, but it didn't make the software very modular.  In this case I have left out the sample test project containing the uDMA control structure (for now), along with the monolithic interrupt vector definitions which is also needed by FatFs driver.  I will probably eventually leave it to the preprocessor to ensure that it isn't declared more than once.

 

(5) In the equivalent of disk_initialize, I found that some cards will not respond to CMD0 right away, because the SD CARD might be in a state where it is perhaps still reading or writing (due to a MCU reset). So I added a loop to allow a few retries for it to answer.

 

I always assumed this is what the initial clock train was for.  Is the card still not responding even after that?

 

(6) In xmit_datablock() ... one needs to write the 16 bit CRC (or dummy it). Given that the card ignores it, and given that the last few bytes of the RAM are taken by the stack, one could simply write 514 bytes, rather than 512 ...

 

I had this in mind when making the list of To-do's; it would make much more sense to eliminate the extra CRC transfer.  I'd also like to eliminate having to disable and re-initialize the uDMA for every 512 byte transfer in a multiblock transfer, but alas the driver currently isn't set up for those kinds of shortcuts.

 

Thanks again,

Jon

Share this post


Link to post
Share on other sites

That's one omission that I really need to figure out.  On a different project (dma_memcpy) I chose to put the control structure within the driver file itself, but it didn't make the software very modular.  In this case I have left out the sample test project containing the uDMA control structure (for now), along with the monolithic interrupt vector definitions which is also needed by FatFs driver.  I will probably eventually leave it to the preprocessor to ensure that it isn't declared more than once

 

I was more after "how does the ROM code know where the uDMA control structure is located" ... Mainly, because I tend to do most of the coding bare metal, but to be honest for setting up things it might be a waste ...

 

I always assumed this is what the initial clock train was for.  Is the card still not responding even after that?

 

Guess the naming is unclear in the code. An SD Card comes up in SDBus mode. This initial sequence switches it's interface over to SPI. If the SD Card was already in SPI mode, this is a noop. There is no SD Card reset sequence ... So if your MCU resets in the middle of a read or write sequence, you have to deal with that another way.

 

I had this in mind when making the list of To-do's; it would make much more sense to eliminate the extra CRC transfer.  I'd also like to eliminate having to disable and re-initialize the uDMA for every 512 byte transfer in a multiblock transfer, but alas the driver currently isn't set up for those kinds of shortcuts.

 

You need to send the extra 2 bytes, there is no way around that. On the other hand if this would be interrupt based, it would not hurt. At the time the uDMA is done and the ISR gets called, you have slots in the FIFO available. So that the ISR could simply do 2 writes without checking ...

Share this post


Link to post
Share on other sites

Sorry to pester this thread, but here some food for thought.

 

First off my comment regarding the clock training was wrong, bad memory. 6.4.1.1 states that 74 clocks (@ 400kHz) are needed with CS tied high. 7.2.1 is clear on that CMD0 with CS tied low triggers the switch to SPI. Perhaps that is part of my problems, as with RobG's TFT/SD there are clocks at more than 400kHz as the TFT in my case is active before the SD Card gets initialized, as the SPI bus is shared. So perhaps the init-sequence has to be split, so that the 74 SPI clocks @ 400kHz and CMD0 are send right away before anything else happens. If there is no response from CMD0, then there is no card ... Thanx for making me revisit this issue.

 
 
Anyway, in the code on github, the scheme essentially is to set up the DMA, and then wait for it to be done. So you are waiting pretty much most of the time. The time you are waiting is time that an ISR can execute without affecting data transfer. If you have an RTOS then it becomes more complex, as you want to have your task sleep till the transfer is done.
 
One the other hand, your MCU is running at 80MHz. The SPI link in the code is at 12.5MHz, which means you have 51.2 CPU clock cycles per byte send. If you want to write say 20kB/sec, then the time it takes to transfer via SPI is equivalent to about 1.4% of  the total clock cycles available to the MCU. So, IMHO it's a drop in the bucket. Hence an interesting question is whether you can send data via SPI under CPU control at that speed ...
 
Well, here is what I came up with:
 

#define LM4F120_DISK_FIFO_COUNT 8

/*
 * lm4f120_disk_send(uint8_t data)
 *
 * Send one byte, discard read data. The assumption
 * is that at this point both TX and RX FIFOs are
 * empty, so that a write is always possible. On
 * the read part there is a wait for the RX FIFO to
 * become not empty.
 */

static void lm4f120_disk_send(uint8_t data)
{
    SSI2_DR_R = data;

    while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }

    SSI2_DR_R;
}

/*
 * lm4f120_disk_receive()
 *
 * Receive one byte, send 0xff as data. The assumption
 * is that at this point both TX and RX FIFOs are
 * empty, so that a write is always possible. On
 * the read part there is a wait for the RX FIFO to
 * become not empty.
 */

static uint8_t lm4f120_disk_receive(void)
{
    SSI2_DR_R = 0xff;

    while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }

    return SSI2_DR_R;
}

/*
 * lm4f120_disk_send_data(const uint8_t *data, uint32_t count)
 *
 * Returns a "Data Response Token".
 */

static uint8_t lm4f120_disk_send_data(const uint8_t *data, uint32_t count)
{
    unsigned int n;
    uint8_t response;
    uint16_t crc16;
 
#if (LM4F120_DISK_CONFIG_CRC == 1)
    crc16 = lm4f120_compute_crc16(data, count);
#else /* LM4F120_DISK_CONFIG_CRC == 1 */
    crc16 = 0xffff;
#endif /* LM4F120_DISK_CONFIG_CRC == 1 */

#if (LM4F120_DISK_CONFIG_OPTIMIZE == 1)

    /*
     * Idea is to stuff first up data into the TX FIFO till it's full
     * (or better said till there ae no more splots in the RX FIFO).
     * Then wait for at least one item in the RX FIFO to read it back,
     * and refill the TX FIFO. At the end, the RX FIFO is drained.
     */

    for (n = 0; n < LM4F120_DISK_FIFO_COUNT; n++)
    {
        SSI2_DR_R = data[n];
    }
    
    for (n = LM4F120_DISK_FIFO_COUNT; n < count; n++)
    {
        while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }
        
        SSI2_DR_R;
        SSI2_DR_R = data[n];
    }
    
    while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }
    
    SSI2_DR_R;
    SSI2_DR_R = crc16 >> 8;
    
    while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }
    
    SSI2_DR_R;
    SSI2_DR_R = crc16;
    
    for (n = 0; n < LM4F120_DISK_FIFO_COUNT; n++)
    {
        while (!(SSI2_SR_R & SSI_SR_RNE)) { continue; }
        
        SSI2_DR_R;
    }
    
#else /* LM4F120_DISK_CONFIG_OPTIMIZE */

    for (n = 0; n < count; n++)
    {
        lm4f120_disk_send(data[n]);
    } 
    
    lm4f120_disk_send(crc16 >> 8);
    lm4f120_disk_send(crc16 >> 0);
    
#endif /* LM4F120_DISK_CONFIG_OPTIMIZE */

    /* At last read back the "Data Response Token":
     *
     * 0x05 No Error
     * 0x0b CRC Error
     * 0x0d Write Error
     */
    
    response = lm4f120_disk_receive() & 0x1f;
    
    return response;
}

If LM4F120_DISK_CONFIG_OPTIMIZE is set, then the code prestuffs the FIFO with data, and then only adds a new entry if there is read data available (read data is driving the FIFO in that case). Hence you need to check only one FIFO status condition. After all data is written, the read FIFO is drained. By not using the ROM routines the compiler can optimize the accesses quite a bit. Inquisitive minds of course have spotted already a further optimization. In reality this code always has a "count" of 512 (or at least it could guarantee that "count" is always even), which means one could switch the SPI port to use a word length of 16 bits, and hence half the number of reads/writes to the SPI device ...
 

 

However in general what's bothering me is that after the write operation that SD card may enter a "busy" state, which it signals by putting a magic "busy" value on the bus ("effectively holding the DataOut line low", 7.2.1). So the traditional way is to poll that, which may take quite a long time (for FatFs this wait is hidden in send_cmd() -> wait_ready()). So you end up wasting more time waiting in many cases than actually writing the data. If there is nothing else better to do this busy wait is ok, but if you want to interleave usage of a shared SPI bus, it might hurt ...

 

- Thomas

 

Share this post


Link to post
Share on other sites

Anyway, in the code on github, the scheme essentially is to set up the DMA, and then wait for it to be done. So you are waiting pretty much most of the time. The time you are waiting is time that an ISR can execute without affecting data transfer. If you have an RTOS then it becomes more complex, as you want to have your task sleep till the transfer is done.

 

One the other hand, your MCU is running at 80MHz. The SPI link in the code is at 12.5MHz, which means you have 51.2 CPU clock cycles per byte send. If you want to write say 20kB/sec, then the time it takes to transfer via SPI is equivalent to about 1.4% of  the total clock cycles available to the MCU. So, IMHO it's a drop in the bucket. Hence an interesting question is whether you can send data via SPI under CPU control at that speed ...

 

I think ultimately it will come down to actual benchmarking numbers, particularly considering the time it takes to set up DMA vs the benefits of not needing the MCU for the transfer.  In the case of dma_memcpy, IIRC DMA didn't make practical sense until around 6-8k transfer sizes.  I feel the same is true for FatFs, at least for the code in its present state.. but the appeal of (nearly) MCU-less SD card transfers is too great to ignore.

 

However in general what's bothering me is that after the write operation that SD card may enter a "busy" state, which it signals by putting a magic "busy" value on the bus ("effectively holding the DataOut line low", 7.2.1). So the traditional way is to poll that, which may take quite a long time (for FatFs this wait is hidden in send_cmd() -> wait_ready()). So you end up wasting more time waiting in many cases than actually writing the data. If there is nothing else better to do this busy wait is ok, but if you want to interleave usage of a shared SPI bus, it might hurt ...

 

I haven't found it in the official spec and haven't tested it personally, but according to ChaN ('Cosideration on Multi-slave Configuration') you can release CS and send a single CLK and the SD card will in turn release DO but continue what ever is making it 'busy', allowing you to service another slave and return to the SD card at a later time.

Share this post


Link to post
Share on other sites

I haven't found it in the official spec and haven't tested it personally, but according to ChaN ('Cosideration on Multi-slave Configuration') you can release CS and send a single CLK and the SD card will in turn release DO but continue what ever is making it 'busy', allowing you to service another slave and return to the SD card at a later time.

 

Ah, yes ... it's in 7.5.1.2 ... Here is a link to a forum that has the version of the spec that you are interested in. http://forums.parallax.com/showthread.php/146864-4-bit-SD-interface.

 

The releasing CS works. I am using that with RobG's TFT/SD boosterpack. The TFT can be updated while the SD card is busy. My point was more along the line that you might spend 1ms to write a bunch of data and then spend 100ms or more waiting for it to be idle again. And the problem with the waiting is that there does not seem to be a good way to offload the CPU from polling.

 

Regarding the DMA, I did some benchmarks way back with a LM3S811, so I had no DMA to play with. Using the mentioned FIFO scheme, it could saturate the SPI bus though. 

 

- Thomas

Share this post


Link to post
Share on other sites

Another update, in the form of a feature branch.  I used the scatter-gather functionality to consolidate the token, block buffer write, and CRC into a single DMA call.  Again, this is pretty experimental and I haven't yet consulted the SD spec to see just how many rules I am breaking, but it works and is cool to see in action.

 

Next on the list (apart from implementing the corresponding read functionality) will be to move the 'buff += 512' for multiblock transmissions out of disk_write() and into the interrupt handler.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×