Jump to content
43oh

Parsing "strings" in C - Attempting to parse MQTT payload


Recommended Posts

I'm attempting to parse a byte/char array that contains the following data

 

cmd=led&color=red&state=on

 

This data is made available to me in a callback function whose signature looks like this.

void callback(char* topic, byte* payload, unsigned int length) {

The payload parameter is the one that holds the data I'm interested in. I've tried various ways to parse this string and I've been unsuccessful (I mean besides shear brute force and tons of memory allocations). I eventually resorted with coding a test harness using Visual Studio and C/C++ and I get it where it works but then when I move it to Energia/Code Composer Studio it doesn't work. Also, I'm not sure if what I have is the right/best way to do this.

 

Here is my C/C++ code that works in Visual Studio.

 

Known issues

1. payload variable is declared as a char* while the actual payload parameter is a byte* so I'm not sure how to get past that. I'm guess I can cast the byte* to a char*?

2. I know the strtok function modifies the original. So I need to use something else. Not sure what

#include "stdafx.h"
#include <string.h>
#include <stdio.h>

enum Command
{
	Led,
	Motor,
	Servo,
	Ligth
};

enum LedColor
{
	red,
	green,
	blue
};

enum LedState
{
	on,
	off
};

typedef struct RemoteCommand
{
	Command command;
	LedColor ledColor;
	LedState ledState;
};

char* getValuePart(char* str);
bool startsWith(const char *pre, const char *str);

int main()
{
	const char parameterDelimiter[2] = "&";	
	
	char payload[] = "cmd=led&color=red&state=off";		

	RemoteCommand remoteCommand;
	char *token = strtok(payload, parameterDelimiter);
	
	if (startsWith(token, "cmd"))
	{
		char* separator = strchr(token, '=');
		if (separator != NULL)
		{
			++separator;			
		}

		// Parse the LED Command
		if (strstr(token, "led") != NULL)
		{
			remoteCommand.command = Command::Led;
			
			// Parse the Color attribute
			token = strtok(NULL, parameterDelimiter);
			if (startsWith(token, "color"))
			{
				if (strstr(token, "red") != NULL)
					remoteCommand.ledColor = LedColor::red;
				else if (strstr(token, "green") != NULL)
					remoteCommand.ledColor = LedColor::green;
				else if (strstr(token, "blue") != NULL)
					remoteCommand.ledColor = LedColor::blue;
			}

			// Parse the State attribute
			token = strtok(NULL, parameterDelimiter);
			if (startsWith(token, "state"))
			{
				if (strstr(token, "on") != NULL)
					remoteCommand.ledState = LedState::on;
				else if (strstr(token, "off") != NULL)
					remoteCommand.ledState = LedState::off;
			}
		}				
	}
	
	return(0);
}

char* getValuePart(char* str)
{
	char* color = strchr(str, '=');
	if (color != NULL)
	{
		++color;
	}

	return color;
}

bool startsWith(const char *str, const char *pre)
{
	return strncmp(pre, str, strlen(pre)) == 0;
}

Any suggestions to improve this code (less memory allocation, more efficient etc.) are most welcome.

Link to post
Share on other sites

First of all, the code you're showing above seems overly complex for performing such a simple task. I say seems, because honestly I do not care to even read the whole thing, but instead just glossed over it all. Anyway, the best code generally tends to be the simplest code. So perhaps you need to think on how you can simplify what you need done.

 

MQTT should not even be a consideration for this. As MQTT is just a communications method / protocol. You can implement whatever high level protocol you wish(on top of MQTT ) . . .but consider a simple byte field. Byte 1-2 is for X, byte 3-4 is for Y, etc.

Link to post
Share on other sites

@yyrkoon,

 

Thanks for your reply. Yes, the only reason I brought up MQTT is just in case someone else has done something similar for their needs. I guess simple and complex can be moved around depending on ones design and style. The protocol chosen here is intentional and is not up for discussion I'm afraid.

 

Given that the protocol is what it is, any help on parsing would be apprecaited.

Link to post
Share on other sites

Basically writing your own tokenizer, and I would suggest making it modify the original (set the separator characters to '\0') but tokenize based on the & and =

However, the output of this tokenizer should be an array of struct's whose members (2 of them) are char* pointers, one called "key" and one called "value" and point to the first character of each respective text in the string. For memory allocation sake this can either be dynamic or a static allocation of e.g. 20 copies, but the struct array member after the last one should have its "key" member point to NULL to denote end of the set.

 

Then your code can interpret the results by walking the key/value struct array and analyzing the keys to determine what to do.

Link to post
Share on other sites

Although my stated approach of storing the token results in an array of struct's may be unnecessarily fat here. You can always establish the key and value pointers (after zeroing the separator bytes) and then process the key right there (in the parser with a dictionary lookup-to-integer followed by switch{} block or just pass the key,value pointers to a callback where you compartmentalize the interpretation work).

Link to post
Share on other sites

If a byte and a char have the same size, casting is no problem.

 

To avoid modifying the data, don't use zero-terminated strings; just use pointer+length everywhere:

static void parseParameter(const char *param, size_t length);
static void handleNameValue(const char *name,  size_t nameLength,
			    const char *value, size_t valueLength);
static void handleCmd(const char *value, size_t length);
static void handleColor(const char *value, size_t length);
// etc.

void parsePayload(const char *payload, size_t length)
{
	size_t i, paramStartIndex = 0;

	for (i = 0; i < length; i++) {
		if (payload[i] == '&') {
			parseParameter(paramStartIndex, i - paramStartIndex);
			paramStartIndex = i + 1;
		}
	}
	parseParameter(paramStartIndex, length - paramStartIndex);
}

static void parseParameter(const char *param, size_t length)
{
	size_t i;

	for (i = 0; i < length; i++)
		if (param[i] == '=') {
			handleNameValue(param, i, &param[i + 1], length - (i + 1));
			break;
		}
}

#define memeq(a,b,sz) (!memcmp(a,b,sz))

#define isString(value, length, literal) (length == sizeof(literal) - 1 && memeq(value, literal, length))

static void handleNameValue(const char *name,  size_t nameLength,
			    const char *value, size_t valueLength)
{
	if (isString(name, nameLength, "cmd")) {
		handleCmd(value, valueLength);
	} else if (isString(name, nameLength, "color")) {
		handleColor(value, valueLength);
	} else if (isString(name, nameLength, "state")) {
		handleState(value, valueLength);
	}
}

static void handleCmd(const char *value, size_t length)
{
	if (isString(value, length, "led")) {
		remoteCommand.command = Command::Led;
	}
}

static void handleColor(const char *value, size_t length)
{
	if (isString(value, length, "red")) {
		remoteCommand.ledColor = LedColor::red;
	} else if (isString(value, length, "green")) {
		remoteCommand.ledColor = LedColor::green;
	} else if (isString(value, length, "blue")) {
		remoteCommand.ledColor = LedColor::blue;
	}
}

// etc.
Link to post
Share on other sites

Here's a simple custom tokenizer that modifies data in-place (I wrote this in C for execution at the command line in Linux, but, the code can be ripped into an MCU just the same):

#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <string.h>

void run_callback(const char *, const char *);

int main(int argc, char *argv[])
{
	size_t i;
	char text[] = "cmd=load&color=red&bg=blueα=0.56";
	size_t tlen = strlen(text);
	char *key=NULL, *value=NULL;

	for (i=0; i < tlen; i++) {
		if (text[i] == '&') { // end of key/value pair
			text[i] = '\0';
			if (key != NULL) {
				run_callback(key, value);
			}
			key = NULL;
			value = NULL;
			continue;
		}
		if (text[i] == '=') { // end of key, start of value
			text[i] = '\0';
			value = &text[i+1];
			continue;
		}
		if (key == NULL) {
			key = &text[i];
		}
		// otherwise do nothing, keep incrementing i
	}
	// end of string, we might have a loaded key/value pair to process at the very end
	if (key != NULL) {
		run_callback(key, value);
	}
	return(0);
}

void run_callback(const char *key, const char *value)
{
        // do something with key & value - replace these with something more relevant to your application
	if (key == NULL || value == NULL) {
		printf("key=ptr[%p], value=ptr[%p]\n", key, value);
	} else {
		printf("key=[%s], value=[%s]\n", key, value);
	}
}

If it's absolutely mandatory that the original text NOT be modified (you should qualify this restriction though, since if it can be modified, this solution works fine), you can modify this tokenizer to copy the key & value into buffers:

#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <string.h>

void run_callback(const char *, const char *);

#define KEYVALUE_BUFFER_MAXLEN 32

int main(int argc, char *argv[])
{
        size_t i;
        char text[] = "cmd=load&color=red&bg=blueα=0.56";
        size_t tlen = strlen(text);
        char *key=NULL, *value=NULL;
        size_t keystart=0, valuestart=0, tmpsz;
        char keybuf[KEYVALUE_BUFFER_MAXLEN], valuebuf[KEYVALUE_BUFFER_MAXLEN];

        for (i=0; i < tlen; i++) {
                if (text[i] == '&') { // end of key/value pair
                        // copy value into buffer
                        tmpsz = i-valuestart;
                        if (tmpsz > KEYVALUE_BUFFER_MAXLEN-1) {
                                tmpsz = KEYVALUE_BUFFER_MAXLEN - 1;  // minus 1 to provide space for the '\0' terminator
                        }
                        memcpy(valuebuf, value, tmpsz);
                        valuebuf[tmpsz] = '\0';

                        if (key != NULL) {
                                run_callback(keybuf, valuebuf);
                        }

                        key = NULL;
                        value = NULL;
                        continue;
                }
                if (text[i] == '=') { // end of key, start of value
                        // copy key into buffer
                        tmpsz = i-keystart;
                        if (tmpsz > KEYVALUE_BUFFER_MAXLEN-1) {
                                tmpsz = KEYVALUE_BUFFER_MAXLEN - 1;
                        }
                        memcpy(keybuf, key, tmpsz);
                        keybuf[tmpsz] = '\0';
                        value = &text[i+1];
                        valuestart = i+1;
                        continue;
                }
                if (key == NULL) {
                        key = &text[i];
                        keystart = i;
                }
                // otherwise do nothing, keep incrementing i
        }
        // end of string, we might have a loaded key/value pair to process at the very end
        if (key != NULL) {
                // copy value into buffer
                tmpsz = i-valuestart;
                if (tmpsz > KEYVALUE_BUFFER_MAXLEN-1) {
                        tmpsz = KEYVALUE_BUFFER_MAXLEN - 1;
                }
                memcpy(valuebuf, value, tmpsz);
                valuebuf[tmpsz] = '\0';
                run_callback(keybuf, valuebuf);
        }
        return(0);
}

void run_callback(const char *key, const char *value)
{
        // do something with key & value - replace these with something more relevant to your application
        if (key == NULL || value == NULL) {
                printf("key=ptr[%p], value=ptr[%p]\n", key, value);
        } else {
                printf("key=[%s], value=[%s]\n", key, value);
        }
}

Note the cost here .... extra code.  Being able to modify the data in-place tidies things up a bit.  What I wrote up above could be reworked to use strtok, but I don't see much point since replicating strtok's code is reasonably compact anyhow (and we're doing a purpose-built version that acts as a state machine unique to your particular data).

Link to post
Share on other sites

Also for kicks, I ran the first code (test.c) with some extra quirks to test its limits:

 

char text[] = "cmd=load&&&&&&&&&&&&&color=red&bg=blueα=0.56&&blah=&&&&=wtf&&&&";

 

Here's the output:

ebrundic@spock:~/inst$ ./test
key=[cmd], value=[load]
key=[color], value=[red]
key=[bg], value=[blue]
key=[alpha], value=[0.56]
key=[blah], value=[]
key=[wtf], value=[wtf]

The only output that seems odd is the "=wtf" case, since the code took the value and pointed both key and value pointers to the value.  A state variable (called no_key here) could protect against this:

#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <string.h>

void run_callback(const char *, const char *);

int main(int argc, char *argv[])
{
	size_t i;
	char text[] = "cmd=load&&&&&&&&&&&&&color=red&bg=blueα=0.56&&blah=&&&&=wtf&&&&";
	size_t tlen = strlen(text);
	char *key=NULL, *value=NULL;
	int no_key = 0;

	for (i=0; i < tlen; i++) {
		if (text[i] == '&') { // end of key/value pair
			text[i] = '\0';
			if (key != NULL) {
				run_callback(key, value);
			}
			key = NULL;
			value = NULL;
			no_key = 0;
			continue;
		}
		if (text[i] == '=') { // end of key, start of value
			no_key = 1;
			text[i] = '\0';
			value = &text[i+1];
			continue;
		}
		if (!no_key && key == NULL) {
			key = &text[i];
		}
		// otherwise do nothing, keep incrementing i
	}
	// end of string, we might have a loaded key/value pair to process at the very end
	if (key != NULL) {
		run_callback(key, value);
	}
	return(0);
}

void run_callback(const char *key, const char *value)
{
	if (key == NULL || value == NULL) {
		printf("key=ptr[%p], value=ptr[%p]\n", key, value);
	} else {
		printf("key=[%s], value=[%s]\n", key, value);
	}
}

which produces:

key=[cmd], value=[load]
key=[color], value=[red]
key=[bg], value=[blue]
key=[alpha], value=[0.56]
key=[blah], value=[]

Link to post
Share on other sites

Anyway, that was fun... I'm glad I spent some time to write all that.  I'll end up using it later ;)

edit: Never did test the &key& scenario - that would produce a key that is equal to the whole rest of the string and NULL pointer for value.  Probably need if (key != NULL && value != NULL) before doing the run_callback stuff.

Link to post
Share on other sites

@spirilis and @Clavier, thank you so much for all of your ideas, code and suggestions! Really appreciate it. It will take me time to digest all of this, try out your suggestions etc. All this is a lot of help, so thank you both.

 

?Yes the original "payload" parameter should not be modified. I believe the socket reuses this buffer at the time publishing. It's not multi-threaded but I'd rather be working/manipulating my own copy.

 

?Ok, lots to do now! I'll be back.

Link to post
Share on other sites

I made this version, it doesn't take a callback, instead it returns a pointer to the value (the key is always at the start of a token anyway) and the key and value lengths in parameters, as well as the start of the next token as return value.

It may be easier to use in a state machine. Also, keys without values are allowed, as well as values without keys, but two = signs cause abortion of the function. empty tokens result in a key and value length of 0.

const char *getKeyValue(const char *const input, int *key_len, const char **value, int *value_len)
{
  const char *iter;
  *value = input;

  for (iter = input; (*iter != '&') && (*iter != '\0'); ++iter)
  {
    if (*iter == '=')
    {
      if (*value != input) // second '=' found, abort parsing
      {
        *key_len = 0;
        *value_len = 0;
        return NULL;
      }
      *key_len = iter - input;
      *value = iter + 1;
    }
  }

  *value_len = iter - *value;
  if (*value == input)
  {
    *key_len = *value_len;
    *value_len = 0;
  }
  return *iter == '\0' ? NULL : iter + 1;
}


#include <stdio.h>
#include <string.h>

char *test[] = {
  "cmd=led&color=red&state=on",
  "cmd=&bla=true",
  "cmd==false&blue=grey",
  "x&&&&&g=t",
  "cmd=load&&&&&&&&&&&&&color=red&bg=blueα=0.56&&blah=&&&&=wtf&&&&"
};

int main()
{
  int key_len, value_len;
  const char *key, *value, *next;
  char key_str[20], value_str[20];
  int n;

  for(n = 0; n < 5; ++n)
  {
    next = test[n];
    printf("%s\n", next);
    do
    {
      const char *current = next;
      next = getKeyValue(current, &key_len, &value, &value_len);
      memcpy(key_str, current, key_len);
      key_str[key_len] = '\0';
      memcpy(value_str, value, value_len);
      value_str[value_len] = '\0';
      printf("%s, %s\n",key_str, value_str);
    } while (next != NULL);
  }
  return 0;
}

Oh, and it doesn't touch the input string at all and uses practically no memory at all, it does require the user to iterate over all tokens though, but that's by design (since, no callback).

Link to post
Share on other sites

I made this version, it doesn't take a callback, instead it returns a pointer to the value (the key is always at the start of a token anyway) and the key and value lengths in parameters, as well as the start of the next token as return value.

It may be easier to use in a state machine. Also, keys without values are allowed, as well as values without keys, but two = signs cause abortion of the function. empty tokens result in a key and value length of 0.


Oh, and it doesn't touch the input string at all and uses practically no memory at all, it does require the user to iterate over all tokens though, but that's by design (since, no callback).

I like it!  Still very compact code.

Link to post
Share on other sites

@roadrunner84, this is good stuff. I appreciate everyone's participation here tremendously!

 

To be honest, it takes me quite a while to wrap my head around all of the code presented here. I've tried all of the code presented here and they all would work (obviously) and within each of the answers there are nuggets or artful pointer manipulations that have taught me how to handle other things that I've been struggling/confused with. At the end of the day, (the way I see it) string/char manipulation/parsing is a big component of code that most noobs are going to have to learn/master so I'm confident this thread will help others who struggle with C/C++.

 

I have a general (best practice) kind of question. In the world I come from (Delphi, C# etc.) even though we have pointers, we don't really know them to be pointers and we also don't use them like I see being done here. So when a method needs to return multiple parameters, we favor defining a struct or class and return it, over using the input parameters as "out" parameters like I see here. Of course in C/C++ do as the C/C++ folks do!

 

So my questions are:

1. Is this a best practice in general C/C++ programming or more of an embedded system area of practice?

2. When I'm trying to digest these functions I feel like I'm having to juggle multiple balls and as I go deeper into the function more balls are being added. In my Delphi/C# code I don't feel like that at all :). Do you do the same (albeit way more naturally)?

3. What is a good book for me to revise/learn and get a good grasp of pointer manipulation? I remember starting out (eons ago) teaching myself C with Kernighan and Ritchie's book. I know I'm not suddenly going to morph into a hardcore C/C++ programmer and do tricks like I see here overnight :)

Link to post
Share on other sites

Returning structs is a bit tricky in C, this is because the way parameters are assigned. You cannot assign a struct to another struct the naive way; it will work though, if you do not have any pointers or char* things in the struct.

If you do have pointers or strings (char*) inside your struct, you must bear in mind that the struct does only carry a pointer to the "member" array, so altering one copy will also alter the other.

struct foo
{
  char *text;
}
struct foo bar;
struct foo baz;

int main()
{
  bar.text = "hello";
  baz = bar;
  printf("%s\n", baz.text); // prints "hello"
  bar.text[1] = 'a'; // change "hello" into "hallo"
  printf("%s\n", baz.text); // prints "hallo"
  baz.text = "olla"; // changes the pointer in baz, so bar is unaltered
  printf("%s\n", bar.text); // prints "hallo"
  return 0;
}

So you need to keep watch over the internals of your struct constantly.

 

I do feel your juggling problem. Since you need to be more aware of slip-ups you cannot approach stuff the naive way. Things like "could this pointer point to invalid memory" or "am I sure I handle cases for an empty string" are essential to writing sound code. Your compiler or runtime will not (always) notify you if you screw things up.

To take my own code as an example, this case is not covered:

getKeyValue(NULL, &key_len, &value, &value_len);

Oops, passing a null pointer to the function will not be caught. My function assumes all four pointers are valid, if they're not, well, things go sour.

 

My code is intended to be plain C code, so no C++ tricks there. I started out with an old C book too, but I mostly learned just by doing things. Note that I debugged my function on a PC, not in the msp430 itself. Almost all behaviour wourks out fine in both. There's just tiny differences, like the size of a pointer (2 bytes in the msp430 and 8 bytes in my 64 bit linux machine), but they do not matter for good C code. I could do ugly things like assign a pointer to an integer or the other way around, which would work on the msp430 since both integers and pointers are 2 bytes, but not on my pc, since integers are 4 bytes but pointers are 8 bytes. Better steer clear from tricks like that.

Well.. I don't assign pointers to integers, but I do assign pointer differences to integers! in C++ those should be ptrdiff_t not int! But I want a string length to result, which is an integer. C++ is a bit more picky on types than C, which is a good thing, because simple errors are more easily detected.

Link to post
Share on other sites

You suggested returning a struct instead of passing out parameters, you could rewrite things like this

#include <stdio.h>
#include <string.h>

const char *test[] = {
  "cmd=led&color=red&state=on",
  "cmd=&bla=true",
  "cmd==false&blue=grey",
  "x&&&&&g=t",
  "cmd=load&&&&&&&&&&&&&color=red&bg=blueα=0.56&&blah=&&&&=wtf&&&&"
};

struct key_value
{
  const char *key;
  int key_len;
  const char *value;
  int value_len;
  const_char *next;
};

struct key_value getKeyValue(const char *const input)
{
  struct key_value result;
  const char *iter;
  result.value = input;
  result.key = input;

  for (iter = input; (*iter != '&') && (*iter != '\0'); ++iter)
  {
    if (*iter == '=')
    {
      if (result.value != input) // second '=' found, abort parsing
      {
        result.key_len = 0;
        result.value_len = 0;
        result.next = NULL;
        return result;
      }
      result.key_len = iter - input;
      result.value = iter + 1;
    }
  }

  result.value_len = iter - result.value;
  if (result.value == input)
  {
    result.key_len = result.value_len;
    result.value_len = 0;
  }
  result.next = *iter == '\0' ? NULL : iter + 1;
  return result;
}


int main()
{
  struct key_value result;
  char key_str[20], value_str[20];
  int n;

  for(n = 0; n < 5; ++n)
  {
    result.next = test[n];
    printf("%s\n", result.next);
    do
    {
      const char *current = result.next;
      result = getKeyValue(current);
      memcpy(key_str, result.key, result.key_len);
      key_str[result.key_len] = '\0';
      memcpy(value_str, result.value, result.value_len);
      value_str[result.value_len] = '\0';
      printf("%s, %s\n",key_str, value_str);
    } while (result.next != NULL);
  }
  return 0;
}

But you need to bear in mind that the key, value and next properties are pointers, not strings. So you can read them, but changing them will alter the input string too!

 

The drawback of this solution is that in embedded systems it takes more memory and more time. You see, the function now needs to allocate memory to store the result variable, as well as the return value. Then when the function returns it copies the result from result to the return value. Then the main function would copy the return value to a local variable again. So that's three times memory for the struct and two memcpy calls done in the dark. So if speed or size is of the essence, this method is less appropriate.

Add to that the issue of keeping track of pointers all the time and it's just not worth it most of the time.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...