Jump to content
43oh

Convert char to integer issue


Recommended Posts

Hello

 

I am working on a MSP430G2553 project and a strange thing is happening when I am converting two chars to integer.

 

If I only convert one value like this:

char a[2] = {data[9], data[10]};  // Put in a array + terminating NULL byte.
int i_a = (int) atoi(a); // Convert to a integer.

It converts the "a" correctly to a integer, but when I do like this:

char a[2] = {data[9], data[10]};  // Put in a array + terminating NULL byte.
int i_a = (int) atoi(a); // Convert to a integer.

char b[2] = {data[12], data[13]};  // Put in a array + terminating NULL byte.
int i_b = (int) atoi(; // Convert to a integer.

It converts both "a" and "b" to integer but it adds 1 at the end, for example if "a"="30" after the conversion the "i_a"=301?

 

So this just happens when I convert both "a" and "b". If I remove one of the conversion code it converts the other value correctly...

 

BUT...

 

If I manually add a terminating NULL byte on both "a" and "b" like this:

char a[2] = {data[9], data[10]};  // Put in a array + terminating NULL byte.
a[2] = NULL;
int i_a = (int) atoi(a); // Convert to a integer.

char b[2] = {data[12], data[13]};  // Put in a array + terminating NULL byte.
b[2] = NULL;
int i_b = (int) atoi(; // Convert to a integer.

it works as intended. Why is this? Shouldn't declaring like "char a[2] = {data[9], data[10]};" autoadd the terminating NULL byte at the end?

 

As you might suspect I am a newbie to this so what is the correct way of doing this?

 

Hope I am clear enough and you understand. Looking forward to a reply.

 

Best regards

Andreas

Link to post
Share on other sites

"Why is this? Shouldn't declaring like "char a[2] = {data[9], data[10]};" autoadd the terminating NULL byte at the end?"

 

Nope.  Only string literals do that, and on top of this, you've only declared char a[] as a 2-byte array, then you do:

  a[2] = NULL;

 

which is actually referencing a memory location OUTSIDE the declared bounds of char a[] !  This will overwrite something else on your stack which may cause erroneous application behavior from there on.

 

The right way to do this:

char a[3] = {data[9], data[10], 0}; // Put in a array + terminating NULL byte.

 

Likewise with char b[].

Link to post
Share on other sites

Thanks for your reply.

 

Sorry I forgot to mention that I am using Energia coding or maybe it doesn't matter?

 

Please see

http://energia.nu/String.html

 

Here it states

 

 

char Str2[8] = {'e', 'n', 'e', 'r', 'g', 'i', 'a'};

 

Declare an array of chars (with one extra char) and the compiler will add the required null character, as in Str2

 

Best regards

Andreas
 

Link to post
Share on other sites

Yes, the array always starts at zero, but the number inside your declaration "char a[2]" is defining the total # of units, not the "maximum index number".  Common mistake with C/C++ coding BTW.

 

So "char a[2]" has two valid indices: a[0], and a[1].  a[2] is 1 byte beyond the upper boundary of the array's reserved memory space.

Link to post
Share on other sites

Being pedantic (which I am...):

 

 

char a[2];
/* ... */
a[2] = NULL;

 

Ignoring the past-end-of-array error: In C, the macro NULL as defined in various headers denotes a null pointer constant, nominally compatible with type void * but not necessarily of pointer type. In C++ NULL is an integral constant expression that evaluates to zero; it cannot be a pointer type. In C++11 the literal nullptr replaces NULL.

 

The end-of-string terminating character in C and C++ is the ASCII code NUL, which is character value '\0' or equivalently an integral value zero.

 

One is a pointer, the other is a character. You shouldn't mix those concepts: you'll just confuse yourself and whoever has to maintain your code. (If you use a compiler where NULL is equivalent to ((void*)0) and you enable all warnings (which you should do) you'd get complaints about assigning a pointer to a non-pointer object.)

 

In situations like the above, assign either '\0' or simply 0 to a char that marks the end of the string.

 

(When documenting an array like a that holds a NUL-terminated array of characters, you might specify to its content as "ASCIIZ" denoting a zero-terminated ASCII encoded string, as opposed to a length-plus-characters encoding used in some other languages. There are cases where the object would intentionally exclude the terminating zero, but that's a topic for another day.)

Link to post
Share on other sites

Being pedantic (which I am...):

 

 

Ignoring the past-end-of-array error: In C, the macro NULL as defined in various headers denotes a null pointer constant, nominally compatible with type void * but not necessarily of pointer type. In C++ NULL is an integral constant expression that evaluates to zero; it cannot be a pointer type. In C++11 the literal nullptr replaces NULL.

 

The end-of-string terminating character in C and C++ is the ASCII code NUL, which is character value '\0' or equivalently an integral value zero.

 

One is a pointer, the other is a character. You shouldn't mix those concepts: you'll just confuse yourself and whoever has to maintain your code. (If you use a compiler where NULL is equivalent to ((void*)0) and you enable all warnings (which you should do) you'd get complaints about assigning a pointer to a non-pointer object.)

 

In situations like the above, assign either '\0' or simply 0 to a char that marks the end of the string.

 

(When documenting an array like a that holds a NUL-terminated array of characters, you might specify to its content as "ASCIIZ" denoting a zero-terminated ASCII encoded string, as opposed to a length-plus-characters encoding used in some other languages. There are cases where the object would intentionally exclude the terminating zero, but that's a topic for another day.)

Or zero-terminated string, or C-string.

Strictly speaking, the content could be legal UTF-8 encoded, which would be identical when using only characters with number 127 or lower. Of course, when using UTF-8 strlen would not have the correct behaviour, nor would most terminals.

Link to post
Share on other sites

When documenting an array like a that holds a NUL-terminated array of characters, you might specify to its content as "ASCIIZ" denoting a zero-terminated ASCII encoded string, as opposed to a length-plus-characters encoding used in some other languages.

Or zero-terminated string, or C-string.

These are alternatives to ASCIIZ, but they're imprecise (does "string" mean text?) or unnecessarily language-specific ("string" means text in C). If you see them, you can guess what they mean; I recommend you do not use them, though, unless you are intend to be language-specific.

 

Strictly speaking, the content could be legal UTF-8 encoded, which would be identical when using only characters with number 127 or lower. Of course, when using UTF-8 strlen would not have the correct behaviour, nor would most terminals.

But that's not ASCIIZ: if it's UTF-8 encoded the content is not characters, but octets. That should be documented as "null(U+0000)-terminated UTF-8 encoded text", and the underlying type of the array would ideally be uint8_t (though it might need to be char because of the API for Unicode support libraries). That C provides three types char, signed char, and unsigned char that conflate character and small integer data was a design flaw, especially that char is distinct from but functionally identical to an implementation-defined selection of one of the other two types. That char may be larger than 8 bits is also...inconvenient. (Taking up wchar_t and other language-supported character types is out of scope here.)

 

Python 2 continued C's error by using one type str to hold both binary data and non-Unicode text, and it caused enough trouble that Python 3 made all text Unicode and data a distinct type bytes.

 

When you're dealing with both text and encodings of text, it's important to be very clear what any particular data object holds. This is a case where something like Apps Hungarian may be appropriate. In PyXB unit tests I use (for example) xmlt for local variables holding XML in text format and xmld for variables holding XML encoded in UTF-8 or another data format. In early versions of PyXB I considered them the same because most XML schema used English to assist in data interchange, so the underlying bits were identical. I paid for that a couple years later when somebody wanted to use it for Japanese, which supports multiple encodings where the representation bits are very different but the full Unicode text content is the same.

Link to post
Share on other sites

Hi all,

 

Well my programming skills are not as good as most, but could you not use an itoa function to convert a char to an integer, by using a different base value.  The itoa code I usually use is linked below by Lukas Chmela.

 

So I guess this is more of a question than an answer, but by changing the int base value, would this not allow char to int conversion?

 

http://www.jb.man.ac.uk/~slowe/cpp/itoa.html

Link to post
Share on other sites

You're right, UTF-8 should be stored in an uint8_t array. I do disagree on storing xml text being different from storing xml UTF-8 encoded. There is an underlying assumption that text is encoded in ASCIIZ, which is not the case by default. Especially in modern systems the default text encoding is UTF-8.

If you used to build PyXB with Engish is your single language in mind, then your text is most certainly fully contained in the ASCII character set, which is equal to Unicode numbers 32 to 126. Since those are equal, the only thing you'd need to change when supporting Japanese is the replace your incorrect string length function (which counts chars/bytes/octets/uint8_ts) by one that counts actual characters, as well as bounded string copy functions etc.

 

@@Antscran : No it wouldn't, itoa translates an integer represented in base base to an ASCIIZ text string that is human readable. The topic started wished to do the opposite.

Link to post
Share on other sites

You're right, UTF-8 should be stored in an uint8_t array. I do disagree on storing xml text being different from storing xml UTF-8 encoded. There is an underlying assumption that text is encoded in ASCIIZ, which is not the case by default. Especially in modern systems the default text encoding is UTF-8.

 

If you used to build PyXB with Engish is your single language in mind, then your text is most certainly fully contained in the ASCII character set, which is equal to Unicode numbers 32 to 126. Since those are equal, the only thing you'd need to change when supporting Japanese is the replace your incorrect string length function (which counts chars/bytes/octets/uint8_ts) by one that counts actual characters, as well as bounded string copy functions etc.

I'm gonna have to object, mostly because your statements on XML and programming for Unicode are unclear in a way that obscures my main point. So I'm going to expand on that main point and try to clarify XML along the way.

 

There is a strict conceptual difference between text and representations of text by encoding schemes such as ASCII and Unicode, just as there's a difference between integers and representations of integers as two's complement, one's complement, or sign-magnitude. I'm trying to express two points: first, be aware of that difference; and second, take into account the possibility of alternative representations.

 

UTF-8 only uses single-byte encodings for characters that are in the ASCII character set (U+0000 through U+007F). The representation of character U+0080 (integer value 0x80 = 128) in UTF-8 is a two-byte sequence hex "C2 80".

 

The only time ASCIIZ and UTF-8 representations are equivalent is when the encoded text is ASCII and the UTF-8 indicates text length by a terminating null. While any standard ASCII C string is bitwise equivalent to its null-terminated UTF-8 encoded representation, many UTF-8 encoded strings cannot be expressed as ASCII strings.

 

This means that "the content of a is ASCIIZ" tells you something very different about how a can be used than what you're told by "the content of a is null-terminated UTF-8". My original point: clearly document your data objects so the reader knows what they contain.

 

XML (which I selected only as an example) by definition uses characters from Unicode. If you're using a language that has a Unicode text data type (unicode in Python 2, str in Python 3, std::wstring in C++ 11) then you will operate on XML text as Unicode characters, using length, copy, catenation, and other functions that manipulate Unicode data and that are distinct from the corresponding narrow-character C functions. You would not operate on them as text in their encoded form with those narrow-character C functions.

 

Encoding comes into play when you need to transfer the text to another system (via storing in a file or sending it over a network). Then you can encode it as UTF-8, UTF-16, UTF-32, shift_jis, or whatever. This representation is not text: it is a sequence of integral values representing code points. If the values are not 8-bit then you also need to know byte ordering before you can treat it as a sequence of octets. For XML, absence of an encoding declaration requires that the content be UTF-8 or UTF-16. (This may be what you meant by "disagree on storing xml text being different from storing xml UTF-8 encoded...the default text encoding is UTF-8".)

 

My whole point was: In early PyXB I mistakenly assumed text and data were identical, because by chance that happened that everything I encountered was the ASCII subset of Unicode. This was an error, demonstrated when somebody used PyXB for (non-romaji) Japanese; I certainly never intended PyXB to only support English. After a fair amount of gratuitous rework, PyXB is very robust for languages where text can't be represented in ASCII. The lesson: Unless you know absolutely that you're dealing only with ASCII text in C, keep in mind from the beginning that text and the representation of text as data are two distinct things.

 

I don't know what Windows does, but in modern POSIX systems like Linux the default text encoding is specified by the "C" or "POSIX" locale which uses the POSIX portable character set which is a subset of ASCII. UTF-8 is what POSIX calls a "state-dependent encoding", and is not the default. On my Linux systems I have to specifically override the environment variable LANG to en_US.UTF-8 to enable UTF-8 encoding to see non-ASCII content.

Link to post
Share on other sites

Encoding comes into play when you need to transfer the text to another system (via storing in a file or sending it over a network). Then you can encode it as UTF-8, UTF-16, UTF-32, shift_jis, or whatever. This representation is not text: it is a sequence of integral values representing code points. If the values are not 8-bit then you also need to know byte ordering before you can treat it as a sequence of octets. For XML, absence of an encoding declaration requires that the content be UTF-8 or UTF-16. (This may be what you meant by "disagree on storing xml text being different from storing xml UTF-8 encoded...the default text encoding is UTF-8".)

I think I did indeed misunderstand you.

Though I think encoding comes into play whenever you do anything with text. So displaying a line of text on a screen implies that the "screen" can decode whichever encoding you're sending to it. There is no possible way to have anything like unencoded text on a computer, it is either encoded as ASCIIZ, length+ASCII, NUL-terminated UTF-8, etc.

So I understand that in earlier versions of PyBX you make the (implied?) decision that your encoding was ASCIIZ, while in fact this was a bypass to use text that was actually UTF-8, am I correct?

Then what did you mean when you said you were storing text in xmlt type, while data is stored in xmld was encoded? Because as far as I see, both types would in fact be UTF-8 encoded text.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...