I decided to try again how to handle UTF8 strings on AmigaOS4. Utility library has had functions to convert strings between UTF8 and UCS4 a little while. I was wondering always what's the use for this type of conversions. I realized that UCS4 is compatible with ASCII (ISO-8859-15 on my system). The point is that while UTF8 is variable size UCS4 is fixed size. UTF-8 seems to be compatible with ASCII -7 bit only. UCS4 seems to be ASCII -8 bit compatible.
STRPTR ConvertToASCII(CONST_STRPTR inbuffer, uint32 inbuff_size) { STRPTR ascii_buffer=NULL; uint32 r=0; uint32 ucs_size=inbuff_size*sizeof(int32)+1; // UCS4 buffer should be int32* // +1 is for null terminator int32 *ucs_buffer=(int32 *)IExec->AllocVecTags(ucs_size,TAG_DONE); if (ucs_buffer!=NULL) { int32 conv_chars_cnt=IUtility->UTF8toUCS4(inbuffer,ucs_buffer,ucs_size,UTF_INVALID_SUBST_FFFD); STRPTR ascii_buffer=(STRPTR)IExec->AllocVecTags(conv_chars_cnt+1,TAG_DONE); if (ascii_buffer!=NULL) { // Copy UCS4 chars to ASCII buffer for (r=0; r<conv_chars_cnt; r++) ascii_buffer[r]=(char)ucs_buffer[r]; ascii_buffer[conv_chars_cnt]=0; } IExec->FreeVec(ucs_buffer); } return ascii_buffer; } // It goes otherway around as well STRPTR ConvertToUTF8(CONST_STRPTR inbuffer, uint32 inbuff_size) { STRPTR utf8_buffer=NULL; uint32 r=0; uint32 ucs_size=inbuff_size*sizeof(int32)+1; // UCS4 buffer should be int32* // +1 is for null terminator int32 *ucs_buffer=(int32 *)IExec->AllocVecTags(ucs_size,TAG_DONE); if (ucs_buffer!=NULL) { // Copy ASCII chars to UCS4 buffer for (r=0; r<inbuff_size; r++) ucs_buffer[r]=(int32)inbuffer[r]; ucs_buffer[inbuff_size]=0; uint32 utf8_buff_size=IUtility->UCS4Count(ucs_buffer,FALSE)*4+1; // UTF8 char size can be 1-4 bytes so reserve room for four bytes per char // +1 is for null terminator utf8_buffer=(STRPTR)IExec->AllocVecTags(utf8_buff_size,TAG_DONE); if (utf8_buffer!=NULL) { int32 conv_chars_cnt=IUtility->UCS4toUTF8(ucs_buffer,utf8_buffer,ucs_chars_cnt,UTF_INVALID_SUBST_FFFD); // You can save utf_buffer to a file or do what ever you like } IExec->FreeVec(ucs_buffer); } return utf8_buffer; } // IUtility->UTF8Count() and IUtility->UCS4Count() functions contains a validator to check validity of the strings.
I did tests with a-z, scandinavian letters and a couple of accent letters.


Comments
Submitted by OldFart on
Hi,
Whilst looking at your code, which btw is nice and clear, I came across a point of concern, due to a possibly shadowing declaration,
- Line 13 contains a shadowing declaration of line 3
Regards,
OldFart
-
Submitted by drule on
Interesting effort, but while your program will work, I suspect it doesn't work the way you think it does. Unicode is directly and deliberately compatible with ISO-8859-1. UTF-8, like its 16 and 32 bit counterparts, will represent all code points in the UCS code space, and not just "compatible with 7-bit ASCII" (although, technically, you're kinda right). You're confusing your code sets in your program.
You state you're using iso-8859-15, which I don't doubt. You don't (can't) specify anywhere in your calls to the conversion routines that this is the case. I'm guessing the string you pass on line 11 to UTF8toUCS4() is your raw 8-bit character array? For example, for character 0xA4 (EURO SIGN) in 8859-15 you SHOULD receive U+20AC as UCS4 which you can't simply cast to an 8-bit char. In these cases you're disregarding what UTF-8 is.
Your environment may be configured to use 8859-15 which handles 0xA4 as EURO SIGN but if you're merely passing the raw 8-bit char of that character into a UTF-8 then you're trying to convert 8859-1 (CURRENCY SIGN) instead, but that probably won't work either as it'll probably be an invalid UTF-8 sequence. To properly use the UTF-8 functions, you first need to convert from your 8-bit character to UTF-8 - and in UTF-8, the EURO SIGN is [0xE2 0x82 0xAC]. Where several other characters in 8859-15 differ from 8859-1, they will be similarly encoded in UTF-8, while the other encoded chars above 0x7F will be a shorter sequence.
Unless the utility lib functions do this for you under the covers for whatever your locale/charset is set to (I can't test right now and I don't think they do), I think you would be garbling the characters if you were to use this kind of approach in any cross platform communications.
SignatureNotFoundException at line 1.