Convert UTF8 strings to ASCII and back (without alien libs)

Blog Rate

I decided to try again how to handle UTF8 strings on AmigaOS4. Utility library has had functions to convert strings between UTF8 and UCS4 a little while. I was wondering always what's the use for this type of conversions. I realized that UCS4 is compatible with ASCII (ISO-8859-15 on my system). The point is that while UTF8 is variable size UCS4 is fixed size. UTF-8 seems to be compatible with ASCII -7 bit only. UCS4 seems to be ASCII -8 bit compatible.

STRPTR ConvertToASCII(CONST_STRPTR inbuffer, uint32 inbuff_size)
{
 STRPTR ascii_buffer=NULL;
 uint32 r=0;
 
 uint32 ucs_size=inbuff_size*sizeof(int32)+1; // UCS4 buffer should be int32* // +1 is for null terminator
 
 int32 *ucs_buffer=(int32 *)IExec->AllocVecTags(ucs_size,TAG_DONE);
 if (ucs_buffer!=NULL)
 {
  int32 conv_chars_cnt=IUtility->UTF8toUCS4(inbuffer,ucs_buffer,ucs_size,UTF_INVALID_SUBST_FFFD);
 
  STRPTR ascii_buffer=(STRPTR)IExec->AllocVecTags(conv_chars_cnt+1,TAG_DONE);
  if (ascii_buffer!=NULL)
  {
   // Copy UCS4 chars to ASCII buffer
   for (r=0; r<conv_chars_cnt; r++) ascii_buffer[r]=(char)ucs_buffer[r];
   ascii_buffer[conv_chars_cnt]=0;
  }
 
  IExec->FreeVec(ucs_buffer);
 }
 
 return ascii_buffer;
}
 
// It goes otherway around as well
 
STRPTR ConvertToUTF8(CONST_STRPTR inbuffer, uint32 inbuff_size)
{
 STRPTR utf8_buffer=NULL;
 uint32 r=0;
 
 uint32 ucs_size=inbuff_size*sizeof(int32)+1; // UCS4 buffer should be int32* // +1 is for null terminator
 
 int32 *ucs_buffer=(int32 *)IExec->AllocVecTags(ucs_size,TAG_DONE);
 if (ucs_buffer!=NULL)
 {
  // Copy ASCII chars to UCS4 buffer
  for (r=0; r<inbuff_size; r++) ucs_buffer[r]=(int32)inbuffer[r];
  ucs_buffer[inbuff_size]=0;
 
  uint32 utf8_buff_size=IUtility->UCS4Count(ucs_buffer,FALSE)*4+1; // UTF8 char size can be 1-4 bytes so reserve room for four bytes per char // +1 is for null terminator
  utf8_buffer=(STRPTR)IExec->AllocVecTags(utf8_buff_size,TAG_DONE);
  if (utf8_buffer!=NULL)
  {
   int32 conv_chars_cnt=IUtility->UCS4toUTF8(ucs_buffer,utf8_buffer,ucs_chars_cnt,UTF_INVALID_SUBST_FFFD);
   // You can save utf_buffer to a file or do what ever you like
  }
 
  IExec->FreeVec(ucs_buffer);
 }
 
 return utf8_buffer;
}
 
// IUtility->UTF8Count() and IUtility->UCS4Count() functions contains a validator to check validity of the strings.

I did tests with a-z, scandinavian letters and a couple of accent letters.

Blog post type:

Tutorial

Comments

Submitted by OldFart on Sun, 2024-02-25 14:36

Hi,

Whilst looking at your code, which btw is nice and clear, I came across a point of concern, due to a possibly shadowing declaration,
- Line 13 contains a shadowing declaration of line 3

Regards,
OldFart
-

Submitted by drule on Thu, 2026-02-26 22:38

Interesting effort, but while your program will work, I suspect it doesn't work the way you think it does. Unicode is directly and deliberately compatible with ISO-8859-1. UTF-8, like its 16 and 32 bit counterparts, will represent all code points in the UCS code space, and not just "compatible with 7-bit ASCII" (although, technically, you're kinda right). You're confusing your code sets in your program.

You state you're using iso-8859-15, which I don't doubt. You don't (can't) specify anywhere in your calls to the conversion routines that this is the case. I'm guessing the string you pass on line 11 to UTF8toUCS4() is your raw 8-bit character array? For example, for character 0xA4 (EURO SIGN) in 8859-15 you SHOULD receive U+20AC as UCS4 which you can't simply cast to an 8-bit char. In these cases you're disregarding what UTF-8 is.

Your environment may be configured to use 8859-15 which handles 0xA4 as EURO SIGN but if you're merely passing the raw 8-bit char of that character into a UTF-8 then you're trying to convert 8859-1 (CURRENCY SIGN) instead, but that probably won't work either as it'll probably be an invalid UTF-8 sequence. To properly use the UTF-8 functions, you first need to convert from your 8-bit character to UTF-8 - and in UTF-8, the EURO SIGN is [0xE2 0x82 0xAC]. Where several other characters in 8859-15 differ from 8859-1, they will be similarly encoded in UTF-8, while the other encoded chars above 0x7F will be a shorter sequence.

Unless the utility lib functions do this for you under the covers for whatever your locale/charset is set to (I can't test right now and I don't think they do), I think you would be garbling the characters if you were to use this kind of approach in any cross platform communications.

SignatureNotFoundException at line 1.

OS4 Coding

Convert UTF8 strings to ASCII and back (without alien libs)

Tags:

Blog post type:

Comments

Coding Tools

Book Shelf

Search form

Tags:

Blog post type:

Comments

Coding Tools

Book Shelf

User login