Monthly Archives: March 2014

UTF-8 and standard C library

Take a look at this:

binf@home:~/codesnippets/C$ cat utf8.c
#include <stdio.h>
#include <string.h>

void main(void)
{
    printf("%s : %i\n","a",(int)strlen("a"));
    printf("%s : %i\n","æ",(int)strlen("æ"));
}

And compile:

binf@home:~/codesnippets/C$ gcc -o utf8 utf8.c

Output :

binf@home:~/codesnippets/C$ ./utf8
 a : 1
 æ : 2

Yeah, it is the Unicode up-code. When you are used to work in an ISO-8859-1 environment, you might take into consideration that more or less of system calls are made with the ASCII in mind.

One example straight to the point; on Linux, the dirent structure is defined as follows:

struct dirent {
               ino_t          d_ino;
               off_t          d_off;
               unsigned short d_reclen;
               unsigned char  d_type;
               char           d_name[256]; /* filename */
           };

In an UTF-8 environment with variable-width encoding a character uses one to four bytes of the system’s assigned 256 bytes for a file name. And with my Unicode example your are limited to one half, a 128 character file name.

binf@home:~/codesnippets/C$ uname -rvm
3.8.0-35-generic #52~precise1-Ubuntu SMP Thu Jan 30 17:24:40 UTC 2014 x86_64