Donate money Every dollar goes directly to work on the project.
Look at the source code (SVN)
Not what you're looking for?
This is the utf-happy project ("utf-happy")
To join this project, please contact the project administrators of this project, as shown on the project summary page.
This is a library to do utf32 to utf8 conversion. It is unusual
because it tries to use special CPU instructions, like Intel's
SSE2 Streaming SIMD whatchamacalits and so forth, to speed up processing.
That might seem kind of silly when a modern PC can process millions characters
in less than a second with ordinary C code but... oh well. It is an interesting puzzle.
cd to where it downloaded (I'm assuming Desktop) cd ~/Desktop tar -zxvf utf-happy.*.tgz cd utf-happy make
sudo apt-get install cmake nasm tcc
Unknown. Api is not even decided on. I would like to have it usable something like this:
f=fopen("utf32data.txt","r")); uint32_t *buf; fread(f,buf); uint8_t *newbuf = utf_happy_utf32_to_utf8(buf); // do stuff with utf8 free(newbuf); printf("%s",utf_happy_get_info());
Basically, utf32->utf8 ... is a bunch of ANDs and ORs and SHIFTs of binary data. The 32 bit UTF-32 character is just a 32 bit number. The bits get shuffled around into a sequence of between one and four 8-bit numbers.
There is a good manpage on this, you can type the following to read it:
Now, with the special SIMD instructions that are in so many CPUs nowdays, you can theortically make things go faster. The SSE2 instructions for example, use 128-bit registers. That is enough to hold 4 UTF 32 characters at one time. If you could do the UTF32->UTF8 conversion on 4 characters at once, it might be a little speedup. That's the theory here.
preliminary testing.. not rigorous.. sse2 assembler code vs -O2 plain c code input: 1024*1024*2 (2MB) of utf-32 chars random byte-length (1 to 4 mixed): ~15% faster all 1-byte utf-8 (ascii): ~150% (2.5x) faster all 2-byte utf-8: ~100% (2x) faster all 3-byte utf-8: ~50% faster