utf-happy: Project Web Hosting - Open Source Software

utf-happy

Users

Download utf-happy files

Every dollar goes directly to work on the project.

Project details and discussion

Look at the source code (SVN)

Get support

Not what you're looking for?

SourceForge.net hosts over 100,000 Open Source projects. You may find what you're looking for by searching our site or using our Software Map.

Project Information

This is the utf-happy project ("utf-happy")

Developers

Join this project:

To join this project, please contact the project administrators of this project, as shown on the project summary page.

Get the source code:

Source code for this project may be available as downloads or through the Subversion SCM repository used by the project, as accessible from the project summary page.

About this project:

This is a library to do utf32 to utf8 conversion. It is unusual
because it tries to use special CPU instructions, like Intel's
SSE2 Streaming SIMD whatchamacalits and so forth, to speed up processing.
That might seem kind of silly when a modern PC can process millions characters
in less than a second with ordinary C code but... oh well. It is an interesting puzzle.

Install:

Please note this is an Alpha release, the API is not decided on yet, and there are many bugs. To try it anyways:
Click 'download files', above
Download newest .tgz file. Then run this in a shell
cd to where it downloaded (I'm assuming Desktop)
cd ~/Desktop
tar -zxvf utf-happy.*.tgz
cd utf-happy
make

You must also have cmake, nasm, and tcc.
On ubuntu you can get them like so:
sudo apt-get install cmake nasm tcc

Usage:

Unknown. Api is not even decided on. I would like to have it usable something like this:


f=fopen("utf32data.txt","r"));
uint32_t *buf;
fread(f,buf);
uint8_t *newbuf = utf_happy_utf32_to_utf8(buf);
// do stuff with utf8
free(newbuf);
printf("%s",utf_happy_get_info());

How it works:

Basically, utf32->utf8 ... is a bunch of ANDs and ORs and SHIFTs of binary data. The 32 bit UTF-32 character is just a 32 bit number. The bits get shuffled around into a sequence of between one and four 8-bit numbers.

There is a good manpage on this, you can type the following to read it:

man utf-8

Now, with the special SIMD instructions that are in so many CPUs nowdays, you can theortically make things go faster. The SSE2 instructions for example, use 128-bit registers. That is enough to hold 4 UTF 32 characters at one time. If you could do the UTF32->UTF8 conversion on 4 characters at once, it might be a little speedup. That's the theory here.

Does it work?

preliminary testing.. not rigorous..

sse2 assembler code vs -O2 plain c code
input: 1024*1024*2 (2MB) of utf-32 chars

random byte-length (1 to 4 mixed): ~15% faster
all 1-byte utf-8 (ascii): ~150% (2.5x) faster
all 2-byte utf-8: ~100% (2x) faster 
all 3-byte utf-8: ~50% faster

Project Web Hosted by SourceForge.net

©Copyright 1999-2009 - Geeknet, Inc., All Rights Reserved

About - Legal - Help