-
Posted: December 24th, 2008, 12:56am CET
pa href="http://www.kfish.org/software/tractorgen/"Tractorgen/a is now on github:/p
ul
lia href="http://github.com/kfish/tractorgen/"http://github.com/kfish/tractorgen//a/li
/ul
h4REPOSITORIAL/h4
pre
The contents of this revision controlled document repository are a computer
source code implementation of TRACTORGEN, being a model of ASCII tractor
mechanics.
It is recommended that one study these documents closely in order to better
understand the finer details of the subject at hand. The authors firmly
believe that only through such preparation, preferably during the course of
one's daily study regimen, can a deeper appreciation of the theory be
attained.
As a side note, it has been noted by correspondents that it is possible to
derive a computer readable binary executable from these documents through
the use of sophisticated compiler technology. On the off chance that any
readers would wish to pursue this path, we include the apparent preparation
for doing so herein, as quoted:
$ automake -a
$ autoreconf
Upon completion of this procedure, which we expect should take on the
order of one to two weeks (of course the actual time depends on the
staffing resources of your local computer centre), a new document shall
be generated _as though from nought!_ [emphasis added]. The name of
this document is expected to be "configure", and it may itself be
executed thus:
$ ./configure
We recommend scheduling a vacation!
Upon your return, type "make", then "make install", and prepare your
experimental apparati forthwith:
$ tractorgen
Generates ASCII tractors.
/pre
h4Commit messages/h4
p
One must eschew the typically terse and perfunctory style of commit messages
that are common in software projects, and ensure that the purpose, significance,
and experimental procedure for each incremental change are appropriately
recorded.
/p
ul
liSubscribe to the a href="http://github.com/feeds/kfish/commits/tractorgen/master"tractorgen commit feed/a/li
/ul
p
Obviously, commit messages are a good place to store source code for important tools:
a href="http://github.com/kfish/tractorgen/commit/9112c05d755091231818aba8c3ce46524e1100a5"9112c05/a.
/p
pre
r-------
_|
/ |_______\_ \\
| |o|----\\
|_____________\_--_\\
(O)_O_O_O_O_O_(O) \\
/pre
-
Posted: December 23rd, 2008, 8:34am CET
pA new release of HOgg, on Hackage:/p
ul
lia href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/hogg-0.4.1"hogg-0.4.1/a/li
/ul
p
This contains updates to work with Hackage, the Haskell source package system; and also a new
tthogg man/tt subcommand to generate man pages for subcommands.
/p
h4Updated for Hackage/h4
p
a href="http://hackage.haskell.org/trac/hackage/"Hackage/a is Haskell's source packaging
system. It makes it very easy to keep up to date with bleeding-edge releases.
/pp
You'll need the ttcabal/tt
command. This is already in Gentoo (ttemerge cabal/tt) and Arch Linux (ttpacman -S cabal-install/tt).
If you're on a system where cabal is not already packaged, you'll first need to
a href="http://book.realworldhaskell.org/read/installing-ghc-and-haskell-libraries.html"install GHC/a
(eg. ttapt-get install ghc6/tt on Ubuntu 8.10 or Debian Lenny systems), then:
/p
pre
$ wget http://hackage.haskell.org/packages/archive/cabal-install/0.6.0/cabal-install-0.6.0.tar.gz
$ tar zxf cabal-install-0.6.0.tar.gz
$ cd cabal-install-0.6.0
$ chmod +x bootstrap.sh
$ ./bootstrap.sh
/pre
p
This will download and build the packages required to set up cabal. From there, a new Haskell
package like tthogg/tt can be installed by simply doing:
/p
pre
$ cabal update
$ cabal install hogg
/pre
p
This will build and install tthogg/tt into tt$HOME/.cabal/bin/tt (which of course you
should add to your $PATH if you actually want to use anything you install via cabal :-)
/p
h4man page output of self-documentation/h4
p
tthogg/tt already generated its own help text, with runtime
a href="http://blog.kfish.org/2008/03/release-hogg-040.html"checking of example syntax/a.
This release adds a tthogg man/tt subcommand which generates the same help text in
Unix man page format:
/p
pre
$ hogg man man
.TH HOGG 1 "December 2008" "hogg" "Annodex"
.SH SYNOPSIS
.B hogg
.RI man
[
.I OPTIONS
]
.SH DESCRIPTION
Generate Unix man page for a specific subcommand (eg. "hogg man chop")
.SH OPTIONS
-h, -? --help Display this help and exit
-V --version Output version information and exit
.SH EXAMPLES
.PP
Generate a man page for the "hogg chop" subcommand:
.PP
.RS
\f(CWhogg man chop\fP
.RE
.SH AUTHORS
hogg was written by Conrad Parker
This manual page was autogenerated by
.B hogg man man.
Please report bugs to lt;ogg-dev@xiph.orggt;
/pre
-
Posted: July 4th, 2008, 12:03pm CEST
a href="http://lists.xiph.org/pipermail/ogg-dev/2008-July/001082.html"liboggz 0.9.8/a
includes the first release of ttoggz-chop/tt, as well as support for the new karaoke
codec a href="http://wiki.xiph.org/index.php/OggKate"OggKate/a.
p
ttoggz-chop/tt can be used to serve time ranges of Ogg media
over HTTP by any web server that supports CGI. The oggz-chop binary simply checks if it
is being run as a CGI script by checking some environment variables, and if so acts
based on the CGI query parameter ttt=/tt, much like ttmod_annodex/tt.
It accepts all the time specifications that
ttmod_annodex/tt accepts (ttnpt/tt and various ttsmpte/tt framerates),
and start and end times separated by a /.
/p
p
All you need to do is set up the following Apache config:
/p
blockquotett
ScriptAlias /oggz-chop /usr/bin/oggz-chop
Action application/ogg /oggz-chop
/tt/blockquote
p
, and all your Ogg files will be handled with ttoggz-chop/tt, which means that you can
put a time range on the end, like:
blockquote
tthttp://www.example.com/candidate_speech.ogv?t=00:23/00:26/tt
/blockquote
p
The minimal amount of data required to play the section between 23 and 26 seconds will
be sent to you, such that it plays back immediately from the time requested.
As for caching, it generates ttLast-Modified/tt HTTP headers, and responds correctly to
ttIf-Modified-Since/tt conditional GET requests.
/p
p
It implements the same chopping algorithm as the Haskell version tthogg chop/tt,
released in a href="http://blog.kfish.org/2007/12/release-hogg-030.html"HOgg 0.3.0/a,
so it will insert an
a href="http://wiki.xiph.org/OggSkeleton"Ogg Skeleton/a
track which can give players hints about what time the in-sync
audio and video data should start being rendered, and if any of the input files include
Skeleton information that will be preserved, and the output will contain only one Skeleton
track.
/p
p
Many thanks to Michael Dale, j^ and John Ferlito for testing out ttoggz-chop/tt
during its development.
/p
-
Posted: June 30th, 2008, 1:10pm CEST
pWe recently opened the
a href="http://www.foms-workshop.org/foms2009/pmwiki.php/Main/CFP"FOMS 2009 Call for Participation/a. FOMS mdash; Foundations of Open Media Software mdash; is a developer workshop "to widen cooperation and interoperability among open source media projects". It will be held a few days before a href="http://linux.conf.au/"linux.conf.au/a, in Hobart, Tasmania.
/p
p
This year's FOMS had a large emphasis on
a href="http://blog.kfish.org/2008/02/foms-lca-2008-roundup.html"free codecs/a, with many of the Xiph.Org developers in attendance.
In 2009 we really hope to expand the participation to include people from projects with alternate technical viewpoints, such as those of
a href="http://www.mplayerhq.hu"MPlayer/a and a href="http://nut-container.org/"NUT/a. It would also be good to get some exposure to projects like
a href="http://openbossa.indt.org/canola/"Canola2/a and
a href="http://omxil.sourceforge.net/"Bellagio/a, in order to deal with the issues of mobile multimedia. If you're involved in development of those or similar projects and would be interested in attending FOMS 2009, please respond to the CFP; there will be some travel grants available to help get you to Tasmania.
/p
bRelated conferences/b
ul
liI'll be at a href="http://linuxplumbersconf.org/"Linux Plumbers Conference/a 2008 in Portland, Oregon, specifically for the "Audio" microconf./li
liWe hope to run a Multimedia Miniconf at LCA 2009 :-)/li
/ul
-
Posted: April 17th, 2008, 4:45pm CEST
p
Some of my favourite Firefox plugins are:
ul
lia href="http://www.polarcloud.com/rikaichan/"Rikaichan/a, a Japanese dictionary, which adds instant translation popups when you mouse over a word;/li
lia href="http://vimperator.mozdev.org/"Vimperator/a, which provides ttvi/tt-like user interface;/li
liand a href="https://addons.mozilla.org/en-US/firefox/addon/1337"Hide Tab Bar/a, because Vimperator's buffer list is more useful./li
/ul
p
Vimperator hides the menu bar by default. bTools-Toggle Rikaichan/b has no default keybinding, and the keybindings to navigate the menubar are not available if the menubar is not visible, so Rikaichan can no longer be activated.
/p
p
The following adds a vimperator command tt:rikaichan/tt; save it to tt.vimperator/plugin/toggleRikaichan.js/tt:
blockquotepre
(function(){
vimperator.commands.add(new vimperator.Command(
['rikaichan', 'rikai'],
function(){
rcxMain.inlineToggle();
}
))
}) ();
/pre/blockquote
/p
p
It is aliased to tt:rikai/tt for short, but unfortunately vimperator won't recognize tt:理解/tt.
Thanks to ktsukagoshi for the explanation of how to write a vimperator plugin (a href="http://d.hatena.ne.jp/ktsukagoshi/20080305/1204730962"vimperatorのプラグインの作成/a).
/p
p
iRemember, a href="http://www.vergenet.net/~conrad/syre/"the interface is inside your mind/a./i
/p
-
Posted: April 14th, 2008, 9:30am CEST
p
Yesterday was a href="http://logic.cs.tsukuba.ac.jp/Continuation/"Continuation Fest 2008/a,
at the University of Tokyo's campus in Akihabara (a very nice venue!).
It was very well attended; latecomers overflowed to a second room and participated by video conference. It was a little strange to see so many people interested in such an
strikeobscure, troublesome and malignant/strike expressively powerful
programming construct; the breadth of talks made for a very inspiring and practical introduction to the theory, applications and implementation of continuations in many different languages.
/p
p
I recommend reading
a href="http://pllab.is.ocha.ac.jp/~asai/"Kenichi Asai/a's
introduction to delimited continuations
(a href="http://pllab.is.ocha.ac.jp/~asai/papers/contfest08slide.pdf"slides/a [PDF]).
He introduced the ttshift/tt and ttreset/tt operators
through the problem of expressing exceptional control flow, and
then explained how to use these to type (ie. determine a concrete type for)
ttprintf/tt. The main point was that
ttshift/reset/tt provide a high-level abstraction over control flow, with minimal impact
on the implementation of your existing functions.
/p
p
a href="http://okmij.org/ftp/"Oleg Kiselyov/a demonstrated some new code for transactional
web applications, using delimited continuations for explicit state sharing between parallel connections. The result is that the user has a consistent view across multiple tabs are open on the same site, and the state is transactional so that there is no need for warnings like "Do not press the BUY button more than once!". He said that everyone already understands delimited continuations, they just don't realize it.
/p
p
The topic of my presentation at Continuation Fest was
bContinuations for video decoding and scrubbing/b:
/p
blockquotep
Playback of encoded video involves scheduling the decoding of audio and video frames and synchronizing their playback. "Scrubbing" is the ability to quickly seek to and display an arbitrary frame, and is a common user interface requirement for a video editor. The implementation of playback and scrubbing is complicated by data dependencies in compressed video formats, which require setup and manipulation of decoder state.
/pp
We present the preliminary design of a continuation-based system for video decoding, reified as a cursor into a stream of decoded video frames. Frames are decoded lazily, and decoder states are recreated or restored when seeking. To reduce space requirements, a sequence of decoded frames can be replaced after use by the continuation which created them.
strikeWe outline implementations in Haskell and C./strike
/p/blockquote
p
ul
lia href="http://seq.kfish.org/~conrad/static/continuation-fest-2008/continuations-for-video.pdf"Slides/a [383KB PDF]/li
lia href="http://seq.kfish.org/~conrad/static/continuation-fest-2008/continuations-for-video.article.pdf"Article/a [215KB PDF]/li
/ul
/p
pI'll be introducing the code for this over the next few months.
Whereas in my presentation about
a href="http://blog.kfish.org/2008/03/bossa-2008-video-player-internals.html"video player internals/a at BOSSA I outlined the problem space in designing a multimedia architecture,
at Continuation Fest I tried to break it down into subproblems and considered
useful data structures and programming techniques for dealing with them.
/p
p
I got a lot of great feedback, and I think I succeeded in my mission to introduce this problem space to some really smart people.
Thanks particularly to
a href="http://www.cs.rutgers.edu/~ccshan"Chung-chieh Shan/a for some insightful ideas
about how to deal with existing stateful codec implementations. It was also very interesting to
talk with
a href="http://www.ie.u-ryukyu.ac.jp/~kono/index-e.html"Shinji Kono/a about
a href="http://sourceforge.jp/projects/cbc/"Continuation-based C (cBc)/a
(a href="http://www.ie.u-ryukyu.ac.jp/~kono/tmp/cf08-kono.tgz"slides/a [HTML tarball]),
a C-like language capable of expressing continuations, non-local jumps, multiple function entry-points, and assorted other ways to shoot yourself in the foot. He suggested that it was designed for exactly the kind of thing I'm doing, and I'll be interested to try it
out. It is implemented in a modifed GCC 4.x as an RTL code generator, so should now be (fairly)
architecture-independent.
/p
p
Thanks to the organizers of Continuation Fest 2008 for putting together such a useful and interesting event. I look forward to implementing just some of the things I learned :-)
/p
-
Posted: April 11th, 2008, 11:43pm CEST
This is a bugfix release of a href="http://www.metadecks.org/software/sweep/"Sweep/a,
addressing a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-1686"CVE-2008-1686/a.
For details, see my earlier post about
a href="http://blog.kfish.org/2008/04/release-libfishsound-091.html"libfishsound 0.9.1/a.
Thanks to Peter Shorthose for managing this release.
-
Posted: April 7th, 2008, 3:08am CEST
p
This is a maintenance release, fixing a security vulnerability in Speex header processing as outlined in a href="http://www.ocert.org/advisories/ocert-2008-2.html"oCERT 2008-02/a.
When used in a client for web video content, as in the
a href="http://www.annodex.net/"OggPlay Firefox Plugin/a or the
a href="http://www.illiminable.com/ogg/"Ogg DirectShow filters/a, a specially crafted Ogg Speex stream hosted on a server could be used to allow an attacker to execute arbitrary code on the client system. The OggPlay plugin binaries available from a href="http://www.annodex.net/"www.annodex.net/a have already been updated.
/p
h4Details/h4
p
The a href="http://wiki.xiph.org/OggSpeex"Speex header/a contains a 32-bit ttmodeID/tt field, interpreted by libspeex as a signed int (ttspx_int32_t/tt)
The normal way to use this is to index into a global mode list to retrieve a SpeexMode *:
blockquotepre
mode = (SpeexMode *)speex_mode_list[modeID];
/pre/blockquote
and then use that to set up a decoder:
blockquotepre
st = speex_decoder_init(mode);
/pre/blockquote
This calls ttspeex_decoder_init()/tt in libspeex, which looks like:
blockquotepre
void *speex_decoder_init(const SpeexMode *mode)
{
return mode-dec_init(mode);
}
/pre/blockquote
So if you don't check that the ttmodeID/tt given in the stream header is within the bounds of ttspeex_mode_list[]/tt, arbitrary code can be executed.
ttlibfishsound/tt was checking the upper bound (ttmodeID lt; SPEEX_NB_MODES/tt) but was not checking against negative values.
/p
h4Discussion/h4
p
This header processing is all boilerplate, and a reference implementation is given in
a href="http://svn.xiph.org/trunk/speex/src/speexdec.c"speexdec.c/a.
I took a copy of that about 7 years ago for
a href="http://www.metadecks.org/software/sweep/"Sweep/a, which I then adapted for libfishsound. The current reference speexdec.c does not have this bug.
/p
p
For the Symbian port of Speex we created a function which returns the desired mode given a modeID, rather than having application code index into a global mode list.
I wrote and committed a href="https://trac.xiph.org/changeset/7511"speex_get_mode()/a
to libspeex in September 2004, and it does the correct bounds checking.
So if I'd been using that function in libfishsound then today's problem would never have happened. As it turns out, the libfishsound svn trunk version of
a href="http://svn.annodex.net/libfishsound/trunk/src/libfishsound/speex.c"speex.c/a
does use that function. As far as I am aware, the OggPlay plugin binaries have always been built against the libfishsound svn trunk, so they were never vulnerable in the first place. However, recent tarball releases of libfishsound have been coming of a separate branch, so the advisory is valid for applications linked against those releases.
/p
p
Finally, I sent a patch to
a href="http://people.xiph.org/~jm/"Jean-Marc Valin/a yesterday which entirely removes the possibility of this bug happening again by bounding the mode values returned by ttspeex_packet_to_header()/tt in libspeex. It will be available very soon in a libspeex release.
/p
h4Acknowledgements/h4
p
Thanks to the team at a href="http://www.ocert.org/"oCERT/a for the efficient reporting of this advisory, and to the anonymous submitter for the details.
I was able to patch the offending branches, which allowed
a href="http://v2v.cc/~j/"j^/a to build and upload new OggPlay plugin binaries (within 24 hours of contact by oCERT).
/p
p
ul
lia href="http://lists.xiph.org/pipermail/speex-dev/2008-April/006636.html"libfishsound 0.9.1 release notes/a/li
lia href="http://www.ocert.org/advisories/ocert-2008-2.html"oCERT 2008-02/a/li
/ul
/p
-
Posted: March 24th, 2008, 7:23pm CET
p
a href="http://www.vergenet.net/~conrad/software/hogg/"HOgg/a
is a Haskell library and commandline tool for manipulating Ogg files.
This release contains a bunch of code written during a href="http://blog.kfish.org/2008/02/foms-lca-2008-roundup.html"FOMS and LCA 2008/a, including
a new sort subcommand and proper handling of Skeleton when merging and ripping files. Full details are in the
a href="http://www.vergenet.net/~conrad/software/hogg/release_notes/hogg-0.4.0.txt"release notes/a.
/p
h3sort implementation/h3
p
My favourite part is the implementation of the new ttsort/tt subcommand:
blockquote
pre
sort :: [OggPage] - [OggPage]
sort = sortHeaders . listMerge . demux
/pre
/blockquote
/p
p
This is somewhat shorter than the equivalent C implementation,
a href="http://svn.annodex.net/liboggz/trunk/src/tools/oggz-sort.c"oggz-sort.c/a mdash;
bHaskell affords abstraction whereas in C it's a trade-off/b.
ttsortHeaders/tt is a long (21 line) function that re-orders header pages according to
the Theora and Skeleton specifications, and ttlistMerge/tt is a generic list merging function, also used in the ttmerge/tt subcommand. ttdemux/tt is tiny:
blockquote
pre
demux :: (Serialled a) = [a] - [[a]]
demux = classify serialEq
/pre
/blockquote
You can read that as "demux is classification by serial number": ttclassify/tt is a generic list function, classifying list elements according to some criterion you give it. Here, for example, the list of pages:
blockquote
tt[Video0, Audio0, Video1, Audio1, Audio2, Audio3, Video2, Audio4, Video3, ...]/tt
/blockquote
will get classified into two separate lists:
blockquote
pre
[[Video0, Video1, Video2, Video3, ...],
[Audio0, Audio1, Audio2, Audio3, Audio4, ...]]
/pre
/blockquote
This is done lazily, meaning that the processing is done on the fly and big intermediate lists are not constructed in memory. ttVideo0/tt, ttAudio0/tt will be passed through ttlistMerge/tt and ttsortHeaders/tt and written to disk by the consumer of ttsort/tt well before ttVideo103/tt and ttAudio5007/tt are seen.
/p
h3Documentation improvements and self-checking/h3
p
The help for each subcommand now contains long descriptions, mostly similar to the man pages of the
a href="http://www.annodex.net/software/liboggz/index.html"ttOggz/tt/a tools.
The descriptions also have explicit sections describing how Theora, Skeleton and chained files are handled.
The example commandlines for each subcommand use the
a href="http://wiki.xiph.org/index.php/MIME_Types_and_File_Extensions"Ogg MIME types and file extensions/a that we are now recommending in Xiph.Org.
/p
p
The best bit though is tthogg selfcheck/tt, which checks that the help examples are valid.
It checks that all the example commandlines pass through getOpt without errors, and that all file extensions used in options are valid. This is the kind of nice touch which would have been a pain to code up in C, but fell out cleanly in the Haskell implementation. As it is fairly cheap to run (and printing help text is hardly a performance-critical operation), this option is also silently run after printing out any help output at all, so that such errors are more likely to be found
and reported. The same commit that introduced tthogg selfcheck/tt also fixed two such documentation errors which were found by this option :-)
/p
-
Posted: March 24th, 2008, 7:17pm CET
p
a href="http://www.kfish.org/software/xsel/"XSel/a is a command-line tool for manipulating the X selection.
This is a maintenance release, improving argument handling, documentation and X11 library detection.
/p
ul
lia href="http://www.vergenet.net/~conrad/software/xsel/download/xsel-1.2.0.tar.gz"xsel-1.2.0.tar.gz/a/li
lia href="http://svn.kfish.org/xsel/trunk/release_notes/xsel-1.2.0.txt"Release notes/a/li
/ul
-
Posted: March 24th, 2008, 6:29pm CET
pLast week I attended
a href="http://www.bossaconference.indt.org/"BOSSA/a, a conference on open source software
for mobile embedded platforms, organized by a href="http://www.indt.org.br/"INdT/a. It was held in the town of Porto de Galinhas, Brazil.
Since then I have been hanging out in the INdT labs in Recife, hacking on xine, catching up with friends and exploring the old city.
/pp
The topic of my presentation at BOSSA was bVideo Player Internals/b:
blockquote
Embedded platforms put demands on latency and memory use. Video playback
makes these difficult to guarantee. This presentation discusses the
architecture of video players, and the problems imposed on them by the
design of video codecs and their containers. To explain these problems
we look at both proprietary and open source formats (MPEG, Ogg, Theora,
Dirac, etc.) and evaluate open source video players in this context.
We particularly examine xine and GStreamer, and introduce the minimal
architecture of OggPlay.
/blockquote
/p
p
ul
lia href="http://seq.kfish.org/~conrad/static/bossa-2008/video-player-internals.pdf"Slides/a [613KB PDF]/li
lia href="http://seq.kfish.org/~conrad/static/bossa-2008/video-player-internals.article.pdf"Article/a [330KB PDF]/li
/ul
/p
p
I'm very grateful to INdT for the opportunity to attend, it was an awesome conference in a very beautiful part of the world.
/p
/p
-
Posted: February 15th, 2008, 9:54am CET
p
There's been a whole bunch of work on
a href="http://www.annodex.net/software/liboggz/index.html"liboggz/a recently; it deserves a few more weeks of
shaking out and perhaps some updated Win32/MacOS support before it gets 1.0 slapped on it.
/p
p
a href="http://lists.xiph.org/pipermail/ogg-dev/2008-February/000847.html"liboggz 0.9.7/a
includes a new tool called oggz-sort, which addresses a problem with some encoders that
Shane Stephens brought up at
a href="http://www.annodex.org/events/foms2008/pmwiki.php/Main/Proceedings"FOMS/a. The
discussion was going around in circles, so my response was to write this C code. It implements a function that Shane has written but not yet released in his OCaml implementation of Ogg
(a href="http://svn.annodex.net/oogg/trunk/"oogg/a), and
which I've written but not yet released in my Haskell implementation (a href="http://www.kfish.org/software/hogg/"HOgg/a). Of course, people will take this version more seriously because it's written in C.
/p
p
From ttboggz-sort (1)/b/tt:
blockquote
p
boggz-sort/b sorts an Ogg file, interleaving pages in order of presentation time. It correctly interprets the granulepos timestamps of Ogg
Vorbis, Speex, FLAC and Theora bitstreams, and all bitstreams of Annodex files.
/pp
Some encoders produce files with incorrect page ordering; for example,
some audio and video pages may occur out of order. Although these files
are usually playable, it can be difficult to accurately seek or scrub
on them, increasing the likelihood of glitches during playback. Players
may also need to use more memory in order to buffer the audio and
video data for synchronized playback, which can be a problem when the
files are viewed on low-memory devices.
/pp
The tool boggz-validate/b can be used to check the relative ordering of
packets in a file. If out of order packets are reported, use boggz-sort/b
to fix the problem.
/p
/blockquote
/p
p
This release also adds support for the experimental
a href="http://lists.xiph.org/pipermail/ogg-dev/2007-December/000706.html"CELT/a audio codec, which is being developed
by Jean-Marc Valin (the primary author of a href="http://www.speex.org/"Speex/a). CELT is
designed as a low-latency codec for high-quality audio. When wiretapping conversations
encoded in CELT, we recommend that you record using the Ogg container format. You can then use oggz-tools to help with your analysis.
/p
-
Posted: February 10th, 2008, 12:30am CET
p
This is a story about the meaning of "version 1.0".
A few weeks ago I released
a href="http://blog.kfish.org/2008/01/release-xsel-100.html"version 1.0 of xsel/a, a simple commandline utility
for manipulating the X selection and clipboard.
I chose to call it 1.0 after recalling a discussion with
a href="http://www.algorithm.com.au/"Andreacute; Pang/a,
about how the meaning of version numbers in open source software tends to differ from that in other software communities. For example, it is often advised not to buy the first version of a proprietary software product as it is sure to be buggy and incomplete; open source projects on the other hand often aspire to 1.0 being a major milestone, bug-free and fully-functional. The Windows and Mac freeware and shareware communities tend to follow a middle ground, content to release a useful but incomplete version 1.0, but thereafter avoiding the quick version creep that afflicts companies with marketing departments (and version-limited support contracts).
/p
p
I'll argue that that middle way makes for more meaningful version numbers. Putting the label "1.0" on a release should be your way of saying that it's the first version that:
/p
ul
liwon't hose a user's system, and/li
lihopefully does something useful./li
/ul
p
Any version number less than 1.0 is sending out a signal that the software isn't quite ready yet; perhaps that you could lose or damage data by using it. Many people intuitively wait for version 1.0 before trying out some software, and this is fair enough. In fact, we ineed/i a way of warning that a project isn't ready for widespread adoption, that it could damage data, that the tarball is only out there so that other people can grab the code and help fix bugs. That's what version numbers less than 1.0 mean.
/p
p
After 1.0, you can keep adding features and bumping the version number, working towards version 2.0 which perhaps does useful things in a different way. And from 2.0, onwards to 3.0 and beyond; integers are cheap. The important thing is not fall into the trap of thinking of 1.0 as some kind of asymptotic upper bound representing the perfect release.
/p
p
Back to xsel. At first I wrote up release notes as version 0.9.7, but then
then remembered that discussion with Andreacute; and realized that it should really
just be 1.0.
More to the point the previous release (in July 2001, which went five years without a bug report or patch) should have been 1.0. So yeah, it was a good feeling to just write "1.0" and send it out.
/p
p
The morning after releasing 1.0 I got a report from someone who couldn't get
it to compile -- turns out they didn't have the X11 development libraries
installed, and for some reason I had commented out the checks for that
in ttconfigure.ac/tt while testing something or other a while ago.
As a result, the configure script wasn't check for its only dependency.
I considered doing a canonical 1.0.1 (LOL) release. Within the next day, though,
I got a report about how to fix handling of COMPOUND_TEXT, an archaic way of
handling international text (since superceded by UTF8_STRING). And there follows
the next lesson (as berated by a href="http://www.rasterman.com/"Raster/a): random
bug reports stream in emphafter/emph a release, not before.
a href="http://www.mega-nerd.com/"Erik de Castro Lopo/a is up to his 20th
pre-release of libsndfile 1.0.18; each pre-release he gets bombarded with reports.
/p
p
Anyway, the post-1.0 bug reports have died down, so today I'm releasing version 1.1.0 of
a href="http://www.vergenet.net/~conrad/software/xsel"xsel/a.
i"This release adds basic support for COMPOUND_TEXT and fixes a configuration bug"/i.
And I'm still waiting to hear good uses of bttxsel --append/tt/b and
bttxsel --follow/tt/b.
/p
-
Posted: February 8th, 2008, 6:58am CET
p
I arrived back in Japan after a few awesome weeks in Australia for
a href="http://www.annodex.org/events/foms2008/pmwiki.php/Main/HomePage"FOMS/a and a href="http://linux.conf.au/"LCA/a. The weather in Melbourne was great, and the food was fantastic.
/p
p
Between FOMS and LCA, dozens of free multimedia software developers were in town. It was the first time that developers of Dirac, Speex, Theora, Vorbis, Ogg, and most of the Annodex crew were all in the same place, so we spent most of the week of LCA holed up in a room designing content description and packaging formats. One immediate outcome will be finalization of the Dirac mapping into the Ogg container.
/p
p
I organised the multimedia miniconf on the Monday of LCA, which was jam-packed with excellent presentations and lightning talks. Thanks to everyone who came, and talked, and video recorded. There were plenty of comments along the lines of it being "pretty hardcore for a miniconf".
If you are interested in helping with next year's LCA Multimedia, or have friends in Hobart who might be able to help, let's start throwing around ideas. In particular, quite a few people asked what happened to the audio miniconf parties from a few years ago, and it might be a good chance to revive those ...
/p
h4Videos/h4
p
The following pages contain embedded videos of the presentations from these events, and the multimedia-related presentations from LCA:
/p
ul
lia href="http://www.annodex.org/events/foms2008/pmwiki.php/Main/Proceedings"FOMS Proceedings/a: introductions by the participants/li
lia href="http://www.annodex.org/events/lca2008_mmm/pmwiki.php/Main/Schedule"LCA Multimedia/a: Dirac, Xiph, EngageMedia, FFADO and many others/li
lia href="http://www.annodex.org/events/lca2008_mmm/pmwiki.php/Main/LCA"Multimedia talks @LCA/a: PulseAudio, Ogg, Theora, Telepathy, Farsight .../li
/ul
blockquote
p
The videos on these pages are embedded with a href="http://metavid.ucsc.edu/wiki/index.php/Mv_embed"mv_embed/a, which supports playback via the a href="http://www.annodex.net/"OggPlay plugin for Firefox/a, vlc-plugin or generic application/ogg.
mv_embed is a JavaScript library by Michael Dale of a href="http://metavid.org/"MetaVid/a. It is really easy to use, you just include that library (ttlt;script src="..."gt;/tt) and then write ttlt;video src="..."gt;/tt anywhere in your page. No need to wait for native HTML5 support in your browser :-)
/p
/blockquote
-
Posted: January 13th, 2008, 6:11am CET
p
This release of
a href="http://lists.xiph.org/pipermail/ogg-dev/2008-January/000717.html"Oggz 0.9.6/a contains a new tool, bttoggz-comment/tt/b, which can be used to edit the basic metadata (title, producer, copyright etc.) of Ogg Theora files.
The library also has some pretty major improvements to the way it works out timestamps and does seeking, mostly the work of Shane Stephens.
/p
p
In media files, timing and synchronization is extremely important. If the image and audio start to go out of sync, it is very noticeable and the video quickly becomes unwatchable. When you scan through a file you often need to decode a lot more data than you actually display. This is particularly the case when you jump backwards, which is common in a user interface that supports scrubbing. As video frames are stored as a difference relative to earlier (or later) frames, you end up needing to secretly jump further back in the file to the previous keyframe, and then decode many frames up to the one you actually want to show. For a smooth user experience you need to do this as quickly as possible.
/p
p
Ogg has some interesting framing properties. Given that timing is so important, you might expect that every packet has its precise timing information associated with it. In Ogg, it turns out not to be so. Packets are stored in pages, and there is only one timestamp per page. It is common for many audio packets to be crammed onto one page; the timing information for all the rest is not stored in the file. On the other hand, the encoded data for video keyframes is usually much larger, and spans multiple pages. Only the last packet on a page has its timestamp recorded, so if the keyframe is followed by an a much smaller packet of frame data in the same page, the timestamp for the keyframe will be lost. For these reasons I tend to refer to Ogg as a "lossy" container.
/p
p
In order to minimize these problems, liboggz now inspects the encoded data in order reconstruct the expected granulepos (corresponding to a timestamp) for every packet in an Ogg stream. This allows applications to use reliable timestamps, even though these are only sparsely recorded in most Ogg streams.
This is not as easy as it sounds, particularly for Ogg Vorbis.
To get a flavour of what's involved, read Shane's rant in the comments, explaining how to
a href="http://trac.annodex.net/browser/liboggz/trunk/src/liboggz/oggz_auto.c#L468"calculate Vorbis timestamps/a.
/p
p
For an in-depth discussion, come to Ralph Giles' talk at linux.conf.au,
a href="http://linux.conf.au/programme/detail?TalkID=68"Seeking is hard: Ogg design internals/a.
/p
-
Posted: January 12th, 2008, 4:53pm CET
p
a href="http://www.vergenet.net/~conrad/software/xsel/"XSel/a is a command-line program for getting and setting the contents of the X selection. You can use ttxsel /ttin shell scripts and desktop keybindings, so that the contents of the X selection are available to command arguments:
/p
blockquote
bttmozilla --remote "openurl(`xsel`)"/tt/b
/blockquote
p
This release adds UTF-8 support and fixes various bugs. The last version of XSel was 0.9.6, released sometime around 2001. It may have been the first version also. For some reason a bunch of patches came in recently, and I've had the joy of revisiting this project.
/p
p
For old time's sake, my
a href="http://lists.slug.org.au/archives/slug-chat/2001/July/msg00054.html"thoughts on ICCCM/a. (Warning: explicit language).
Back then I made a point of implementing as much of that crack as possible. You can even tell applications to delete their selected text:
/p
blockquote
ul
liTo delete the contents of the selection: bttxsel --delete/tt/b/li
/ul
/blockquote
p
(This really works, you can try it on ttxedit/tt to remotely delete text in the editor window).
/p
/p
p
This time around, of course, nothing does what the docs say anymore.
So we ignore the details in the 2001 proposal for Inter-Client
Exchange of Unicode Text and just grunt atoms at the selection owner
until they yield all their secrets. And now, finally, ttxsel/tt works on
Japanese.
/p
p
emPeople have come up with some interesting uses for ttxsel/tt over the years, but nobody has yet come up with a nifty use for the following options:/em
/p
blockquote
ul
liTo append to the X selection: bttxsel --append lt; file/tt/b/li
liTo follow a growing file: bttxsel --follow lt; file/tt/b/li
/ul
/blockquote
p
Any ideas?
/p
-
Posted: January 12th, 2008, 11:34am CET
p
Now a href="http://lists.xiph.org/pipermail/flac-dev/2008-January/002472.html"libfishsound 0.9.0/a supports
a href="http://flac.sourceforge.net/"FLAC/a, the Free Lossless Audio Codec.
The a href="http://www.annodex.net/software/libfishsound/libfishsound-flac/"patches/a
were originally contributed by Tobias Gehrig in 2004. There hasn't been much use of Ogg FLAC, whereas FLAC in its native encoding is very popular. However, the point of the Ogg mapping is to allow FLAC to be used in parallel with other codecs, in particular as the audio codec for video files.
The combination of Theora video and FLAC audio can be very useful for music videos, where you might not care too much if the image has lost some quality but you want the sound to be as good as possible.
/p
p
However, creating such a file isn't so easy. Let's say you have a source video, like
a href="http://www.archive.org/details/gtv204_jacobfredjazzodyssey"GrooveTV #204 - Jacob Fred Jazz Odyssey/a. I took the MPEG-1 file as recommended; for clarity, let's call it ttsource.mpg/tt. To make a video to test on, I did:
/p
blockquote
p
bttffmpeg2theora source.mpg/tt/bbr/
to encode the video into an Ogg file containing Theora video and Vorbis audio. This produces bttsource.ogv/tt/b.
/p
p
bttoggzrip -c theora source.ogv -o video-theora.ogv/tt/bbr/
to extract only the Theora video track, into bttvideo-theora.ogv/tt/b.
/p
p
bttmpg123 -w source.wav source.mpg/tt/bbr/
to extract the audio to a wav file, bttsource.wav/tt/b. Here the audio in the source material was encoded as MPEG I layer II; obviously if you were producing a music video, you'd skip this step and encode FLAC from the original recording. I didn't have that here, and I just wanted a file I could test on.
/pp
However, at the least this step means that no further artifacts are introduced into the audio, other than those which were present in the MPEG encoding. If the only source material you have is already encoded, you don't want to degrade it further by re-encoding it with a different codec.
/p
p
bttflac --ogg source.wav -o audio-flac.oga/tt/bbr/
to encode the audio. This produces an Ogg FLAC file called bttaudio-flac.oga/tt/b.
/p
p
bttoggzmerge video-theora.ogv audio-flac.oga -o final.ogv/tt/bbr/
to merge the video and audio tracks into the final Ogg video file, bttfinal.ogv/tt/b.
/p
/blockquote
p
Note that we're using the recently recommended
a href="http://wiki.xiph.org/index.php/MIME_Types_and_File_Extensions"file extensions for Ogg video and audio/a.
/p
p
If you know an easier way to create Ogg Theora+FLAC files, please leave a note in the comments :-)
/p
-
Posted: December 11th, 2007, 5:04am CET
pThere has been a bit of a href="http://lists.xiph.org/pipermail/advocacy/2007-December/001469.html"FUD about Ogg Theora/a recently
[a href="http://yro.slashdot.org/yro/07/12/09/2045200.shtml"2/a]
[a href="http://www.boingboing.net/2007/12/09/nokia-to-w3c-ogg-is.html"3/a].
So, over on a href="http://wiki.whatwg.org/wiki/IRC"#whatwg/a, one day before the a href="http://www.w3.org/2007/08/video/"W3C Video on the Web Workshop/a:
/p
blockquote
table
trtd11:35:59/tdtd * Hixie casually removes Ogg from the spec and sees what happens/td/tr
trtd11:36:43/tdtd * othermaciej_ takes shelter/td/tr
trtdnbsp;/tdtd.../td/tr
/table
/blockquote
p
The editor of the HTML5 draft specification, Ian Hickson (Hixie), sent a href="http://lists.w3.org/Archives/Public/public-html/2007Dec/0136.html"this message /a:
/p
blockquote
I've temporarily removed the requirements on video codecs from the HTML5
spec, since the current text isn't helping us come to a useful
interoperable conclusion. When a codec is found that is mutually
acceptable to all major parties I will update the spec to require that
instead and then reply to all the pending feedback on video codecs.
/blockquote
blockquote
table
trtd12:05:02/tdtd lt;kfishgt; Hixie!/td/tr
trtd12:11:47/tdtd * kfish throws a tantrum on behalf of the free software community/td/tr
trtdnbsp;/tdtd.../td/tr
/table
/blockquote
p
However, the change didn't turn out to be so bad after all. The new text reads:
/p
blockquote
...; we need a codec that is known to not require per-unit or per-distributor licensing, that is compatible with the open source development model, that is of
sufficient quality as to be usable, and that is not an additional submarine patent risk for large companies.
/blockquote
p
The previous draft stated no such requirements. As no rationale was given for choosing Ogg, that recommendation was easy to attack.
Members of the a href="http://www.mpegla.com/"MPEG LA/a, the cabal whose members receive money when people use content in MPEG formats, then had a fairly easy job of inciting a href="http://www.whatwg.org/issues/#graphics-video-codec"flamewars/a
on the whatwg list.
/p
p
The new, clearer wording should allow more productive technical discussion, so that we can actually build an a href="http://perens.com/OpenStandards/Definition.html"open standard/a which encourages anyone, anywhere, to publish their videos freely.
/p
blockquote
table
trtd12:29:48/tdtd * kfish reads the replacement text and revokes the tantrum/td/tr
trtd12:30:15/tdtd lt;kfishgt; Hixie, actually you didn't casually remove Ogg, you made the case for Ogg stronger, so thankyou :-)/td/tr
trtd12:35:37/tdtd lt;Dashivagt; "Lift the cat who was amongst the pigeons up and put him back on his pedestal for now."/td/tr
trtd12:35:40/tdtd lt;Dashivagt; Poetic/td/tr
trtd12:37:49/tdtd lt;Hixiegt; kfish: :-)/td/tr
/table
/blockquote
-
Posted: December 6th, 2007, 2:31pm CET
a href="http://www.kfish.org/software/hogg/"Hogg/a is a commandline tool for manipulating Ogg files. It has subcommands, like tthogg chop/tt for cutting out bits of video, tthogg info/tt for telling you about the codecs, and tthogg dump/tt for hexdumping the packet data.
It's basically a re-implementation of most of the stuff in a href="http://www.annodex.net/software/liboggz/index.html"liboggz/a, but the new features in
a href="http://www.kfish.org/software/download/hogg-0.3.0.tar.gz"hogg 0.3.0/a
such as chopping out a section of a file and adding a href="http://wiki.xiph.org/OggSkeleton"Ogg Skeleton/a metadata, are not yet in ttoggz-tools/tt.
pre
$ hogg help chop
chop: Extract a section (specify start and/or end time)
Usage: hogg chop [options] filename ...
Examples:
Extract the first minute of file.ogg:
hogg chop -e 1:00 file.ogg
Extract from the second to the fifth minute of file.ogg:
hogg chop -s 2:00 -e 5:00 -o output.ogg file.ogg
Extract only the Theora video stream, from 02:00 to 05:00, of file.ogg:
hogg chop -c theora -s 2:00 -e 5:00 -o output.ogg file.ogg
Extract, specifying SMPTE-25 frame offsets:
hogg chop -c theora -s smpte-25:00:02:03::12 -e smpte-25:00:05:02::04 -o output.ogg file.ogg
/pre
Nevertheless, I'm continuing to work on both ttliboggz/tt and tthogg/tt. ttliboggz/tt, in pure C, is faster; tthogg/tt, in pure (but unoptimised) Haskell, is more correct.
I spent a few hours earlier today tracking down a corner case in ttliboggz/tt, coincidentally triggered by the chopping routines in ttlibannodex/tt. It reminded me that one of my first realizations about Haskell was that its sanity-checker often tells you about forgotten corner cases of algorithms.
-
Posted: November 15th, 2007, 12:50pm CET
Haskell source is interpreted as UTF-8, but internally the data is stored as Unicode code points. However the generic show method does not serialize Strings as UTF-8
(when using GHC).
So, when reading or writing documents it is necessary to introduce an explicit conversion from or to the desired character set. This article outlines how to use Unicode in Haskell, and surveys three alternatives for character set conversion: iiconv/i, iutf8-string/i and iencoding/i, providing working examples for each.
h2Unicode in Haskell source/h2
The Haskell Prime standardization wiki contains discussions of
a href="http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource"Unicode in Haskell Source/a, and of ways of handling
a href="http://hackage.haskell.org/trac/haskell-prime/wiki/CharAsUnicode"Char as Unicode/a.
In particular, GHC (as of release 6.6, early Jan 2006) interprets source files as UTF-8. Hence the following is a valid source file:
pre
import System.Time
main :: IO ()
main = do
time - getClockTime
cal - toCalendarTime time
putStrLn $ dayName $ ctWDay cal
dayName :: Day - String
dayName d = case d of
Monday - "月曜日"
Tuesday - "火曜日"
Wednesday - "水曜日"
Thursday - "木曜日"
Friday - "金曜日"
Saturday - "土曜日"
Sunday - "日曜日"
/pre
The ttdayName/tt function provides the Japanese name for a given ttDay/tt. However the ttmain/tt function, which tries to ttprint/tt that onto ttstdout/tt, dumps it without any character set conversion, truncating each character to 8 bits. In order to control the output charset, we need to use a Unicode conversion library. The three libraries
iiconv/i, iutf8-string/i and iencoding/i have similar purposes but some different features.
h2a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/iconv"iconv/a/h2
table
trthDescription:/thtdBinding to C iconv() function/td/tr
trthAuthor:/thtdDuncan Coutts/td/tr
trthdarcs get/thtdtta href="http://code.haskell.org/iconv/"http://code.haskell.org/iconv//a/td/tr
trthExports:/thtdttCodec.Text.IConv/tt/td/tr
trthInterface:/thtdttByteString.Lazy/tt/td/tr
trthAdvantages:/thtdSpeed, coverage of charset support/td/tr
trthDisadvantages:/thtdPortability: requires POSIX tticonv()/tt/td/tr
/table
This is a Haskell binding to the tticonv()/tt C library function, providing a lazy ByteString interface.
The only module exported is ttCodec.Text.IConv/tt, which provides a single
function:
pre
-- | Convert fromCharset toCharset input output
convert :: String - String - Lazy.ByteString - Lazy.ByteString
/pre
where ttfromCharset/tt and tttoCharset/tt are the names of the input and output character set encodings, and input and output are the input and output text
as lazy ByteStrings.
An example program to convert the encoding of an input file, similar to the
GNU iconv program, is given in
a href="http://haskell.org/~duncan/iconv/examples/hiconv.hs"examples/hiconv.hs/a.
The guts of that program is:
pre
output = convert (fromEncoding config) (toEncoding config) input
/pre
which is somewhat clearer than the
a href="http://lists.slug.org.au/archives/coders/2006/12/msg00003.html"brain-damaged/a interface exported by the C library. Exceptions are provided for handling unsupported conversions, invalid and incomplete characters. These errors can be silently ignored if desired by calling ttconvertFuzzy/tt instead.
As this library wraps the system tticonv()/tt implementation, all character sets supported on the underlying system are available. The Lazy.ByteString interface works directly on the memory buffers used by the C library, which may give a speed advantage for large conversions.
Note however that the tticonv()/tt C library function is defined by POSIX.1-2001 and may not be available on some older systems. In most such cases it should be possible to install
a href="http://www.gnu.org/software/libiconv/"GNU libiconv/a separately.
h2a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string"utf8-string/a/h2
table
trthDescription:/thtdSimple UTF-8 conversion library/td/tr
trthAuthor:/thtdEric Mertens/td/tr
trthdarcs get/thtdtta href="http://code.haskell.org/utf8-string/"http://code.haskell.org/utf8-string//a/td/tr
trthExports:/thtdttCodec.Binary.UTF8.String, System.IO.UTF8/tt/td/tr
trthInterface:/thtdttString/tt/td/tr
trthAdvantages:/thtdSimplicity/td/tr
trthDisadvantages:/thtdOnly supports UTF-8 conversions/td/tr
/table
This library contains both a simple module for data conversion with a String interface, and a useful IO module.
The String conversion module, ttCodec.Binary.UTF8.String/tt, provides two pairs of complementary encoding and decoding functions:
pre
-- | Encode a string using 'encode' and store the result in a 'String'.
encodeString :: String - String
-- | Decode a string using 'decode' using a 'String' as input.
-- | This is not safe but it is necessary if UTF-8 encoded text
-- | has been loaded into a 'String' prior to being decoded.
decodeString :: String - String
-- | Encode a Haskell String to a list of Word8 values, in UTF8 format.
encode :: String - [Word8]
-- | Decode a UTF8 string packed into a list of Word8 values, directly to String
decode :: [Word8] - String
/pre
I guess "not safe" in the comment for ttdecodeString/tt refers to type-safety; for example this function doesn't stop you from trying to decode the same text twice, whereas if you tried that with the plain ttdecode/tt function, the compiler would point out your bug for you.
To see how this might look in the wild, the following is a complete "Hello World" web application (err, CGI script) in Japanese:
pre
import Codec.Binary.UTF8.String
import Network.CGI hiding (Html)
import Text.Html
main :: IO ()
main = runCGI $ handleErrors cgiMain
cgiMain :: CGI CGIResult
cgiMain = do
setHeader "Content-Type" "text/html; charset=utf-8"
output $ renderHtml $ h1 encodeString "おはよう御座います!"
/pre
The iutf8-string/i library also includes an entire IO module, ttSystem.IO.UTF8/tt, exporting
ttprint, putStr, putStrLn, getLine, readLn, readFile, writeFile, appendFile, getContents, hGetLine, hGetContents, hPutStr, hPutStrLn/tt. These essentially wrap the default IO functions in ttencodeString/tt and ttdecodeString/tt, which you may find convenient if you are doing lots of UTF-8 processing.
This library is tiny, and implemented natively in Haskell so there are no portability issues. As it works directly on ByteStrings it should be sufficiently fast for practical purposes. Of course, if you need to do conversions to or from character sets other than UTF-8, you will need to use a different library.
h2a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding"encoding/a/h2
table
trthDescription:/thtdNative Haskell charset conversion library/td/tr
trthAuthor:/thtdHenning Günther/td/tr
trthdarcs get/thtdtta href="http://code.haskell.org/encoding/"http://code.haskell.org/encoding//a/td/tr
trthExports:/thtdttData.Encoding.*, System.IO.Encoding/tt/td/tr
trthInterface:/thtdttByteString.Lazy/tt/td/tr
trthAdvantages:/thtdPortable; covers more charsets than iutf8-string/i/td/tr
trthDisadvantages:/thtdCovers fewer charsets than iiconv/i/td/tr
/table
ttData.Encoding/tt provides native Haskell implementations for encoding and decoding of many common character sets: ASCII, UTF8, UTF16, UTF32, ISO8859[1-16],
CP125[0-8], KOI8R, and GB18030, as well as BootString (for a href="http://www.ietf.org/rfc/rfc3492.txt"Punycode/a). For each of these, it implements an ttEncoding/tt interface:
pre
{- | Represents an encoding, supporting various methods of de- and encoding.
Minimal complete definition: encode, decode
-}
class Encoding enc where
-- | Encode a 'String' into a strict 'ByteString'. Throws the
-- 'HasNoRepresentation'-Exception if it encounters an unrepresentable
-- character.
encode :: enc - String - ByteString
-- | Encode a 'String' into a lazy 'Data.ByteString.Lazy.ByteString'.
encodeLazy :: enc - String - LBS.ByteString
encodeLazy e str = LBS.fromChunks [encode e str]
-- | Whether or not the given 'Char' is representable in this encoding. Default: 'True'.
encodable :: enc - Char - Bool
encodable _ _ = True
-- | Decode a strict 'ByteString' into a 'String'. If the string is not
-- decodable, a 'DecodingException' is thrown.
decode :: enc - ByteString - String
decodeLazy :: enc - LBS.ByteString - String
decodeLazy e str = concatMap (decode e) (LBS.toChunks str)
-- | Whether or no a given 'ByteString' is decodable. Default: 'True'.
decodable :: enc - ByteString - Bool
decodable _ _ = True
/pre
Notice that this interface provides exceptions for handling unrepresentable characters.
Instances of ttEncoding/tt can be found by importing charset-specific modules; each simply exports a value with the same name as the module, ie. ttData.Encoding.ISO88592/tt exports ttISO88592/tt, which is an instance of ttEncoding/tt. Here is a "Hello World" CGI in Polish, using ISO-8859-2:
pre
import Data.Encoding
import Data.Encoding.ISO88592
import Data.ByteString.Char8
import Network.CGI hiding (Html)
import Text.Html
main :: IO ()
main = runCGI $ handleErrors cgiMain
cgiMain :: CGI CGIResult
cgiMain = do
setHeader "Content-Type" "text/html; charset=iso-8859-2"
output $ renderHtml $ h1 (unpack $ encode ISO88592 "Cześć")
/pre
You'll notice the call to the ttunpack/tt to convert the ttByteString/tt into a plain ttString/tt as expected by ttHtml/tt.
The iencoding/i library also provides a way to select an encoding by name:
pre
-- | Takes the name of an encoding and creates a dynamic encoding from it.
encodingFromString :: String - DynEncoding
/pre
(Anything which is a DynEncoding is by definition an instance of Encoding). So we could choose the encoding at runtime, or we can just be lazy and pick encodings by name. If we do this, we don't need to import the charset-specific module, and we can replace the last line of our CGI with:
pre
let enc = encodingFromString "ISO-8859-2"
output $ renderHtml $ h1 (unpack $ encode enc "Cześć")
/pre
The iencoding/i library also provides a pair of functions for converting character sets directly between two ByteStrings:
pre
-- | This decodes a string from one encoding and encodes it into another.
recode :: (Encoding from,Encoding to) = from - to - ByteString - ByteString
recodeLazy :: (Encoding from,Encoding to) = from - to - Lazy.ByteString - Lazy.ByteString
/pre
The ttSystem.IO.Encoding/tt module does not try to provide as many convenience functions as the similar module provided by iutf8-string/i, providing only the generic tthGetContents/tt and tthPutStr/tt. However, it does provide a way of retrieving the current system's default encoding (when used on systems supporting POSIX.1-2001 ttnl_langinfo()/tt), which iutf8-string/i lacks.
pre
-- | Like the normal 'System.IO.hGetContents', but decodes the input using an
-- encoding.
hGetContents :: Encoding e = e - Handle - IO String
-- | Like the normal 'System.IO.hPutStr', but encodes the output using an
-- encoding.
hPutStr :: Encoding e = e - Handle - String - IO ()
-- | Returns the encoding used on the current system.
getSystemEncoding :: IO DynEncoding
/pre
As this library is native Haskell it is portable, and as it uses lazy ByteStrings it can be fast. While it does not (yet) provide as many character sets as your system's tticonv()/tt, it does support many of the most commonly used ones.
h2Notes/h2
The libraries surveyed here are under fairly active maintenance, and there are rumours of unifying their implementations. Nevertheless the existing interfaces are fairly similar where common functionality exists.
strike
Historically, all serialized data was handled in Haskell as Strings, and there was a legitimate concern that transparently converting the character set of arbitrary Strings could mangle data.
The newer ByteString and Binary interfaces may allow future Haskell standards to clearly disambiguate binary and textual data, and simply serialize Strings as UTF-8 by default.
/strike
Although it might be nice to "simply" serialize Strings as UTF-8, ttshow/tt is the wrong place to do it. Haskell's ttRead/Show/tt serialization serializes to ttString/tt, which is a list of ttChar/tt, ie. a list of abstract Unicode code points. Character set conversion should rather happen on conversion to tt[Word8]/tt, at which point byte values become significant. This also encompasses direct conversions to ttByteString/tt, and the internals of primitive IO functions such as:
pre
putChar :: Char - IO ()
putChar = primPutChar
getChar :: IO Char
getChar = primGetChar
/pre
, ttgetContents/tt, ttreadFile/tt, ttwriteFile/tt, and ttappendFile/tt defined in the a href="http://www.haskell.org/onlinereport/standard-prelude.html"Haskell Prelude/a, and the various character IO functions on ttHandle/tts defined in a href="http://www.haskell.org/ghc/docs/latest/html/libraries/base-3.0.0.0/System-IO.html"System.IO/a.
Whether or not this conversion can be done everywhere transparently, and backwards-compatibly, is an open issue for Haskell Prime. Meanwhile these libraries provide useful interfaces for explicit tt[Word8]/tt and ttByteString/tt conversion, and various IO wrappers.
h2Summary/h2
Although all Haskell Strings are Unicode, Haskell98 does not specify a character set representation for their IO. Unicode strings can be written directly into Haskell source files and hence exist as data within a program, but character set conversion is required if you wish to read or write these Strings in files, user input or on the network.
We looked at ways of dealing with Unicode in Haskell, surveyed some useful libraries and provided working examples. Although we might hope that a future version of Haskell will provide a way to handle UTF-8 conversions, in the meantime we need to choose an appropriate library for each project that handles Unicode text.
h2Updates/h2
bFri Nov 16/b: Edited to incorporate some feedback from #haskell:
ul
liThanks to Tim Newsham for clarifying GHC's default a href="http://hpaste.org/3908"character encoding/a when printing Strings./li
liThanks to Stefan N. O'Rear for pointing out that Show/Read is not the right place for serialization, but that it should instead occur on conversion to/from tt[Word8]/tt./li
/ul