Wednesday, August 4, 2010

Text-or-Binary?

Sometimes it is useful to be able to tell if a file should be treated as a stream of text or binary characters. Rather than use the file extension (which might be missing or wrong), Subversion has a simple heuristic based on the file contents:

Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary.

Someone implemented this in a library written in Clojure. Here's my take, but in Factor.

Some vocabularies we will use, and a namespace:

USING: io io.encodings.binary io.files kernel math sequences ;

IN: text-or-binary

Checking if any of the bytes are zero:

: includes-zeros? ( seq -- ? )
    0 swap member? ;

The first 32 characters (e.g., 0-31) of ASCII are reserved for non-printing control characters. Checking that a majority (over 85%) of characters are printable (and assuming an empty sequence is printable):

: majority-printable? ( seq -- ? )
    [ t ] [ 
        [ [ 31 > ] count ] [ length ] bi / 0.85 >
    ] if-empty ;

Then, determining a sequence of bytes is text:

: text? ( seq -- ? )
    [ includes-zeros? not ] [ majority-printable? ] bi and ;

And implementing the operation to check if a file is text or binary:

: text-file? ( path -- ? )
    binary [ 1024 read text? ] with-file-reader ;

Using it is pretty easy:

( scratchpad ) "/usr/share/dict/words" text-file? .
t

( scratchpad ) "/bin/sh" text-file? .
f

The code for this (and some tests) is available on my Github.

3 comments:

sthsnthsn said...

There's an io.encodings.detection library already that does binary file detection using a trick similar to svn. You should incorporate your code into that and send it upstream. Nice work.

Joakim Hårsman said...

What if the text is in UTF-16 but in a script that's mostly ASCII compatible? You'll have plenty of null bytes then.

UTF-16 is a pretty common encoding for text files on Window. Or maybe not common, but it's what you usually get if you save someting as "Unicode".

mrjbq7 said...

@Joakim: That's exactly right, this was designed with ASCII in mind (also UTF-8). The io.encodings.detect vocabulary handles some unicode cases - to do this right, its probably best to have a less simple algorithm!