Wednesday, December 30, 2009

Counting Word Frequencies

One of my favorite memes right now is for people to write small programs in various languages to compare and contrast and learn from one another.

Recently, someone blogged an example of counting word frequencies in a collection of newsgroup articles. The initial implementation was written in Ruby and Scala, with someone else implementing a Clojure solution. These solutions are compared for lines of code as well as time to run.

The basic idea is to iterate over a bunch of files, where each represents a newsgroup posting and is organized by newsgroup into directories. Split each file into words, and then increment a count per word found (comparing case-insensitively).

First, we need a test that is equivalent to the "\w" regular expression. In ASCII, this is essentially a-zA-Z0-9_. We use a short-circuit combinator to break early if one of the tests succeeds.

: \w? ( ch -- ? )
    { [ Letter? ] [ digit? ] [ CHAR: _ = ] } 1|| ; inline

We can then build a word to split a sequence of characters.

: split-words ( seq -- seq' )
    [ \w? not ] split-when harvest ;

This now leaves the main task of counting and aggregating the word counts. The Ruby and Scala examples do this sequentially, but the Clojure example tries to do things in parallel. We are going to keep it simple, and do things sequentially.

: count-words ( path -- assoc )
    f recursive-directory-files H{ } clone [
        '[
            ascii file-contents >lower 
            split-words [ _ inc-at ] each
        ] each
    ] keep ;

We need a word to generate the desired output (tab-delimited words and counts).

: print-words ( seq -- )
    [ first2 "%s\t%d\n" printf ] each ;

And another word to do the actual writing of output to files.

: write-count ( assoc -- )
    >alist [
        [ "/tmp/counts-decreasing-factor" ascii ] dip
        '[ _ sort-values reverse print-words ] with-file-writer
    ] [
        [ "/tmp/counts-alphabetical-factor" ascii ] dip
        '[ _ sort-keys print-words ] with-file-writer
    ] bi ;

Unfortunately, performance isn't quite what I was hoping. I tested this on a 2.8 GHz MacBook Pro. Ruby (using 1.8.7) runs in roughly 41 seconds, Factor runs in 20 seconds, and Python (using 2.6.1) runs in 13 seconds.

I was sort of hoping Factor would come in under the typical scripting languages, and I'd love to get feedback on how to improve it.

For reference, the Python version that I wrote is:

import os
import re
import time
from collections import defaultdict
from operator import itemgetter

root = '/tmp/20_newsgroups'
#root = '/tmp/mini_newsgroups'

t0 = time.time()

counts = defaultdict(int)

for dirpath, dirname, filenames in os.walk(root):
    for filename in filenames:
        f = open(os.path.join(dirpath, filename))
        for word in re.findall('\w+', f.read()):
            counts[word.lower()] += 1
        f.close()

print "Writing counts in decreasing order"
f = open('counts-decreasing-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(1), reverse=True):
    print >> f, '%s\t%d' % (k, v)
f.close()

print "Writing counts in decreasing order"
f = open('counts-alphabetical-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(0)):
    print >> f, '%s\t%d' % (k, v)
f.close()

print 'Finished in %s seconds' % (time.time() - t0)

Tuesday, December 29, 2009

Recursively Listing Files

Update: It was pointed out to me that the recursive-directory-files word in io.directories.search solves this problem. Good to know, and the below article can be thought of as a learning exercise! :)

Factor has several vocabularies for interacting with the filesystem, and quite a lot of work has been done on making these useful. Some interesting blog posts from early 2009 include:

One feature that I haven't found yet, is a word to simply (and recursively) list files within a directory. This is often useful with the intent of processing some (or all) of the files in some way.

This feature exists in other language standard libraries with different names. In Python, this is called os.walk(). In Perl, this is called File::Find. Usually, even if it doesn't exist, it can be built relatively easily.

I'm going to show you how to create a word to do that with Factor.

Many words within the standard library are implemented by a public word that defers to a private word for part of the functionality. In this case, I want to separate the "setup" functionality of the word, from the "work" functionality.

Let's define a word list-files that will recursively list all the files within a directory. First, we need to create a growable sequence (in this case a vector) to hold the result, and then we want to call a word that will be used to do the actual work.

: list-files ( path -- seq )
    [ V{ } clone ] dip (list-files) ;

For each path that we process, we will want to handle differently depending on what type of file the path points to:

  1. If a symbolic link, we want to read the link and recurse into it.
  2. If a directory, we want to recurse into each of the files within it.
  3. If a regular file, we want to add it to the list of files.

Using some of the words in io.directories, io.files, io.files.info, io.files.links, and io.files.types, we can build such a function:

: (list-files) ( seq path -- seq )
    normalize-path dup link-info type>> {
        { +symbolic-link+ [ read-link (list-files) ] }
        { +directory+ [
            [ directory-files ] keep
            '[ normalize-path _ prepend (list-files) ] each ] }
        { +regular-file+ [ over push ] }
        [ "unsupported" throw ]
    } case ;

We can setup a directory structure like so:

$ tree -f /tmp/foo
/tmp/foo
|-- /tmp/foo/bar
`-- /tmp/foo/baz
    `-- /tmp/foo/baz/foo

1 directory, 2 files

And then use our new function from Factor:

( scratchpad ) "/tmp/foo" list-files .
V{ "/tmp/foo/bar" "/tmp/foo/baz/foo" }

This could be improved further by handling file permissions issues, infinite recursion, and lazily generating the list of files (for better performance with large directory trees).

Sunday, December 27, 2009

Generating Text in Factor

Some months back, I came across a few blog posts about generating text using algorithms. A simple algorithm was implemented in Clojure and Haskell.

Basically, the idea is to:

  1. Read in a text document.
  2. Count the frequency of word pairs in the document.
  3. Pick a starting word.
  4. Generate random text, using the word pair probabilities.

I wanted to see what a simple Factor implementation would look like, and thought I would share one below.

First, create a list of words from some lines of text. In Factor, it is easy to write a processing word that can then be used from standard input, files, or other input streams.

: split-words ( -- sequence )
    V{ } clone [
        "\r\n\t.,\" " split [ over push ] each
    ] each-line harvest ;

Next, create a sequence of all word pairs (including the pair linking the tail to the head of the list) from the input sequence.

: word-pairs ( sequence -- sequence' )
    dup 1 head-slice append
    dup 1 tail-slice zip ;

Next, we need a map of word pairs, for random sampling by frequency of occurrence. I have chosen to use words in the text as keys, and have the values be a sequence of all words that ever followed it in the text.

: word-map ( sequence -- assoc )
    [ H{ } clone ] dip word-pairs [
        [ second ] [ first ] bi pick push-at
    ] each ;

This makes sampling words with the proper probability quite simple:

: next-word ( word assoc -- word' )
    at random ;

We can now generate paragraphs of random text with the "feel" of the original document.

: wordgen-from ( start n -- string )
    [ [ 1vector ] keep split-words word-map ] dip
    [ [ next-word dup pick push ] keep ] times
    2drop " " join ;

And for convenience, as in the original blog posts, we can start generating from a common English word.

: wordgen ( n -- string )
    "the" swap wordgen-from ;

Putting all of this together, we can make 250 words from Dracula:

( scratchpad ) "/tmp/345.txt" ascii [ 250 wordgen ] with-file-reader .
"the bank notes of electronic works based on the same time to you are fading and it came I too much Without taking a quick he had struck by our eyes and she was not asleep twice when she sank down again There were striving to think yet when the Huns This to get to her all at her on it really Lucy's father Goodbye my own time when she look at the facts before them away and his diary which is out Look! Look! Look! The bright as ever since then we had said Dr Van Helsing will be a chair with you said Yes! And it was a pistol shot up to write for a woman can love and added Friend John and pain nor a long enough You were so seeing it high tide altogether I awoke She is confined within a room lit by the old sweet that the temptation of lion-like disdain His cries are great love must go there would tear my eyes the colour of him and quiet; but still Renfield sitting on one ready Madam Mina Harker had retired early Lucy and there ran downstairs then we get out his fair accuracy when I found that my dear do our visit to oppose him can look had a measure of the other since I secured and seemed to tell upon your all our furs and the face I put in his eyes never be This afternoon she was that such horrors when the"

Not quite Bram Stoker, but just enough to taste the flavor. More complex algorithms exist and can be used for fun and profit. Similar ideas can even be applied to other areas, such as music.

Wednesday, August 12, 2009

Calculating with EBNF

Factor has a syntax for EBNF parsing grammars, implemented in the peg.ebnf vocabulary.

Several useful vocabularies are partially implemented using EBNF syntax, including (among many) the formatting, globs, urls, regexp, brainfuck, javascript, and smalltalk vocabularies.

I thought I'd walk through the creation of a small macro for parsing "calculator" expressions, as a small demonstration of their utility.

First, the goal. We'd like to be able to parse, and compute, the following:

( scratchpad ) "2+2" calc .
4

So, first we need to define a parsing grammar for numbers. Since these can be represented as an optional sign (for negative numbers), a whole number, and optional digits (for floating point numbers), we start with this structure:

sign   = "-"
whole  = ([0-9])*
digit  = "." ([0-9])*
number = sign? whole digit?

We would like the grammar to parse each component as a string, if specified, concatenate it together, and then convert the resulting string to a number, to be used by the rest of the grammar in calculations.

sign   = "-" 
=> [[ >string ]]

whole  = ([0-9])* 
=> [[ >string ]]

digit  = "." ([0-9])* 
=> [[ [ >string ] map concat ]]

number = sign? whole digit? 
=> [[ [ f eq? not ] filter concat string>number ]]

The calculator needs to support operations, so we start with the basic four: addition, subtraction, multiplication, and division. These map the characters "+", "-", "*", and "/" to the words implementing those functions in Factor.

add  = "+" => [[ [ + ] ]] 
sub  = "-" => [[ [ - ] ]] 
mul  = "*" => [[ [ * ] ]] 
div  = "/" => [[ [ / ] ]]

ops  = add|sub|mul|div

We define an expression to be two numbers combined by an operator, which changes infix to postfix notation, using the fry vocabulary to create a quotation that will perform the intended computation.

expr = number ops number
=> [[ [ first ] [ third ] [ second ] tri '[ _ _ @ ] ]]

Putting this all together, we have our calculator:

USING: fry kernel macros peg.ebnf 
math math.parser sequences strings ;

EBNF: parse-calc

sign   = "-"          => [[ >string ]]
whole  = ([0-9])*     => [[ >string ]]
digit  = "." ([0-9])* => [[ [ >string ] map concat ]]

number = sign? whole digit?  
=> [[ [ f eq? not ] filter concat string>number ]]

add  = "+"  => [[ [ + ] ]]         
sub  = "-"  => [[ [ - ] ]]         
mul  = "*"  => [[ [ * ] ]]         
div  = "/"  => [[ [ / ] ]]

ops  = add|sub|mul|div

expr = number ops number
=> [[ [ first ] [ third ] [ second ] tri '[ _ _ @ ] ]]

;EBNF

MACRO: calc ( string -- ) parse-calc ;

You can use the macros.expander vocabulary to see the code that is created when you compile the original example:

( scratchpad ) [ "2+2" calc ] expand-macros .
[ 2 2 [ + ] call ]

This model can be reasonably extended to support other operators (such as "mod", "shift", "pow", etc.), functions (such as "sin", "sqrt", etc.), spaces in the expressions, chained expressions, grouped expressions, input validation, compile-time calculations, and other useful features. But, then it wouldn't be so simple...

Friday, June 26, 2009

IPInfoDB

Providing location-based services is all the rage these days. On the newer mobile devices, there are services that use GPS and nearby cellular towers to improve your user experience with the knowledge of where you are at any moment.

In the wild internet, there hasn't been many convenient ways to do this, but a recent website called IPInfoDB (previously known as "IP Location Tools"), provides an API for mapping an IP address to a physical address. The API consists of a PHP page that returns an XML response that contains your location, or the location of an IP address that you provide. They also seem to support CSV and JSON responses.

For example, to lookup the location of an IP address used by google.com:

http://ipinfodb.com/ip_query.php?ip=74.125.45.100

Curious to see how easily it would be to use this API from Factor, I wrote a vocabulary for accessing IPInfoDB.

First, we define the characteristics of an ip-info tuple.

TUPLE: ip-info ip country-code country-name region-code
region-name city zip-code latitude longitude gmtoffset
dstoffset ;

The API returns XML response, which we need to convert into an ip-info object. For this, we use the excellent XML parsing vocabulary that has seen a lot of improvements recently.

: find-tag ( tag name -- value )
    deep-tag-named children>string ; inline

: xml>ip-info ( xml -- info )
    [ ip-info new ] dip
    {
        [ "Ip" find-tag >>ip ]
        [ "CountryCode" find-tag >>country-code ]
        [ "CountryName" find-tag >>country-name ]
        [ "RegionCode" find-tag >>region-code ]
        [ "RegionName" find-tag >>region-name ]
        [ "City" find-tag >>city ]
        [ "ZipPostalCode" find-tag >>zip-code ]
        [ "Latitude" find-tag >>latitude ]
        [ "Longitude" find-tag >>longitude ]
        [ "Gmtoffset" find-tag >>gmtoffset ]
        [ "Dstoffset" find-tag >>dstoffset ]
    } cleave ;

With that done, it is very easy to make an http request to retrieve one's own location:

: locate-my-ip ( -- info ) 
    "http://ipinfodb.com/ip_query.php"
    http-get string>xml xml>ip-info nip ;

Or determine the location of any arbitrary IP:

: locate-ip ( ip -- info )
    "http://ipinfodb.com/ip_query.php?ip=" prepend
    http-get string>xml xml>ip-info nip ;

The example I showed earlier would be:

( scratchpad ) "74.125.45.100" locate-ip .
T{ ip-info
    { ip "74.125.45.100" }
    { country-code "US" }
    { country-name "United States" }
    { region-code "06" }
    { region-name "California" }
    { city "Mountain View" }
    { zip-code "94043" }
    { latitude "37.4192" }
    { longitude "-122.057" }
    { gmtoffset "-8.0" }
    { dstoffset "-7.0" }
}

Presumably, if this code were to be used more frequently, it would need some error handling for when the http server is not available, or non-responsive.

Anyway, this vocabulary is available on my GitHub

Sunday, June 7, 2009

Brainf*ck

The Brainfuck programming language is a curious invention. Seemingly useful only for proving oneself as a True Geek at a party, it could also be useful for educational purposes.

The first programming example in any language is typically "Hello, world!". In Brainfuck, this could be written as:

++++++++++[>+++++++>++++++++++>+++>+<<<<-]
>++.>+.+++++++..+++.>++.<<+++++++++++++++
.>.+++.------.--------.>+.>.

For fun, I thought I would build a Brainfuck compiler for Factor.

( scratchpad ) USE: brainfuck

( scratchpad ) "
               ++++++++++[>+++++++>++++++++++>+++>+<<<<-]
               >++.>+.+++++++..+++.>++.<<+++++++++++++++
               .>.+++.------.--------.>+.>.
               " run-brainfuck
Hello World!

Behind the scene, the Brainfuck code is being compiled into proper Factor using a macro that parses the Brainfuck code string. When translated into Factor, the "Hello, world!" example becomes:

<brainfuck>
10 (+) [ (?) ] [
    1 (>) 7 (+) 1 (>) 10 (+) 1 (>) 3 (+)
    1 (>) 1 (+) 4 (<) 1 (-)
] while
1 (>) 2 (+) (.) 1 (>) 1 (+) (.)
7 (+) (.) (.) 3 (+) (.) 1 (>) 2 (+) (.)
2 (<) 15 (+) (.) 1 (>) (.) 3 (+)
(.) 6 (-) (.) 8 (-) (.) 1 (>) 1 (+) (.) 
1 (>) (.) drop flush

I made only a slight optimization, which you might notice above, to collapse a series of identical operators together into a single call to the operator word, while staying true to the original set of Brainfuck operators.

Some fun examples of Brainfuck in the brainfuck-tests.factor unit tests include addition, multiplication, division, uppercase, and a cat utility.

It is available on my Github, and hopefully will be pulled into the main repository soon.

Wednesday, February 25, 2009

cd factor

When working with Factor source code, you might find yourself moving between several directories on the filesystem, navigating and editing several files.

Due to source code being kept inside core, basis, extra, work, and other locations, it is not always a simple matter of moving up and down directories to find the files and vocabularies that you are looking for.

Today, I wrote a bit of bash script to simplify, and improve, this part of the development process.

# change directories to a factor module
cdfactor() {
    code=$(printf "USING: io io.pathnames vocabs vocabs.loader ; "
           printf "\"%s\" <vocab> vocab-source-path (normalize-path) print" $1)
    echo $code > $HOME/.cdfactor
    fn=$(factor $HOME/.cdfactor)
    dn=$(dirname $fn)
    echo $dn
    if [ -z "$dn" ]; then
        echo "Warning: directory '$1' not found" 1>&2
    else
        cd $dn
    fi
}

This function takes a vocabulary as argument. When run, it generates a temporary Factor script that is executed to find the pathname to the source file. It then changes directories into the directory containing the source file.

For example, if you want to switch to the io.files vocab, just run:

$ cdfactor io.files
Note: this requires factor to be on the $PATH.

Wednesday, February 18, 2009

Readline support for Factor

When using Factor from the command line, you might notice that it lacks many of the features of typical REPL's offered by other languages. For example, using arrow keys, home, end, delete, and backspace to navigate and edit text, and having proper history support for using previous lines of text.

These features are often implemented by using the GNU Readline library.

But, how do you use readline with Factor? The easiest way is to use the rlwrap command, which wraps an arbitrary command in readline.

To install it on Linux, run sudo yum install rlwrap (for RPM-based distros) or sudo apt-get install rlwrap (for Debian-based distros).

On the Mac, you can install it with the MacPorts system by running sudo port install rlwrap.

Once installed, add rlwrap in front of your factor command. For instance:

rlwrap ./factor

You should find interacting with Factor to be much easier.