Wednesday, June 29, 2011

TF-IDF

A few days ago, someone implemented a consice TF-IDF ("term frequency - inverse document frequency") search engine in Scala. That was followed by a similarly concise version in Clojure. I thought to contribute a simple (hopefully also concise) implementation using Factor.

USING: accessors arrays assocs combinators.short-circuit fry
io.encodings.utf8 io.files kernel math math.functions
math.statistics memoize sequences sets sorting splitting
unicode.case unicode.categories ;

IN: tf-idf

Tokenize

Since our inputs are going to be text files, our program will need to:

  1. Read the file into a string.
  2. Split the string into words.
  3. Eliminate common (or "stop") words.

Fist, let's build a word to split our text on any non-letter or non-number characters.

: split-words ( string -- words )
    [ { [ Letter? ] [ digit? ] } 1|| not ] split-when harvest ;

We load a list of "stop words" that are to be ignored. These are typically common occurring words such as "and, in, or, of, the, is".

MEMO: stopwords ( -- words )
    "vocab:tf-idf/stopwords.txt" utf8 file-lines fast-set ;

We can then tokenize a piece of text, removing all of the "stop words".

: tokenize ( string -- words )
    >lower split-words [ stopwords in? not ] filter ;

And, finally, we can create a word to tokenize a series of files into an assoc mapping path to words.

: tokenize-files ( paths -- assoc )
    [ dup utf8 file-contents tokenize ] H{ } map>assoc ;

Index

To implement our search engine, we need to build an index that maps each word to list of (path, count) pairs.

: index1 ( path words -- path index )
    histogram [ pick swap 2array ] assoc-map ;

: index-all ( assoc -- index )
    [ index1 ] assoc-map values assoc-merge ;

The tokenized files and our index will form the basis for a "database":

TUPLE: db docs index ;

: <db> ( paths -- db )
    tokenize-files dup index-all db boa ;

TF-IDF

The "inverse document frequency" is calculated by the log of the total number of documents divided by number of documents where a particular word appears.

: idf ( term db -- idf )
    [ nip docs>> ] [ index>> at ] 2bi [ assoc-size ] bi@ / log ;

Using this, we can create a "TF-IDF" scores by multiplying the number of times a term appears in each document by the previously calculated "inverse document frequency":

: tf-idf ( term db -- scores )
    [ index>> at ] [ idf ] 2bi '[ _ * ] assoc-map ;

Search

Our search engine is now just a matter of:

  • Splitting an input query into terms.
  • Calculating the "TF-IDF" score for each term.
  • Normalizing the scores across documents.
  • Sorting the scores from highest to lowest.
: scores ( query db -- scores )
    [ split-words ] dip '[ _ tf-idf ] map assoc-merge ;

: (normalize) ( path db -- value )
    [ docs>> at ] keep '[ _ idf 2 ^ ] map-sum sqrt ;

: normalize ( scores db -- scores' )
    '[ sum over _ (normalize) / ] assoc-map ;

: search ( query db -- scores )
    [ scores ] keep normalize sort-values reverse ;

Notes

The implementation above uses an assoc-merge utility word that performs a "union", for each key collecting a list of all values.
: (assoc-merge) ( assoc1 assoc2 -- assoc1 )
    over [ push-at ] with-assoc assoc-each ;

: assoc-merge ( seq -- merge )
    H{ } clone [ (assoc-merge) ] reduce ;
Also, searching for words that don't exist produces a "divide by zero" error. Making it more robust requires some changes to either catch and ignore that error or to "add 1" to the denominator of the "IDF" formula.

The code for this is on my Github.

Friday, June 10, 2011

Yahoo! Finance

Many programmers work for publicly traded companies, and probably spend more time than they realize watching their (or their friends) company's stock prices. To make that easier, I wanted to implement a simple wrapper to retrieve prices from Yahoo! Finance.

Quotes

You can use a "quotes.csv" interface to retrieve current price information. The way it works is to perform an HTTP request to a specially formatted URL that looks like:

http://finance.yahoo.com/d/quotes.csv?s=SYMBOLS&f=FIELDS

In the URL, SYMBOLS is a list of symbols separated by "+" and FIELDS is a list of letters and numbers (from the following table) representing fields to be requested.


a Ask a2 Average Daily Volume a5 Ask Size
b Bid b2 Ask (Real-time) b3 Bid (Real-time)
b4 Book Value b6 Bid Size c Change & Percent Change
c1 Change c3 Commission c6 Change (Real-time)
c8 After Hours Change (Real-time) d Dividend/Share d1 Last Trade Date
d2 Trade Date e Earnings/Share e1 Error Indication (returned for symbol changed / invalid)
e7 EPS Estimate Current Year e8 EPS Estimate Next Year e9 EPS Estimate Next Quarter
f6 Float Shares g Day’s Low h Day’s High
j 52-week Low k 52-week High g1 Holdings Gain Percent
g3 Annualized Gain g4 Holdings Gain g5 Holdings Gain Percent (Real-time)
g6 Holdings Gain (Real-time) i More Info i5 Order Book (Real-time)
j1 Market Capitalization j3 Market Cap (Real-time) j4 EBITDA
j5 Change From 52-week Low j6 Percent Change From 52-week Low k1 Last Trade (Real-time) With Time
k2 Change Percent (Real-time) k3 Last Trade Size k4 Change From 52-week High
k5 Percent Change From 52-week High l Last Trade (With Time) l1 Last Trade (Price Only)
l2 High Limit l3 Low Limit m Day’s Range
m2 Day’s Range (Real-time) m3 50-day Moving Average m4 200-day Moving Average
m5 Change From 200-day Moving Average m6 Percent Change From 200-day Moving Average m7 Change From 50-day Moving Average
m8 Percent Change From 50-day Moving Average n Name n4 Notes
o Open p Previous Close p1 Price Paid
p2 Change in Percent p5 Price/Sales p6 Price/Book
q Ex-Dividend Date r P/E Ratio r1 Dividend Pay Date
r2 P/E Ratio (Real-time) r5 PEG Ratio r6 Price/EPS Estimate Current Year
r7 Price/EPS Estimate Next Year s Symbol s1 Shares Owned
s7 Short Ratio t1 Last Trade Time t6 Trade Links
t7 Ticker Trend t8 1 yr Target Price v Volume
v1 Holdings Value v7 Holdings Value (Real-time) w 52-week Range
w1 Day’s Value Change w4 Day’s Value Change (Real-time) x Stock Exchange
y Dividend Yield

Using this URL format, we can build a word that retrieves current quotes for a list of symbols:

: quotes ( symbols -- csv )
    "http://finance.yahoo.com/d/quotes.csv" >url
        swap "+" join "s" set-query-param
        "sbal1v" "f" set-query-param
    http-get nip >string string>csv
    { "Symbol" "Bid" "Ask" "Last" "Volume" } prefix ;

With the strings.table vocabulary, we can format the response as a table:

( scratchpad ) { "MSFT" "GOOG" "AAPL" } quotes
               format-table [ print ] each
Symbol Bid    Ask    Last    Volume
MSFT   23.64  23.69  23.705  49327104
GOOG   505.00 509.05 509.505 2440475
AAPL   N/A    334.80 325.90  15504200

Historical Prices

You can also retrieve historical prices using a "table.csv" interface. Similar to retrieving quotes, you make an HTTP request to a special URL:

http://ichart.finance.yahoo.com/table.csv?s=SYMBOL

In the URL, SYMBOL is the symbol that you are requesting prices for, and you can further limit the response using additional parameters:

Start date for historical prices:

  • a - Month number, starting with 0 for January.
  • b - Day number, eg, 1 for the first of the month.
  • c - Year.

End date for historical prices (default is the most current available closing price):

  • d - Month number, starting with 0 for January.
  • e - Day number, eg, 1 for the first of the month.
  • f - Year.

And finally, the frequency of historical prices:

  • g - Possible values are 'd' for daily (the default), 'w' for weekly, and 'm' for monthly.

With this knowledge, we can build a word to retrieve historical prices from January 1, 2009 until the current day.

: historical-prices ( symbol -- csv )
    "http://ichart.finance.yahoo.com/table.csv" >url
       swap "s" set-query-param
       "0" "a" set-query-param
       "1" "b" set-query-param
       "2009" "c" set-query-param
    http-get nip string>csv ;

To use it, and demonstrate the impact that AAPL has had over the last few years, we can chart the daily closing prices (remembering to reverse the order of the prices, oldest to newest):


The code for this is on my Github.