Monday, September 23, 2013

YouTube Downloader

YouTube needs no introduction - it hosts many videos and has a virtual firehose of new content uploaded every second of every day. The primary interface is through a web browser with no easy way to download content for offline viewing. Naturally, "YouTube downloader" plugins have become a popular addition to most web browsers.

I thought it would be fun to implement a basic "YouTube downloader" in Factor.

First, we start by obtaining information about a video using the YouTube API, parsing the response as a query string:

CONSTANT: video-info-url URL" http://www.youtube.com/get_video_info"

: get-video-info ( video-id -- video-info )
    video-info-url clone
        3 "asv" set-query-param
        "detailpage" "el" set-query-param
        "en_US" "hl" set-query-param
        swap "video_id" set-query-param
    http-get nip query>assoc ;

Next, we can get a list of the available video formats:

: video-formats ( video-info -- video-formats )
    "url_encoded_fmt_stream_map" of
    "," split [ query>assoc ] map ;

A particular video format includes a download URL and a signature that needs to be attached to it to successfully download the video:

: video-download-url ( video-format -- url )
    [ "url" of ] [ "sig" of ] bi "&signature=" glue ;

We are going to use the video title to create a filename, but first we want to sanitize it by removing unprintable characters and a few characters which might conflict with your filesystem and making it no more than 200 characters long:

: sanitize ( title -- title' )
    [ 0 31 between? not ] filter
    [ "\"#$%'*,./:;<>?^|~\\" member? not ] filter
    200 short head ;

Finally, to download a video, we lookup its video info, find the first mp4 formatted video, convert it to a download URL, and then download-to a file.

: download-video ( video-id -- )
    get-video-info [
        video-formats [ "type" of "video/mp4" head? ] find nip
        video-download-url
    ] [
        "title" of sanitize ".mp4" append download-to
    ] bi ;

You can choose a directory to download it to using the with-directory word. For example, downloading a video to your home directory:

IN: scratchpad "~" [ "G8LC8ES6ogw" download-video ] with-directory

The code for this is on my GitHub.

Wednesday, September 18, 2013

ZeroMQ

ZeroMQ is a "high-level" socket library that provides some middleware capabilities and markets itself as "The Simplest Way to Connect Pieces".

About a year ago, Eungju Park created a binding that was usable from Factor. He was gracious to allow us to merge his vocabulary into the main repository. In the process, I cleaned up the API a little bit and made sure it works with the latest version of ZeroMQ (currently 3.2.3).

As a quick tease, I thought I would show off a simple "echo" example.

echo

Our "echo" server will bind to port 5000. It will loop forever, sending any messages that it receives back to the client:

: echo-server ( -- )
    [
        <zmq-context> &dispose
        ZMQ_REP <zmq-socket> &dispose
        dup "tcp://127.0.0.1:5000" zmq-bind
        [ dup 0 zmq-recv dupd 0 zmq-send t ] loop
        drop
    ] with-destructors ;

Our "echo" client will connect to the server on port 5000. Each second, we grab the current time (by calling now), format it into as string and send it to the server, printing the message being sent and the message that was received:

: echo-client ( -- )
    [
        <zmq-context> &dispose
        ZMQ_REQ <zmq-socket> &dispose
        dup "tcp://127.0.0.1:5000" zmq-connect
        [
            now present
            [ "Sending " write print flush ]
            [ >byte-array dupd 0 zmq-send ] bi
            dup 0 zmq-recv >string
            "Received " write print flush
            1 seconds sleep
            t
        ] loop
        drop
    ] with-destructors ;

Perhaps not as simple as a pure Factor echo server, but all the nuts and bolts are working. Maybe in the future we could write a wrapper for a wrapper to simplify using ZeroMQ in common use-cases.

The zeromq vocabulary and several other examples of its use are available in the latest development version of Factor.

Note: the calls into the ZeroMQ are blocking in such a way that we have to run the echo-server in one Factor process and the echo-client in another Factor process. It would be nice to integrate it in such a way this was not necessary.

Wednesday, September 4, 2013

Verbal Expressions

Recently, thechangelog.com wrote a blog post asking programmers to stop writing regular expressions and begin using "verbal expressions".

Instead of writing this regular expression to parse a URL:

^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$

You could be writing this verbal expression (in Javascript):

var tester = VerEx()
            .startOfLine()
            .then( "http" )
            .maybe( "s" )
            .then( "://" )
            .maybe( "www." )
            .anythingBut( " " )
            .endOfLine();

I'm not sure the second is more readable than the first once you get comfortable with regular expressions, but it could be useful in some circumstances and particularly to programmers that aren't as familiar with the esoteric syntax that is frequently required when matching text.

These "verbal" expressions seem to have become popular, with a GitHub organization listing implementations in 19 languages so far.

With that in mind, let's make a Factor implementation!

Basics

We need to create an object that will keep our state as we build our regular expression, holding a prefix and suffix that surround a source string as well as any modifiers that are requested (like case-insensitivity):

TUPLE: verbexp prefix source suffix modifiers ;

: <verbexp> ( -- verbexp )
    "" "" "" "" verbexp boa ; inline

Making a regular expression is as simple as combining the prefix, source, and suffix, and creating a regular expression with the requested modifiers:

: >regexp ( verbexp -- regexp )
    [ [ prefix>> ] [ source>> ] [ suffix>> ] tri 3append ]
    [ modifiers>> ] bi <optioned-regexp> ; inline

For convenience, we could have a combinator that creates the verbal expression, calls a quotation with it on the stack, then converts it to a regular expression:

: build-regexp ( quot: ( verbexp -- verbexp ) -- regexp )
    '[ <verbexp> @ >regexp ] call ; inline

When we want to add to our expression, we just append it to the source:

: add ( verbexp str -- verbexp )
    '[ _ append ] change-source ;

Anything that is not a letter or a digit can be escaped with a backslash:

: re-escape ( str -- str' )
    [
        [
            dup { [ Letter? ] [ digit? ] } 1||
            [ CHAR: \ , ] unless ,
        ] each
    ] "" make ;

Methods

We can specify "anything" or "anything but":

: anything ( verbexp -- verbexp )
    "(?:.*)" add ;

: anything-but ( verbexp value -- verbexp )
    re-escape "(?:[^" "]*)" surround add ;

We can specify "something" and "something but":

: something ( verbexp -- verbexp )
    "(?:.+)" add ;

: something-but ( verbexp value -- verbexp )
    re-escape "(?:[^" "]+)" surround add ;

We can specify looking for "start of line" or "end of line":

: start-of-line ( verbexp -- verbexp )
    [ "^" append ] change-prefix ;

: end-of-line ( verbexp -- verbexp )
    [ "$" append ] change-suffix ;

We can specify a value ("then"), or an optional value ("maybe"):

: then ( verbexp value -- verbexp )
    re-escape "(?:" ")" surround add ;

: maybe ( verbexp value -- verbexp )
    re-escape "(?:" ")?" surround add ;

We could specify "any of" a set of characters:

: any-of ( verbexp value -- verbexp )
    re-escape "(?:[" "])" surround add ;

Or, maybe simply a line break, tab, word, or space:

: line-break ( verbexp -- verbexp )
    "(?:(?:\\n)|(?:\\r\\n))" add ;

: tab ( verbexp -- verbexp ) "\\t" add ;

: word ( verbexp -- verbexp ) "\\w+" add ;

: space ( verbexp -- verbexp ) "\\s" add ;

Perhaps many of whatever has been specified so far:

: many ( verbexp -- verbexp )
    [
        dup ?last "*+" member? [ "+" append ] unless
    ] change-source ;

Modifiers

Some helper words allow us to easily add and remove modifiers:

: add-modifier ( verbexp ch -- verbexp )
    '[ _ suffix ] change-modifiers ;

: remove-modifier ( verbexp ch -- verbexp )
    '[ _ swap remove ] change-modifiers ;

Should we be case-insensitive or not:

: case-insensitive ( verbexp -- verbexp )
    CHAR: i add-modifier ;

: case-sensitive ( verbexp -- verbexp )
    CHAR: i remove-modifier ;

Should we search across multiple lines or not:

: multiline ( verbexp -- verbexp )
    CHAR: m add-modifier ;

: singleline ( verbexp -- verbexp )
    CHAR: m remove-modifier ;

Testing

We can try out our original example using the unit test framework to show that it works:

{ t } [
    "https://www.google.com" [
        start-of-line
        "http" then
        "s" maybe
        "://" then
        "www." maybe
        " " anything-but
        end-of-line
    ] build-regexp matches?
] unit-test

I'm not convinced this is an improvement. In the current specification for "verbal" expressions, the language for expressing characteristics to match against is relatively limited. Perhaps with some effort, this could evolve into a more capable (but still readable) syntax.

In any event, the code for this is on my GitHub.