Friday, October 15, 2010

Text-to-PDF

In this article, we will be building step-by-step a program for converting text files into PDF. The PDF specification is at version 1.7 (approximately 750 pages plus some supplements) and is available for download from the Adobe website.

The entire solution listed below is approximately 140 lines of code, and compares favorably to a 600 line Python version and a 450 line C version.

First, we will need a list of imports and a namespace:

USING: assocs calendar combinators environment formatting
grouping io io.files kernel make math math.ranges sequences
splitting xml.entities ;

IN: text-to-pdf

PDF files are essentially text files that contain numbered objects with display instructions and a cross-reference table that allows random access to those objects. We will need a word to create an object from its contents and number (e.g., start with n 0 obj and end with endobj):

: pdf-object ( str n -- str' )
    "%d 0 obj\n" sprintf "\nendobj" surround ;

References (used to provide pointers to objects) are of the form n 0 R where n is the number of the object being referred to.

The PDF standard has support for various data types, one of which are text strings. These strings require escaping to be able to include certain characters (and are surrounded by parentheses):

: pdf-string ( str -- str' )
    H{
        { HEX: 08    "\\b"  }
        { HEX: 0c    "\\f"  }
        { CHAR: \n   "\\n"  }
        { CHAR: \r   "\\r"  }
        { CHAR: \t   "\\t"  }
        { CHAR: \\   "\\\\" }
        { CHAR: (    "\\("  }
        { CHAR: )    "\\)"  }
    } escape-string-by "(" ")" surround ;

Hashtables of key/value properties are used frequently within the specification and are essentially space-separated key/value pairs surrounded by << and >>.

Our PDF file will start with an "info" section that contains a creation timestamp, author, and software details:

: pdf-info ( -- str )
    [
        "<<" ,
        "/CreationDate D:" now "%Y%m%d%H%M%S" strftime append ,
        "/Producer (Factor)" ,
        "/Author " "USER" os-env "unknown" or pdf-string append ,
        "/Creator (created with Factor)" ,
        ">>" ,
    ] { } make "\n" join ;

We will follow it with a catalog indicating which object (by reference) contains the list of pages.

: pdf-catalog ( -- str )
    {
        "<<"
        "/Type /Catalog"
        "/Pages 4 0 R"
        ">>"
    } "\n" join ;

As is typical of "text2pdf" programs, we will use the Courier monospace font. We can just refer to it by name since it is one of the fonts directly supported by the PDF specification.

: pdf-font ( -- str )
    {
        "<<"
        "/Type /Font"
        "/Subtype /Type1"
        "/BaseFont /Courier"
        ">>"
    } "\n" join ;

Each page is composed of two objects: a resource object and a contents object. We convert the number of pages into an index specifying the canvas size (US Letter - "612 by 792 points" or "8.5 by 11 inches") and a list of pages (or "kids") specified as references to each pages' resource object:

: pdf-pages ( n -- str )
    [
        "<<" ,
        "/Type /Pages" ,
        "/MediaBox [ 0 0 612 792 ]" ,
        [ "/Count %d" sprintf , ]
        [
            5 swap 2 range boa
            [ "%d 0 R " sprintf ] map concat
            "/Kids [ " "]" surround ,
        ] bi
        ">>" ,
    ] { } make "\n" join ;

The resources for each page n includes a reference to the contents object (n + 1) and the font object:

: pdf-page ( n -- page )
    [
        "<<" ,
        "/Type /Page" ,
        "/Parent 4 0 R" ,
        1 + "/Contents %d 0 R" sprintf ,
        "/Resources << /Font << /F1 3 0 R >> >>" ,
        ">>" ,
    ] { } make "\n" join ;

Pages are essentially objects which contain a stream of text operations. The stream is prefixed with the length of its contents:

: pdf-stream ( str -- str' )
    [ length 1 + "<<\n/Length %d\n>>" sprintf ]
    [ "\nstream\n" "\nendstream" surround ] bi append ;

The text operations that we will use to draw lines of text on each page:

  • BT - begin text
  • Td - location (where 0,0 is the bottom left of the page)
  • Tf - font and size
  • TL - line height
  • ' - insert newline and draw a line text
  • ET - end text
: pdf-text ( lines -- str )
    [
        "BT" ,
        "54 738 Td" ,
        "/F1 10 Tf" ,
        "12 TL" ,
        [ pdf-string "'" append , ] each
        "ET" ,
    ] { } make "\n" join pdf-stream ;

Using 10-point Courier font (6 points wide by 10 points tall), 12-points of line spacing, and 3/4 inch left and right margins: each page supports 57 lines of text (where each line is 84 characters long). We use splitting and grouping words to convert a string into pages of text:

: string>lines ( str -- lines )
    "\t" split "    " join string-lines
    [ [ " " ] when-empty ] map ;

: lines>pages ( lines -- pages )
    [ 84 <groups> ] map concat 57 <groups> ;

We can then take these "pages" and assemble PDF objects (including the info, catalog, font, and page index objects):

: pages>objects ( pages -- objects )
    [
        pdf-info ,
        pdf-catalog ,
        pdf-font ,
        dup length pdf-pages ,
        dup length 5 swap 2 range boa zip
        [ pdf-page , pdf-text , ] assoc-each
    ] { } make
    dup length [1,b] zip [ first2 pdf-object ] map ;

Given a list of objects, and a 9-byte %PDF-1.4 version header in the file, we can write a PDF complete with cross-reference table trailer and %%EOF "end-of-file" marker:

: pdf-trailer ( objects -- str )
    [
        "xref" ,
        dup length 1 + "0 %d" sprintf ,
        "0000000000 65535 f" ,
        9 over [
            over "%010X 00000 n" sprintf , length 1 + +
        ] each drop
        "trailer" ,
        "<<" ,
        dup length 1 + "/Size %d" sprintf ,
        "/Info 1 0 R" ,
        "/Root 2 0 R" ,
        ">>" ,
        "startxref" ,
        [ length 1 + ] map-sum 9 + "%d" sprintf ,
        "%%EOF" ,
    ] { } make "\n" join ;

: objects>pdf ( objects -- str )
    [ "\n" join "\n" append "%PDF-1.4\n" ]
    [ pdf-trailer ] bi surround ;

Putting it all together, we can convert a text string to PDF, and a text file to PDF:

: text-to-pdf ( str -- str' )
    string>lines lines>pages pages>objects objects>pdf ;

: file-to-pdf ( path encoding -- )
    [ file-contents text-to-pdf ]
    [ [ ".pdf" append ] dip set-file-contents ] 2bi ;

Running this code on itself, produces this PDF file (so you can see what it will look like).

As usual, the code is available on my Github.

1 comment:

kib said...

Reducing the Python version by ~75% is really impressive.

Thanks for sharing!