Markup: a minimalist markup frontend for the PostScript language

anastigmatix.net

This document has a standard, validated CSS2 stylesheet, which your browser does not seem to display properly. In a browser supporting web standards, this table of contents would be fixed at the side of the page for easy reference.

anastigmatix home

Minimalist markup for PostScript

What Markup is not

What Markup is

What Markup does differently

Markup's only rules

Markup: reference

Markup dictionary contents

Markup proper

Parameter dictionary keys

Utility procedures

Predefined Markup configurations

Dump

Basic

Basic: reference

Basic in one paragraph

The details underneath

Baselines

RagRight
Center
RagLeft

Markup: minimalist markup for the PostScript® language

Markup is a procedure-set resource for the PostScript language that changes the form of input to free lines of text that can be interrupted by fragments of PostScript programming. The obvious application is text formatting, but Markup can be adapted to many jobs that involve reading material line-by-line.

What Markup (by itself) is not

Quite sophisticated resources for in-PostScript text formatting exist, as can be seen in my direct PostScript resources survey. They offer full justification, tables, columns, and many other capabilities associated with complete text typesetting systems. Those systems run 40 to 200 kilobytes and more of interpreter memory, and Markup, at about eight, is not intended to replace but to complement them. It has no preordained idea what to do with the lines it reads, but can be linked to the procedures of a sophisticated typesetting resource to drive it with a convenient form of input. It works well in this capacity with the TinyDict, which does not have free-text input provisions of its own.

Thumbnail image of a business letter formatted with Markup Basic

What Markup is (even by itself)

Markup does include simple provisions, usable without any larger typesetting library, for simple line-for-line setting of text—that is, producing one set line from each input line, without filling words to fit—ragged right, ragged left, or centered, and Markup's especially simple relationship to the underlying PostScript language means those standalone facilities are versatile enough for everyday business correspondence, promotional flyers, labeling of figures, and other jobs that do not demand the greater automation the elaborate systems provide. There is no plan to add significantly to these built-in capabilities, as Markup is meant always to be lightweight enough to be an attractive front end for other libraries and, by not being tied to any one in particular, to stimulate use and development of new and existing libraries built on it or usable with it. Markup's standalone capabilities are meant to be adequate for a range of simple tasks but never to become the main point.

What Markup does differently

Markup is intended to stake out a distinctive position in the relationship of the markup language to the underlying PostScript. Some of the typesetting libraries I have surveyed introduce markup codes with an all new look and new rules for scanning and syntax, new mechanisms for defining commands or selecting fonts, and so on. Markup strives to avoid reinventing anything that PostScript already does easily and well, and to behave as nearly as possible as a natural outgrowth of PostScript. This example is written for Markup with its own built-in Basic formatter:

Markup may offer no dedicated new codes for, for example, font switching
but will certainly accept \{/Times-Italic 12 selectfont}the appropriate
ordinary PostScript\{/Times 12 selectfont} anywhere in the input. The effect
depends on the back-end typesetting library in use, but if it follows the
same philosophy of transparency, as Markup's Basic certainly does, this will
do just what you expect.

If there are only a few font changes in a one-off document, it would be hard to beat that form for clarity: there is no burden of learning or remembering new commands for font selection, or which fonts have been assigned to them. In a longer document, or where a consistent style is important, it will make sense to define some compact abbreviations. Now style changes can be made in one place. But that doesn't get any easier than PostScript already makes it:

\{
  /ro {currentfont /Times-Roman 12 selectfont} bind def
  /it {currentfont /Times-Italic 12 selectfont} bind def
  /last {setfont} bind def
}\ro This text is in Roman; \it this is emphasized, but to
\ro really \last emphasize something that's already in italics, one
sometimes goes back to Roman. \last It is easy to set up
a \/quoteleft last \/quoteright  command when the stack behaves as expected.

I like using the standard PostScript glyph names for the quotes rather than
remembering to write \<60>last\<27> in StandardEncoding,
or \<91>last\<92> in CE encoding\/mdash but that's a matter of
preference.

Thumbnail image of a Markup Basic sampler

Markup's only rules

Markup's clear family resemblance to PostScript is a result of its PostScript-like scanning rules, which result from its use of PostScript's own scanning operators, with rules as simple as can be:

Markup reads lines. That is, it uses the readline operator, which imposes the usual PostScript rule making the line endings of various operating systems equivalent. The line-ending characters are not included in the returned string. That solves a common annoyance in getting text handling right across platforms, and does so with PostScript's own mechanism.
If the line is interrupted, by a backslash unless configured otherwise, the portion read to the interruption is a PartLine. Otherwise, the line is a FullLine if there is anything on it, or a BareLine if there isn't. Markup is configured by a dictionary for what to do in each case; that is how it is set up to drive any given typesetting library. Typically BareLine is hooked to whatever makes the library start a new paragraph, so running paragraphs of text can be separated simply by bare lines.
When interrupted, Markup reads a single token, using PostScript's token operator. Therefore the interruption can be any single PostScript token, for example an integer or a name, a string, a hex string, or an entire bracketed procedure of any length; PostScript's rules for white space and comments apply. What to do with the token, and what to do with any result of doing that, are configured in the dictionary; reasonable conventions are to execute procedures, add strings to the current line or paragraph, pass literal names to glyphshow, and leave other things on the stack alone (to allow uses like the currentfont stacking example). Then Markup resumes reading lines.

To express those rules in PostScript took even fewer words than to describe them here. A few natural consequences are worth mentioning:

The PostScript tokens that are self-delimiting (procedures, strings) consume nothing past the end-delimiting character, but those that are not (such as names) consume one following whitespace character. An extra space must follow if a space is intended in the text.
Before an interruption that begins a line, or between two contiguous interruptions, will be a zero-length PartLine.
A self-delimiting token that ends a line will be read as followed by a BareLine: the scanner, resuming after the interruption, finds zero characters and then the line end. To avoid an unwanted paragraph break, move the ending delimiter to the beginning of the next line:
```
  The token at the end of this line\{0.5 setgray}
  causes an unwanted paragraph break, but this one\{0.5 setgray
  } does not. There is no unwanted break with this glyph name\/mdash
  because the newline was consumed in terminating it.
```
There is no special provision for the interruption character to appear as literal text, but if string-type tokens are simply added to the text, the problem is not difficult.
```
  This line contains $\$ a backslash.
```
If that is too uncomfortable, the name \ could be defined (just once) as the string (\\):
```
  \{
    /\ (\\) def
  }This line contains \\  one too. Don't forget the extra space.
```
If there are lots of backslashes to be included, perhaps it is worth changing EODString in Markup's configuration dictionary to some other value. A large block of verbatim text can be included by temporarily setting EODString to some unlikely value like ThisIsTheEndOfAllThatVerbatimText.

Markup: reference

Markup is a ProcSet resource. To make it available to your own code, include in the setup section of your file:

/net.anastigmatix.Markup /ProcSet findresource begin

The findresource will succeed if you have made the Markup resource file [download] available in any of these ways:

You have included the resource file in the prolog of your own file, before the findresource (which belongs in the setup section)
You downloaded the resource file to your printer's persistent memory, since it was last powered on
You saved the resource file in your printer's (if it has a disk or flash filesystem), or ghostscript's, resource directory under the right name
You use document manager software that recognizes the %%DocumentNeededResources and %%IncludeResource DSC comments, you include these comments at the right position in your file to specify that it needs net.anastigmatix.Markup, your document manager software is configured to automatically insert needed resources in files being printed, and you have put the Markup resource file where your document manager can find it.

Markup relies on another resource, MetaPre (the eight kilobyte memory figure given earlier is the total for both). You will need that file also. If you use the first method, you should include both files in the prolog of your document, MetaPre first. The other methods should Just Work as long as both files are where they need to be. In any case, your document only needs the single findresource line shown above. A findresource for MetaPre is not needed unless you also use MetaPre features in your own document. You pay no penalty to do so, as the resource must be there anyway.

The resource files are in a compact form. That is for efficiency, not to keep you from viewing them; there is a script for that on the resource packaging page.

The Markup dictionary is read-only. Before creating any definitions, you will want either
userdict begin or your own dict begin so that you have a writable dictionary on top of the dictionary stack.

Markup dictionary contents

This section describes the contents of the read-only dictionary that is returned by /net.anastigmatix.Markup /ProcSet findresource.

Markup proper

The dictionary contains one definition that implements Markup itself, reading and processing free text input as described in the introduction.

\markup

dict \markup -

\markup configures itself according to the supplied parameter dictionary dict and begins reading and processing input until it reaches end of input, or the PostScript operator stop is executed. It reads from the file given as SourceFile in the parameter dictionary, supplying a definition based on currentfile if the entry is not present. If currentfile is the source, ordinary PostScript interpretation of the file resumes after a stop is executed.

Lines are read by readline. If readline reads a complete line (terminated by newline), the line is a FullLine if it has nonzero length, a BareLine otherwise, and the corresponding procedure is executed with the line on the stack.

If readline does not read a complete line, either the end of input has been reached, or an interruption has been reached, defined by the values EODCount and EODString. The PartLine procedure is executed with the partial line read on the stack, and then the token operator is used to read a single token from SourceFile.

If token succeeds, the procedure HandleToken is executed with the token on the stack. If HandleToken returns (without executing stop), the procedure HandleResult is executed. If that also returns without executing stop, \markup freshly fetches SourceFile, EODCount, and EODString from the parameter dictionary in case their values were changed in handling the token, and resumes reading lines from the (possibly changed) SourceFile.

If token returns false, the end of SourceFile has been reached. If no EOFToken entry is present in the dictionary, \markup completes. If an EOFToken is present, \markup executes HandleToken and HandleResult just as if that token had been read and, if stop has not been executed, freshly fetches SourceFile, EODCount, and EODString, and resumes reading from the (presumably changed) SourceFile.

The parameter dictionary passed to \markup may be crafted for some underlying library for typesetting (or even some other purpose), to cause \markup to execute the appropriate procedures of that library. The dictionary may contain additional entries controlling the interface to that library and of no interest to \markup. The entries that are meaningful to \markup are described here:

Key	Purpose
BareLine	Procedure to be executed when readline has returned true, meaning a newline was encountered, and the line read was of zero length. This can happen for an actual bare line in the input, or for the final readline of a line that ends with a self-delimiting PostScript token. For consistency with FullLine and PartLine, the stack contains (a) whatever was on it when \markup was invoked (except for the parameter dictionary), (b) as modified by any embedded PostScript tokens since, and not consumed by HandleToken or HandleResult, and (c) on top, the string just read, though in this case it is known to be empty; in some uses (simple line-by-line setting), it can make sense for BareLine and FullLine to be the same procedure. When used with paragraph-at-a-time systems, BareLine will usually be defined to invoke the library procedure for filling and setting the completed paragraph. \markup fetches this value only once, so changing it from embedded PostScript code has no effect, but nothing stops the supplied procedure looking at other keys and having changeable behavior.
EODCount	An integer that, in combination with EODString, determines how embedded-PostScript “interruptions” are recognized. Determines how many instances of EODString can be encountered in the input before reading is interrupted; overlapping instances are not multiply counted. If the value is zero, the first occurrence of EODString interrupts reading and is not read as part of the text. If this entry is not present in the dictionary, \markup adds a definition of zero. If EODString is empty, reading will be interrupted when exactly EODCount bytes have been read; this combination is not likely to have practical use in \markup. This value is freshly fetched from the dictionary after any embedded token has been handled, so embedded PostScript tokens may change it.
EODString	A string that, in combination with EODCount, determines how embedded-PostScript “interruptions” are recognized. See EODCount. If this entry is not present in the dictionary, \markup adds a definition of (\\). This value is freshly fetched from the dictionary after any embedded token has been handled, so embedded PostScript tokens may change it.
EOFToken (optional)	When the end of SourceFile has been reached, ordinarily \markup completes. If this entry is present, its value is treated just as if it had been returned by token at the end of the file; HandleToken and HandleResult are executed, and if stop was not executed, reading resumes from the (presumably changed) SourceFile. This is the mechanism by which a file-inclusion operation can be supplied. The idea is for the file-include procedure to store the new file in SourceFile and supply an EOFToken that restores the old one (and the old EOFToken). Such a procedure can be generated by includegen, which is used to provide the file inclusion feature in Basic. If there is an EOFToken and it does not either replace SourceFile, replace EOFToken, or execute stop, \markup will reach end of file and spin.
FullLine	Procedure to be executed when readline has returned true, meaning a newline was encountered, and the line read was of nonzero length. This can represent an entire full line in the input, or the final segment of a line after an embedded PostScript token. The stack contains (a) whatever was on it when \markup was invoked (except for the parameter dictionary), (b) as modified by any embedded PostScript tokens since, and not consumed by HandleToken or HandleResult, and (c) on top, the string just read. In driving a line-at-a-time text setting system, this procedure may invoke PartLine and then a library procedure to set the complete accumulated line; for a paragraph-at-a-time system, this procedure and PartLine will probably be the same. \markup fetches this value only once, so changing it from embedded PostScript code has no effect, but nothing stops the supplied procedure looking at other keys and having changeable behavior.
HandleResult	Procedure to be executed when an embedded token has been read, after HandleToken has been executed, and if HandleToken did not incur a stop. Anticipated uses may check the number and type of objects on the stack, and perhaps add a string on top of stack to the accumulating line, interpret a literal name as a glyph name to be shown, or other conventions appropriate to the application.
HandleToken	Procedure to be executed when an embedded token has been read; any results on the stack may then be processed by HandleResult. One reasonable implementation of HandleToken is a simple unconditional exec. At the time this procedure is executed, SourceFile contains the file object from which the token was read.
LineBuffer	A string that will be used as the buffer for readline. The size of this string determines the maximum line length that can be found in the input without incurring a rangecheck.
PartLine	Procedure to be executed when readline has returned false, meaning it encountered the end of input (or an interruption for embedded PostScript) before a newline. The line may be of zero or nonzero length. The stack contains (a) whatever was on it when \markup was invoked (except for the parameter dictionary), (b) as modified by any embedded PostScript tokens since, and not consumed by HandleToken or HandleResult, and (c) on top, the string just read. \markup fetches this value only once, so changing it from embedded PostScript code has no effect, but nothing stops the supplied procedure looking at other keys and having changeable behavior.
SourceFile	File object from which \markup reads input. If this entry is not initially present, \markup defines it with the file object obtained from currentfile. This value is freshly fetched from the dictionary after any embedded token has been handled, so embedded PostScript tokens may change it.

Utility procedures

The resource dictionary includes four utility procedures likely to be useful to other code.

includegen: dict includegen proc
Returns a procedure proc that will implement an “include” facility by manipulating the SourceFile and EOFToken entries in a markup dictionary dict. The returned proc has stack effect (file proc –). It will install file as SourceFile, and install an EOFToken that will restore the previous SourceFile and EOFToken, if any, when it is invoked by \markup at the end of the included file.
glyphwidth: name glyphwidth wx wy
Returns the change in current point that would result from executing name glyphshow. I used to wonder why PostScript had both show and glyphshow, but only stringwidth and no glyphwidth. I got tired of wondering.
gr-path: – gr-path –
Read “grestore minus path.” Replaces the current graphics state using all values saved by the matching gsave except the current path, which is preserved. Any current path that was saved by gsave is lost.
sgs-path: gstate sgs-path –
Read “setgstate minus path.” Replaces the current graphics state using all values saved in gstate except the current path, which is preserved. Any current path that was saved in gstate is not lost—it is still in gstate—but does not become the current path.

Predefined Markup configurations

The resource dictionary includes prototypes for parameter dictionaries giving two usable configurations of \markup.

Dump: This is a parameter dictionary that simply supplies values for PartLine, FullLine, BareLine, HandleToken and HandleResult that will cause \markup to dump to standard output the same input that it reads (PostScript tokens will be as formatted with ==). A 192-byte default LineBuffer is provided. The dictionary is read-only; you should allocate a new dictionary and copy the contents so automatically-added contents like SourceFile, EODCount, and EODString can be added.
Basic: num Basic dict
This is a procedure to produce a parameter dictionary configuring \markup as the simple, line-for-line Basic text formatter described in the introduction. The dictionary is initially configured to set ragged-right with an advance of num units downward between consecutive baselines, though both can be changed in the resulting dictionary before or during use. “Right” and “down” are taken, as in PostScript default coordinates, to be increasing x and decreasing y, respectively. A 192-byte default LineBuffer is provided.The details of the Basic formatter are described in the next section.

Basic: reference

Basic is a minimal, line-for-line text formatter that can be driven by Markup. Its versatility stems from its very close relationship to the PostScript language beneath; what happens when arbitrary PostScript sequences are embedded in text being set should (and often does) match what you might expect without thinking too hard.

Basic in one paragraph

Basic requires an initial current point to be set, as with moveto. Each line of text is placed with its reference point at the current point: in RagRight mode, the left end of the line is placed there, in RagLeft mode, the right end, and in Center mode, the middle, and the original current point is moved by the Baselines advance. These are special cases for the values of two matrices that completely control line placement. And with that, the operation of Basic is almost completely described—nothing remains but details.

The details underneath

Basic is implemented as a parameter dictionary for \markup. An instance of the dictionary is generated by the procedure Basic as described in the previous section. The dictionary has several entries in addition to those required by \markup itself. Some of these entries are procedures that can be used to change settings from embedded PostScript code. To use them that way, it is convenient to put the parameter dictionary on the lookup stack, something \markup does not automatically do:

14 Basic dup begin userdict begin \markup

The userdict begin merely arranges that any definitions created in your embedded PostScript do not wind up in the parameter dictionary itself. Be sure to use store and not def when it is your intent to change values in the parameter dictionary: Basic looks only there for its own state.

The Basic parameter dictionary defines \markup's required keys BareLine, PartLine, FullLine, and HandleResult all in terms of each other and Basic's two fundamental operations, Track and Place. Track and Place can be used directly from embedded PostScript for particular effects. Consider two similar lines of input:

\{(M)stringwidth rmoveto}This is a line of text.
\{(M)stringwidth{rmoveto}Track}This is another line of text.

The first example simply moves the current point one em to the right, essentially behind Basic's back. Because Basic simply places all text according to the current point and advances it with relative moves, the effect is a one-em indent for the current and all following lines, until changed by another rmoveto. You would do this to indent a block of text, or to pick up and move to an entirely new area of the page.

The second example makes the rmoveto into a tracked element of the current line, which will be compensated when the line is placed and the current point is advanced. The one-em indent will last only for one line. You would use this for a paragraph indent (probably by supplying a definition of BareLine that used it, so the indent would be applied for each bare line in the input).

An untracked move will have the same effect no matter where in a line it appears—it is simply a change to the current point made behind Basic's back while the line is being assembled and before it is placed. Tracked moves can appear anywhere within lines, and do just what you would think.

The entries specific to Basic in the parameter dictionary are these:

Baselines

num Baselines -

Set the baseline advance to num (the initial value was given by the num argument to Basic when the dictionary was generated). A positive value advances downward (decreasing y). Modifies the After matrix.

RagRight Center RagLeft

- RagRight -
- Center -
- RagLeft -

Change the line placement mode (RagRight is the default when the dictionary is generated) by modifying the Before and After matrices.

Track

proc Track -

The first part of Basic's internal mechanism for text placement. Track is used to build a queue of procs representing segments of a line. It takes a snapshot of the current graphics state and operand stack, executes proc without marking the page, finds a distance vector from the current point before executing proc to the current point after, restores the graphics state from the snapshot, updates a cumulative distance vector for the line, and adds proc and the snapshot to the queue. The cumulative distance vector is maintained in device space to be independent of changes to user space.

If proc executes stop, Track does not update the distance vector or add to the queue. This is one way a line-filling formatter could be built on top of Basic: Track each candidate element wrapped in a suitable procedure that will stop if it cannot fit on the current line. The origin is the initial current point when Track executes proc, so following the graphics operations proc can obtain its own distance vector (in user units) with a simple currentpoint. This is one application that would justify designing proc to behave differently when executed by Track and when later executed by Place—a bad idea, in more ordinary circumstances.

Place

after before Place -

The second part of the text placement process. Before executing the queued procs that make up the line, Place transforms the line's cumulative distance vector to user space and then through the matrix before to obtain arguments for an rmoveto. For each proc on the queue, Place then imposes its saved operand stack and graphics state, using sgs-path so the current point is not altered, executes proc, this time marking the page, and cleans the operand stack. After executing the last proc, Place restores the original operand stack and again transforms the line's distance vector, this time using the after matrix to find an rmoveto that reaches the next line's reference point, and resets the distance vector.

From the description of Track and Place it should be clear that every proc representing a segment of the line is executed twice, first by Track and later, with the same operand stack, by Place. The necessary assumption is that the procs do not depend on side effects or other state than their operands that could change their behavior from one execution to the next. If a proc violates that assumption, surprises may result. Absolute moves are not recommended because the current point is unlikely to be the same both times proc is executed.

Width

– Width wx wy

Returns the current accumulated width vector of all that has been Tracked and not yet Placed, that is, the vector from the start point of the first item Tracked since the last Place to the end point of the most recent, in units of the user space in effect when Width is executed. This could be used in conjunction with conditional Tracking to implement a line-filling formatter as suggested in the description of Track.

ResultHandlers

This entry is a dictionary mapping object type names to procedures, used to customize the behavior of HandleResult after an embedded PostScript token has been executed. If there is at least one item on the stack, and the type of the topmost item is a key in ResultHandlers, then the associated procedure is executed. Three types are initially present: stringtype maps to the procedure PartLine, nametype to the procedure {{glyphshow}Track}, and filetype maps to a procedure generated by includegen.

Before After

These entries are matrices. Before is used to transform the complete line's distance vector, in user space, to an rmoveto locating the point where the line should begin. From the current point after the line has been placed, After transforms the line's distance vector to an rmoveto locating the reference point for the line to follow. In RagRight mode, the matrices are
[0 0 0 0 0 0] and [-1 0 0 -1 0 -num], respectively, where num is the baseline distance given to Baselines or to Basic when the dictionary was generated.

Examples

Here are three examples of Basic in use; you can view them in a PostScript viewer or a text editor, depending on whether you would like to see how they look on the page or how they were written. Each one is in two versions, one with all resources included, and one that does not include them and can be viewed only if your viewer or printer already can find the MetaPre and Markup resources, as discussed in the reference section.

The “labeler demo” was inspired by this newsgroup thread in which the original poster wanted a simple PostScript template that a PHP script could emit to make a simple label, and some of the suggested solutions had the PHP script emit LaTeX code, to be postprocessed with LaTeX and dvips. For a label!

	Bare file	Resources included
Sampler	PostScript view text view	PostScript view text view	PDF view
Business letter	PostScript view text view	PostScript view text view	PDF view
Labeler demo	PostScript view text view	PostScript view text view	PDF view

$Id: Markup.html,v 1.14 2009/11/12 03:16:30 chap Exp $