Beyond OmniMark

Sam Wilmott, sam@wilmott.ca

Balisage: The Markup Conference 2019, July 30 - August 2 2019

Copyright © 2019 Sam Wilmott

Document processed: Thu Jul 18 11:36:16 PDT 2019

Abstract

The field of programming languages is in continual flux: there are new languages coming along every few years. In the field of text and markup processing languages, things have settled down rather, with XSLT dominating and a few other languages like OmniMark filling in the gaps, but it is no more exempt from change than any other application area. Whether change is always improvement is not a certainty, but we must always be striving for improvement, so one hopes that there is such a thing that applies to our text and markup language field.

This paper starts with an overview of some existing text and markup processing languages, and concludes with an outline and examples of a new programming language, that I hope will make text and markup processing easier than is now the case, or at least provide thoughts as to which directions things can go.

Table Of Contents

  1 Text And Markup Processing Languages
  2 Why Programming Languages Advance
  3 An Attempt At A Future Programming Language
   3.1 Language Features
   3.2 Markup Language Features
  4 Examples
   4.1 Hello World
   4.2 XML Processing
   4.3 More XML Processing
   4.4 Even More XML Processing
   4.5 XML Processing Without Rules: Serial Processing
   4.6 XML Processing Without Rules: Tree Processing
   4.7 Text Matching Rules
   4.8 Using Regular Expressions
   4.9 Defining Operators
   4.10 toArray
   4.11 Procedural And Functional Methods
   4.12 Numbers and Strings
   4.13 Classes
   4.14 Parameterized Classes
   4.15 Synchronous Pipes and Streams
    4.15.1 Pipes
    4.15.2 Text Streams
   4.16 Locally Scoped Names
   4.17 Selecting
   4.18 Iterators
   4.19 Patchable Print Streams
   4.20 Program Profiles
   4.21 Defining XML Processing
  5 Conclusion
  Appendix A — BML: The Smallest Markup Language
   A.1 An Example of Using BML
   A.2 How BML Tree Processing Is Defined
   A.3 A Couple Of More Examples

1 Text And Markup Processing Languages

In the old days, most of the work required of computers was about processing numbers: for either commercial or scientic applications. That, together with manipulating the system hardware on which there programs ran, was what programmers required of computer programming languages. Even 50 years ago, in the late 1960's, there were many programming languages produced to help with all of that.

In the still early days of computer languages, in the 1960's and 1970's, text and markup processing were none-the-less important application areas. In fact, there were more text processing languages than there seem to be now. Most notable were Ralph Griswold's SNOBOL and Icon, and the less well known HUGO -- a text processing and electronic publishing language that I designed and helped implement at the Canadian Publishing and Printing Office -- it was used in the publishing of Hansard, Canadian Laws and other government publications. In the last couple of decades what computers do has greatly changed. A lot of what is done with computers these days is manipulating text, and how it is marked up: formatting it for different platforms, converting it from one markup form to another, and enhancing it with new information. Yet there are very few computer languages that are designed to help with these tasks.

Perl more-or-less leads the field of commonly used text processing languages these days, and a lot of work has been put into it to keep it up-to-date. XSLT is very useful and is widely used in the markup language field, even though it only supports input in XML and is somewhat weak in its text processing features. OmniMark is strong in its text processing abilities, it handles SGML, XML and, more recently, JSON. (OmniMark's unpopularity relative to XSLT, in spite of its wider application uses, is as much as anything else because of the high cost of acquiring it, not because of its comparative utility.)

The total history of programming languages of all sorts is a little over 60 years -- less than the age of some of us still in this business. When I first got interested in programming languages, in the late 1960's, the people who had pioneered programming languages in the 1950's were still active in the field. In this context, both XSLT and OmniMark are getting rather long-in-tooth. XSLT is 20 years old and OmniMark's basic design and most useful features date from 30 years ago. In fact, OmniMark's syntax, processing model and text processing capabilities are based on the HUGO language, making its style more like 40 years old. There have been a few recent advances in both languages, including serial markup processing in XSLT and JSON support in both OmniMark and in XSLT1 but development in both cases has definitely slowed down.

2 Why Programming Languages Advance

For programming languages, the available memory and speed of processors at the time of their initial design and implementation is a major factor in a language's design and the in cost of its implementatation. Early design decisions in a language greatly impact what can be added to it later. This is why new programming languages come along every decade or so: one has to go back to the beginning to really make a change in a language, and increased computer power make that progressively easier to do over time.

As an example of how the world changes, when OmniMark was first designed and implemented, the largest machine in the office was a 2 megabyte memory desktop Macintosh. Now I'm typing this paper on a much faster 8 gigabyte memory laptop, which is no more than average for these days. It is about seven years old, so it is getting old, and may already be considered slow and small.

More memory and faster processors make it easier to implement a new programming language. Easier yes, but not easy -- it is still a big job. And more is now expected of a new programming language than was the case in the past. It is getting less and less clear what hasn't already been done, and what new could be done. However, even though it is over 60 years since the first widely-used programming languages like FORTRAN and COBOL were created, computer hardware, system software and other software tools have continued to develop, what they are used for has changed and expanded, and one would think that programming languages should participate in this continued development. As well, a number of useful programming features from the 1960's seem to have disappeared, and really need reviving (in my opinion, of course).

So if nothing else, there is an opportunity for a new programming language, combining desirable features both from current languages and past languages, together with a few minor innovations.

3 An Attempt At A Future Programming Language

I've been working on a new programming language, on and off, for best part of a decade now, and it is coming close to completion. It is an example of what I think should be on the horizon. It is a bit too large a topic for this talk, but I'll outline a few features, and give a few examples, which I hope will demonstrate what I'm talking about.2

The new language is currently called Bobbee.

3.1 Language Features

The language has a number of interesting features:

3.2 Markup Language Features

The most interesting aspects of Bobbee's markup language support are that:

Currently, the libraries available for use with Bobbee support the markup languages SGML, XML, MicroXML, JSON and BML (for BML, see Appendix A).

4 Examples

Following are examples of some of the interesting features of the language. Given that the language description is hundreds of pages long, this is of necessity just a taste of what the language does and how it works. As well, many features do not work well for small examples: they need much larger examples to be convincing.

4.1 Hello World

We start with the basic "Hello World" program, which is required for all programming language examples:

println ("Hello World");

People with Java experience will notice that println doesn't need prefixing with "System.out". This is a user selected option that applies to anything the user chooses. There are a few defaults in the "standard" profile (which can be overridden). Here is the declaration that makes the above happen:

use System.out.(print, println, printf);

The user can declare their own short-hand forms. Where the short-hand is not appropriate, fully qualified names can be used, as in:

System.err.println ("Goodbye Cruel World");

4.2 XML Processing

Here is an XML example that converts a XML document to a JSON form:

program xmltree;

parseXml (System.in);

choose xmlElementNode {
  print (" [\"%s\" {" (*.qName));
  for a = *.attributes, comma = "" then "," do
    print ("%s\"%s\":\"%s\"" (comma, a.qName, a.allValues));
  print ("}");
  processChildren;
  print ("]");
}

So:

<section id="X"><title>Title Text</title><p>Para 1.</p></section>

is converted to:

["section", {"id":"X"}, ["title", {}, "Title Text"], ["p", {}, "Para 1."]]

An XML processing program is prefixed by a program directive that indicates what kind of XML processing is to be done, in this case, XML tree processing, as in the original release of XSLT.

A markup processing rule is heralded by a two-word prefix: choose, indicating it is a markup processing rule and, in the above case, xmlElementNode indicating what is to be recognized by the rule.

4.3 More XML Processing

Here is another XML processing program, using serial parsing this time:

program xmlserial;

sendToMatch (System.in);

choose xmlElement ("section") {
  processChildren;
}

choose xmlElement ("title") when *.parent..is ("section") {
  print (".section ");
  processChildren;
  println ();
}

choose xmlElement ("p") {
  print (".para ");
  processChildren;
  println ();
}

About markup processing:

4.4 Even More XML Processing

Serial and tree markup processing can be combined. For eample, when processing a potentially large document that is best done with serial processing, there may be subcomponents, like tables, that require the added functionality of tree parsing. As a simple example:

program xmlserial, xmltree, textpatterns;

parseXmlSerially ();

choose xmlElement ("doc") {
  println (".startdoc"); processChildren; println (".enddoc");
}

choose xmlElement ("p") :> processChildren;

choose xmlElement ("table") :> xmlElementNode (*.captureTree ());

choose xmlElementNode ("table") {
  println (".starttable"); processChildren; println (".endtable");
}

choose xmlElementNode ("tr") {
  println (".startrow"); processChildren; println (".endrow");
}

choose xmlElementNode ("td") {
  println (".tableitem "); processChildren;
}

choose xmlText :>
  println (*.data) unless *.data matches ([" \t\n"]* & -|);

choose xmlTextNode :>
  println (*.data) unless *.data matches ([" \t\n"]* & -|);

The "Node" rules do tree processing, and the non-"Node" rules, serial processing.

"captureTree" captures the current element (in this case "") as a tree data structure, and invoking "xmlElementNode" passes the tree to the tree processing rules.

4.5 XML Processing Without Rules: Serial Processing

A markup parser can be invoked without using rule-base processing. It is an iterator of the serial processing item type returned by the parser, in this case XmlItem:

import bj.xml.*;
for item = XmlSerialParser.parse () do
  println ("%s: %s" (name of item, item));

With the following input:

<doc><p>The text.</p></doc>

this program outputs the following:

bj.xml.XmlStartDocumentItem: <?xml version="1.0"?>
bj.xml.XmlStartTagItem: <doc>
bj.xml.XmlStartTagItem: <p>
bj.xml.XmlTextItem: The text.
bj.xml.XmlEndTagItem: </p>
bj.xml.XmlEndTagItem: </doc>
bj.xml.XmlEndDocumentItem: <!-- end document -->

The above sample program illustrates two minor but useful features of Bobbee:

4.6 XML Processing Without Rules: Tree Processing

As well as direct access to a markup parser's serial parser as an iterator, its tree parser can be used to return a tree-structured data structure:

import bj.xml.*;
println (XmlTreeParser.parse ());

With input as:

<?xml version=\"1.0\"?><doc><p>The text.</p></doc>

this program produces the same output as its input.

However, because "XmlTreeParser.parse" returns a data structure, working with it can be very useful. For example, if one wants to sort a table prior to outputing information from it, based on some of its fields, that data structure is something one can do that with the data structure.

4.7 Text Matching Rules

For a taste of text pattern matching rules, here is a program that converts all words by capitalizing the first letter followed by all the other letters in lower case. It has two rules: the first recognizes a word (a string of letters, including appostraphies and dashes), and does the conversion, and the second copies out all other things:

program textpatterns;

sendToMatch (System.in);

match uLetter => "first" & (uLetter | ["'-"])* => "rest" {
  print (upper => "first" + lower => "rest");
}

match \uLetter+ => "other" {
  print (=> "other");
}

The "=>" operator is used in two ways (in the same way that "+" can be used in various ways):

"upper", "lower" and "length" are prefix operators that respectively upper-case and lower-case their argument: which in this example is captured text from a text pattern match. And "+" joins two string values.

One major feature in the above example is starting the program with a program declaration. In this example, it names the text pattern features that are used in the program. (More about program later.)

4.8 Using Regular Expressions

Here is the previous text matching rules exmaple using Regular Expressions instead of text pattern matching expressions:

program textpatterns;

sendToMatch (System.in);

match `\=(\p{L})([\p{L}\p{N}]*)` {
  print (upper * [1] + lower * [2]);
}

match `\=([^\p{L}])` {
  print (* [1]);
}

When using regular expressions:

"\=" helps make strings a bit easier to read. The alternative to using "\=" is to double "\" characters as is used in Java text patterns:

match `(\\p{L})([\\p{L}\\p{N}]*)` {...}

"\=" is like Java's "\Q" except that:

A string can be split into parts, and the effect of "\=" and "\Q" is limited to one part:

print ("\=\\" "\\");

prints three backquote characters: two because "\=" means that neither of the following "\"'s are considered escapes, and one more because "\=" doesn't affect the second part of the string, where an first "\" is needed to escape the second one. The two-part string in this example is a single string, not two strings "joined" together.

4.9 Defining Operators

In the previous example, a variety of operators are used. Operators can be defined in a Bobbee program or library. For example, these operators are defined with their name, their arguments names and types, the result of the operator and finally the code that produces the result:

operator length (arg1 : String) : integer :> arg1.length ();
operator lower (arg1 : String) : String :> arg.toLowerCase ();
operator upper (arg1 : String) : String :> arg.toUpperCase ();
operator (arg1 : String) + (arg2 : String) : String :> arg1.concat (arg2);

defines the "length", "lower", "upper" and "+" operators as they apply to "String" values. These operators are defined in a "functional" style, with ":>" meaning that what follows is the returned value of operator. Operators and methods can be defined in this functional style, or have a "body" containing "return" statements.

Operators need a "precedence" defined for them, which is what determines how tightly they bind. For example, one wants "*" (for multiplication) to bind more tightly than "+" (for addition). For these operators, again defined in a Bobbee program or library:

operator length 311;
operator lower 311;
operator upper 311;
operator upper 311;
operator 240 + 241;

To improve program performance, Bobbee provides a mechanism for eliding the call to the "length" operator definition, by telling the compiler what to call instead, by use of a "@Builtin" annotation:

@Builtin ("VCALL1:public:java.lang.String.length:()I")
operator length (arg1 : String) : javaInt :> arg1.length ();
@Builtin ("VCALL1:public:java.lang.String.toLowerCase:()Ljava/lang/String;")
operator lower (arg1 : String) : String :> arg1.toLowerCase ();
@Builtin ("VCALL1:public:java.lang.String.toUpperCase:()Ljava/lang/String;")
operator upper (arg1 : String) : String :> arg1.toUpperCase ();
@Builtin ("VCALL2:public:java.lang.String.concat:(Ljava/lang/String;)Ljava/lang/String;")
operator (arg1 : String) + (arg2 : String) : String :> arg1.concat (arg2);

(The "length" operator actually returns a "javaInt" value rather than whatever Bobbee's "integer" type may be implemented as. This reflects what the underlying Java library method returns.)

The "length", "lower", "upper" and "+" operators are defined in a class (bj.lang.Operators). To allow them to be used in a user's program without class qualification, Bobbee provides a means for declaring which values, methods and operators can be used unqualified. For these operators this is:

use bj.lang.Operators.{length, lower, upper, +};

(Different overloadings of an operator or method can have a "use" from different defining classes.)

All operators, even plain old arithmetic "+" and "*", are defined by these mechanisms.

The primary motivation for including operator definitions in Bobbee is to allow libraries to be implemented in Bobbee itself, and not requiring them to be "built in". All of Bobbee's markup language and text pattern matching support, and much of its other functionality, is implemented in libraries written in Bobbee using these mechanisms.

4.10 toArray

There are many small things that I find irritating about existing programming languages. I've attempted to address some of these concerns in designing and implementing Bobbee, although addressing all of them is beyond practical. One major irritation using Java has been the form of the ~~toArray~~ method for convering a value of subclasses of the ~~java.util.Collection~~ interface to an array, as in:

var c : ArrayList<String>;
var a : String [] = c.Array (new String [0]);

The problem is that the item type of "c" (~~java.lang.String~~) needs repeating in the argument of ~~toArray~~, even though it's known to the compiler what the type of "c" is. Bobbee provides a postfix operator that does the job without this repetition:

var c : ArrayList<String>;
var a : String [] = c toArray;

4.11 Procedural And Functional Methods

A classic method example is the "factorial" function. In Bobbee there are a number of ways of defining it, that illustrate different ways of implementing it. First, here's the traditional recursive functional form:

def factorial (n : integer) : integer :>
  if n == 0 then 1 else n * factorial (n - 1);

Here's the same thing in an iterative procedural form:

def factorial (n : integer) : integer {
  var f : integer = 1;
  for i = 2 to n do
    f *= i;
  return f;
}

":>" indicates that the (returned) value of the method or other form is given as an expression. return provides a method result in a procedural manner. Both methods and operators can be defined in either a functional or procedural form. For some things one form is best. For others the other.

4.12 Numbers and Strings

Bobbee used two kinds of numbers: integer (64-bit integer) and real (64-bit floating number). The idea is that given that computer memories are way bigger than they were not long ago, there's no real need to be careful with storage sizes the way there used to be. This makes programming a bit easier.

var n : integer = 0;
# declare "n" to be an integer, and initialize it to zero.

One can access Java type numbers using special names: javaByte, javaShort, javaInt , javaLong, javaFloat and javaDouble. The next section has an example of using this feature.

Bobbee supports two kinds of strings (and characters):

Most importantly, for string, the length of a string is the actual number of characters in the string, unlike the case for Java strings, where the length may be more than the number of characters.

By default, all types, including numbers, characters and boolean values, are implemented as objects. There's a lot to be said for making everything an object: it makes the language more uniform in its use. And there's not as much cost as one might think: for example, calls to the various "print" methods convert all their arguments to objects. On the other hand, the language can handle non-object forms of these types.

4.13 Classes

Here is a simple example of a class that implements a subclass of OutputStream, that just puts what is written to it in a buffer, which then becomes the class's "toString" value:

class BufferedOutput : OutputStream {
  def this () {}
  def this (this.buffer) {}
  def write (b : javaInt) : void :>
    write (new javaByte [] {b}, 0, 1);
  def write (b : javaByte []) : void :> write (b, 0, length b);
  def write (b : javaByte [], off, len : javaInt) : void :>
    buffer += new String (b, off, len);
  def close () : void {}
  def flush () : void {}
  def toString () : String :> buffer;
  private:
  var buffer : String = "";
}

This class illustrates a number of features of the language:

4.14 Parameterized Classes

The "generic type" or "parameterized type" feature of many current languages is very useful in the the definition and use of classes whose subcomponents can be of many different types. For example, the following is a useful way of defining a variable-sized list of string values, that can have values added to it or removed from it:

val stringList : ArrayList<string>;

As well, the very useful Iterable and Iterator types can be used with a specified type that is returned when using those types. There's an iterator example later that illustrates this feature.

Parameterized classes (interfaces and enums) are defined using type parameters following the name of the class, interface or enum, as in this class that defines a very general implementation of value pairs:

class Pair<H,T> {
  val head : H;
  val tail : T;
  def this (this.head, this.tail);
  def toString () : String :> "Pair(%s,%s)" (head, tail);
}

Used like:

var myPair : Pair<string,Integer>;
myPair = new Pair<string,Integer> ("third", 3);

This simple, class-level parameterized type syntax is supported, but there's no corresponding support for parameterizing methods and the "super-type" relation.

4.15 Synchronous Pipes and Streams

Bobbee supports synchronous threads -- which are called "coroutines" in other languages. Synchronous threads run in parallel as do other threads, but are implemented so that only one is running at the same time. This means that no synchrously is required for one thread to use properties of another such thread. There are two kinds of synchronous threads: object-passing pipes and text streams.

Synchronous pipes and text streams are useful when doing context-sensitive data processing: where where one is in the original input is significant for down-stream processing. Object-passing pipes are useful for things such as parsing and processing markup languages and other data encodings in parallel. Text streams are useful where what is passed is a text stream.

4.15.1 Pipes

An object-passing synchronous pipe passes objects of a particular type from one thread to another, pausing the sending pipe until the receiving pipe has used the passed value and requires another value, and pausing the receiving pipe when it requires another value until the sending pipe has a value to send to it. A single SynchronousPipe is created to communicate between two processes:

val pipe = new SynchronousPipe<string> ();

The pipe is initialized to wait for something to be written ("put") to it by one process:

pipe.put ("Start");

Once a value is written, the writing process is suspended until another process retrieves the value. That other process can wait for something to be written to the pipe, and get the passed value when it is written:

val nextValue : String = pipe.get ();

This reading process continues until it does another "get". At which time the reading process is suspended, and the writing process is resumed so that it produce another value.

4.15.2 Text Streams

A variation on the synchronous pipe is the SynchronousStream, which instead of passing objects between synchronous threads, passes a stream of text. Text-communicating coroutines are created by calling one of two static methods of the SynchronousStream type, as in:

local System.out = outputTo (theOtherRegime);

or as in:

local System.in = inputFrom (theOtherRegime);

where "theOtherRegime" is a java.lang.Runnable class which is started out by setting its "standard output" to return text to the current thread (for "outputTo") or which is started out by setting its "standard in" to send text to the current thread (for "inputFrom"). In both cases, synchronization is kept by only allowing one of the two threads to be active at any one time.

4.16 Locally Scoped Names

A couple of features have been revived from 1960's languages.

The local prefix says that the value of given name is to be restored on exit from the current local scope, no matter how it is exited:

local depth += 1;

means save away the current value of "depth", increment its value for use in the local scope, and restore its saved value at the end of the scope. Even if the local scope exits with a throw or an error, the restoring will happen. This reduces the corruption of data when scopes are exited in strange ways.

local can be used with qualified names as well. The following temporarily rebinds "System.out" but ensures it is restored for later use:

local System.out = new PrintStream ("myoutputfile.txt");

4.17 Selecting

select is the Bobbee language version of "switch" in other languages. It has a number of features beyond the selecting of numeric and string values. Two forms of select are of special interest. Pattern matching can be done using select with match parts rather than case parts:

def upperize (x : string) : string {
  select x {
    match uLetter => "first" & (uLetter | ["'-"])* => "rest":
      result += upper => "first" + lower => "rest";
    match \uLetter+ => "other":
      result += => "other";
  }
  return result;
}

A value can be selected based on its type:

def show (x : Object) : string {
  select x {
    case (y : String) : return "String: \"%s\"" (y);
    case (y : Long) : return "Number: %d" (y);
    default: return "Other: %s" (x);
  }
}

The type selecting statement does three things of use:

4.18 Iterators

One can define iterators in Bobbee as methods, rather than as classes as in Java. For example, the following method creates an Iterator that returns all the space separated words in a passed-in string:

def splitWords (sentence : string) : Iterator<string> {
  selectAll sentence {
    match [" \t\n"]* & \[" \t\n"]+ => "word":
      yield => "word";
  }
}

This example illustrates the use of the selectAll statement, which extends the select statement by looping over a string, array, Iterator or other collection, by selecting parts of the string or components of the collection on each iteration of the selectAll.

Iterators are used extensively in the implementation of the language's markup libraries, where the serial parsers are all iterators of their item types.

4.19 Patchable Print Streams

Patchable print streams allow later-found information, such as chapter numbers, to be used earlier in an output document, even when using serial processing.

with pps = new PatchablePrintStream () do {
  with local System.out = pps do processDocument ();
  pps.emit ();
}

Within whatever "processDocument" does, the current print output is bound to the "patchable" stream. This means "marks" can be written and defined. The "pps.emit ()" writes out the result to the current output, which in the example is the "System.out" outside of its binding to the patchable print stream. "Marks" are written to the patchable stream using "writePrintMark". In the following example, a "" element's "id" attribute value is used as a mark:

choose xmlElement ("ref") {
  writePrintMark (* ["id"].value);
}

"Marks" are defined by assigning values to items in the PatchablePrintStream value. In the following example, a copy of the title text is bound to the mark value given by the chapter title's "id" attribute value, if it has one:

choose xmlElement ("title") when *.parent.is ("chapter") {
  with title = new ByteArrayOutputStream () do {
    print ("<H2>");
    with local System.out = new PrintStream (title) do
      processChildren;
    with id = *.parent ["id"] do
      if id != null then {
        print ("<A NAME=\"%s\"></A>" (id.value));
        pps [id] = "<A HREF=\"#%s\">%s</A>" (id.value, title);
      }
    println ("%s</H2>" (title));
  }
}

4.20 Program Profiles

Program features are defined by one or more "profile" files. There is one that defines the basic features of the language that is always used (the "standard profile"), and others imported at the start of a program using the program directive, as in:

program xmlserial;

Profiles define defaults for:

The user can define their own profile files, and override the "standard profile". More than one profile can be declared for a program. For example, the following says that the program can use text pattern, serial XML processing, tree XML processing and tree JSON processing:

program textpatterns, xmlserial, xmltree, jsontree;

4.21 Defining XML Processing

As an example of a profile file, here is how XML serial processing is defined:

"Use XML serial profile.";

import bj.xml.*;

def choose xmlComment (XmlCommentItem) default {}

def choose xmlDataEntity (XmlDataEntityItem)
  choose (*.entity.name)
  default {System.err.println
             ("ERROR: No rule for entity \"%s\"!" (*.entity.toRef));}

def choose xmlDocument (XmlStartDocumentItem ... XmlEndDocumentItem)
  default {processChildren;}

def choose xmlDtd (XmlStartDtdItem ... XmlEndDtdItem) default {processChildren;}

def choose xmlElement (XmlStartTagItem ... XmlEndTagItem)
  choose (*.uri default null) : (*.localName)
  default {System.err.println
             ("ERROR: No rule for element \"%s\"!" (*.element.name));
           processChildren;}

def choose xmlError (XmlErrorItem)
  default {System.err.println ("ERROR: %s" (*.message));}

def choose xmlProcessingInstruction (XmlProcessingInstructionItem)
  choose (*.target) default {}

def choose xmlText (XmlTextItem) default {print (*.data);}

def choose xmlTextEntity (XmlStartTextEntityItem ... XmlEndTextEntityItem)
  choose (*.entity.name) default {processChildren;}

def parseXmlFileSerially (systemId : String, options : String = "") : void :>
  parseXmlSerially (XmlSerialParser.parseFile (systemId, options));

def parseXmlSerially (data : String, options : String = "") : void :>
  parseXmlSerially (XmlSerialParser.parse (data, options));

def parseXmlSerially (in : InputStream = System.in,
                      options : String = "") : void :>
  parseXmlSerially (XmlSerialParser.parse (in, options));

def parseXmlSerially (buffer : CharSequence, options : String = "") : void :>
  parseXmlSerially (XmlSerialParser.parse (buffer, options));

def parseXmlSerially (source : Readable, options : String = "") : void :>
  parseXmlSerially (XmlSerialParser.parse (source, options));

def parseXmlSerially (parser : XmlSerialParser) {
  val iterator : Iterator<XmlItem> = parser.iterator ();
  for item = iterator do
    $processAllChildren (item, iterator);
}

This definition for a serial XML parser consists of:

The choose definitions consist of:

5 Conclusion

Tools for processing text and markup (programming languages) have slowly developed over the last six decades and one thinks that that development will continue. I'm hoping that the Bobbee programming language will participate in that continued development, either being the next step forward, or providing ideas for the future.

As to where the Bobbee language is at the moment, I'm still in the process of debugging the implementation, refining its user documentation, and documenting its code, so that others can help with or take over maintaining it.

How it'll be distributed, and whether it'll be for sale or freely distributed, I don't yet know. But once the implementation and its documentation are completed, I'll be working on figuring that out.

Appendix A — BML: The Smallest Markup Language

There's been a general trend to simplifying markup languages, initiated by a not necessarily true perception that SGML was too complex. XML was the first step in this trend. JSON and MicroXML are further attempts to simplify things. A new markup language called BML (Basic Markup Language) has been created strictly as an aid in developing the Bobbee programming language -- a simple markup language was needed to test the language's markup features -- and which is minimalistic in all respects. That given, BML can also be considered an example of how simple a markup language can become, while remaining useful:

These limitations are appropriate for machine-to-machine communication, with only the occasional intervention by human and similar-type beings.

The BML package, implemented using the Bobbee language, supports both rule-based and procedural processing of BML data, and supports both serial and tree-based rule processing.

A.1 An Example of Using BML

Here's a small fragment of BML markup:

(p (text: (Each\ node:)))
(list
  (item (text: (is\ surrounded\ in\ parentheses,)))
  (item (text: (consists\ of\ a\ text\ sequence,\ and)))
  (item (text: (has\ zero\ or\ more\ nested\ nodes.)))
)

The designer of a particular markup notation determines which nodes are markup labels and which are data, in the same way that they determine the import of any element or data in XML. In this example, the subcomponents of the "text:" tag are text, and everything else is markup.

Although BML input superficially looks like Lisp s-expressions, it is textual markup and data. Don't be fooled by the parentheses. On the other hand, yes, BML was inspired, in part, by Lisp, which is itself a markup language.

Here's a program that translates this BML fragment into a HTML fragment:

program bmltree;

choose bmlNode ("p") {
  print ("<P>"); processChildren; println ("</P>");
}
choose bmlNode ("list") {
  print ("<UL>"); processChildren; println ("</UL>");
}
choose bmlNode ("item") {
  print ("<LI><P>"); processChildren; println ("</P></LI>");
}
choose bmlNode ("text:") {
  print (* [0].textValue);
}

The program translates BML input into a simple XML fragment form:

<P>Each node:</P>
<UL><LI>is surrounded in parentheses,</LI>
<LI>consists of a text sequence, and</LI>
<LI>has zero or more nested nodes.</LI>
<UL>

The example BML program processes BML input as follows:

A.2 How BML Tree Processing Is Defined

The BML implementaion is defined entirely using the Bobbee language itself, like the other markup languages. The language mechanisms are defined below and can be used to define any markup language, and any markup language processing a user wishes. At present, there are packages for SGML, XML, JSON, MicroXML and BML processing available, but others can be easily added.

This is the file, which defines BML tree processing. It declares that BML parsing is done by the bj.bml.* package with help from the bj.patternmatching.* package.

# Display a message when compiling:
"Use BML tree profile.";

# Import what's needed by these declarations or the user:
import bj.bml.*, bj.patternmatching.*;

# Define the "choose" rules, what's used as the selecting
# "name", and the rule's default behaviour, if any:
def choose bmlNode (BmlNode) choose (*.textValue)
  default {print ("(" + *.textValue); processChildren;
           print (")");}

# Define the different ways that markup processing can be
# initiated, especially what kinds of inputs are supported:
def parseBml (in : CharSequence) : void :>
  bmlNode (BmlTreeParser.parse (in));

def parseBml (in : InputStream) : void :>
  bmlNode (BmlTreeParser.parse (in));

def parseBml (source : MatchableInput) : void :>
  bmlNode (BmlTreeParser.parse (source));

def parseBml (in : Readable) : void :>
  bmlNode (BmlTreeParser.parse (in));

def parseBml (data : String) : void :>
  bmlNode (BmlTreeParser.parse (data));

def parseBmlFile (fileName : String) : void :>
  bmlNode (BmlTreeParser.parseFile (fileName));

The def choose declaration defines the meaning of a "bmlNode" rule:

A.3 A Couple Of More Examples

A final example of using Bobbee and BML is the following two programs, the first of which translates BML into a simple XML form, and second of which translates XML into BML. These programs translate the markup form of one markup language to another. They don't change the node names or structure of the documents.

Here's BML-to-XML, using tree processing:

program oldstyle, textpatterns, bmltree, xmltree;

parseBml ("(a ((a1 (a1v))) (b () (text: (Some\\ text.))) (c ()))");
println ();

choose bmlNode ("text:") {
  processText (*);
}

choose bmlNode {
  print ("<%s" (XmlNode.escapeXml (*.textValue)));
  for a = * [0].children do {
    print (" %s=\"" (a.textValue));
    processText (a);
    print ("\"");
  }
  if length * > 1 then {
    print (">");
    for i = 1 to length * - 1 do
      bmlNode (* [i]);
    print ("</%s>" (XmlNode.escapeXml (*.textValue)));
  } else
    print ("/>");
}

def processText (node : BmlNode) {
  for subNode = node.children do {
    print (XmlNode.escapeXml (subNode.textValue));
    processText (subNode);
  }
}

Here's XML-to-BML, using serial processing:

program oldstyle, textpatterns, xmlserial, bmlserial;

parseXmlSerially ("<a a1=\"a1v\"><b>Some text.</b><c/></a>");
println ();

choose xmlElement {
  print (" (%s (" (BmlItem.escape (*.qName)));
  for a = *.attributes do
    print ("(%s (%s))" (a.qName, BmlItem.escape (a.allValues)));
  print (")");
  processChildren;
  print (")");
}

choose xmlText {
  print (" (text: (%s))" (BmlItem.escape (*.data)));
}

References

SNOBOL: http://snobol4.com

Icon: http://cs.arizona.edu/icon

HUGO: http://wilmott.ca/hugo.pdf

Perl: https://www.perl.org

XSL Transformations (XSLT): http://www.w3.org/TR/xslt

Extensible Markup Language (XML) 1.1 (Fifth Edition): https://www.w3.org/TR/2008/REC-xml-20081126

OmniMark Developer Resources: http://developers.omnimark.com

ISO 8879:1986 Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML): https://www.iso.org/standard/16387.html

JSON: http://json.org

FORTRAN: http://www.phy.pku.edu.cn/climate/class/software/Fortran95-Manual.pdf

COBOL: http://math.uni.lodz.pl/~arogow/os390/podr/cobol-manual.pdf

Footnotes

1. In both cases, the JSON support is provides by the conversion of JSON to XML, rather than direct support of JSON itself.

2. I call it an "attempt", based on what English change-ringing performances are called prior to their completion: it is an "attempt" before it is successfully finished, and only called a "touch", "quarter-peal" or "peal" once ringing has been finished and what has been rung has conformed to what was intended. This language is an "attempt" prior to its completion and prior to be accepted by a significant number of users.

Keywords: Markup Language Implementation, XML, SGML, MicroXML, JSON, BML

Sam Wilmott, sam@wilmott.ca

Sam Wilmott designed his first programming language in the winter of 1967-1968 and was using early non-standardized markup languages in the late 1960's. Since then he has led the development of typesetting/text-formatting systems for the Canadian Government Printing Office (in the 1970's) and for a major real-estate company (in the 1980's), implemented one of the first SGML parsers (which was also the first pull-model markup parser), and is the originator of the OmniMark programming language (in the early 1990's), with its strong support of SGML, XML, and text transformation.

After leaving OmniMark, Sam worked in the XSLT world: he contributed to the implementation of an XSLT compiler and worked as an XSLT programmer and analyst (in the early 2000's). Currently he is largely retired, happily married, does voluntary work locally and walks a little dog every day, but in spite of his advancing age, he is nonetheless working on new programming language ideas for markup language and text processing.