Syntax tree generator with XSLT

My employer publishes (among other things) books on linguistics. These are full of pretty tree diagrams that show the structure of sentences.

Here's an example.

A simple syntax tree diagram showing the structure of the sentence "This is a wug"

The diagram contains a sentence and breaks it down into labelled syntactic components.

When we produce an ebook version, these diagrams are produced as GIF images, which are generally unsatisfactory. The text in the diagram does not match the surrounding text, and the images don't scale.

I've been saying for years that we should produce these diagrams in SVG, but it's never quite reached the top of the work priority list.

Anyway, about a month ago my son (now a linguistics student) showed me a tool he was using to generate syntax tree diagrams.

This is written in Javascript and converts a plain text representation of the tree into a PNG file.

The plain text consists of a bunch of nested square bracketed expressions, each consisting of a label and one or more values. A value may be a simple string (part of a sentence) or a nested expression.

Here's the example from the tool:

[S [NP This] [VP [V is] [^NP a wug]]]

Or with indentation:

[S 
    [NP This] 
    [VP 
        [V is] 
        [^NP a wug]
    ]
]

In a rare moment of clarity I realised that this plain text format was structured text and could be converted to produce an SVG version of the tree diagram, using XSLT.

I must have said this out loud because my son asked if I could get it done by the end of August.

So I took a look. My first thought was that invisible XML might be part of the solution.

I wrote a grammar and got some XML out using John Lumley's iXML parser.

But I realised that it was going to take me too long to plug this in to a second tool to process this into SVG, so I wrote XSLT with my own parser for this format. Basically a couple of regular expressions to identify a "balanced" expression, i.e. one with the same number of "[" as "]" characters.

The first pass over the source plain text creates an intermediate XML format. This is then run through a couple of other XSLT templates to add widths and coordinates to the text items. There are parameters and variables to set margins and spacing between items.

<dc:expression x="5" y="5" width="38.52864456176758">
   <dc:category 
       width="34.088539123535156" 
       x="7.220052719116211" 
       y="31.666666666666664">NP</dc:category>
   <dc:values>
      <dc:value 
          width="38.52864456176758" 
          x="5" 
          y="98.33333333333331">this</dc:value>
   </dc:values>
</dc:expression>

The only part of this process that unavoidably requires Javascript is calculating the width of a text item, which is not possible with XSLT. So we have a Javascript function that creates a canvas in the document and then uses its measureText method to get the width. The font information (e.g. "serif 12pt bold") is supplied with parameters that the user will be able to control. (Requiring a bit more Javascript.)

Syntax trees may contain arrows joining values. The plain text format indicates the start of an arrow with "<1>" and the end with "_1". A second arrow might use "<2>" etc.

Here's an example:

[CP 
    [C'
        [C \0]
        [TP
            [DP^ Sarah]
            [T'
                [V+T_1 BE+{pres}]
                [VP 
                    [V'
                        tV<1>
                        [VP
                            [V'
                                [V eating]
                                [DP^ fruit]
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
]

A more complex syntax tree diagram showing a movement arrow

Another XSLT pass calculates the geometry of the arrows.

So that an arrow does not cross over another part of the diagram, the XSLT identifies the text items between the start and end points of the arrow, and places the bottom of the arrow beneath the lowest of these items (which have coordinates by this point).

A syntax tree diagram showing a movement arrow ducking under an intervening item in the diagram

Now we can generate the SVG with another template, using the coordinates to place the text items and draw lines and arrows between them with <line> and <path>. An arrow is made up of a pair of quadratic curves and a triangle.

There's a complication in drawing the movement arrows: if a text item is both the start and end point of an arrow, these points must be displaced horizontally so as not to overlap, This requires checking the directions of the arrows (left or right) and which of them has the lowest arc in the diagram.

Finally I added a few bells and whistles to allow the user to select the font styling, line colour, and to download the resulting SVG file.

The XSLT is compiled and run in the browser with SaxonJS.

Now available to users at https://linguistics.datacraft.co.uk

Challenges

Writing the regular expressions for parsing the plain text format took a few goes.

I wanted to be liberal about letting the user add spaces and line breaks to their plain text, to make it more readable. This comes at the expense of clarity in the regular expression, since the . character in a regular expression does not match a line break.

XSLT parses XML from the root node downwards. But the coordinates of a category label depend on what is beneath it in the tree. The XSLT code is a mixture of templates and recursive function calls to work within this architecture.

The position and width of a nested expression is calculated multiple times. Once for its own position in the SVG and again for each of its ancestor expressions in the tree.

Next steps

There are probably more features I could add. I'm relying on my son to use the tool for his assignments and tell me what's broken or missing.

I'm also starting to write XSpec unit tests. To get to this stage I have been repeatedly pasting in examples linked from Drake Prebyl's fork of syntree, which has been very useful, but more time-consuming than necessary.

Using XSpec with this kind of stylesheet is challenging, because it includes iXSL and Javascript functions that are recognized by SaxonJS but not defined in the XSLT code.

So a basic XSpec test suite that points at the stylesheet does not work. It is not executed with SaxonJS and falls over because of the undefined functions.

Fortunately I found a paper from Balisage 2020 that provided a solution. This involves creating a "test harness" stylesheet that includes dummy definitions for these functions and imports the actual stylesheet.