This post covers a technique that’s an oldie but a goodie, with some thoughts on how it applies with MarkLogic features. We will cover some available implementations and the raw technique itself, and when to use each of them.

It’s a common problem: you’ve got some XML in one format and you need to change it. With MarkLogic, you can make updates to stored documents using functions like xdmp:node-insert-child(), but when you want to update nodes that you have in memory, you need to turn to a different technique.

XQuery versus XSLT

Some of you reading this will be thinking, “that’s easy, just apply an XSL transform” — and you’re right, that’s a good way to do it, if you know XSLT. Personally, I learned XQuery first and never learned XSLT. There’s an XQuery wikibook where others have written up their thoughts on how XQuery compares to XSLT; I’ll refer rather than rehash. For me, it’s a simple matter of already having a tool that does the job well, so I spent my time learning other stuff.

XQuery/Typeswitch Transformations

The essential tool in using XQuery to transform XML is a recursive function with a typeswitch. If you haven’t encountered it before, a typeswitch is the switch statement we know from Java and other languages. The XQuery wikibook has a pretty good page discussing XQuery/Typeswitch Transformations.

xquery version "1.0-ml";

declare variable $stuff :=
  <doc>
    <content>
      <change-me my-attr="look at me">some text</change-me>
      <count>1</count>
      <name>Fred Smith</name>
    </content>
  </doc>;

declare function local:change($node)
{
  typeswitch($node)
  case element() return 
    element { fn:node-name($node) } {
      $node/@*,
      $node/node() ! local:change(.)
    }
  default return $node
};

local:change($stuff)

typeswitch.xqy hosted with ❤ by GitHub

typeswitch.xqy shows the essence of the technique. Note that it doesn’t yet make any changes; this is just showing the mechanics. The cases in the typeswitch check for some type of node. When there’s a match, it returns something. To transform an element, we create a new element with the desired change, and then (typically) recursively call the function on the element’s children. For any node that doesn’t match (such as a text node), we just return the node.

With this approach, we can change namespaces, local names, add or remove children, and change the text content of an element. Note that we can’t write a case to match an attribute, so we make attribute changes by matching on the element the attribute belongs to.

Adding a Namespace

Let’s make this more interesting by adding a namespace to the change-me element.

declare namespace dmc = "dmc";

declare function local:change($node)
{
  typeswitch($node)
  case element(change-me) return
    element { xs:QName("dmc:" || fn:local-name($node)) } {
      $node/@*,
      $node/node() ! local:change(.)
    }
  case element() return 
    element { fn:node-name($node) } {
      $node/@*,
      $node/node() ! local:change(.)
    }
  default return $node
};

add-namespace.xqy hosted with ❤ by GitHub

We’ve added a case that targets the change-me element. Note that order matters: the first matching case will win. This is similar to our general case statement, but we’re modifying the namespace as we create the new element.

This highlights an important aspect of the technique: we are creating a whole, new XML node, a modified copy of the original. We are not modifying in place. More on why that’s important in a bit.

FunctX Functions

FunctX is a collection of XQuery functions covering a range of common needs. A copy of the library is distributed with MarkLogic, so here’s the same change as the above, but using an off-the-shelf implementation:

xquery version "1.0-ml";

import module namespace functx = "http://www.functx.com" at "/MarkLogic/functx/functx-1.0-nodoc-2007-01.xqy";

declare namespace dmc = "dmc";

declare variable $stuff :=
  <doc>
    <content>
      <change-me my-attr="look at me">some text</change-me>
      <count>1</count>
      <name>Fred Smith</name>
    </content>
  </doc>;

functx:change-element-names-deep(
  $stuff,
  xs:QName("change-me"),
  xs:QName("dmc:change-me")
)

functx-namespace.xqy hosted with ❤ by GitHub

That was easy. So if we have an existing function that does what we need, why bother looking at the recursive descent code?

First, it’s useful to understand the implications of what the libraries you are using are doing. That lets you make informed decisions about when to use them.

Second, building on the first, suppose you have a much bigger XML node to start with, and you’re going to be running on a lot of them. So far, you’ll still want to use functx. But now suppose you need to do multiple changes that can’t be handled by one of those functions. You’ll want to consolidate that into one run through the XML structure.

Multiple Changes

Here’s another version of our local:change function, this time adding a count increase and redaction (no, I can’t think of why you’d update a count in a non-persisted transformation; work with me here):

declare function local:change($node)
{
  typeswitch($node)
  case element(change-me) return
    element { xs:QName("dmc:" || fn:local-name($node)) } {
      $node/@*,
      $node/node() ! local:change(.)
    }
  case element(name) return element name { "redacted" }
  case element(count) return
    element count {
      xs:int($node) + 1
    }
  case element() return 
    element { fn:node-name($node) } {
      $node/@*,
      $node/node() ! local:change(.)
    }
  default return $node
};

typeswitch-multiple.xqy hosted with ❤ by GitHub

This illustrates making multiple changes to the XML structure using a single descent through the XML.

In-Memory-Update Library

This review would be incomplete without mentioning another library that comes with MarkLogic: /Modules/MarkLogic/appservices/utils/in-mem-update.xqy. This library module contains five functions that are analogous to the xdmp:node-* functions, but act on in-memory XML nodes instead of in-database documents. For easy reference, the five are:

  • mem:node-insert-child()
  • mem:node-insert-before()
  • mem:node-insert-after()
  • mem:node-replace()
  • mem:node-delete()

Note that these functions use the recursive descent approach as well (see mem:_process()). You can use these functions in your code after importing them:

import module namespace mem = "http://xqdev.com/in-mem-update" at 
  "/MarkLogic/appservices/utils/in-mem-update.xqy";

import-mem.xqy hosted with ❤ by GitHub

When To Use

You can use this technique anytime you want to transform a block of XML. Commonly, you’d use it during ingest (putting data into a better format before storing it) or for display (formatting, supplementing, or redacting data before showing it to the user). With the MarkLogic REST API, applying transforms during ingest and display has become a common pattern.

A member of my team recently had a project in which Office 2007 documents were to be stored, but needed to have internal links updated to reflect the documents’ new home in the database. The documents were opened up using ooxml:package-parts(), surgically adjusted using the transformation method above, then rebuilt as zips using xdmp:zip-create(). No need to create temporary docs in the database; no need to worry about transactions.

This article first appeared as a post on David’s blog.

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.