My Trusted Companion: XML (Which Is Now Everywhere)

The XML hype curve has long since passed … but where did XML end up? Dragged through the trough of disillusionment, left broken and forgotten in the gutter looking up forlornly at the stars of Big Data? Or did XML make it through the wilderness to the promised land of the plateau of productivity.”

And why should you care? Well, while we wade through the ever-growing complexity of today’s IT landscape, replete with Big Data Analytics, Hadoop, Spark, Machine Learning, hundreds of NoSQL players, Microservices, *aaS, IoT, SMAC, blockchain … and on and on, liddle ol’ XML continues to be the life blood of many organizations. So take a moment to think about how you treat the data format that underpins your intellectual property? First-class citizen or after-thought?

But hang on … what XML? We don’t have XML in our organization.

Yeah, you might think you don’t have XML – – but are you sure? What’s moving around on the ESB and messages queues? What do Microservices and REST endpoints speak? What are the data interchange formats provided or received from business partners?

XML standards exist in most industries, examples include FpML, FixML & XBRL in Financial Services, BXF (Broadcast Exchange Format) in Broadcasting, ICD 10 in Healthcare, LegalXML, Standards Organizations use XML for describing standards, scientific publishing is full of XML … even the MS Office documents flowing around your organization are an XML format (OOXML). The reality is XML is everywhere – and likely, coursing through your organization.

But XML is not in our core databases … They are relational …

Ah… so now I have to ask: Why is XML not in your core databases?

Do you remember when XML hit the IT world? Perhaps you worked in an industry where your data was what is now called unstructured or semi-structured, it was huge … SGML had only ever been successfully implemented in a small number of industries. It required specialists and special toolsets. There’d be that person on the team who was a “doc head” and knew Jim Clarke’s SP parser and concepts of SGML DTDs that made the average Developers’ toes curl. Then XML hit the scene, it was simple yet rich enough to model most data requirements. Open-source tools proliferated. Standards were rapidly built. Traditional databases scrambled to add XML support. It was going to solve world hunger.

And then the trough …

The implementation of XML in your application wasn’t as smooth as many would have liked … some of the tools created heavy-weight memory requirements, performance issues, some developers complained XML was too verbose and complex for simple data models … The rise of JSON has in part been a response to those issues. However, the biggest issue by far was storing and querying your XML. The choices were poor. Store in file systems or shred into a relational schema were the options — important fields extracted and the rest as an un-queryable, unsearchable CLOB … not much more than multi-key value stores … and we all know from today’s key/value stores that they have a limited set of uses.

If your XML was really, really simple and you had no regard for performance then the XML support bolted on top of relational algebra of the traditional databases was an option. If you wanted an easy in, easy out, enterprise grade XML storage option you had limited choices. There were a small group of XML databases, however they mostly had scale and speed limitations. Faced with this most organizations stuck with what they knew, relational and just poured on more and more ETL … until NoSQL.

This is the reason you don’t see XML in your core databases today but it is fills your company’s file systems and pipes and is the interface between you and the rest of the world.

Whoa, hold up friend … our systems are structured data. Relational is just fine thanks.

Really? Perhaps the data in your relational databases is structured. What about your knowledge management systems, customer information systems, document systems, CMS, mail, etc.? How do you integrate that data with structured data to get a holistic view of all your data? What do you do when you want to bring a group of relational schemas from different systems together to get that elusive 360 view – which is being demanded by the world’s financial institution regulators? Mergers and acquisitions drive this requirement too. How do you search across that data?

Sure there are solution stack answers. We’ve all seen whiteboards with ever growing number of boxes and those innocuous puny arrows between them that translate to teams of people, buckets of code, test and operations teams. They all add up to ever-increasing costs, complexity, missed deadlines & market share loss. Sound overly dramatic? Gartner calculated a worldwide spend of $5 Billion on data integration software in 2015. How much did you spend … would you know where to start calculating that cost?

OK I see that, our organization is drowning in these problems, but is XML really the best data format to capture and store structured data in?

Kudos to the creators of XML, they saw the need for XML to be able to support both data and document-centric use cases and so XML and XML Schema support both. In fact, the XML bolt-on capability of relational databases could only cope with data or structured forms of XML, they struggled with document-centric data. However, XML is flexible enough to capture structured data and type that data, i.e. this is an integer or a date as well as mixed content models, such as free flowing text with markup, such as italics, bold, headings, etc.

Here’s a simple example:

XML Code

It’s a trivial example, but what it shows is structured data, the page number, typed as an integer; so an application can reliably perform mathematics on this value and free-flowing text with additional information inline, i.e. the text with bold and italics. As you can see, XML can represent both structured and unstructured data, and since, as we’ve established, you have both in your organization – (and need to solve business problems every day) — both can be brought together into a single actionable view.

Now what if we want to bring together a slew of structured records from a variety of relational sources and unstructured information from document or knowledge management systems? Again XML can represent all of these data sources with no loss of fidelity.

Allow me a brief digression on JSON. If you want a lightweight self-describing data structure for simple or highly structured data, then JSON works well and it is often found tightly coupled to your application. However, representing unstructured data, specifically mixed content models in JSON, can quickly become a mess. I’ve even seen folks escape XML inside their JSON to get around this! The horror! No JSON is not the data format for combining structured and unstructured data. Here’s the example above now in JSON … clunky eh! Imagine an even richer, varied data structure, it’s not pretty. The inline integer typing on the number has also been lost.

JSON Code

But what about the verbosity of XML? Well if it is compressed efficiently then the verbosity issue does not impact storage costs.

If you haven’t guessed by now, I am a fan of XML. As someone who has lived through this journey as an implementer, XML has served me well, it’s like the Labrador of the data formats world, it may not be super cool, the latest thing, but it won’t let you down no matter what the situation.


For more information on this topic

What’s the Deal With an XML Database?, John Biedenbach reveals that a great place to store all that Mainframe data — is an XML database …

The NoSQL Generation: Embracing the Document Model White paper on how the Document Database allows a more logical, human approach to modeling data, and is generally the most flexible and easy to use.

XML and JSON Data Modeling Best Practices 11 min Tutorial. Learn some best practices when modeling data in XML and/or JSON. In this tutorial, we will cover document sizing and granularity, keeping the model simple and understandable, and applying envelope patterns.