Progress Acquires MarkLogic! Learn More

My Trusted Companion: XML (Which Is Now Everywhere)

Back to blog
7 minute read
Back to blog
7 minute read

The XML hype curve has long since passed … but where did XML end up? Dragged through the trough of disillusionment, left broken and forgotten in the gutter looking up forlornly at the stars of Big Data? Or did XML make it through the wilderness to the promised land of the plateau of productivity.”

And why should you care? Well, while we wade through the ever-growing complexity of today’s IT landscape, replete with Big Data Analytics, Hadoop, Spark, Machine Learning, hundreds of NoSQL players, Microservices, *aaS, IoT, SMAC, blockchain … and on and on, liddle ol’ XML continues to be the life blood of many organizations. So take a moment to think about how you treat the data format that underpins your intellectual property? First-class citizen or after-thought?

But hang on … what XML? We don’t have XML in our organization.

Yeah, you might think you don’t have XML – – but are you sure? What’s moving around on the ESB and messages queues? What do Microservices and REST endpoints speak? What are the data interchange formats provided or received from business partners?

XML standards exist in most industries, examples include FpML, FixML & XBRL in Financial Services, BXF (Broadcast Exchange Format) in Broadcasting, ICD 10 in Healthcare, LegalXML, Standards Organizations use XML for describing standards, scientific publishing is full of XML … even the MS Office documents flowing around your organization are an XML format (OOXML). The reality is XML is everywhere – and likely, coursing through your organization.

But XML is not in our core databases … They are relational …

Ah… so now I have to ask: Why is XML not in your core databases?

Do you remember when XML hit the IT world? Perhaps you worked in an industry where your data was what is now called unstructured or semi-structured, it was huge … SGML had only ever been successfully implemented in a small number of industries. It required specialists and special toolsets. There’d be that person on the team who was a “doc head” and knew Jim Clarke’s SP parser and concepts of SGML DTDs that made the average Developers’ toes curl. Then XML hit the scene, it was simple yet rich enough to model most data requirements. Open-source tools proliferated. Standards were rapidly built. Traditional databases scrambled to add XML support. It was going to solve world hunger.

And then the trough …

The implementation of XML in your application wasn’t as smooth as many would have liked … some of the tools created heavy-weight memory requirements, performance issues, some developers complained XML was too verbose and complex for simple data models … The rise of JSON has in part been a response to those issues. However, the biggest issue by far was storing and querying your XML. The choices were poor. Store in file systems or shred into a relational schema were the options — important fields extracted and the rest as an un-queryable, unsearchable CLOB … not much more than multi-key value stores … and we all know from today’s key/value stores that they have a limited set of uses.

If your XML was really, really simple and you had no regard for performance then the XML support bolted on top of relational algebra of the traditional databases was an option. If you wanted an easy in, easy out, enterprise grade XML storage option you had limited choices. There were a small group of XML databases, however they mostly had scale and speed limitations. Faced with this most organizations stuck with what they knew, relational and just poured on more and more ETL … until NoSQL.

This is the reason you don’t see XML in your core databases today but it is fills your company’s file systems and pipes and is the interface between you and the rest of the world.

Whoa, hold up friend … our systems are structured data. Relational is just fine thanks.

Really? Perhaps the data in your relational databases is structured. What about your knowledge management systems, customer information systems, document systems, CMS, mail, etc.? How do you integrate that data with structured data to get a holistic view of all your data? What do you do when you want to bring a group of relational schemas from different systems together to get that elusive 360 view – which is being demanded by the world’s financial institution regulators? Mergers and acquisitions drive this requirement too. How do you search across that data?

Sure there are solution stack answers. We’ve all seen whiteboards with ever growing number of boxes and those innocuous puny arrows between them that translate to teams of people, buckets of code, test and operations teams. They all add up to ever-increasing costs, complexity, missed deadlines & market share loss. Sound overly dramatic? Gartner calculated a worldwide spend of $5 Billion on data integration software in 2015. How much did you spend … would you know where to start calculating that cost?

OK I see that, our organization is drowning in these problems, but is XML really the best data format to capture and store structured data in?

Kudos to the creators of XML, they saw the need for XML to be able to support both data and document-centric use cases and so XML and XML Schema support both. In fact, the XML bolt-on capability of relational databases could only cope with data or structured forms of XML, they struggled with document-centric data. However, XML is flexible enough to capture structured data and type that data, i.e. this is an integer or a date as well as mixed content models, such as free flowing text with markup, such as italics, bold, headings, etc.

Here’s a simple example:

XML Code

It’s a trivial example, but what it shows is structured data, the page number, typed as an integer; so an application can reliably perform mathematics on this value and free-flowing text with additional information inline, i.e. the text with bold and italics. As you can see, XML can represent both structured and unstructured data, and since, as we’ve established, you have both in your organization – (and need to solve business problems every day) — both can be brought together into a single actionable view.

Now what if we want to bring together a slew of structured records from a variety of relational sources and unstructured information from document or knowledge management systems? Again XML can represent all of these data sources with no loss of fidelity.

Allow me a brief digression on JSON. If you want a lightweight self-describing data structure for simple or highly structured data, then JSON works well and it is often found tightly coupled to your application. However, representing unstructured data, specifically mixed content models in JSON, can quickly become a mess. I’ve even seen folks escape XML inside their JSON to get around this! The horror! No JSON is not the data format for combining structured and unstructured data. Here’s the example above now in JSON … clunky eh! Imagine an even richer, varied data structure, it’s not pretty. The inline integer typing on the number has also been lost.


But what about the verbosity of XML? Well if it is compressed efficiently then the verbosity issue does not impact storage costs.

If you haven’t guessed by now, I am a fan of XML. As someone who has lived through this journey as an implementer, XML has served me well, it’s like the Labrador of the data formats world, it may not be super cool, the latest thing, but it won’t let you down no matter what the situation.

For more information on this topic

What’s the Deal With an XML Database?, John Biedenbach reveals that a great place to store all that Mainframe data — is an XML database …

The NoSQL Generation: Embracing the Document Model White paper on how the Document Database allows a more logical, human approach to modeling data, and is generally the most flexible and easy to use.

XML and JSON Data Modeling Best Practices 11 min Tutorial. Learn some best practices when modeling data in XML and/or JSON. In this tutorial, we will cover document sizing and granularity, keeping the model simple and understandable, and applying envelope patterns.

Lee Pollington

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo