My company bought MarkLogic. What am I going to do at work next Monday?
You’ve been a long-time SQL developer—expert at handling tables, rows, columns, foreign keys, and SQL. You have hundreds of reports to create every month – and suddenly you are told that the company has bought MarkLogic.
MarkLogic’s marketing says don’t worry about modeling up front, just “load as is!”—and you’re thinking, ‘what is this, magic? How in the world can I generate reports and do analysis without typing select commands against tables in predefined and rigid schemas?’
When you go to work next Monday and start building out your new MarkLogic-based system just what exactly will you be doing?
I know this feeling of uncertainty quite well. I spent two decades in the SQL world before I landed at MarkLogic. I think I had a week to get up to speed before they sent me out on the road. And it was a bit of a shock. I had to let go of many habits and develop new ones. Let me walk you through the differences – and show you how it all becomes quite manageable.
Overview—MarkLogic vs. SQL
As you are well aware, SQL databases store data in tables made up of rows and columns which are accessed through SQL select, insert, update, delete, and a few other related commands.
MarkLogic is a document database that supports text searches, semantic triples based queries, and geospatial queries, and can handle any form of hierarchical data, among other things.
The document-based structure of MarkLogic is richer and more powerful and expressive than an SQL-based approach. Complex, diverse, and changing data structures are much easier to handle with a document based hierarchical database than with a rigid set of columns and rows.
With SQL, development is forced into a waterfall approach. This is because before data can be accessed it must be modeled and ETL must be performed to make the data fit into the data model. As discussed in “Using a Metadata Repository to Improve MDM Success,” this often means that there is a long time (sometimes years) between when a project starts and when the first benefits of the project begin to appear.
MarkLogic suggests a highly iterative development approach instead. Because data can be loaded and accessed as is without any processing, deliverables can be defined and implemented with only the processing and data transform needed to implement that deliverable.
Handling Source Data
With the above in mind, the first change in your workflow will be handling incoming data. Most of the data you get is more naturally hierarchical rather than relational. Instead, you may be getting complex XML in the form of FpML or FIX trades, JSON-based market data, or object based output from Java or .NET applications. In an SQL-based workflow, this complex data needs to be shred through an ETL process so that each file or object can be stored in a normalized form in a number of separate tables.
With MarkLogic much of this work will disappear. Instead, as will be discussed later, you store the data as is with no or minimal changes and add metadata to make the data fully usable.
There are two main reasons for storing the data as is:
- Users can gain benefits from the data, such as full-text search before any transformations or processing is done on the data so waiting to build out an ETL process will delay the time before users gain benefit from the data.
- In an ever-increasing emphasis today on data lineage, storing the data as you receive it makes it much easier to understand how it is transformed before it gets to the ultimate end-user.
Entities and Relationships
Data is often related to other data and indeed they can be related in a variety of ways. In relational terms, there may be a many-to-many relationship between records or a one-to-many relationship. One-to-many relationships may be parent-child relationships where the child must have exactly one parent or they may be more general where a child may have no parent or may have a one-to-many relationship with several different parents for different relationships.
With relational systems, these different types of relationships are handled the same way: normalize the data into tables and use foreign keys to link the tables together. In some cases, this is a great approach. What it means, however, is a lot of work pulling together multiple tables to create a single representation of the data.
For example, the above swap might easily require a dozen tables in a relational system. To store or pull together a single swap can, therefore require a dozen disk IOs. If your application generally requires accessing swaps as a whole it would be much more efficient to store the swap as a single unit.
Going forward, less of your time will be spent doing low-level modeling of data structures. You will do the modeling needed to secure, query, and allow your users to access the data but it will not be necessary for the database engine to formally model every attribute of every data source.
Getting Going on MarkLogic
Now that we have set the background, let’s look at some of the MarkLogic functionality you will be using to build the reports and provide the other kinds of data access your end-users require. This list is not comprehensive but should provide an idea of the kinds of things you will be doing on a day-to-day basis. Specifically, we will look at:
- Search and Query Languages
- Hierarchical Data Models
SQL – MarkLogic can expose selected data attributes using SQL through its ODBC driver. This can be a valuable addition to a MarkLogic-based solution. Because SQL is more limited than NoSQL it is unlikely to be a complete solution: your company probably selected MarkLogic because a pure SQL solution could no longer handle its needs. You may spend part of your time determining which data attributes should be exposed via SQL, but you should use this approach as needed, not as the core of your MarkLogic based offering.
The thought of having to use non-SQL query languages to fully access the power of MarkLogic may be disconcerting. Before deciding on the exact approach(es) to follow in providing your users with data, it is important to understand that while your end users expect data to be delivered to them in an SQL-friendly manner, for many of them, they take data this way because that is the only way it has ever been offered.
In many cases, end-users actually need object-oriented or hierarchical data and need to perform ETL on the data from SQL based systems to transform it to meet their requirements. This is often done through approaches like JPA which can require extensive effort and which can be difficult to maintain. With MarkLogic, data can be stored in a way that is highly compatible with object-oriented languages like Java or .NET. In fact, environments like Java and .NET have the ability to serialize XML and JSON into objects (or deserialize objects into XML or JSON) with a single command that is much more lightweight and easier to maintain than JPA.
Going forward you will be spending time determining the optimal approach to providing your end users with data and offering them a broader and more useful array of data delivery options than they currently have.
Directories – With SQL databases, record IDs are usually integers that grow over time. The reason for this is, with a relational database, if you regularly insert records into the middle of a table you will quickly run out of space and block splitting will occur — which is bad for performance. MarkLogic works quite a bit differently than traditional databases—data is constantly moved between machines and block sizes are larger – so block splitting is much less of an issue.
As a result, with MarkLogic, you are not limited to using meaningless integers as document identifiers and have the option of using identifiers that can make your user’s lives easier. Specifically, MarkLogic URIs can be based on directories similar to those found in file systems. In many cases, users can find the documents they are looking for without the need to evaluate all the documents in the system. For example, if you are looking for an FX Option trade done with counterpartyA you may have a naming convention so the trade is stored as Trades/FXOption/CounterPartyA/Trade20160901.json.
If you know you are looking for FXOption trades with CounterPartyA, with MarkLogic you can just go to the appropriate directory and scan through the limited set of documents located there instead of searching through the entire database. Not all queries will be able to take advantage of a document’s directory structure but those that do can be made to run many times faster.
One of the first things you will want to do going forward is to determine the naming conventions that will improve performance and make your data more understandable. This is one item you might want to work on before you begin doing your initial data loads into MarkLogic.
Hierarchical Data Models/Denormalizing Data – A key aspect to MarkLogic is that it is, among other things, a hierarchical database which can handle complex data in a single document. When data is stored hierarchically the need for SQL-style joins (a major performance killer) can be greatly reduced. As discussed above, much of the data coming into your firm already is hierarchical and your SQL-based approach shreds the data to force it into columns and rows. In many cases you will just be able to take the incoming data as is and, initially at least, just store it.
In other cases to make data easier to understand and improve query performance you will want to “denormalize” the data. It is important to understand that in a MarkLogic context “denormalize” means something very different than it does in the relational world. In the relational world denormalizing data means duplicating data so that queries can be resolved more efficiently. In general, this is a negative because data integrity suffers when more than one data element is logically required to be the same.
With MarkLogic, denormalizing data means pulling child data into the parent’s document which eliminates duplication of data and improves data integrity. Storing children with their parents not only makes the data easier to understand, it eliminates the possibility of “orphans” (children without parents), which should not occur in relational systems — but often do.
To optimize performance and make data easier to understand and maintain, you will be spending time determining which data should be stored in a hierarchical fashion and sometimes pulling this data together.
Semantics/Triple – The closest direct replacement MarkLogic has for relational foreign keys is semantic triples. Triples are far more powerful than foreign keys but can provide foreign key-like functionality. Triples can be used to connect documents and apply meaning to relationships.
At a high level triples are subject/predicate/object relations that can look like:
<Swap><isA><FinancialInstrument> . <DerivativesSwap><isA><Derivative> . <DerivativesSwap><isA><Swap> . <Trade1><isA>< DerivativesSwap > . <Trade1><hasACouponPayment><PaymentA> . <PaymentA><hasAPaymentDate><2016-04-05> . <Trade1><hasCounterParty><CompanyA> . <CompanyA><hasARiskRating></risk/companyARating> .
The subjects and objects in the triples can, but need not, refer to documents in the database and when they do they can act as a foreign key. With the SPARQL language, SQL Select like functionality can be performed to join data in separate documents. Generating this kind of triples is generally not difficult as the information already resides in the foreign keys of the legacy data.
Going beyond simple foreign key replacement, for conducting relational style queries, triples and semantics are often used to build out synonym lists and hierarchies. For example, if different data sets refer to the same financial instrument with different identifiers you can create triples to show all these variants actually refer to the same thing and search on all the terms together. A key point of using triples for synonyms (unlike fields which are discussed below) is that the relationships are defined at run time and not design time so relationships can continually evolve and be added.
With hierarchies you can slice and dice information as you please. For example, if you define a hierarchy like the above with FinanicalInstrument->Swap->DerivativesSwap you can index documents with the narrowest term (here derivatives swap) and query on any of its broader terms.
A key feature of MarkLogic is that semantic queries against triples can be integrated with searches and queries against other types of data. For example, the results of a SPARQL query can be filtered against a date range stored as XML in your underlying documents.
An important part of your job going forward will be creating triples to define relationships between data in your database. Much of this will be repurposing the information used in your legacy foreign keys but to fully exploit MarkLogic’s capabilities you will want to review your data to find more triples.
Fields – If the same attribute is defined differently in different data sources, say customer_id in one document and custID in another, you can create fields to link the different attribute names into a single global identifier. It is common to also associate the field with an in-memory range index to make queries against the field run fast.
Unlike using triples to define synonyms, fields are a design-time tool and their use should be limited to extremely important data attributes.
In the future, you will be spending some time analyzing your data to determine which data attributes should be treated as a single unit with either fields or triples.
Collections – One of the reasons business intelligence (BI) tools are able to provide users with fast answers even when working with large data sets is that sometimes they pre-calculate answers to common queries. When BI tools do this it can have a negative side effect of slowing processing down so much that real-time access must be abandoned and data loaded in batches.
An important capability MarkLogic provides to speed many queries is “collections.” When a document is inserted — or at any time afterward — it can be associated with zero or more collections. The criteria for determining if a document belongs in a given collection can be as complex as needed. Assigning a document to a collection at ingest instead of at query time enables queries involving collections to run very fast – MarkLogic maintains a list of all the URIs of the documents that belong to the collection and so queries of the form: collection = ? are precomputed and take virtually no time to run.
MarkLogic can maintain as many collections as is helpful to your users. Although new collections can be defined at any time, using collections is a design time capability in the sense that when a new collection is defined it is necessary to scan existing documents to determine if they should be added to it.
Determining what collections can improve your user’s query performance should be a part of your regular ongoing workflow.
Pulling It All Together
Now that you have a flavor of the low-level workflow you will be doing with MarkLogic how do you pull it together? With MarkLogic, the general design approach is to load data close to “as-is” and then use the envelope pattern to make the data generally accessible to users.
It is important to note that, unlike relational systems which can require extensive data modeling and ETL to be useful, as soon as data is loaded a great deal of value can be gained from it. With MarkLogic’s Universal Index, as soon as data is loaded it immediately becomes available for search and query. Google-like searches can be conducted without the database administrators understanding what the data is. Descriptors in the data (XML & JSON attributes and csv field descriptors) can be queried against by any user who understands what’s in the data. A substantial percentage of use cases can be largely handled by this functionality alone.
To move beyond this base functionality and allow MarkLogic to fully exploit its capabilities, customers generally use the “envelope” pattern where data is stored “as-is” along with metadata to harmonize and enrich the original content. Primary drivers behind the load data as-is framework is 1) data can be used and accessed as discussed above while the metadata is built and 2) the increasing requirement to provide data lineage. In relational systems data sometimes goes through multiple ETL steps before it is provided to end users. If a question arises on the province of given data value it can be difficult to determine if that value was part of the original source data or if it came out of one of the ETL steps.
By loading data as is and then storing the transformed and enriched metadata with the original data it becomes much easier to track data lineage.
Data stored in the metadata portion of the envelope can include items like normalization of units (some original data comes in millions, others thousands) and identifiers as well as other transformations and enrichment to the data.
As metadata is defined, users can query on a combination of the original data and the canonical and enriched metadata to obtain a consistent and enhanced view of the data.
As an example of the envelope pattern, suppose that the original data for a bond simply gave its full-text issue name and coupon. To make this more useful to your users you might want to provide the normalized identifier of the bond issuer (AAPL) and enrich the data with the bond’s Moody’s Rating.
After this is done the document for the bond is much more accessible and useful for your end-users.
Going back to the MarkLogic design philosophy, in general, you should focus your envelope pattern efforts in an iterative, deliverable-focused manner. It is not necessary to fully build out a document’s metadata for it to be useful. You should focus on the metadata needed for your next deliverable.
After you have your data loaded and delivered your first deliverables, building out the meta data needed to make your documents uniformly accessible and as useful as possible will take a substantial amount of your time.
MarkLogic allows you to take diverse complicated data silos and begin gaining value from them with minimal effort. To gain more value does require work and effort but this work can be done in an incremental effort with each development sprint focused on just the tasks needed to complete a limited set of deliverables. As each sprint is completed its results can be made available to all users of the data and the resulting database becomes the starting point for future development.
In some ways, the work is similar to what is done in relational based systems and in some ways different. In general, projects can be delivered and maintained with significantly less overall effort than has been possible with yesterday’s technologies.
What ‘Load As Is’ Really Means Explanation of the Load As Is pattern, how you can validate and transform JSON and XML, and how it works to speed up development.