Since the mid-1990s, and especially since the development of languages like Java, object-oriented programming has been a mainstay of commercially important IT projects. More recently, the need to treat large, incompatible, and rapidly growing and changing data silos as a single unified whole has led to the development and growth of NoSQL databases like MarkLogic.
Object-oriented programming (OOP) depends on the existence of well-defined classes to populate the instances that OOP programming works with. NoSQL is most powerful and useful when handling dissimilar data that is nearly impossible to force into a single data dictionary.
How can these two vitally important IT assets be brought together so that firms and developers can gain the benefits of both technologies?
The Rise of Object-Oriented Programming
As computers grew in power they became, at a hardware level, capable of processing more data and more complex data. As a result, the traditional relational databases where the data was stored had ever-increasing difficulty expressing the information in a way that was meaningful and useful to users. The entities being described were actually complex and hierarchical data and the effort involved in normalizing this into rows, columns, and tables made the data inaccessible to all but specialized database experts.
As time passed a compromise was reached. The data continued to be stored in the rows and columns that make up relational databases, but developers who needed to model complex entities used object-oriented languages whose instances were populated from relational databases. One of the main approaches to this is the Java Persistence API (JPA) and its implementations (e.g. Hibernate). JPA defines mappings between relational and object-oriented data structures and allows data to be translated from one format to the other.
While JPA was able to extend the ability of relational databases to support object-oriented programming it has always been an imperfect solution. Entities modeled in object-oriented languages can sometimes require hundreds of tables to fully model with a normalized SQL approach. The development complexity and performance degradation caused by shredding complex objects into SQL tables has always been a hurdle to overcome.
As data has become ever more complex, and even more importantly, as the need to consolidate overlapping but heterogeneous datasets has grown, the problems caused by an Object to SQL approach have grown.
Instead of just having to deal with the mapping SQL tables to complex entities, today there are many different and incompatible versions of data. To fit this overlapping siloed data into SQL tables adds even greater levels of complexity.
The difficulties with handling siloes of heterogeneous data explain the rise of NoSQL but what about object-oriented programming? A substantial part of the business logic in today’s applications is built on developers being able to use complex classes that model the entities that are being analyzed. If a firm’s data is so complex that an integrated data model is impossible does this mean object-oriented programming cannot be used for projects that treat data at a corporate level?
Integrating Object-oriented Programming and NoSQL
The answer: Object-oriented programming CAN coexist with complex, heterogeneous, ever-changing data silos.
The key to understanding how this can work is that a database management system can answer user queries, provide rock solid security and maintain a high level of data quality without understanding every detail of every attribute in the data. It is certainly true that all else being equal, the closer a system comes to having a single, common data model the more power the system can be. It is also true that while the ultimate end-users generally need a full understanding of some of the data they receive the data management system does not.
An Example – Handling FpML With MarkLogic
To understand how this works in practice, let’s look at storing, querying, and processing documents based on the Financial products Markup Language (FpML). FpML is a message standard initially developed for the over-the-counter derivatives industry. FpML messages are complex XML-based documents. For example, converting the XML .xsd files that describe the base of FpML version 5.8 into Java classes yields 1690 individual classes – try shredding that into a normalized relational database!
Unlike schemas designed for database use, creators of message schemas often do not place a major focus on maintaining schema evolution – that is, limiting schema changes so that older versions remain compatible with new versions. There is no guarantee that the SQL schema you create for FpML version 4.1 will be compatible with version 5.9. In MarkLogic we can handle that difference and, if you wish, ensure that documents match with the appropriate schema.
Suppose you want to maintain a database of all the FpML messages you have sent or received in the last 10 years. You want to be able to easily store, query, examine individual transactions and perform aggregates against the data. What is the best way to implement this?
In a relational based approach you will likely attempt to create a common data model that encompasses all the versions of FpML you have messages for, perform ETL on individual messages to fit them into the model, and then let users use SQL select queries to pull the data together.
In principal this is a fine approach but it does have a few drawbacks.
- You may have retired by the time your database is ready. The 1690 classes representing version 5.8 mentioned above was just for one version of FpML. Other versions have different object-oriented representations and while there is quite a bit of overlap there are also differences. Your schema designers will need to do a lot of work to create either a common data model to hold all the messages or alternatively, separate data models for each FpML version. Designing the ETL needed to move individual messages into the common format is a major job. If you maintain separate data models for each version, how will you query the entire database in an integrated fashion?
- Your performance is likely to be bad. Decomposing a single FpML message into potentially hundreds of tables to store and then reversing the process to populate your FpML object is a very costly process from a database perspective. Performing the joins needed to query the messages in an integrated fashion will likely kill your query performance.
- Accessing the data may be hard. If it takes 1690 tables to represent just one version of FpML how many end-users will be able to construct SQL queries to pull together the data they need?
The NoSQL/MarkLogic Alternative
- MarkLogic approaches it a different way. Just load the FpML messages as is without any processing and begin searching and querying them immediately. One of the beauties of MarkLogic is that data is accessible as soon as it is loaded, even if database developers have no idea what’s in the data. MarkLogic’s Ask Anywhere Universal Index provides users with Google-like search abilities on any data as soon as it is loaded – with no processing. The Universal Index also provides the ability to query on attributes contained in the data – in this case the XML attributes describing the FpML messages. Users who understand FpML can immediately begin issuing structured queries against the FpML without the need for any database transformations, ETL, or the development of a common data model.
- Use lightweight, industry standard technologies to convert FpML content into object representations and then back into XML when needed. There are a variety of technologies to convert XML and JSON documents into objects.
For Java the most popular tool is JAXB. To understand how MarkLogic can work with JAXB the workflow is:
- Download the xsd files that make up the schemas for the FpML version you are interested in.
- In your Java IDE, use JAXB to generate java classes from the .xsd files which define the FpML version.
- Use MarkLogic to search the database to find the messages you are interested in.
- Pass the messages to JAXB and use it to create instances of the classes.
- Use the envelope pattern to enhance and enrich the documents storing the FpML messages to continually make it easier to access and use the data in the messages. While the above will allow your users to get to work on day 1 and will, on its own, satisfy a lot of your needs, you may want to do more. For example, you may want to perform structured queries like how many counterparties have we worked with? What was the sum of the initial values in 2015?
Answering these questions does not require creating an object representation of individual FpML messages. These are traditional database aggregate queries.
To resolve them, if you are working on messages within a single FpML version your users can construct queries to resolve these questions using their understanding of how data is laid out in FpML.
If you need to query across versions (the same attribute may be accessed in different object paths) or if you want to hide the complexity of FpML from casual users then you will have to do some work implementing the envelope pattern.
We are not doing to do a deep dive into the envelope pattern here. But a brief overview of the envelope pattern is that data is contained in an envelope that stores the original incoming data that is loaded as is along with metadata that standardizes identifiers and units, enriches documents with information from external sources, provides links between different documents (for joins and other purposes), and performs other processing on the data to make it more useful. Users query across the entire envelope and have access to both the original data and to the enhancements that have been made to it.
The envelope pattern allows MarkLogic to provide structured access to data sets of any complexity. The key difference between the work involved with implementing the envelope pattern and traditional data modeling/ETL exercises is that in total it often requires much less effort than traditional approaches (partly because you do not need to shred incoming data to fit into your data model or build a common data model) and also because it is iterative: you only implement what you need to achieve your immediate goals. You can see more in-depth descriptions of the envelope pattern in these two posts in MarkLogic As an SQL Replacement and What ‘Load As Is’ Really Means.
Introducing SWIFT and FIX too
A final point here is that we have been talking about accessing FpML data. While FpML is an important standard, FpML documents are often used as the payload for SWIFT or FIX messages. SWIFT and FIX each have complex, evolving message structures. Implementing FpML may just be the first step in your workload. To get a complete picture of your firm’s trading activities you may need to be able to process information from all these standards. With traditional technologies, each new data set is a major new project. With MarkLogic each new data set is a small increment.
On the surface, it appears that object-oriented programming is incompatible with the kind of complex, ever-changing, and diverse data found in major NoSQL projects. In reality, object-oriented programming outgrew relational technologies long ago and making the two work together becomes more of a struggle every year.
MarkLogic can dramatically reduce the complexity and effort needed to support an object-oriented development approach while maintaining the ability to access the data as a unified whole.
- MarkLogic As an SQL Replacement Blog post that addresses the differences between querying with MarkLogic vs SQL.
- What ‘Load As Is’ Really Means Comprehensive blog post by Dan McCreary on what to expect when you load data without modeling first.