Since the mid-1990s, and especially since the development of languages like Java, object-oriented programming has been a mainstay of commercially important IT projects. More recently, the need to treat large, incompatible, and rapidly growing and changing data silos as a single unified whole has led to the development and growth of NoSQL databases like MarkLogic.
Object-oriented programming (OOP) depends on the existence of well-defined classes to populate the instances that OOP programming works with. NoSQL is most powerful and useful when handling dissimilar data that is nearly impossible to force into a single data dictionary.
How can these two vitally important IT assets be brought together so that firms and developers can gain the benefits of both technologies?
As computers grew in power they became, at a hardware level, capable of processing more data and more complex data. As a result, the traditional relational databases where the data was stored had ever-increasing difficulty expressing the information in a way that was meaningful and useful to users. The entities being described were actually complex and hierarchical data and the effort involved in normalizing this into rows, columns, and tables made the data inaccessible to all but specialized database experts.
As time passed a compromise was reached. The data continued to be stored in the rows and columns that make up relational databases, but developers who needed to model complex entities used object-oriented languages whose instances were populated from relational databases. One of the main approaches to this is the Java Persistence API (JPA) and its implementations (e.g. Hibernate). JPA defines mappings between relational and object-oriented data structures and allows data to be translated from one format to the other.
While JPA was able to extend the ability of relational databases to support object-oriented programming it has always been an imperfect solution. Entities modeled in object-oriented languages can sometimes require hundreds of tables to fully model with a normalized SQL approach. The development complexity and performance degradation caused by shredding complex objects into SQL tables has always been a hurdle to overcome.
As data has become ever more complex, and even more importantly, as the need to consolidate overlapping but heterogeneous datasets has grown, the problems caused by an Object to SQL approach have grown.
Instead of just having to deal with the mapping SQL tables to complex entities, today there are many different and incompatible versions of data. To fit this overlapping siloed data into SQL tables adds even greater levels of complexity.
The difficulties with handling siloes of heterogeneous data explain the rise of NoSQL but what about object-oriented programming? A substantial part of the business logic in today’s applications is built on developers being able to use complex classes that model the entities that are being analyzed. If a firm’s data is so complex that an integrated data model is impossible does this mean object-oriented programming cannot be used for projects that treat data at a corporate level?
The answer: Object-oriented programming CAN coexist with complex, heterogeneous, ever-changing data silos.
The key to understanding how this can work is that a database management system can answer user queries, provide rock solid security and maintain a high level of data quality without understanding every detail of every attribute in the data. It is certainly true that all else being equal, the closer a system comes to having a single, common data model the more power the system can be. It is also true that while the ultimate end-users generally need a full understanding of some of the data they receive the data management system does not.
To understand how this works in practice, let’s look at storing, querying, and processing documents based on the Financial products Markup Language (FpML). FpML is a message standard initially developed for the over-the-counter derivatives industry. FpML messages are complex XML-based documents. For example, converting the XML .xsd files that describe the base of FpML version 5.8 into Java classes yields 1690 individual classes – try shredding that into a normalized relational database!
Unlike schemas designed for database use, creators of message schemas often do not place a major focus on maintaining schema evolution – that is, limiting schema changes so that older versions remain compatible with new versions. There is no guarantee that the SQL schema you create for FpML version 4.1 will be compatible with version 5.9. In MarkLogic we can handle that difference and, if you wish, ensure that documents match with the appropriate schema.
Suppose you want to maintain a database of all the FpML messages you have sent or received in the last 10 years. You want to be able to easily store, query, examine individual transactions and perform aggregates against the data. What is the best way to implement this?
In a relational based approach you will likely attempt to create a common data model that encompasses all the versions of FpML you have messages for, perform ETL on individual messages to fit them into the model, and then let users use SQL select queries to pull the data together.
In principal this is a fine approach but it does have a few drawbacks.
For Java the most popular tool is JAXB. To understand how MarkLogic can work with JAXB the workflow is:
Answering these questions does not require creating an object representation of individual FpML messages. These are traditional database aggregate queries.
To resolve them, if you are working on messages within a single FpML version your users can construct queries to resolve these questions using their understanding of how data is laid out in FpML.
If you need to query across versions (the same attribute may be accessed in different object paths) or if you want to hide the complexity of FpML from casual users then you will have to do some work implementing the envelope pattern.
We are not doing to do a deep dive into the envelope pattern here. But a brief overview of the envelope pattern is that data is contained in an envelope that stores the original incoming data that is loaded as is along with metadata that standardizes identifiers and units, enriches documents with information from external sources, provides links between different documents (for joins and other purposes), and performs other processing on the data to make it more useful. Users query across the entire envelope and have access to both the original data and to the enhancements that have been made to it.
The envelope pattern allows MarkLogic to provide structured access to data sets of any complexity. The key difference between the work involved with implementing the envelope pattern and traditional data modeling/ETL exercises is that in total it often requires much less effort than traditional approaches (partly because you do not need to shred incoming data to fit into your data model or build a common data model) and also because it is iterative: you only implement what you need to achieve your immediate goals. You can see more in-depth descriptions of the envelope pattern in these two posts in MarkLogic As an SQL Replacement and What ‘Load As Is’ Really Means.
A final point here is that we have been talking about accessing FpML data. While FpML is an important standard, FpML documents are often used as the payload for SWIFT or FIX messages. SWIFT and FIX each have complex, evolving message structures. Implementing FpML may just be the first step in your workload. To get a complete picture of your firm’s trading activities you may need to be able to process information from all these standards. With traditional technologies, each new data set is a major new project. With MarkLogic each new data set is a small increment.
On the surface, it appears that object-oriented programming is incompatible with the kind of complex, ever-changing, and diverse data found in major NoSQL projects. In reality, object-oriented programming outgrew relational technologies long ago and making the two work together becomes more of a struggle every year.
MarkLogic can dramatically reduce the complexity and effort needed to support an object-oriented development approach while maintaining the ability to access the data as a unified whole.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.
Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo