In a previous post, I discussed the limitations found in many existing mortgage infrastructures, which leaves them unable to handle the potentially thousands of different variants of documents that cause headaches for mortgage origination, processing or securitization.
To get past these limitations it is necessary to build an infrastructure that includes:
- A universal repository that is a single source of truth with a consolidated view of ALL information
- Quick access to new data and new data sources — minimal ETL
- Support for new data governance and regulatory demands
- Enterprise data integrity, reliability, and security
- A complete history of each mortgage
- Ability to handle many billions of documents
- Geospatial queries
Combining all these into a universal mortgage repository is not a utopian fantasy. In this entry, I will delve into work being done at major banking institutions to show how a next-generation system should work.
The Scope of the Problem
The biggest issue in building a repository is the number of different types of documents involved in tracking a mortgage over its possibly 30-year lifespan. Just scratching the surface, there are documents on customer information, checking and savings statements, check images, credit reports, and mortgage applications. The formats of many of these documents vary from state to state or possibly even from company to company. The versions of the documents used today may look very different than the versions existed in 1995 or 2005 and it is necessary to store them all. Every time regulations change new versions of many of the documents will be created.
A large bank that originates and processes millions of mortgages will store many, many thousands of variants of different document types — each of which has a different implied schema.
In some cases a firm does not want a complete document repository — where the actual documents are centrally stored and queried. It sometimes prefers to keep the actual documents with the primary systems that create and maintain them. Instead of creating a document repository, a metadata repository is maintained. This makes it possible to determine which documents meet a set of search or query parameters and quickly locate retrieve them from the primary owner’s systems.
In the past, documents were paper documents and were stored in filing cabinets. Each primary system would maintain a metadata database which allowed users to query metadata to determine the ID numbers of required documents. Today scanned images and PDF or Word documents have largely replaced the file cabinets making it easier to access the full document. As time has passed firms have gradually increased the metadata stored on individual documents often making it possible to answer some questions with just the metadata.
Metadata repositories can be easier to build than document repositories as it is not necessary to change metadata formats every time there is a small change to the underlying document.
However, even with a metadata repository, there is still a lot of work to do. To start with, there are many different business processes involved in the life of a mortgage, including check processing, monthly statement creation, and mortgage origination processing. In fact, an infrastructure may contain 35 or more primary systems. These systems have largely been developed independently of each other and the metadata varies from system to system. Mergers and acquisitions mean that different metadata approaches coexist in a single firm. These days, after origination, mortgages are regularly bought and sold so documents, and perhaps metadata, created by many different firms may coexist in the same database. All this means that even with a metadata repository there can still be thousands of variants of documents types coexisting within a firm.
Trying to pull all these primary or metadata documents together with ETL and a relational database is a recipe for disaster. The large existing inventory of document types, and rates at which new document types and variants of existing types grow, means that any attempt to engage in a systematic modeling and ETL effort will likely never finish.
Building the Universal Repository
When discussing the mechanics of building a new repository we will use MarkLogic as the database backing the repository as it has all the needed functionality.
Loading the Data
The first requirement in building a document or metadata repository is to just load the data. Instead of defining a schema and then engaging in extensive ETL to force underlying data sources to fit into that schema, you just load the data “as is.”
As it is being ingested, MarkLogic’s Universal index makes it immediately available for searching without any ETL required. Additionally, structured queries can be performed against the existing metadata in the primary PDF, Word and other documents, or against the XML descriptors or JSON tags found in metadata documents — again without any modeling or transformations.
This ability to powerfully access the data on load means that the repository can provide value from day one. Users can search and query and get results faster and more accurately than even before development begins.
Optimizing the Data
Once data is loaded the repository can be optimized and improved upon in a variety of ways. There are two key factors to keep in mind during this process. First, optimization can be done in an incremental fashion. Data can be continually loaded and the repository can be searched and queried while repository enhancement is constantly making the system ever more powerful. As a result, time to measurable results can be a fraction of that needed for relational/ETL based projects.
Second, all the techniques discussed in this post can be used simultaneously in a single search or query (searches and queries can be performed together as well).
Some specific approaches to enhance the repository include:
- Combining Dissimilar Data Sets — In the legacy data, fields that are logically the same will often have different names — customerID in one document may be acct_num in another. To allow for integrated queries, data mappings can be provided to MarkLogic on either an ad-hoc or system-wide basis.Mappings can be added to the system incrementally. Unlike relational based approaches, it is not necessary to identify the mappings the system needs at design time and design them into an ETL process, which is then run before data load.
- Advanced Queries — To enhance query capabilities and speed queries, indexes can be added to the repository. MarkLogic comes with a wide set of index types only a few of which will be mentioned here to show as an illustration of the power they add.Range indexes can be used to create lists of specific elements occurring in documents along with the number of documents that meet a search or query and also contain that element. An example is a list of account IDs along with the number of documents that have that ID and which meet a specified search or query. This makes it easy to drill into a data set by making it easy to create “facets” which are lists of specific attributes (perhaps a list of account ids) along with a count of the number of documents containing that value.
- Location Based Queries — Geospatial indexes make it easy to do location based queries. For example, if you are pricing a basket of mortgages and want your pricing algorithm to take into account that properties near a specific location may have been affected by a recent oil spill geospatial searches makes it easy. Creating a basket of mortgages that include a geographic component is obviously easier to do with a repository that supports location-based queries.
- Semantics — Semantics takes triple sets and uses them to build indexes that enable powerful new search abilities. As a simple example, if different document types use different customer identifiers, triples can be defined like: “EID0001 is the sameAs act0001.”With semantic queries (performed with the industry standard SPARQL query language) and a set of triples like the example above, all the information for a specific account id can then be easily queried by adding a SPARQL component to the rest of the query. This approach can be built out to handle more complex use cases. For example, one of the holy grails of master data management is householding — getting customer data to a point where it can support analytics based families or other broader groupings instead of just against individuals. With triple sets like: “EID0001 is_Husband_of EID2343” and “EID2343 is_Sister_of EID6765” it becomes possible to combine the mortgage data with other data sets and create a 360-degree view of household based groups. Adding more types of triples like “EID6765 works_for IBM” and “EID6765 owns 2014 Buick” allows data to be sliced and diced in a truly wide variety of ways.
- What did you know and when did you know it? — An important issue in today’s mortgage world is for firms to have the ability to show regulators and others what it knew and when it knew it. Bitemporal functionality is the ability to show what the state of a database was at a point in the past. With relational technology, it is especially difficult to implement bitemporality with long-lived instruments like mortgages. Document structures used in mortgage processing and securitization are continually changing which, in a relational based system, requires schema changes. Displaying data with their schemas as it existed 10 or 20 years ago is a nearly impossible job for a relational database.
Building a universal mortgage document or metadata repository has been nearly impossible with relational technologies, and major banking institutions are finding it fairly easy to do with MarkLogic. We have not covered it in this blog entry but all of the capabilities we have discussed are done with enterprise level security, high availability and disaster recovery, ACID transactions, in a clustered environment that can scale to many billions of documents.
We do a pretty good job though, of showing the power of a universal mortgage repository in a world where mortgages are constantly changing and where the way mortgage data is accessed grows ever more demanding.
Dave Grant put together a terrific demo that let’s you draw polygons in a geographic area and see the risks to your portfolio — should there be a flood or plant closings. You can see exactly what I am referring to in this webinar where we show the very un-uniform metadata from varying documents and how it seamlessly joins using semantic triples.
In future blog posts, I’ll show how many of the issues and challenges caused by today’s legacy infrastructures melt away.