Data integration initiatives create untold burdens on IT teams. I reached out to four veterans of data integration to identify those tasks they thought caused the biggest challenges. Not surprisingly, answers surrounded areas such as knowing your data and data mapping. But what also emerged was the cultural and organizational commitments to change – and actually leveraging integrated data. The final burden was dealing with the sometimes capricious and ever-changing deadlines – set by business, government regulators, and now even, the public.
What do you think, do you agree?
The goal of data integration is to combine disparate sets of data into meaningful information. The tasks involved to accomplish this goal generally follow common patterns, but can quickly become as varied as the data sources themselves. Here are a few of the tasks that have proven to be most difficult:
(Example: In an RDBMS, you have a column defined as: data type of varchar(100), column name: product_description; perhaps the word “discontinued” (or “dis” or “disc”, etc.) is embedded to reflect the status of the product. Now, the column “product_description” has multiple meanings and/or data (content), and is inconsistent with what the column_name suggests.)
These complexities demonstrate that a request to “bring data together” is no trivial exercise. When you attempt to combine sources of data that were designed and created in insolation from one another, it will be the details buried deep in the data and systems that will challenge the progress toward creating one cohesive, reliable, resilient, and secure source of information.
Maureen Penzenik is a Solutions Architect at Northern Trust, focusing on Information Architecture, Data Integration, and the foundations of BI and Analytics. She previously held positions in at RR Donnelley and McDonald’s Corporation in applying Data Warehousing and Business Intelligence methods and technologies.
Currently, the EU General Data Protection Regulation (GDPR) is the biggest challenge for companies handling data, since every company in the European Union has to comply with the GDPR by May 25.
The GDPR was designed to harmonize data privacy laws across Europe, to protect and empower all EU citizens’ data privacy and to reshape the way organizations across the region approach data privacy.
To comply, companies need to:
As GDPR compliance is mandatory by law and penalties impend, many companies see the implementation within the given timeframe as a burden. Further, compliance requirements are not as clear as “you do XYZ and then you are bullet-proof compliant.” Thus, there is a lot of insecurity in the market on how to approach this and how to execute in order to be compliant.
Companies need to document how they process data, where they store it, and how they got it, and document that they have approval to process it. What’s more, they also need to establish processes that document and report the data that is stored in a given format and within a given period of time, by request of any EU citizen. Also, deletion of data and documentation thereof must be executed on demand.
When you look at it from a technical perspective, this is a comprehensive data integration project. On a baseline, you’ll need to extract, link, search and explore/monitor data out of all silos and formats you can imagine since the company was founded.
The stakes are high – penalties for non-compliance per incident are:
Technologies that allow for working with historical and operational data at the same time, not only make the burden/task of compliance easier, but enable the ability to create new revenue streams in data-driven businesses.
Alexander Deles, CEO – EBCONT, is a member of the management board of the EBCONT group of IT companies. He brings almost two decades of experience across different industries. EBCONT is a long-term MarkLogic partner and successfully delivers projects based on MarkLogic around the globe.
As a consultant, I’ve worked with a number of Fortune 500 companies to help manage their data integration strategies better. There are multiple pain points that I’ve seen bring projects to a crawl. In general, the primary difficulty is not in the ETL. Importing a spreadsheet or extracting content from a relational database is old news, and there are multiple strategies for getting content into NoSQL or graph databases that, while not simple, are relatively mechanical.
The more complex issue involves bringing this information into a data store such as MarkLogic. The challenge comes in being able to handle resources, such as products, organizations, people, or similar entities, that are identified by different identity keys from different databases. Today, most master data management solutions use algorithms to try to map how closely two given entities match, but this is an area, in particular, where semantic technologies (a subfield of artificial intelligence) and graph databases (triple stores) shine.
This becomes especially the case as 360° data initiatives become more prominent. Such initiatives, providing a global view of the data within an organization, should use semantics. Period. This means thinking about data holistically, and making an effort to create a simplified core model that can be a target for ingestion ETL from a wide variety of sources. It also involves baking in data governance, provenance, quality and rights management as part of the overall design, as most relational databases are often notoriously bad for capturing any of this.
The payoffs for this, however, are worth it – it becomes possible to determine what is and is not quality data, determine what data is applicable (or even available) by locality or time, and makes it easier to build services to get at that data across multiple potential sources. As a similar process can be used for managing the data dictionaries or taxonomies that your organization has accrued; this also means that you can use semantics as a way to coordinate relational databases with far fewer headaches.
One final arena where semantics have simplified the burden of data integration is in providing better tools for managing controlled vocabularies and taxonomies. Machine learning relies heavily upon having good, complementary facet tables and facets, and as stochastic models become more complex, so, too, do the number and types of facet tables necessary to fully describe an information space. This is a natural (and early) use of semantic technologies, and the combination of machine learning and semantics will provide a powerful boost to computational business processing.
Thus, most of the pain points I’ve seen – master data management, 360° views, provenance/governance, and facet table problems – can be resolved with the proper use of semantic technologies.
An Invited Expert with the W3C, Kurt Cagle has helped develop XML, HTML, CSS, SVG, and RDF standards, regularly writes on data modeling, architecture and data quality for Data Science Central and LinkedIn. His company, Semantical LLC, provides expertise on building smart data applications and metadata repositories.
Data integration always should be driven by concrete business needs: better customer understanding to sell more, improve cost efficiency or to meet regulatory needs. In order words, data integration tasks should be applied to grow, optimize, innovate or protect.
And data integration is not only a technological issue but also cultural and organizational one. All three need to be considered as a first stage before tackling data integration. Is my organization ready to handle it? Do we have the right people? Are the different areas prepared?
Next, you need to tie data integration with a data governance approach. Data quality and data lineage are crucial to success. In fact, a common trend in the market is to provide end-to-end data lineage, regardless of whether it is transactional or informational/historical data.
The main barriers that could appear when facing data integrationare:
Main tips to keep in mind:
As the Big Data Executive Director at everis, Juanjo Lopez leads the eDIN, a Center of Excellence around Data & Analytics, which offers services inside everis & NTT DATA Group worldwide and across all market sectors. Juanjo also belongs to the NTT DATA G1T, a worldwide steering committee defining the strategy and services offering of the Group around Data & Analytics.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
Learn about data bias in AI, ways technology can help overcome it, why AI still needs humans, and how you can achieve transparency.
Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Sharing data can be relatively easy. Sharing our specialized knowledge about data is harder – and current approaches don’t scale.
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo