We thought it would be interesting to start some conversations on the advances, challenges and successes in data that our community might soon see, so we asked IT leaders to tell us what the biggest data news will be in 2018. They distilled their experience in the field and their knowledge of cutting edge practices to predict that this is the year that data integration will move beyond traditional ETL practices to use different, more seamless tools. A common theme among our IT experts was that data will come out of the relational basement, so to speak, and will become easier to see and understand by the business user, will be more trusted across the enterprise, and will drive more practical and economic uses.
Rise in Trusted Data Drives API Economy
In 2018, we will see data integration across silos become more trusted, and that, in turn, will fuel the API economy.
The key components of trusted data are:
- Auditability – the ability to easily query and historically report on data used by APIs exactly when the data was provisioned, down to the microsecond
- Traceability – the ability to easily link conformed data used by APIs to its raw form with any other relevant information about the source application
- Consistency – ensuring that APIs will always render the same result for the same data for any particular timestamp.
These are just three of the key requirements of a robust data governance practice.
Another trend I predict you will see is the increase in the use of analytical data by APIs.
While data enrichment done by analytical applications has always been valuable, historically, it has not been easy to share with customers because of the limitations of business intelligence applications.
We’ll see a shift toward combining the operational data that customers currently see with analytical markers specified by predictive models created by data scientists. This combination is key to the personalized user experience that the API economy demands – and a seamless integration, rendered for API consumption will ensure a delightful user experience.
Mike Fillion, Vice President of Data Services at Tahzoo, specializes in data integration and analytics, leads teams of application architects and data modelers, and is a globally recognized data evangelist who speaks and leads workshops on practical data implementations around the globe.
Data Integration Methods Meet Practical Problems
I have noticed three trends, and one set of common elements.
- Datasets. Talking about datasets is hardly new, but there is a resurgence of practical interest as government agencies, professional societies, and universities pursue their strategic goals for 2018. The datasets that lurk underneath published research, generated in the testing of specific hypotheses or in pursuit of basic science, are in increasing demand. This is a priority of NIH and of the recently appointed director of the National Library of Medicine, and is the subject of the current round of Hackathons at NCBI. The use cases include a heightened interest in replication studies, and a desire to combine datasets to create new knowledge. Organizations have been formed to facilitate re-use and interoperability, such as the Data Conservancy. The data integration challenges around datasets are similar to those in dealing with metadata, with a few added twists. The metadata used to describe the datasets needs to be harmonized for effective discoverability. Then there is the content of the datasets, with field names, granularity, and details of measurement to be harmonized for comparability.
- Data integration for the discovery of emergent patterns. Classic goals of data integration included pursuit of the insights that can come from compilation, aggregation, and statistics. A new addition is the exploration of more sparsely distributed data. Data Science and Artificial Intelligence are put to practical use in exploring weak connections to determine which are most likely noise and which represent new discoveries.
- And the reverse: AI to find emergent patterns to help with entity relationship modeling and with indexing. Some of the more “out there” approaches to data integration were mentioned in DZone’s January predictions. For some “fun with math” see the approach proposed in The Case for Learned Index Structures.
- Common elements: One thing that is common across each approach is that there will be messy data that doesn’t fit the models (or in some cases, messy models that don’t fit the data) and there will be iterations through changing views of the universe – as metadata is mapped, models are revised, and entities are outfitted with new Ids. Two tools that are very convenient for helping the cause are the “envelope” pattern long used in Marklogic data integration (and more recently expanded in the Data Hub), and RDF graphs (effective and flexible representation of the world as it is understood at this moment). Each support iterative feedback and adjustments in the course of integration.
Beverly Jamison, Ph.D., is a computer scientist and IT architect who specializes in semantics, AI/machine learning and data integration. She has taught grad and undergrad classes in computer science and served as director of IT architecture and publishing solutions at the American Psychological Association. Currently, she is a consultant at Practical Semantics.
Successful Integration Includes Querying Across Data Silos
Data integration remains one of the most intractable and critical constituent pieces of the new era in information management. Unless information seekers can look across sources—information silos—they will not be able to put together the whole picture of what is happening, what patterns are emerging, what risks are looming. Combining human understanding with the speed and volume that machine learning offers is probably the only way to tackle this barrier.
Sue Feldman, founder of the Cognitive Computing Consortium and CEO of Synthexis, provides business advisory services to vendors and buyers of cognitive computing, search and text analytics technologies. She is the author of the book, The Answer Machine.
Distributed Computing Simplifies Integration Processes
Below are two trends I see in data integration in 2018, particularly in the life sciences industry and other regulated industries.
- A new simplified model for data integration: When we think of data integration, ETL (extract, transform, load) functions come to mind. ETL at the enterprise level is overly complex and crowded with support tools mostly to support on premise, centralized data warehousing. With growing movement of large organizations toward Hadoop and MapReduce environments, the traditional relational databases are no longer the default. These new environments will gain more prominence and tools to deal with unstructured data just as simply as we do with tabular data. Thus, the complex ETL model of data warehousing is falling out of favor. A more simplified model for data integration that leverages cloud native functions and Hadoop/MapReduce file systems will emerge.
- Data governance and stewardship: There is a growing recognition of the need for enterprise data governance and the engagement of business (functional) users as information experts during data integration, as well as the need to stage the integrated data for data science applications across functional groups. I think business users will take an increasing role in data governance, quality and stewardship with IT.
Harsha Rajasimha, Ph.D., Senior Director, Life Sciences at NTT Data, is a scientist and executive leader with more than 15 years of distinguished experience in the fields of life sciences consulting, systems biology, IT systems integration, big data analytics, genomics of rare diseases, and precision medicine.
Digital Transformation Tools Become Seamless
Enterprises’ digital transformations include adapting to the changing demands of different formats of data – structured, unstructured, semi-structured – and different velocities of data – batch, near-real time, real time. This has put data integration front and center, and while ETL vendor tools are still relevant, we are still seeing organizations use a bouquet of data integration solutions (scripting and commercial ETL engines) for their ingestion, data processing and transformation, curation and consumption/syndication needs.
In 2018, I foresee convergence into more seamless tools so that companies don’t have to buy multiple tools and still resort to engineering and coding custom routines/scripts in Scala, Python etc.
Another trend that I anticipate in 2018 with the emergence of data engineering, is the convergence of data preparation and wrangling within the core data integration framework. For too long we have viewed data prep and wrangling as a pure consumption/analytics enabler. We are now seeing the entire data integration value chain of Ingest > Prepare > Catalog > Secure > Govern > Access as one giant data prep step in a way and the DI tool vendors are also starting to align their capabilities accordingly.
Parthasarathy Rangachari, Senior Director, Analytics and Information Management, Cognizant Technology Solutions, is a leader and practitioner in big data technologies specializing in enterprise data lakes, data management and governance and analytical use cases.