MarkLogic uses machine learning to solve complex data problems by leveraging the new Embedded Machine Learning capability that runs at the core of MarkLogic.
MarkLogic Embedded Machine Learning helps you achieve the best results because your machine learning models have direct access to high quality, curated, governed data. And, if you’re not a data scientist, that’s okay too. We’re also using this capability to improve how MarkLogic operates and how data is curated — but it’s completely transparent to users of the MarkLogic Data Hub.
Machine learning can be thought of as pattern recognition in data. The challenge, however, is voluminous and complex data that makes it difficult to detect relationships between attributes in the data without advanced tools. A machine learning model is a mathematical representation of relationships allowing you to:
Above all, machine learning provides levels of accuracy with data and insights that were not previously possible.
Lack of quality and governance — You need to have proper governance to trust your data not only for effective machine learning, but to foster trust in machine learning outputs. You need to be able to answer questions such as: What data should be used? Where did it come from and what’s been done to the data? Does it contain PII? Is it the same data we used last time? Good data is critical because machine learning can be even more sensitive to data quality since you’re using the same data to both train and then execute the model. As a result, any problems with data quality get amplified.
Wild west ecosystem — The machine learning and AI tools ecosystem is incredibly complex and as security and governance become a priority, it is tough to find people with the right skillsets to build and maintain the systems. According to an article in The New York Times, data scientists spend 80% of their time just wrangling data.
Low business ROI — Often times the business doesn’t trust the ‘black box’ outputs of machine learning models even when they are accurate. AI investments for most companies look more like science projects rather than core infrastructure because businesses don’t understand or trust the outputs of machine learning models to make decisions using them. And, data scientists and the hardware infrastructure they need aren’t cheap. High costs and poor outputs equate to an overall low ROI.
We think the best place to do machine learning is in a data hub where data can be secured, governed and curated. That’s why we built MarkLogic Embedded Machine Learning into the core of MarkLogic. Machine learning routines can run close to the data, in parallel across a MarkLogic cluster, under the umbrella of a secure environment.
Embedded machine learning was designed for peak performance not only for CPUs but also for GPUs, and it scales to multi-machine-multi-GPU systems. Additionally, it is designed using a compression technique that dramatically reduces communication costs, reducing inter-node communications and enabling highly scalable parallel training across multiple machines.
Embedded machine learning also supports the Open Neural Network Exchange ONNX format, an open-source shared model representation allowing for framework interoperability and shared optimization. ONNX allows developers to move models between popular frameworks such as CNTK, MXNet, PyTorch, and others.
The toolkit leveraged to build MarkLogic Embedded Machine Learning was originally developed by Microsoft in conjunction with Facebook and AWS and released under the name Cognitive Toolkit, or CNTK. Microsoft used CNTK to develop keystone products like Skype, HoloLens, Cortana, and Bing.
Watch a talk introducing MarkLogic’s new machine learning algorithms and GPU acceleration capabilities. Learn more about data curation and take a deep dive into one company’s machine learning implementation.