MarkLogic uses machine learning to solve complex data problems by leveraging the new Embedded Machine Learning capability that runs at the core of MarkLogic.
MarkLogic uses machine learning to solve complex data problems by leveraging the new Embedded Machine Learning capability that runs at the core of MarkLogic.
Watch a talk introducing MarkLogic’s new machine learning algorithms and GPU acceleration capabilities. Learn more about data curation and take a deep dive into one company’s machine learning implementation.
Machine learning can be thought of as pattern recognition in data. The challenge, however, is voluminous and complex data that makes it difficult to detect relationships between attributes in the data without advanced tools. A machine learning model is a mathematical representation of relationships that uses algorithms to:
Above all, machine learning provides deep levels of accuracy with data and insights that were not previously possible, helping you create unique models and system. This advanced computer intelligence will, in turn, help increase intelligence within your business enterprise.
Machine learning offers tremendous upside, but you are likely to experience a few challenges along the way. The most common challenges are:
You need to have proper governance to trust your data not only for effective machine learning, but to foster trust in machine learning outputs. You need to be able to answer questions such as: What data should be used? Where did it come from and what’s been done to the data? Does it contain PII? Is it the same data we used last time? Good data is critical because machine learning can be even more sensitive to data quality since you’re using the same data to both train and then execute the model. As a result, any problems with data quality can result in regressions in the system.
The machine learning and artificial intelligence tools ecosystem is incredibly complex and as security and governance become a priority, it is tough to find people with the right skillsets to build and maintain the systems. According to an article in The New York Times, data scientists spend 80% of their time just wrangling data. It’s important to research the tools and make the right call.
Often times the business doesn’t trust the ‘black box’ outputs of machine learning models even when they are accurate. Machine learning investments for most companies look more like science projects rather than core infrastructure because businesses don’t understand or trust the outputs of artificial intelligence to make the decisions. Also, data scientists and the hardware infrastructure they need aren’t cheap. High costs and poor outputs equate to an overall low ROI.
We think the best place to do machine learning is in a data hub where data can be secured, governed and curated. That’s why we built MarkLogic Embedded Machine Learning into the core of MarkLogic. Machine learning routines can run close to the data, in parallel across a MarkLogic cluster, under the umbrella of a secure environment.
Challenges aside, the advantages of machine learning continue to grow with more complex algorithms. As businesses adapt to artificial intelligence, these are the benefits people are experiencing:
With Embedded Machine Learning, MarkLogic will run queries more efficiently and scale autonomously based on workload patterns. With autonomous elasticity, for example, MarkLogic can use models of infrastructure workload patterns to automatically adjust the rules that govern data and index rebalancing.
Embedded Machine Learning reduces complexity and increases automation of various steps in the data curation process. For example, with MarkLogic’s Smart Mastering feature, machine learning will augment the rules-based mastering process so that records are mastered with more accuracy, and models continue to improve as more data is processed—all with less human involvement.
For data scientists, it’s now simpler to just do the work of training and executing models right inside MarkLogic, where we can handle almost every part of the architecture and process. This includes data processing/curation, and the model engineering to build, train, execute and deploy the model.
MarkLogic’s Embedded Machine Learning is a full deep learning toolkit that operates as a run-time library installed right at the core of MarkLogic, in the database kernel. It exposes its functions as built-ins from JavaScript and XQuery, which means these functions run close to the data and are completely integrated.
Embedded machine learning was designed for peak performance not only for CPUs but also for GPUs, and it scales to multi-machine-multi-GPU systems. Additionally, it is designed using a compression technique that dramatically reduces communication costs, reducing inter-node communications and enabling highly scalable parallel training across multiple machines.
Embedded machine learning also supports the Open Neural Network Exchange ONNX format, an open-source shared model representation allowing for framework interoperability and shared optimization. ONNX allows developers to move models between popular frameworks such as CNTK, MXNet, PyTorch, and others.
The toolkit leveraged to build MarkLogic Embedded Machine Learning was originally developed by Microsoft in conjunction with Facebook and AWS and released under the name Cognitive Toolkit, or CNTK. Microsoft used CNTK to develop keystone products like Skype, HoloLens, Cortana, and Bing.