The amount of data flowing into and between systems continues to grow every day. With these ever-increasing volumes of data, system integrators are turning to tools like Apache Kafka to provide a central routing service for streaming that data. One of the primary consumers of the data are databases like MarkLogic.
However, in order to subscribe to the Kafka topics, retrieve the message, and subsequently load them into a MarkLogic database, we need an efficient and reliable tool to act as the bridge: the Kafka-MarkLogic-Connector.
This tool is intended for anyone interested in using Kafka to stream data to MarkLogic. For instance, they could be a solutions engineer working with a potential customer that is considering Kafka, or a consultant who is working with an existing customer to design a solution around Kafka and MarkLogic. Or, they may simply be an experimenter– somebody who is trying out different technologies for learning or for fun.
The Kafka-MarkLogic-Connector, written in Java, uses the standard Kafka APIs and libraries to subscribe to Kafka topics and consume messages. The connector then uses the MarkLogic Data Movement SDK (DMSDK) to efficiently store those messages in a MarkLogic database. As messages stream onto the Kafka topic, the threads of the DMSDK will aggregate the messages and then push the messages into the database based on a configured batch size and time-out threshold.
All three components of the system– Kafka, MarkLogic, and Kafka-MarkLogic-Connector– are designed to easily permit new servers to be added to the system. New Kafka nodes can be used for redundancy to prevent data loss. Combined with MarkLogic’s ACID transactions, the system has extremely high reliability. New server nodes can also quickly and dynamically increase available bandwidth. As resources are maxed out, each of the three components may be expanded independently to meet data flow requirements.
To summarize, this tool would be used primarily for streaming large amounts of data into MarkLogic. Kafka is a message streaming system that is capable of incredible volumes. Those messages may need to be stored somewhere, and that somewhere is MarkLogic. Using just a single MarkLogic server on an AWS t2.xlarge instance, the connector can retrieve and store approximately 4000 messages per second.
Thus, this system has the potential to work with high-bandwidth data sources, such as IoT sensors, satellite constellations, or internet traffic data. Ultimately, the speed of each of the components means that the data can be stored reliably, which has universal value.
If you’d like some hands-on experience with the tool, read the Quickstart with the Kafka-MarkLogic-Connector in AWS to get the basic version of this system set-up.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.
Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]
Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.
Request a Demo