Progress Acquires MarkLogic! Learn More

Streaming Data into MarkLogic with the Kafka-MarkLogic Connector

Back to blog
3 minute read
Back to blog
3 minute read

Why use Kafka with MarkLogic?

The amount of data flowing into and between systems continues to grow every day. With these ever-increasing volumes of data, system integrators are turning to tools like Apache Kafka to provide a central routing service for streaming that data. One of the primary consumers of the data are databases like MarkLogic.

However, in order to subscribe to the Kafka topics, retrieve the message, and subsequently load them into a MarkLogic database, we need an efficient and reliable tool to act as the bridge: the Kafka-MarkLogic-Connector.

This tool is intended for anyone interested in using Kafka to stream data to MarkLogic. For instance, they could be a solutions engineer working with a potential customer that is considering Kafka, or a consultant who is working with an existing customer to design a solution around Kafka and MarkLogic. Or, they may simply be an experimenter– somebody who is trying out different technologies for learning or for fun.

How does the Kafka-MarkLogic-Connector work?

The Kafka-MarkLogic-Connector, written in Java, uses the standard Kafka APIs and libraries to subscribe to Kafka topics and consume messages. The connector then uses the MarkLogic Data Movement SDK (DMSDK) to efficiently store those messages in a MarkLogic database. As messages stream onto the Kafka topic, the threads of the DMSDK will aggregate the messages and then push the messages into the database based on a configured batch size and time-out threshold.

All three components of the system– Kafka, MarkLogic, and Kafka-MarkLogic-Connector– are designed to easily permit new servers to be added to the system. New Kafka nodes can be used for redundancy to prevent data loss. Combined with MarkLogic’s ACID transactions, the system has extremely high reliability. New server nodes can also quickly and dynamically increase available bandwidth. As resources are maxed out, each of the three components may be expanded independently to meet data flow requirements.

What are the advantages of using the tool?

  • Scalability: A system made up of Kafka, the Kafka-MarkLogic-Connector, and MarkLogic, is broadly scalable and reliable. Accordingly, each of these components can scale independently. As the demands on Kafka increase, additional connectors may be added to monitor the topic. With the MarkLogic cluster behind a load-balancer, a properly configured system is capable of processing a very large number of messages per minute.
  • No Code: The Kafka-MarkLogic-Connector is convenient and simple to use. As a part of the system described previously, each component may be set up and integrated without writing any code. All that is required is configuring the connector so its’s aware of the Kafka cluster and is connected to the MarkLogic cluster. By properly setting those parameters, you are ready to stream messages from a Kafka topic to a MarkLogic database.
  • AWS-Ready: All these components are all compatible with AWS Cloud Computing Services; in turn, all the advantages of AWS are available as well. We can design and deploy our system using tools such as CloudFormation and monitor the system using CloudWatch. Additionally, since each of the three main components are scalable, we can take advantage of the AWS auto-scaling to automatically grow and shrink each of the components in our system as needs dictate.

What can be done with this tool in MarkLogic?

To summarize, this tool would be used primarily for streaming large amounts of data into MarkLogic. Kafka is a message streaming system that is capable of incredible volumes. Those messages may need to be stored somewhere, and that somewhere is MarkLogic. Using just a single MarkLogic server on an AWS t2.xlarge instance, the connector can retrieve and store approximately 4000 messages per second.

Thus, this system has the potential to work with high-bandwidth data sources, such as IoT sensors, satellite constellations, or internet traffic data. Ultimately, the speed of each of the components means that the data can be stored reliably, which has universal value.

If you’d like some hands-on experience with the tool, read the Quickstart with the Kafka-MarkLogic-Connector in AWS to get the basic version of this system set-up.

Phil Barber

Phil has been building solutions using MarkLogic for nearly eight years including the last five as a MarkLogic consultant. He has more than 30 years of experience in the software industry and loves solving problems. Phil lives with his wife and family in Fredericksburg, Va, and enjoys games and learning new skills.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo