We are excited to announce support for using Apache NiFi to ingest data into MarkLogic. Apache NiFi is an open source tool for distributing and processing data. When used alongside MarkLogic, it’s a great tool for building ingestion pipelines. NiFi has an intuitive drag-and-drop UI and over a decade of development behind it, with a big focus on security and governance.
One of the historical challenges to adopting new NoSQL databases is getting legacy relational data migrated over. Relational databases store data in rows and columns in a highly normalized form. MarkLogic, a multi-model NoSQL database, stores data as JSON and XML documents and RDF triples. Typically, you group data into natural “entities” that are modeled as documents, and you add RDF triples to capture meaningful relationships among the entities.
NiFi helps to naturally group your data by either converting relational rows to small documents or joining groups of rows together into hierarchical structures using primary/foreign key relationships. With new MarkLogic processors, this data then moves quickly into MarkLogic with minimal configuration and high performance.
The NiFi approach uses the data model that already exists in the relational database to the extent possible, avoiding costly, fragile and slow ETL jobs. Existing approaches such as MarkLogic Content Pump (mlcp) still work well for getting data into MarkLogic. But, NiFi makes the whole process of ingesting relational data to MarkLogic faster and easier. And, you don’t need to buy a separate ETL tool.
Here is an example:
The above screenshot shows a simple process for getting relational data into MarkLogic. An SQL query is executed to get data out of a relational system. Then, a NiFi processor converts the resulting Avro serialized data to JSON, and the JSON data is put into MarkLogic. Watch this five-minute demo that shows how to get relational data ingested into MarkLogic using NiFi.
NiFi is designed and built to handle real-time data flows at scale. But, NiFi is not advertised as an ETL tool, and we don’t think it should be used for traditional ETL. The sweet spot for NiFi is handling the “E” in ETL. It extracts data easily and efficiently. If necessary, it can do some minimal transformation work along the way. We think it’s better to let the database (i.e., MarkLogic) take care of the data transformation and harmonization.
The main benefits of NiFi include the following:
The main concepts to understand when using NiFi are dataflows, processors and connections. You create a dataflow by wiring together processors with connections. A dataflow can be saved as a template, and these templates can be combined into more complex flows and reused or replicated across servers.
The following table from Hortonworks provides a very nice summary of the individual components and how they map to dataflow programming:
|NiFi Term||FBP Term||Description|
|FlowFile||Information Packet||A FlowFile represents each object moving through the system, and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.|
|FlowFile Processor||Black Box||Processors actually perform the work. In Enterprise Integration Terms, a processor is doing some combination of data routing, transformation or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.|
|Connection||Bounded Buffer||Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues then can be prioritized dynamically and can have upper bounds on load, which enable back pressure.|
|Flow Controller||Scheduler||The Flow Controller maintains the knowledge of how processes actually connect and manage the threads and allocations that all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.|
|Process Group||Subnet||A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow for the creation of entirely new components through the composition of other components.|
Using NiFi with MarkLogic is similar to using NiFi with any other database—you just need to use the processors specifically built for getting data in and out of MarkLogic.
There are currently two processors built for MarkLogic: the PutMarkLogic processor for ingesting data into MarkLogic and the QueryMarkLogic processor for querying documents in MarkLogic. Both of these processors are built on top of MarkLogic’s Data Movement SDK.
The below list of capabilities provides a general idea of what each processor is capable of.
The steps below illustrate how fast and easy it is to get started using NiFi with MarkLogic.
Download the NiFi binaries from http://nifi.apache.org/download.html. Make sure you’re on the latest release of NiFi (1.7). Unpack (i.e., unzip) the tar or zip files in a directory of your choice (for example: /abc).
Clone the MarkLogic/nifi-nars repository to get the MarkLogic-specific processors located in the GitHub repository.
Place the MarkLogic-specific processor files in the correct directory. To do this, copy the two .nar files provided by MarkLogic in the zip folder into the lib folder (nifi-1.7.0/lib) of the unpacked NiFi distribution.
Go to the Apache NiFi Development Quickstart and follow the commands in the Decompress and Launch sections. Note that you do not need to follow the decompress instructions. Also, make sure that you are in the directory of your NiFi installation. If not, change your directory using a command (e.g., “cd /abc/nifi-1.7.0”). Now, you are ready to follow the launch instructions provided in the Apache NiFi Development Quickstart for your particular environment.
Now, you’re ready to run NiFi using your browser. You can point to a web browser at http://localhost:8080/nifi/ to run NiFi. Make sure you are running MarkLogic version 9.0+.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
Get info on recent and upcoming product updates from John Snelson, head of the MarkLogic product architecture team.
The MarkLogic Kafka Connector makes it easy to move data between the two systems, without the need for custom code.
MarkLogic 11 introduces support for GraphQL queries that run against views in your MarkLogic database. Customers interested in or already using GraphQL can now securely query MarkLogic via this increasingly popular query language.
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo