Today, we are excited to announce the availability of the MarkLogic Connector for Apache Spark. Apache Spark has gained significant user adoption and is an important tool for complex data processing and analytics, especially when it involves machine learning and AI. By combining Spark with MarkLogic’s data persistence and governance capabilities, organizations can build a modern integration hub that is more consistent, powerful, and well-governed than Spark alone can provide.
To get started, users can download the MarkLogic Connector for Apache Spark here.
What is Apache Spark?
Apache Spark is an in-memory, distributed data processing engine for analytical applications, including machine learning, SQL, streaming, and graph. As a unified analytical tool, it is widely used by developers to build scalable data pipelines that span diverse data sources, including relational databases and NoSQL systems. Spark supports a variety of programming languages (like Scala, Java, Python) making it a tool of choice for data engineering and data science tasks.
Using Apache Spark with MarkLogic
While Apache Spark is widely used for analytical processing at scale, it does not include its own distributed data persistence layer. This is where MarkLogic Data Hub shines as a unified operational and analytical platform for integrating and managing heterogeneous data from multiple systems.
The combination of Apache Spark and MarkLogic enables organizations to modernize their data analytics infrastructure for faster time-to-insights while reducing cost and risk. Using the MarkLogic Connector for Apache Spark, developers can run Spark jobs for advanced analytics and machine learning directly on data in MarkLogic. This removes the ETL overhead that would otherwise be required when moving and wrangling data between separate operational and analytics systems. Instead, organizations can achieve a simpler architecture and speed up delivery of analytical applications that rely on durable data assets managed in a MarkLogic data hub.
Below are few use cases for Spark with MarkLogic:
- Scalable Data Ingestion: The MarkLogic Connector for Apache Spark makes it easy to implement Spark jobs for loading data as is while tracking provenance, lineage, and other metadata. With readily available connectors to diverse data sources, Spark easily facilitates batch and streaming data ingestion. Additionally, it also provides rich data transformation capabilities (like joins, filters, unions, etc.) so developers can easily cleanse and consolidate data from multiple source systems before loading data into MarkLogic. Once data is loaded, MarkLogic has the necessary capabilities to integrate, curate, and enrich source data into durable data assets for multiple use cases.
- Advanced Analytics: Spark provides a rich ecosystem for machine learning and predictive analytics libraries like MLlib. Using the MarkLogic Connector for Spark, developers can now run advanced analytics and machine learning directly on the data in MarkLogic. And, they can leverage MarkLogic’s multi-model querying capabilities and securely share fit-for-purpose data with Spark libraries (like streaming, SQL, machine learning) for analytical processing. Another benefit is that MarkLogic’s distributed design can easily scale compute capacity to allow Spark jobs to process vast amounts of data – which is important because with machine learning, processing capacity needs can fluctuate heavily.
The MarkLogic Connector for Spark is compatible with Spark’s DataSource API providing a seamless developer experience. The connector returns the data in MarkLogic as a Spark DataFrame that can quickly be processed using Spark SQL and other Spark APIs. Developers can leverage existing skills as they use Spark native libraries (like SQL, machine learning, and others) in a variety of programming languages (like Java, Scala, and Python) to build sophisticated analytics on top of MarkLogic.
Together, the combination of MarkLogic and Spark provides huge benefits for building intelligent analytical applications. The MarkLogic Connector for Spark ensures that organizations are maximizing the benefits of MarkLogic as the trusted source of durable data assets and Spark as the high-performance analytical framework.
To get started, follow along with the hands-on, step-by-step tutorial. To learn more about how you can configure the MarkLogic Connector for Apache Spark, please check out the documentation here. Apache Spark documentation is available here.