Gartner Cloud DBMS Report Names MarkLogic a Visionary

MarkLogic Connector for Apache Spark Now Available

Today, we are excited to announce the availability of the MarkLogic Connector for Apache Spark. Apache Spark has gained significant user adoption and is an important tool for complex data processing and analytics, especially when it involves machine learning and AI. By combining Spark with MarkLogic’s data persistence and governance capabilities, organizations can build a modern integration hub that is more consistent, powerful, and well-governed than Spark alone can provide.

To get started, users can download the MarkLogic Connector for Apache Spark here

What is Apache Spark?

Apache Spark is an in-memory, distributed data processing engine for analytical applications, including machine learning, SQL, streaming, and graph. As a unified analytical tool, it is widely used by developers to build scalable data pipelines that span diverse data sources, including relational databases and NoSQL systems. Spark supports a variety of programming languages (like Scala, Java, Python) making it a tool of choice for data engineering and data science tasks.

Using Apache Spark with MarkLogic

While Apache Spark is widely used for analytical processing at scale, it does not include its own distributed data persistence layer. This is where MarkLogic Data Hub shines as a unified operational and analytical platform for integrating and managing heterogeneous data from multiple systems.

The combination of Apache Spark and MarkLogic enables organizations to modernize their data analytics infrastructure for faster time-to-insights while reducing cost and risk. Using the MarkLogic Connector for Apache Spark, developers can run Spark jobs for advanced analytics and machine learning directly on data in MarkLogic. This removes the ETL overhead that would otherwise be required when moving and wrangling data between separate operational and analytics systems. Instead, organizations can achieve a simpler architecture and speed up delivery of analytical applications that rely on durable data assets managed in a MarkLogic data hub.

Below are few use cases for Spark with MarkLogic:

  • Scalable Data Ingestion:  The MarkLogic Connector for Apache Spark makes it easy to implement Spark jobs for loading data as is while tracking provenance, lineage, and other metadata. With readily available connectors to diverse data sources, Spark easily facilitates batch and streaming data ingestion. Additionally, it also provides rich data transformation capabilities (like joins, filters, unions, etc.) so developers can easily cleanse and consolidate data from multiple source systems before loading data into MarkLogic. Once data is loaded, MarkLogic has the necessary capabilities to integrate, curate, and enrich source data into durable data assets for multiple use cases.
  • Advanced Analytics:  Spark provides a rich ecosystem for machine learning and predictive analytics libraries like MLlib. Using the MarkLogic Connector for Spark, developers can now run advanced analytics and machine learning directly on the data in MarkLogic. And, they can leverage MarkLogic’s multi-model querying capabilities and securely share fit-for-purpose data with Spark libraries (like streaming, SQL, machine learning) for analytical processing. Another benefit is that MarkLogic’s distributed design can easily scale compute capacity to allow Spark jobs to process vast amounts of data – which is important because with machine learning, processing capacity needs can fluctuate heavily.

The MarkLogic Connector for Spark is compatible with Spark’s DataSource API providing a seamless developer experience. The connector returns the data in MarkLogic as a Spark DataFrame that can quickly be processed using Spark SQL and other Spark APIs. Developers can leverage existing skills as they use Spark native libraries (like SQL, machine learning, and others) in a variety of programming languages (like Java, Scala, and Python) to build sophisticated analytics on top of MarkLogic.

Get Started

Together, the combination of MarkLogic and Spark provides huge benefits for building intelligent analytical applications. The MarkLogic Connector for Spark ensures that organizations are maximizing the benefits of MarkLogic as the trusted source of durable data assets and Spark as the high-performance analytical framework.

To get started, follow along with the hands-on, step-by-step tutorial. To learn more about how you can configure the MarkLogic Connector for Apache Spark, please check out the documentation here. Apache Spark documentation is available here.

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Digital Acceleration Series: Powering MDM with MarkLogic

Our next event series covers key aspects of MDM including data integration, third-party data, data governance, and data security -- and how MarkLogic brings all of these elements together in one future-facing, agile MDM data hub.
Read Article

Of Data Warehouses, Data Marts, Data Lakes … and Data Hubs

New technology solutions arise in response to new business needs. Learn why a data hub platform makes the most sense for complex data.
Read Article

5 Key Findings from MarkLogic-Sponsored Financial Data Leaders Study

Financial institutions differ in their levels of maturity in managing and utilizing their enterprise data. To understand trends and winning strategies in getting the greatest value from this data, we recently co-sponsored a survey with the Financial Information Management WBR Insights research division.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.