Hadoop Distributed File System (HDFS) is the next generation of shared storage and the foundation of the Hadoop ecosystem. It provides reliable file storage at massive scale on commodity hardware. Most applications, however, need fast and secure access to specific pieces of data, not large files. In the latest version of MarkLogic Server available now from MarkLogic Developer Labs, you can run the MarkLogic enterprise NoSQL database on top of HDFS to provide ACID transactions, role-based security, full-text search, and the flexibility of a granular document data model for real-time applications all within your existing Hadoop infrastructure.
Unify Your Operational and Analytic Workloads
Today you can configure MarkLogic to use direct attached or shared SAN storage, and now you can also include Hadoop’s distributed file system. This provides a real-time database for Hadoop that leverages HDFS for scalability, performance, and availability enabling a fluid mix of data between operational and analytic workloads. Segregating data across different storage and computation tiers lets users optimize cost, performance, availability, and flexibility. Users can store data, indexes, and journals across a mixture of local (RAM, HDD, and SSD), SAN, and HDFS-based storage. This provides both secure, low-latency access to operational data and economical storage and processing of the remaining long tail.
Like HBase, But With Enterprise Capabilities Included
Just about anywhere you can access the file system in MarkLogic, you can now reference HDFS. From the DBA’s perspective HDFS is just another file system. You can mount data and index partitions directly on HDFS. Architecturally, this is similar to the way that HBase works with Hadoop’s distributed file system. MarkLogic has the added benefit of production-tested enterprise features, like ACID transactions, replication, and full-text search built directly into the database kernel.
Benefits of Running MarkLogic on Hadoop Distributed File System (HDFS)
- Real-time Hadoop applications: HDFS is great for high-throughput applications, such as MapReduce. With MarkLogic running on HDFS, Hadoop is extended to support low-latency applications as well, such as real-time search and analytics. Using HDFS as storage for a NoSQL database is not new, of course. However, MarkLogic is the first to support transactional updates, full-text search, a document data model, and enterprise security together in one integrated system.
- Database management system, not custom middleware: If you’re building your own secondary indexes, replication, or transactions, you’re doing work you shouldn’t be focusing on. Having these low-level capabilities built into the database reduces custom code and one-off integrations, allowing your development and operations teams to focus on application features and SLAs that differentiate your business, and not plumbing.
- Unstructured throughout: MarkLogic and Hadoop are both built for unstructured data. By combining the two you eliminate the need to map data to rows and columns for every type of analysis or service endpoint that your business users will dream up. This makes it much easier to integrate new and unanticipated data sources.
- Easily age your data and store it appropriately: By combining Hadoop and MarkLogic, users are able to index the most critical and frequently accessed data on high-performance infrastructure and allocate the rest to a low-cost archive for intermittent access and historical analysis.
A MarkLogic® Server technology preview, featuring Hadoop Distributed File System (HDFS) storage, is available from MarkLogic Developer Labs today.