As Big data moves from descriptive analytics (batch) to predictive (interactive) and prescriptive analytics (real-time) , businesses are increasingly using streaming data sources and historical batch data for machine learning and to build predictive models. In simple terms, descriptive analysis is the understanding of the state of affairs, predictive is to model a what-if scenario and prescriptive is to influence the outcome by taking data-driven action. New analytics applications are able to capture a business transaction while it is happening in real-time and are able to influence the outcome of this transaction, resulting in direct business benefits. Use case examples include:
- Anti-Money Laundering
- Fraud Analytics
- Targeted Marketing
- Industrial Internet (IoT/IoE)
- Real-Time Manufacturing
- Patient Data Intelligence in Healthcare
- And we at SanDisk® use it for real-time analytics for our semiconductor manufacturing data.
From Lambda Architecture to SMACK
The ability to combine real-time together with batch analytics has made Lambda Architecture popular. Lambda Architecture uses HDFS, Scalding and HBASE as popular building blocks for combining real-time analytics with batch data pipelines. However, the architecture comes with the overhead of duplicating the code and data through the various pipelines, making it difficult to implement at scale.
To overcome limitations posed by Lambda Architecture, it was essential to come up with a Big Data pipeline that can handle both batch and real-time streaming effectively. The new SMACK stack – Scala and its ecosystem of Spark, Mesos, Akka, Cassandra and Kafka – aims to achieve this. SMACK streaming has emerged as an effective large-scale platform to handle both batch and stream processing.
Streaming architecture with SMACK (Spark, Mesos, Akka, Cassandra and Kafka) Stack
For those of you who are not familiar with these tools and systems, here is a quick overview:
- Spark – a fast and general engine for distributed, large-scale data processing
- Mesos – a cluster resource management system that provides efficient resource isolation and sharing across distributed applications
- Akka – a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM
- Cassandra – a distributed, highly available database designed to handle large amounts of data across multiple datacenters
- Kafka – a high-throughput, low-latency distributed messaging system designed for handling real-time data feeds
Big Data Flash for Converged Data Platform
To build effective predictive models, converged stack systems need fast access to historical data along with real-time data streams. Flash-based data grids deliver significant benefits for these new data-driven architectures.
In March 2016, SanDisk established the “Big Data Flash” market category with the launch of the InfiniFlash System, delivering massive capacity with extreme performance and breakthrough economics (due to fab economics and a new flash form factor).
I’d like to talk about how the InfiniFlash system’s architecture and capabilities serve as an essential building block for Converged Data Platform architecture:
- Capture millions of events per second without losing events
- Faster batch ingest
- Scale easily
- Store data without the need for ETL (Extract, Transform, Load) using Avro, or Protobuf formats
- Eliminate load through support for distributed messaging systems such as Kafka
- Process both real-time events and batch data effectively
- Feed in-memory processing to deliver insights with second and sub-second latency
- Software-defined data fabrics for data intensive workloads, which provide agility and scale
- Store multi-terabyte for long periods
- Support high throughput batch and low latency real-time queries
- Handle disparate data sources, ‘bursty’ workloads
- Store data in schema-free way
- Support HDFS and NoSQL databases (e.g. Cassandra, CouchDB, MemSQL, HBase etc.)
- Scale to petabytes with Rackscale architecture
- Very low Annual Failure Rate (AFR)
- Provide enterprise readiness, lineage (audit logging), compliance ( legal hold etc.), versioning (maintain different point in time copies) by using disaggregation / shared storage
- Designed for failure, backups and patching from HDFS/S3
- All at most cost effective economics at less than $1/GB
- Support real-time queries with msec latencies
- Support batch/aggregate queries
- Support queries for HDFS and NoSQL
Three Reasons to use InfiniFlash ‘Big Data Flash’ for Data-Intensive Converged Data Platforms
Whether you are an enterprise organization or a service provider, here are three reasons you should consider InfiniFlash for your Converged Data Platform:
1. Meeting all requirements for Capture, Process, Store and Query data pipelines Traditional direct attached storage deployments and pure HDD-based deployments cannot provide the performance and throughput at scale, as needed by Converged Data Platforms. They also cannot match the CapEx and OpEx benefits when it comes to scale, nor the agility and enterprise readiness desired by these platforms.
InfiniFlash system delivers 50x the performance, 5x the density, and 4x the reliability of traditional hard disk drive and was designed to easily scale up or scale out for demanding Big Data applications. The software-defined flash-based data fabric offers a choice of use with multiple filesystems including HDFS, Spectrum Scale, Lustre or Ceph to work with a spectrum of needs. You can see how we leverage InfiniFlash Big Data Flash for our own flash-based data fabric for memory technologies in this webinar by SanDisk Chief Data Scientist, Janet George.
2. Worldwide Support InfiniFlash is supported across the world from SanDisk and its partners. InfiniFlash is part of TSA Net Support Community, ensuring tight SLAs, while our FlashStart™ capabilities ensure smooth installation and superior customer experience.
3. Best in Class Ecosystem We partnered up with leading software developers and hardware partners to deliver greater choice and flexibility through a best in class ecosystem. Some of our partners include: HDFS, Leading Hadoop distros, Redhat Ceph, Nexenta, IBM Spectrum Scale, Lustre, Cloudbyte and vendors such as Cisco, Lenovo, Dell, Supermicro, Quanta and more. We’re working closely with the Open source community and have become a contributor and thought leader through our various undertakings. (You can learn more about our contributions to enterprise-class features to Open Source SCST here.)
Converged Data Platforms have been established in order to meet the requirements of converging operational and analytical pipelines, and there are ensuing storage level requirements to build these d data platforms during capture, process, store and query phases. A Big Data Flash-based data fabric is an ideal storage layer building block for converged platforms delivering advantages at each phase of the data pipeline.
You can learn more about InfiniFlash on SanDisk.com and feel free to reach out with any questions in the comments below.
Shailesh has 20+ years experience in IoT/IoE, Healthcare and Genome Sequencing, FSI and Enterprise Application verticals.