As Big data moves from descriptive analytics (batch) to predictive (interactive) and prescriptive analytics (real-time) , businesses are increasingly using streaming data sources and historical batch data for machine learning and to build predictive models. In simple terms, descriptive analysis is the understanding of the state of affairs, predictive is to model a what-if scenario and prescriptive is to influence the outcome by taking data-driven action. New analytics applications are able to capture a business transaction while it is happening in real-time and are able to influence the outcome of this transaction, resulting in direct business benefits. Use case examples include:
Industrial Internet (IoT/IoE)
Patient Data Intelligence in Healthcare
And we at SanDisk® use it for real-time analytics for our semiconductor manufacturing data.
From Lambda Architecture to SMACK
The ability to combine real-time together with batch analytics has made Lambda Architecture popular. Lambda Architecture uses HDFS, Scalding and HBASE as popular building blocks for combining real-time analytics with batch data pipelines. However, the architecture comes with the overhead of duplicating the code and data through the various pipelines, making it difficult to implement at scale.
To overcome limitations posed by Lambda Architecture, it was essential to come up with a Big Data pipeline that can handle both batch and real-time streaming effectively. The new SMACK stack – Scala and its ecosystem of Spark, Mesos, Akka, Cassandra and Kafka – aims to achieve this. SMACK streaming has emerged as an effective large-scale platform to handle both batch and stream processing.
Streaming architecture with SMACK (Spark, Mesos, Akka, Cassandra and Kafka) Stack
For those of you who are not familiar with these tools and systems, here is a quick overview:
Spark – a fast and general engine for distributed, large-scale data processing
Mesos – a cluster resource management system that provides efficient resource isolation and sharing across distributed applications
Akka – a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM
Cassandra – a distributed, highly available database designed to handle large amounts of data across multiple datacenters
Kafka – a high-throughput, low-latency distributed messaging system designed for handling real-time data feeds
Big Data Flash for Converged Data Platform
To build effective predictive models, converged stack systems need fast access to historical data along with real-time data streams. Flash-based data grids deliver significant benefits for these new data-driven architectures.
Subscribe to Blog via Email
By providing your email address, you agree to the terms of Western Digital's Privacy Statement
In March 2016, SanDisk established the “Big Data Flash” market category with the launch of the InfiniFlash System, delivering massive capacity with extreme performance and breakthrough economics (due to fab economics and a new flash form factor).
I’d like to talk about how the InfiniFlash system’s architecture and capabilities serve as an essential building block for Converged Data Platform architecture:
Capture millions of events per second without losing events
Faster batch ingest
Store data without the need for ETL (Extract, Transform, Load) using Avro, or Protobuf formats
Eliminate load through support for distributed messaging systems such as Kafka
Process both real-time events and batch data effectively
Feed in-memory processing to deliver insights with second and sub-second latency
Software-defined data fabrics for data intensive workloads, which provide agility and scale
Store multi-terabyte for long periods
Support high throughput batch and low latency real-time queries
Handle disparate data sources, ‘bursty’ workloads
Store data in schema-free way
Support HDFS and NoSQL databases (e.g. Cassandra, CouchDB, MemSQL, HBase etc.)
Scale to petabytes with Rackscale architecture
Very low Annual Failure Rate (AFR)
Provide enterprise readiness, lineage (audit logging), compliance ( legal hold etc.), versioning (maintain different point in time copies) by using disaggregation / shared storage
Designed for failure, backups and patching from HDFS/S3
Three Reasons to use InfiniFlash ‘Big Data Flash’ for Data-Intensive Converged Data Platforms
Whether you are an enterprise organization or a service provider, here are three reasons you should consider InfiniFlash for your Converged Data Platform:
1. Meeting all requirements for Capture, Process, Store and Query data pipelines Traditional direct attached storage deployments and pure HDD-based deployments cannot provide the performance and throughput at scale, as needed by Converged Data Platforms. They also cannot match the CapEx and OpEx benefits when it comes to scale, nor the agility and enterprise readiness desired by these platforms.
InfiniFlash system delivers 50x the performance, 5x the density, and 4x the reliability of traditional hard disk drive and was designed to easily scale up or scale out for demanding Big Data applications. The software-defined flash-based data fabric offers a choice of use with multiple filesystems including HDFS, Spectrum Scale, Lustre or Ceph to work with a spectrum of needs. You can see how we leverage InfiniFlash Big Data Flash for our own flash-based data fabric for memory technologies in this webinar by SanDisk Chief Data Scientist, Janet George.
2. Worldwide Support InfiniFlash is supported across the world from SanDisk and its partners. InfiniFlash is part of TSA Net Support Community, ensuring tight SLAs, while our FlashStart™ capabilities ensure smooth installation and superior customer experience.
3. Best in Class Ecosystem
We partnered up with leading software developers and hardware partners to deliver greater choice and flexibility through a best in class ecosystem. Some of our partners include: HDFS, Leading Hadoop distros, Redhat Ceph, Nexenta, IBM Spectrum Scale, Lustre, Cloudbyte and vendors such as Cisco, Lenovo, Dell, Supermicro, Quanta and more. We’re working closely with the Open source community and have become a contributor and thought leader through our various undertakings. (You can learn more about our contributions to enterprise-class features to Open Source SCST here.)
Converged Data Platforms have been established in order to meet the requirements of converging operational and analytical pipelines, and there are ensuing storage level requirements to build these d data platforms during capture, process, store and query phases. A Big Data Flash-based data fabric is an ideal storage layer building block for converged platforms delivering advantages at each phase of the data pipeline.
You can learn more about InfiniFlash on SanDisk.com and feel free to reach out with any questions in the comments below.
Shailesh has over 20 years experience in the industry earning significant experience with Internet of Things/Everything (IoT/IoE), Healthcare and Genome Sequencing, Financial Services Industry (FSI) and Enterprise Application verticalsAt SanDisk he is tasked with strategy, product planning and execution for HPC, Big Data and 3rd Platform Application Portfolio.Shailesh has worked for companies like EMC, NetApp, Force10 Networks, Adaptec, Brocade, HP and Digital. He has worked in product and solutions marketing, business development and presales roles.
Shailesh earned his bachelor’s degree in Electrical Eng. from the University of Mumbai, followed by a M.S and MBA in Finance and General Management from San Jose State University, including course work from UC Berkley.