Big Data is big and complex. Not only in the accumulation of information, but also in its impact on business strategy. IDC projects spending on business analytics will run up to $89.6 billion by 2018. Successfully utilizing Big Data has become a critical element for many organizations, including the formulation of platform strategy, be it a “Data Hub,” “Data Platform” or “Data Lake.
Many who haven’t yet implemented a Big Data project are assessing their data strategy for 2016, while others are looking at current undertakings and examining new ways of leveraging analytics to improve business operations and increase revenue streams. The truth is that Big Data is hard to do. According to Gartner, through 2018, 70 percent of Hadoop deployments are predicted to fail to meet cost savings and revenue generation objectives due to skills and integration challenges. So how do you do Big Data ‘right’? Here are the most common Big Data pitfalls you should avoid:
Pitfall # 1: Lacking Enterprise Platform or Data-Centric Architecture
Hadoop usually enters an organization as a prototype for a specific use-case, then slowly becomes the center of gravity, attracting more data, and soon becomes a monster, number crunching engine lead by a small group of “Data Scientists”. Enterprises must begin with an enterprise platform strategy and a data-centric architecture to break down the debilitating silos rampant in organizations of all sizes. Big Data requires the ability to parallel process, with as little friction as possible, in a completely scalable and distributed environment. Unlike in traditional database systems or isolated application islands, in a data-centric architecture or enterprise platform data is not restrained, schema bound and locked.
Pitfall #2: Lacking the Vision for the “Data Lake”
The “Data Lake” is game-changing and transformational for an enterprise. It is a central destination for data and provides a much-needed unification of different types and kinds of data, including structured, unstructured and semi-structured data. This is in addition to internal, external and partner data. The Data Lake repository provides powerful benefits through the “economics of Big Data,” with up to 30x to 50x lower costs to store and analyze data in comparison to traditional setups. The Data Lake can capture “as-is” or “raw data” prior to any data transformation or schema creation before capturing the data, with automated rapid ingest mechanisms in place. The Data Lake plays a pivotal role in the journey towards connecting enterprise data together with seamless data access, iterative algorithm development and agile deployment.
Pitfall # 3: Not Planning for Data Growth or Levels of Maturity
When the Data Lake becomes the default data destination, governance and fine-grained security become of pivotal importance from the get go. Meta data access and storage along with data lineage and annotations become built in. Raw data and various stages of transformed data can all live side by side without any conflict. Applications can use each other’s data via Hadoop. External data can be shielded or integrated based on explicit processing/analytics requirements and variable data sets all live harmoniously on the Data Lake leading to increased data availability with decreased time for application deployment and unlimited scalability and growth.
Pitfall #4: Analyzing Small Samples of Data
Many hold the assumption that data doesn’t necessarily need to be united, and that one can work with small sample sets of data. This is a dangerous misconception, as the results are often extrapolated to larger data sets, and variances are not accounted for, which leads to at least misleading or, more likely, even deeply skewed results. It’s often called the curse of small sample data set analysis. For example when you work with small sample data set, you might come across many outliers or anomalies. With the small sample data set, there is no way of knowing that the anomaly is actually structural when you have larger data set, or the outliers are indeed a pattern with a definite signature.
Pitfall #5: Collecting Less Data and Relying on More Sophisticated Algorithms
Another misconception is that advanced and complex algorithms will solve all the problems. Well, life would be great if it were that simple. Computers, since they operate on logical processes, will unquestioningly process unintended, even nonsensical, input data, and produce undesired, often nonsensical, output. In information and computer science, this is called “garbage in, garbage out” when it comes to uncleansed data being fed to complex algorithms. Missing/sparse data, null values, and human errors, must all be cleansed. Avoid relying on un-proven assumptions or weak co-relations. Instead, collect as much data as possible and let the data speak for itself. This is very cost effective with the implementation of a data platform.
Building a Successful Big Data Strategy
Consider all of the above as motivational! Getting things right from the very start will greatly help your organization leverage Big Data more quickly and successfully.
Recently, I discussed some of the big challenges companies face when first analyzing their data and five key tips to building a successful Big Data strategy. You can stream the free webinar here!
Have experience with Big Data? What common pitfalls would you advise professionals to avoid as they build their Big Data Strategy? Share your knowledge with the community in the comments below.
Janet George is a Fellow and Chief Data Scientist at Western Digital. She is a technical leader with more than 15 years of experience in Big Data Platform, machine learning, distributed computing, compilers, and Artificial Intelligence. Previously, she served as managing director, chief scientist, and Big Data expert at Accenture technology labs. She has also served as head of Yahoo Labs Research Engineering, inventing next-generation platforms, cloud infrastructures, and machine learning for Big Data, as well as at eBay and Apple. Janet holds a Bachelors and Advanced Master Degree with distinction in Computer Science, Mathematics, with a thesis focus on AI and ML.