An integrated and centralized data store model that enables stakeholders from throughout an organization to harvest and analyze data on the same platform continues to be the goal of many organizations from the Global 2000 as they strive to address the requirements of Big Data. However, with today’s decentralized and cloud based storage systems and data storage requirements for the Global 2000 reaching the petabyte and even exabyte levels, massive centralized single data store infrastructures with single points of failure such as Apache Hadoop may not the most effective long term solution.
Driven by social media, mobile computing devices and cloud computing, the volume of Electronically Stored Information (ESI) is increasing at an accelerating rate. As a result, organizations are not only faced with the unprecedented challenges in managing the costs and legal and compliance requirements associated with this increase in data, they are also now under pressure to analyze all of this new data to uncover competitive advantage for product marketing, sales, Enterprise Resource Management (ERM) and even Human Resource (HR) management.
The rewards for successfully managing and leveraging what is now know as Big Data are well documented. At IBM’s Smarter Analytics event in March 2012, clients and partners presented success stories about how organizations are driving business value out of big data, analytics, and IBM Watson technology.
- City of Dublin, Ireland, using thousands of data points from local transportation and traffic signals to optimize public transit and deliver information to riders.
- Seton Healthcare mining through vast amounts of unstructured data captured in notes and dictation to get a more complete view of patients. Seton currently uses this information to construct programs that target treatments to the right patients with a goal of minimizing hospitalizations in the way that most efficiently optimizes costs with benefits. The ability to mine unstructured data gives a much more complete view of patients, including factors such as their support system, their ability to have transportation to and from appointments, and whether or not they have a primary care physician.
- WellPoint using IBM’s Watson technology to improve real-time decision-making by mining through millions of pages of medical information while doctors and nurses are face-to-face with patients.
Unfortunately, as organizations realize they have to do “something” to survive Big Data and therefore begin the journey through the Big Data Maturity Model, many inadvertently field redundant efforts at the department, division and other sub-corporate level.
As an example, it wouldn’t be unusual for an organization attempting to address Big Data issues to have the following groups performing the same basic data harvesting, remote storage and analysis functions with different yet very similar technologies:
- A corporate Business Intelligence (BI) group within the Information Technology (IT) division to harvest and analyze data for internal business consumers
- A corporate Governance, Risk and Compliance (GRC) group associated within the office of the Chief Financial Officer (CFO) to harvest and analyze data to reduce risk and ensure compliance
- An eDiscovery group within the legal department to harvest data for computer forensics and Early Case Assessment (ECA)
What organizations should be striving for as they evolve up the Big Data Maturity Model is an integrated and centralized data store model that enables stakeholders from BI, GRC, eDiscovery and other groups as required to harvest and analyze data on the same platform.
Apache Hadoop, an open-source Big Data platform for developing and deploying distributed, data-intensive applications has become a attractive solution among the Global 2000 for accomplishing this goal. However, with today’s decentralized and cloud based storage systems and data storage requirements for the Global 2000 reaching the petabyte and even exabyte levels, massive centralized single data store infrastructures with single points of failure such as Hadoop may not be the most effective long term solution.
The alternative solution is to match today’s decentralized; cloud based virtual environments with virtual federated data stores. Technology vendors such as VMware are already investigating solutions.
A recent article posted on CRN.com states, “As server and storage virtualization become standard in the data center, Apache Hadoop and software defined networking are looming as VMware’s next big challenges.” Earlier this year, VMware launched an open-source project called Serengeti that includes a free deployment toolkit for deploying a Hadoop cluster on vSphere.
Other Big Data software startups are also addressing the requirement to provide a virtual data store with what I would characterize as Big Data middleware. One such vendor is Tarmin with its GridBank virtual software solution. GridBank enables users to create a virtual object based federated data store and provides a single management console to monitor and analyze rapidly growing, geographically dispersed unstructured data repositories.
An integrated and centralized data store model that enables stakeholders from BI, GRC and eDiscovery to harvest and analyze data on the same platform continues to be the goal of many organizations from the Global 2000 as they strive to address the requirements of Big Data. However, with the rapid increase in the amount of data and today’s decentralized and cloud based storage systems, Virtual Federated Data Stores may be an attractive alternative to massive central data stores.