Information managers can expect data storage companies to drive significant campaigns around Big Data as we enter 2012. Storage is the least of anyone’s concerns, according to The Economist Intelligence Unit (EIU) report “Big Data: Harnessing a game-changing asset.” Information Governance in 2012 requires Data Science strategy and practitioners be added to all business teams.
The EIU report is based on a survey of 586 senior executives conducted June of 2011. The respondents are a cross section of international business executives:
- 31% from North America
- 28% from Asia-Pacific
- 26% from Western Europe
When these executives were asked to “Indicate how problematic each of the following is in the management of data in your organization” less than 20% indicated storage was problematic. However, when those same executives were asked to rate reconciliation of data, more than 50% said it was problematic. (EIU Report Page 26)
Data reconciliation is a core practice of Data Science teams. In a recent interview with Jeff Hammerbacher, Chief Scientist at Cloudera, he clarified the primary requirement for Data Science teams and team members. Mr. Hammerbacher says “Data Science teams must be passionate about comparing, cleansing, and munging (French/Italian meaning to eat/chew) data to expose actionable data elements as visualization.“
Data reconciliation is identified by more than 50% of firms surveyed in the EIU report. The report characterized the top five areas of investment as:
- Ensuring Data Accuracy and Reliability
- Integrating data across the organization
- Developing higher levels of data security
- Implementing data analysis tools and software
- Expanding infrastructure to handle more data
Storage and security are not the solution
Companies that fail to execute a strategy around big data programs and Data Science teams will be left behind. Collecting terabytes of secure content is no replacement for actionable data elements that exist within the collected content. For example, the EIU report quotes Wim Vriens of Levi Strauss saying “[after two years of data reconciliation and Data Science team building] we are only beginning to see the opportunities that this insight can bring to our brands and products.“
Levi Strauss and 50% of the respondents to the EIU survey indicated data is a strategic asset to their company. (EIU Report Page 27) However, there is a difference between collected data and actionable data. Actionable data is the result of data reconciliation. Collected data is a result of multiple systems creating data. Unfortunately, these systems are creating more data than can be managed using current people, processes and technology. The top five systems generating data for companies are (EIU Report Page 28):
- Web data
- Mobile usage (location-based information, mobile apps)
- Social Media (FaceBook, Twitter, blogs, etc)
- Sensors (smart grid, manufacturing, etc)
For companies to overcome the collected data, they must hire people with a Data Science perspective. However, when hiring people for Data Science jobs, educational and business experience must be consider holistically.
What is a data scientist?
Too much data is a Big Data opportunity, and we’re happy to have the opportunity. However, we require people to help us take advantage of the Big Data opportunity.
Big Data requires a clear strategy around data science. Companies of all sizes are facing growing email, social media, web, sensor and other data growth. Moreover, there is no doubt that data science is a cross disciplinary effort. But, it involves passion to connect separate objects and datasets in to discernible actions and visualizations.
Big Data opportunities will be managed by executing a strategy focused on hiring people with visual-spatial perspectives on tools, techniques, and content and information management. Do not expect storage and security products to address Big Data opportunities for businesses requiring actionable data.
I’m a big fan of Hilary Mason, chief scientist at bit.ly, so I’ll cite her definition: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Daniel Tunkelang, Principal Data Scientist at LinkedIn
By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo – starry eyed explorers and skeptical detectives. Monica Rogati, Senior Data Scientist, LinkedIn
A data scientist is someone who analyzes an organization’s big data to discover actionable trends that lead to business results. Data scientists look at what questions business people need to ask to remain competitive. They work directly with C-level executives, advising them on how to drive maximum value from big data and integrate new information. In many ways, a data scientist serves as a change agent in today’s workforce, pushing organizational collaboration and information integration. Anjul Bhambhri , IBM’s vice president of Big Data Products (August 2011 definition)
Video interviews with Data Scientists
Jeff Hammerbacher, Chief Scientist, Cloudera
ement, data rat akin to a gym rat, someone who is generally looking for new data sets to compare, cleanse, manger/mangi/mung (chew the data) to expose actionable data elements as a visualization etc.
5 components to Jeff Hammerbacher’s Introduction to Data Science course at University of California Berkeley in Spring 2011:
- Data collection and integration (munging) (1:31)
- Visualization design/dashboard design (1:47)
- Large scale experimentation (2:19)
- Causal inference and observation (3:02)
- Data products; how do you deploy it and set a refresh cycle (3:56)
What is a data scientist? (1:35 thru 4:10)
- Curiosity and a passion for really getting to an answer
- Core attributes are more about personality than pure skills; data jujitsu is the art of turning data into a product. Heuristics are more important than core skills of math, stats, etc.
How do you build a data science team? (4:20)
- Combined team, that is NOT an auxiliary team, these must be embedded teams so a sense of ownership around the data science is within the team.