As massive data acquisition and storage becomes increasingly aﬀordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. The arrival of MAD (Magnetic, Agile, Deep) data analysis is a radical departure from traditional Enterprise Data Warehouses and Business Intelligence with profound consequences.
Standard business practices for large-scale data analysis is based on the notion of an “Enterprise Data Warehouse” (EDW) queried by “Business Intelligence” (BI) software. BI tools produce reports and interactive interfaces that summarize data via basic aggregation functions (e.g., counts and averages) over various hierarchical breakdowns of the data.
Traditionally, a carefully designed EDW is considered to have a central role in good IT practice. The design and evolution of a comprehensive EDW schema serves as the rallying point for disciplined data integration within a large enterprise, rationalizing the outputs and representations of all business processes. The resulting database serves as the repository of record for critical business functions.
The conceptual and computational centrality of the EDW makes it a mission-critical, expensive resource, used for serving data intensive reports targeted at executive decision-makers. It is traditionally controlled by a dedicated IT staﬀ that, not only maintains the system, but jealously controls access to ensure that executives can rely on a high quality of service. While EDW is still valid, a number of factors are pushing towards a very diﬀerent philosophy for large-scale data management in the enterprise.
First, storage is now so cheap that small subgroups within an enterprise can develop an isolated database of astonishing scale within their discretionary budget. The world’s largest data warehouse from just over a decade ago can be stored on less than 20 commodity disks priced at under $100 today.
Meanwhile, the number of massive-scale data sources in an enterprise has grown remarkably: massive databases arise today even from single sources like clickstreams, software logs, email and discussion forum archives, etc. Finally, the value of data analysis has entered common culture, with numerous companies showing how sophisticated data analysis leads to cost savings and even direct revenue. The end result of these opportunities is a grassroots move to collect and leverage data in multiple organizational units.
While this has many beneﬁts in fostering eﬃciency and data-driven culture, it adds to the force of data decentralization that data warehousing is supposed to combat.
In this changed climate of widespread, large-scale data collection, there is a premium on what its called MAD analysis skills. The acronym arises from three aspects of the new environment:
Magnetic: Traditional EDW approaches “repel” new data sources, discouraging their incorporation until they are carefully cleansed and integrated. Given the ubiquity of data in modern organizations, a data ware-house can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.
Agile: Data Warehousing orthodoxy is based on longrange, careful design and planning. Given growing numbers of data sources and increasingly sophisticated and mission-critical data analyses, a modern warehouse must instead allow analysts to easily ingest, digest, produce and adapt data at a rapid pace. This requires a database whose physical and logical contents can be in continuous rapid evolution.
Deep: Modern data analyses involve increasingly sophisticated statistical methods that go well beyond the rollups and drilldowns of traditional BI. Moreover, analysts often need to see both the forest and the trees inrunning these algorithms – they want to study enormous datasets without resorting to samples and extracts. The modern data warehouse should serve both as a deep data repository and as a sophisticated algorithmic runtime engine. As noted by Varian, there is a growing premium on analysts with MAD skills in data analysis. These are often highly trained statisticians, who may have strong software skills but would typically rather focus on deep data analysis than database management. They need to be complemented by MAD approaches to data warehouse design and database system infrastructure. These goals raise interesting challenges that are diﬀerent than the traditional focus in the data warehousing research and industry.