Big Data and Cybersecurity

Cyberspace and cybersecurity contain numerous problems in search of novel approaches able to facilitate dynamic, results driven solution sets. Big Data if examined from a complex, multi-disciplinary perspective offers a range of potential advantages to cyber offense and defense for public and private sector entities ranging from small businesses to the national security community. This post, in brief, highlights the foundations of a research push in its infancy to assess the application of big data for national cybersecurity. While the focus is national cybersecurity writ-large, the lessons to be learned are likely to be impactful to organizations and individuals as the economics and applications of big data for cybersecurity become increasingly affordable.

Big data analysis as a concept is hard to pin down. Generally, it is considered to constitute extremely large observation datasets generated through human or technical means in either structured or unstructured formats. The defining characteristic being that rather than sampling from a population as undertaken in conventional statistical analysis, big data analysis partially infers that the data itself are the population or such a large proportion of the population that mechanisms of analysis are somehow different. Rather than inferring from a sample to a population, the population itself highlights novel insights into some form of action or behavior – machine or human. Moreover, big data is a relatively modern concept. The ability to aggregate, store, process and subsequently analyze the data relies on computational power associated with modern computing devices, i.e. analysis unable to be conducted by hand or with simple observational analysis.

Big data are large, complex and most commonly unstructured. IBM identifies four dimensions associated with big data- Volume (scale of data), Velocity (analysis of streaming data), Variety (different forms of data), Veracity (uncertainty of data).[1] The insights afforded by Big Data are of value when attempting to understand or solve multiple problem sets. Often the data exhaust (the data unneeded for initial analysis) are where the gems for unknown questions reside.[2] Because of its scale and definitive characteristics as broadly encompassing, big data has the potential to facilitate answers for both known problems and unknown problems. Yet big data is not without problems in the form of misinterpretations of noise (i.e. Error) resident within the data.

Big data is useful for multiple applications. First, novel applications of big data can help solve some issues related to cybersecurity issues. In particular, big data can help in identifying anomalous behavior patterns within network traffic or human operators. This occurs through the bolstering of the analytic and machine learning techniques already employed by intrusion detection systems. Second, the application of big data collected both within cyberspace and as a result of sensors in various areas of operations can provide insight into both the human and technical terrain of a given area. In his recent book Data and Goliath, Bruce Schneier examines how the the cost function associated with storing data has made it possible to not just collect large and diverse volumes of data, but also to store that data efficiently.[3] The data exhaust that every person generates is massive and each click, purchase, facial scan, finger print and much more can help to build innovative tailored information environments for everything from the purchasing of goods and services to the tracking of transnational terrorists. Examples of big data use are growing in ubiquity. Whether it is text analysis using captcha codes[4] or search analysis on flu or dengue fever,[5] big data is present and exploitable.

While big data is present and growing in ubiquity the concerns associated with its use are pervasive. Privacy concerns and big data are being heard. In 2014 the President received a report on privacy and big data from the the Council of Advisors on Science and Technology.[6] The report notes that the pervasiveness of data generation makes traditional notice and consent burdensome to individual users and instead recommends placing that burden with the organization. Yet, here to privacy concerns arise when considering the type of organizations collecting and storing data. The report outlines priorities and indicates a recognition by the U.S. Government of the policy and legal challenges faced by both the public and the private sector with regards to the collection and analysis of data in large volumes. The field of study is growing, yet the impact of the use of big data on the public consciousness and discourse is real and persistent.[7]

Yet despite a recognition that big data challenges privacy, its ability to affect positive change might revolutionize aspects of cybersecurity and military operations to reduce costs and increase efficiencies for both the cyber warrior and the boots on the ground soldier. Below is just one of the potential application already being worked on by multiple actors, public and private, but that offers potential benefits to national security.

A Smart IDS

Cybersecurity failures are not solely technical or human problems. Instead, cybersecurity failures run the gamut from simple errors to a complex amalgam of human and computer interaction that results undesirable outcomes. There is little doubt that when functioning in the desired fashion computer and human interactions can generate positive net benefits. Yet, whether it is physical, logical, or human error – either intentionally or unintentionally induced, the complexity of the problems can be overwhelming in insolation.

Dumb (constrained data-stream anomaly-based or signature-based) IDS that operate independent of data from other aspects of an organization can collect, store and detect anomalous traffic patterns resident within a network.[8] These systems can offer extreme power and systemic security, yet as the needs, uses and goals of an organization change, their ability to rapidly adapt are limited to their purview of collection. Conventional IDS is a form of big data analysis, leveraging two, perhaps three of the “V’s” of Big data identified by IBM. IDS constitutes large volume and velocity (i.e. real-time streaming), but the variety and veracity of data are limited in scope. By informing IDS with project, human resources, market, weather, political and data from multiple disciplines with direct relevance to understanding the volume, type, origin and destination of data it is possible to move from dumb (i.e. constrained lens) to smart (i.e. multi-lens) dynamic analytic processes. Network security that moves beyond the network and incorporates and reimagines the way in which data are used by examining the causal mechanisms of failures associated with anomalous behavior identified in traditional IDS models are complex and require modifications to existing or the creation of entirely new cluster computing algorithms, the development of new visualization strategies to convey complex information to operators, legal and privacy considerations to avoid illegal collection and policy frameworks to facilitate reasoned rule based approaches to the collection and analysis of information generated.

Multi-disciplinary Big Data

This blog is meant to hint at the numerous potential research agendas available for study within emerging branch of analytics known as big data. Big data applications are not isolated to one use case and the potential applications are limited only by the creativity of those wishing to utilize it for innovative solutions. By examining and leveraging big data from social (i.e: behavioral/cognitive, legal, policy, cultural, historical, geographical) and from technical (computer science, engineering, physics) perspectives the resultant utility of the research is likely to incorporate multiple perspectives and result in more useful products and applications.

Footnotes

[1]“The Four V’s of Big Data.” Ibmbigdatahub.com. Accessed September 9, 2015. http://www.ibmbigdatahub.com/infographic/four-vs-big-data.

[2] Mayer-Schönberger, Viktor, and Kenneth Cukier. 2013. Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt.

[3] Schneier, Bruce. 2015. Data and Goliath: the hidden battles to collect your data and control your world. New York: New York W.W. Norton & Company.

[4] See: Agarwal, Shivam, “Utilizing Big Data in Identification and Correction of OCR Errors”. http://digitalscholarship.unlv.edu/cgi/viewcontent.cgi?article=2915&context=thesesdissertations

[5] See: https://www.google.org/flutrends/about/

[6] See: https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf

[7] Brantly, Aaron F. “The Changed Conversation About Surveillance Online”. https://www.nditech.org/blog/2013/10/changed-conversation-about-surveillance-online

[8] See: https://en.m.wikipedia.org/wiki/Intrusion_detection_system for a basic explanation of IDS.

Big Data and Cybersecurity

Footnotes

TheCyberDefenseReview@WestPoint.edu