To banish the ominous scenario of garbage in, evil out, we must reject bad stuff no matter how closely it resembles good. For that, we need to precisely define the good, assure that it can be identified accurately, and handlle it correctly. Formal methods come to the fore. If you call this a platitude, you are right. The hard part is to thoroughly buy into it and keep the faith in the face of myriad temptations and compulsions to do otherwise. Among the challenges are technical subtleties that lurk behind the innocuous adverb "thoroughly".
PDF has become a de facto standard for exchanging electronic documents, for visualization as well as for printing. However, it has also become a common delivery channel for malware, and previous work has highlighted features that lead to security issues. In our work, we focus on the structure of the format, independently from specific features. By methodically testing PDF readers against hand-crafted files, we show that the interpretation of PDF files at the structural level may cause some form of denial of service, or be ambiguous and lead to rendering inconsistencies among readers. We then propose a pragmatic solution by restricting the syntax to avoid common errors, and propose a formal grammar for it. We explain how data consistency can be validated at a finer-grained level using a dedicated type checker. Finally, we assess this approach on a set of real-world files and show that our proposals are realistic.
In any software system, unprincipled handling of input data presents significant security risks. This is particularly true in the case of mobile platforms, where the prevalence of applications developed by amateur developers in combination with devices that hold a wealth of users' personal information can lead to significant security and privacy concerns. Of particular concern is the so-called shotgun parser pattern, in which input recognition is intermixed with input processing throughout the code base. In this work, we take the first steps toward building a tool for identification of shotgun parsers in Android applications. By extending the FlowDroid framework for static taint analysis, we are able to quantify the spread of untrusted data through 55 applications selected from 15 categories on the Google Play store. Our analysis reveals that on average, most untrusted input propagates a relatively short distance within the application code. However, we also find several specific instances of very long data propagations. In addition to providing a first look at the "state of parsing" in a variety of Android applications, our work in this paper lays the groundwork for more precise shotgun parser signature recognition.
Schema validation for the Extensible Markup Language (XML) corresponds to checking language acceptance and is therefore a first-line defense against attacks. However, extension points in XML Schema can render validation ineffective. Extension points are wildcards and considered best practice for loose composition, but an attacker can also place any content at an extension point without rejection, e.g., for a signature wrapping attack. This paper presents an incremental learner that infers types and datatypes in mixed-content XML in terms of a datatyped XML visibly pushdown automaton as language representation. An inferred automaton is free from extension points and capable of stream validation, e.g., in a monitoring component. Additional operations for unlearning and sanitization improve practical usability. The learner is guaranteed to converge to a good-enough approximation of the true language, and convergence can be measured in terms of mind changes between incremental steps. All algorithms have been evaluated in four scenarios, including a web service implemented in Apache Axis2 and Apache Rampart, where XML attacks have been simulated. In all scenarios, the learned representation had zero false positives and outperformed schema validation.
Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.
We previously presented a theory of analysis for expressive low-level languages that is capable of proving non-interference for expressive languages. We now present an implementation of that analysis, empirical evaluation of that implementation and optimizations to improve performance. In the course of developing an optimization, we provide a independence result for the taint-flow analysis that drives tracking of information. In particular, we show that the taint-tracking can be derived from the results of a taint-free analysis. In addition to improving performance, this independence broadens the applicability of the underlying approach to information-flow analysis. Finally, we present metrics for our analysis on a suite of applications which demonstrate improvements in its performance.
We present a case study of applying the LangSec approach to the popular ICS protocol, DNP3, and of using the Hammer toolkit to building a validating proxy for DNP3, which can be thought of as an exhaustive inspection protocol-specific firewall. In the process we learned a lot both about the protocol and how it could be specified to avoid ambiguity in the standard, reduce potential for vulnerabilities, and improve parser performance—at the same time.
Checking the incoming payloads of a complex protocol is known to be a hard, error-prone task for the programmer. However, despite all the widely shared advice that input-validation errors are a leading cause of vulnerabilities and must be avoided, exactly what the ICS/SCADA programmer should do to avoid these errors for certain is still unclear. There are many conditions to be checked: is the incoming data encoded, delimited, and encapsulated correctly? Do its values fit into correct ranges? Do they relate correctly to other incoming values? All of these must be checked at some point before the incoming data can be safely used, but in what order? How can you make sure no check has been missed? This session hopes to finally provide an answer to how more secure input-checking ICS code can be written with less effort and fewer errors.
This report describes our experience in answering these question and communicating the answers to the ICS community developers and stakeholders.
Preliminary experiment design and research goals are presented to measure the applicability of restricted computational complexity environments in general purpose development efforts. The Linux kernel is examined through the lens of LangSec in order to gain insight into the make-up of the kernel code vis-a-vis the complexity class of recognizer for input to each component on the Chomsky Hierarchy. Manual analysis is assisted with LLVM Passes and comparison with the realtime Linux fork. This paper describes an on-going effort with the goals of justifying further research in the field of restricted computational environments.
Security and privacy of computation, and the related concept of (deliberate) sharing, have, historically, largely been afterthoughts. In a traditional multi-user, multi-application web hosting environment, typically applications are public by default. Applications wishing to offer a notion of private resources must take it upon themselves to independently manage authentication and authorization of users, leading to difficult and disjointed notions of access and sharing. In such a context, LangSec-based vulnerabilities threaten catastrophic loss of privacy for all users of the system, likely even of non-vulnerable applications. This is a tragic state of affairs, but is thankfully not inevitable! We present the Sandstorm system, a capability-based, private-by-default, tightly-sandboxing, proactively secure environment for running web applications, complete with a single, pervasive sharing mechanism. Sandstorm, and capability systems, are likely of interest to the LangSec community: LangSec bugs are mitigated through the robust isolation imposed by the Sandstorm supervisor, and the mechanism of capability systems offers the potential to turn difficult authorization decisions into LangSec's bread and butter, namely syntactic constraints on requests: every well-formed request which can be stated is authorized. We present aspects of the Sandstorm system and show how those aspects have, by building systematic protection into several levels of the system, dramatically reduced the severity of LangSec bugs in hosted applications. To study the range of impact, we will characterize addressed vulnerabilities using MITRE's Common Weakness Enumeration (CWE) scheme.
The point of Software-Defined Infrastructure is an infrastructure that is at once more flexible, controllable, and transparent to user and developer. One important characteristic of this infrastructure is that it is not owned or controlled by the user. At runtime, it is an opaque black box. Thus, it must have guaranteed properties of both performance and function. Infrastructure also has limited visibility and debuggability. It’s hard to diagnose network problems, and it’s hard to diagnose runtime issues on a remote system. Thus, programs which manipulate the infrastructure (e.g., orchestration systems, SDN applications, etc.) should have their infrastructure manipulations verified, to the extent that this is possible. We need to catch bugs statically to the extent that we can, performance and correctness both. Infrastructure configurations ought to be inherently veri- fiable. Verification of state-free systems is in NP; verification of finite-state systems is always in P-Space, and many problems are in NP. It has been shown by a number of authors that OpenFlow rulesets are state-free, and verification is therefore in NP. Similar arguments can be made for configuration layers and workflow engines, depending on precise semantics. These results imply that the under- lying model of computation for configuration of software-defined networking and at least some elements of softwaredefined infrastructure are state-free or, at worst, finite-state, and therefore that verification of these systems is relatively tractable. The large challenge before the community is then to design configuration models for software-defined infrastructure that preserve the precise and weak semantics of the implementation domain; offer appropriate abstractions of performance characteristics; and nonetheless retain usability and concision.
The Network Time Protocol and the NTP reference implementation are the de facto standard for synchronizing computer clocks on the Internet. In the past year, a number of vulnerabilities have been discovered which allow attackers to use NTP to arbitrarily control the clock of a remote system. In a number of cases, software engineering choices appear to have played a role in causing or obscuring the vulnerability. In this short presentation, we'll review several recent vulnerabilities and highlight the software engineering choices that may have played a role in their origin.
(Invited presenation)
Disclaimer: All opinions expressed are the speaker's personal opinions, not those of Cisco.
(Invited presenation)
How do we make LangSec into a part of a security software development lifecycle?