LangSec Workshop

at IEEE Security & Privacy, Thursday May 26, 2016

Abstracts

Keynote

Opening Keynote: Doug McIlroy, "Buy/by the Book Or Bye-Bye the Game"

To banish the ominous scenario of garbage in, evil out, we must reject bad stuff no matter how closely it resembles good. For that, we need to precisely define the good, assure that it can be identified accurately, and handlle it correctly. Formal methods come to the fore. If you call this a platitude, you are right. The hard part is to thoroughly buy into it and keep the faith in the face of myriad temptations and compulsions to do otherwise. Among the challenges are technical subtleties that lurk behind the innocuous adverb "thoroughly".

Papers

Guillaume Endignoux, Olivier Levillain, Jean-Yves Migeon, "Caradoc: a Pragmatic Approach to PDF Parsing and Validation"

PDF has become a de facto standard for exchanging electronic documents, for visualization as well as for printing. However, it has also become a common delivery channel for malware, and previous work has highlighted features that lead to security issues. In our work, we focus on the structure of the format, independently from specific features. By methodically testing PDF readers against hand-crafted files, we show that the interpretation of PDF files at the structural level may cause some form of denial of service, or be ambiguous and lead to rendering inconsistencies among readers. We then propose a pragmatic solution by restricting the syntax to avoid common errors, and propose a formal grammar for it. We explain how data consistency can be validated at a finer-grained level using a dedicated type checker. Finally, we assess this approach on a set of real-world files and show that our proposals are realistic.

Katherine Underwood, Michael Locasto, "In Search Of Shotgun Parsers In Android Applications"

In any software system, unprincipled handling of input data presents significant security risks. This is particularly true in the case of mobile platforms, where the prevalence of applications developed by amateur developers in combination with devices that hold a wealth of users' personal information can lead to significant security and privacy concerns. Of particular concern is the so-called shotgun parser pattern, in which input recognition is intermixed with input processing throughout the code base. In this work, we take the first steps toward building a tool for identification of shotgun parsers in Android applications. By extending the FlowDroid framework for static taint analysis, we are able to quantify the spread of untrusted data through 55 applications selected from 15 categories on the Google Play store. Our analysis reveals that on average, most untrusted input propagates a relatively short distance within the application code. However, we also find several specific instances of very long data propagations. In addition to providing a first look at the "state of parsing" in a variety of Android applications, our work in this paper lays the groundwork for more precise shotgun parser signature recognition.

Harald Lampesberger, "An Incremental Learner for Language-Based Anomaly Detection in XML"

Schema validation for the Extensible Markup Language (XML) corresponds to checking language acceptance and is therefore a first-line defense against attacks. However, extension points in XML Schema can render validation ineffective. Extension points are wildcards and considered best practice for loose composition, but an attacker can also place any content at an extension point without rejection, e.g., for a signature wrapping attack. This paper presents an incremental learner that infers types and datatypes in mixed-content XML in terms of a datatyped XML visibly pushdown automaton as language representation. An inferred automaton is free from extension points and capable of stream validation, e.g., in a monitoring component. Additional operations for unlearning and sanitization improve practical usability. The learner is guaranteed to converge to a good-enough approximation of the true language, and convergence can be measured in terms of mind changes between incremental steps. All algorithms have been evaluated in four scenarios, including a web service implemented in Apache Axis2 and Apache Rampart, where XML attacks have been simulated. In all scenarios, the learned representation had zero false positives and outperformed schema validation.

Sheridan Curley, Richard Harang, "Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec"

Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.

Research Reports

Peter Aldous, Matthew Might, "A Posteriori Taint-tracking for Demonstrating Non-interference in Expressive Low-level Languages"

We previously presented a theory of analysis for expressive low-level languages that is capable of proving non-interference for expressive languages. We now present an implementation of that analysis, empirical evaluation of that implementation and optimizations to improve performance. In the course of developing an optimization, we provide a independence result for the taint-flow analysis that drives tracking of information. In particular, we show that the taint-tracking can be derived from the results of a taint-free analysis. In addition to improving performance, this independence broadens the applicability of the underlying approach to information-flow analysis. Finally, we present metrics for our analysis on a suite of applications which demonstrate improvements in its performance.

Sven Hallberg, Adam Crain, Meredith L. Patterson, Sergey Bratus, "The State of Hammer: Ready for ICS"

We present a case study of applying the LangSec approach to the popular ICS protocol, DNP3, and of using the Hammer toolkit to building a validating proxy for DNP3, which can be thought of as an exhaustive inspection protocol-specific firewall. In the process we learned a lot both about the protocol and how it could be specified to avoid ambiguity in the standard, reduce potential for vulnerabilities, and improve parser performance—at the same time.

Checking the incoming payloads of a complex protocol is known to be a hard, error-prone task for the programmer. However, despite all the widely shared advice that input-validation errors are a leading cause of vulnerabilities and must be avoided, exactly what the ICS/SCADA programmer should do to avoid these errors for certain is still unclear. There are many conditions to be checked: is the incoming data encoded, delimited, and encapsulated correctly? Do its values fit into correct ranges? Do they relate correctly to other incoming values? All of these must be checked at some point before the incoming data can be safely used, but in what order? How can you make sure no check has been missed? This session hopes to finally provide an answer to how more secure input-checking ICS code can be written with less effort and fewer errors.

This report will describe our experience in answering these question and communicating the answers to the ICS community developers and stakeholders.

Jacob Torrey, Jonathan Miodownik, "Research Report: Analysis of Software for Restricted Computational Environment Applicability"

Preliminary experiment design and research goals are presented to measure the applicability of restricted computational complexity environments in general purpose development efforts. The Linux kernel is examined through the lens of LangSec in order to gain insight into the make-up of the kernel code vis-a-vis the complexity class of recognizer for input to each component on the Chomsky Hierarchy. Manual analysis is assisted with LLVM Passes and comparison with the realtime Linux fork. This paper describes an on-going effort with the goals of justifying further research in the field of restricted computational environments.

Nathaniel Filardo, "Mitigating LangSec Problems With Capabilities"

Language-theoretic security focuses on correctness of handling hostile input data as a syntactic matter and suggests that formalizing the languages of interchange is an effective way to eliminate many classes of bugs in software. Herein, we hope to suggest that a similar tack could apply when considering the possible actions of a program. That is, we view an application as speaking a formal language to its hosting kernel. By constraining this language, so that fewer actions on fewer objects may be expressed, one may hope to eliminate or mitigate bugs, LangSec-related or otherwise, in applications. We argue that Sandstorm, a supervisor for secure, multi-user web applications, has indeed mitigated a number of vulnerabilities in software run under its control. We hope to show that its capability-based architecture, which provides excellent formal vocabulary for discussing the permissions of applications and users in the system, is worthy of attention within the LangSec community.

Work-in-progress, Invited presentations

Rick McGeer, "Declarative Verifiable SDI Specifications"

The point of Software-Defined Infrastructure is an infrastructure that is at once more flexible, controllable, and transparent to user and developer. One important characteristic of this infrastructure is that it is not owned or controlled by the user. At runtime, it is an opaque black box. Thus, it must have guaranteed properties of both performance and function. Infrastructure also has limited visibility and debuggability. It’s hard to diagnose network problems, and it’s hard to diagnose runtime issues on a remote system. Thus, programs which manipulate the infrastructure (e.g., orchestration systems, SDN applications, etc.) should have their infrastructure manipulations verified, to the extent that this is possible. We need to catch bugs statically to the extent that we can, performance and correctness both. Infrastructure configurations ought to be inherently veri- fiable. Verification of state-free systems is in NP; verification of finite-state systems is always in P-Space, and many problems are in NP. It has been shown by a number of authors that OpenFlow rulesets are state-free, and verification is therefore in NP. Similar arguments can be made for configuration layers and workflow engines, depending on precise semantics. These results imply that the under- lying model of computation for configuration of software-defined networking and at least some elements of softwaredefined infrastructure are state-free or, at worst, finite-state, and therefore that verification of these systems is relatively tractable. The large challenge before the community is then to design configuration models for software-defined infrastructure that preserve the precise and weak semantics of the implementation domain; offer appropriate abstractions of performance characteristics; and nonetheless retain usability and concision.

Matthew Van Gundy, "Complex Paths and Derelict Sentinels: software engineering underpinnings of recent NTP vulnerabilities"

The Network Time Protocol and the NTP reference implementation are the de facto standard for synchronizing computer clocks on the Internet. In the past year, a number of vulnerabilities have been discovered which allow attackers to use NTP to arbitrarily control the clock of a remote system. In a number of cases, software engineering choices appear to have played a role in causing or obscuring the vulnerability. In this short presentation, we'll review several recent vulnerabilities and highlight the software engineering choices that may have played a role in their origin.

(Invited presenation)

Erik Bosman, "Some musings on the deduplication weird machine"

(Invited presenation)


Panel: Toward a LangSec Software Development Lifecycle

How do we make LangSec into a part of a security software development lifecycle?