NDSS 2018 Paper #141 Reviews and Comments
===========================================================================
Paper #141 Towards Scalable Cluster Auditing through Grammatical Inference
over Provenance Graphs


Review #141A
===========================================================================

Overall merit
-------------
2. Weak reject

Relative Ranking
----------------
2. Top 50% but not top 25% of reviewed papers

Reviewer expertise
------------------
3. Knowledgeable

Writing quality
---------------
3. Adequate

Paper summary
-------------
This paper describes Winnower, a system that processes provenance DAGs on different machines in a cluster, and detect anomalies in the DAG patterns.

Comments for author
-------------------
Winnower seems like a useful system that can find anomalous patterns in provenance data when applications are replicated across the cluster, and it runs with reasonable storage, computation overheads. 

While I find the paper to be interesting at a high level, I do have several concerns.  As a start, I am not sure if I'm fully convinced by the claim that existing systems do not scale to the amount of data produced by the audit systems. Provenance data is just another type of data produced by a distributed system, and auditing is just another type of application to process the data: therefore, I wonder if state-of-the-art big data systems (such as Spark Streaming, graph processing systems, etc.) should not apply to this kind of data, too. In particular, since the data considered in this paper forms a graph, popular graph processing systems should be a good candidate for ingesting and analyzing provenance data. Did you consider this alternative before designing Winnower? Would these systems meet the requirement considered in the paper, or are they still not scalable enough? If the latter, I would also like to see some preliminary numbers, which would then be a much more convincing argument as to why a data reduction system is needed. 

The key claim of the paper is that replicated applications should produce similar provenance structures, so a deviance in provenance structure is indicative of an attack. However, different replicas in the cluster should be handling different requests, which may in turn trigger potentially different execution paths and result in different provenance structures. So I'm not sure to what extent the assumption holds. The evaluation does show that attacks show up as anomalies; however, that section contains very little information 
about what workloads are being used in the experiments. As far as I can see, the evaluation does not use any popular benchmarks, but rather mostly uses artificial requests as the 'normal workloads', and a specially crafted attack as the 'attack workload'; so it isn't very surprising that the attack workload is considered anomalous. What if you tried Winnower on a large set of requests, drawn from representative workloads, which can potentially be very noisy in terms of the execution paths? Would the injected attack still show up as anomalous? Also, what if the application is not replicated in the first place? Defending these application against attacks is important, too, but it isn't clear whether Winnower can provide such defense. 

I find the 'equivalence class' to be a bit confusing. How can Winnower know which fields to remove and which fields to retain? For instance, IP addresses are considered to be unimportant, thus removed from the abstraction; the claim is that 'While these details are important when inspecting an individual node, they not useful to an administrator attempting to reason about a distributed system in aggregate". But what if you have a DDoS attack, where 
the number of distinct source IP addresses is a key indicator of attack? In terms of the activity vertices, why is not the program MD5 hash part of the label? What if a program binary is substituted with a malicious version? At a high level, I feel that the abstraction part requires more thought. 

Coming back to the evaluation, is it really true that each workload only contains 40 requests?? If so, the scale of the experiment is extremely small; and, as pointed out earlier, the experiment may not contain enough noise/variation to truly evaluate how well the anomaly detection works. Also, in the 'Computational cost' section, there isn't any number showing how much overhead the provenance collection incurs, compared to a baseline system that does not rely on provenance at all. I would like to see what the tradeoff is when applying a provenance-based approach to attack detection (vs. offline log analysis, analyzing simple TCP counter statistics, etc.)

Questions for authors’ response
-------------------------------
- How representative are the workloads used in the evaluation? 
- How do you know which fields to remove/retain when finding the abstraction?


Review #141B
===========================================================================

Overall merit
-------------
3. Weak accept

Relative Ranking
----------------
3. Top 25% but not top 10% of reviewed papers

Reviewer expertise
------------------
1. No familiarity

Writing quality
---------------
4. Well-written

Paper summary
-------------
Winnower addresses the problem of correlating logs on large numbers of machines in a datacenter scenario. DFA Learning is used to dramatically compress the amount of data that needs to be stored and transmitted. A prototype is implemented for Docker Swarm, as an extension to pre-existing Linux auditd.

Graph abstractions are built using equivalence classes (e.g., workers have different IP addresses but they all have some IP address in certain locations), immutable classes (static strings that are uninteresting), and removal classes (uninteresting data that can be removed).

Comments for author
-------------------
# Pros
* Dramatic reduction in machine resource consumption to collect / correlate logs around some event.
* Possible reduction in human time and effort required to make sense out of some purported security event.
* Results look good for known attack types in the case study

# Cons
* Attacker is in the TCB for this system. A sufficiently clever attacker may be able to forge or erase logs. (The paper states this explicitly but I still perceive it to be a con.)
* In a large-scale system, workers are not identical. There is often a smearing of older and newer versions as features and bug fixes are canaried and then carefully rolled out. The set is finite and well-controlled, but they are not identical.

# Detailed comments
* This log collection and analysis problem exists for datacenter operators regardless of security intrusions. Software upgrades, hardware failures, configuration changes, hot weather, and many other things may affect functionality and it's always good to know when there's a problem.

* My big issue with this system is that effectively detecting intrusions and cleaning up after them remains an unsolved problem. Thus, while Winnower may show me some interesting information, it's hard for me to guage how much trust I will be willing to place in its output. From a research perspective, I think this is okay. However, it would be a very expensive experiment to conduct at scale - I think more evidence is needed that the results are useful.
  * Quality software for use in distributed systems needs careful attention to the logs that it does generate. Such "bread crumbs" could be debugging lifelines just as easily as they could be a complete waste of storage, network, and compute resources. To the best of my knowledge it remains mostly a black art to be able to distinguish the two.


Response by Wajih Ul Hassan <whassan3@illinois.edu> (Author) (500 words)
---------------------------------------------------------------------------
We thank all the reviewers for their valuable comments, and propose to incorporate the following clarifications in our paper:

##Representative Workloads (Rev.A):
Our evaluation uses standard application workloads (apache-benchmark, sysbench, FTPbench, Redis-benchmark) which are designed for performance/load testing and include noise/variation to trigger different execution paths (e.g., variable HTTP headers or SQL queries). We agree that these workloads may not be representative of real-world request heterogeneity. However, even for complex workloads, overtime Winnower will generate a high confidence-level model for all valid execution paths, such that anomalies will be visible to due to their low confidence-level. Note also these same benchmarks were used in  most relevant prior work [51,52,53,54,70]. 

##Graph Abstraction (Rev.A):
Our technique targets non-deterministic and instance-specific fields such as PID, timestamps, and hostnames; these fields are guaranteed to change between worker-nodes, even when the nodes are performing equivalent tasks. Variance in these fields do not indicate change in application behaviour. Provenance graphs generated by auditd/SPADE has total of 14 fields/labels types. We defined classification rules (i.e. equivalence, immutable, removal) for each of type based on the previously mentioned criterion. Essentially, equivalence class fields facilitate a trade-off between the fidelity and compression of the provenance model. As a consequence, we agree that the model would not be helpful in explaining a DDoS attack. However, unlike system intrusions, it is unlikely that audit logs would need to be consulted to identify/diagnose DDoS. 

##Graph Processing (Rev.A):
Existing graph processing systems (e.g., Spark-GraphX) are complementary to Winnower. These systems provide fast ingestion/querying, but do not provide methods for deduplication of redundant audit data across nodes. Rather than transmit raw audit streams to a central graph processing server (imposing high network and storage overheads), we present techniques for distributed graph processing and compression that allows each work node to locally eliminate redundancy prior to transmission. In a production environment, it would be desirable to implement our graph abstraction/induction/parsing algorithms using, e.g.., Spark-GraphX’s API. However, as our proof-of-concept implementation is already low overhead (Figure 14), we did not find any need to use more sophisticated graph processing system. 

##MD5 Activity Labels (Rev.A):
Our techniques would work just as well if an MD5 hash was included in the activity label. We chose not to include the hash because if an attacker changed the program binary it would already be visible in the provenance graph of that machine. Further, the issue of mimicry attack is discussed in the section-3.D.

##Concurrent Requests (Rev.A):
By 40 concurrent requests, we mean the concurrency-level parameter which is the number of requests sent simultaneously from the benchmark. The total number of requests depends on how long the benchmark was run (“Duration” column in table 2&3). We will add a “total number of requests” column to Tables 2&3 in the final version of paper (e.g., ProFTPD had 28,564 total requests).

##Application Versions (Rev.B):
In the scenario, where different versions of the same application are running on worker-nodes, Winnower can maintain different provenance models for each version to deal with any behavioural discrepancy.


Review #141C
===========================================================================

Overall merit
-------------
3. Weak accept

Relative Ranking
----------------
3. Top 25% but not top 10% of reviewed papers

Reviewer expertise
------------------
3. Knowledgeable

Writing quality
---------------
4. Well-written

Paper summary
-------------
This paper presents Winnower, a scalable provenance system designed for
a cluster environment. A provenance system aims to log every system event
to analyze the log in real time or later to detect and reconstruct
any attacks for further investigation. To achieve this goal,
a comprehensive and detailed log is necessary, but it could bring
tremendous storage and network overhead especially when an investigator
has to analyze system logs collected from 10s or 100s of machines,
e.g., a cluster environment.

To solve the problem, Winnower models the behaviors of individual cluster
nodes using the Deterministic Finite Automata (DFA) Learning.
Each cluster node sends a central analysis node its system logs only when
it observes anomalous behaviors that do not match with its trained model.
This overall design is highly effective for a cluster environment because,
in such an environment, all nodes routinely carry on similar operations
such that their provenance model would be almost the same.
Evaluation results show that, compared to the Linux audit system,
Winnower was able to reduce storage/network overhead of cluster workloads
by 98% without any information loss.

Comments for author
-------------------
* Strength

+ Winnower tackles the important problem of provenance systems for
multiple machines: storage/network overhead. It well solves this problem
within a certain environment, i.e., a cluster environment whose nodes
almost always provide similar services.

+ Winnower is well motivated by a realistic system configuration:
Docker Swarm consisting of vulnerable containers.

+ Though it somehow relies on experience and intuition,
Winnower well developed induction rules for provenance graphs to avoid
over-specification and over-generalization.


* Weakness

- Winnower's application is quite limited: mostly for cluster environments.
Many provenance systems are designed for enterprise computer systems, so
Winnower's target environments considerably differ from them. More importantly,
for a cluster environment, administrators can specify strict security policies
(because cluster nodes usually do not need to run general applications)
to avoid attacks such that the necessity of a provenance system is questionable.

- Many rules and parameters are mentioned without scientific reasoning and experiments.
For example, why are the rules in Fig. 6&8 the best? What is the tradeoff between
over-specification and over-generalization? Further, configuration parameters,
such as an epoch size of 50 s and tau thresholds of 400, are just introduced
without any mathematical and/or experimental reasoning.

- Neither false positive nor false negative are clearly explained.
It would be highly interesting if the authors compare false positive and
false negative rates by adjusting the tradeoff between over-specification and
over-generalization. However, currently, this paper only shows that
Winnower was able to detect five hand-picked attack scenarios.


* Overall, I think that this paper is well written and attacks an important problem
with a very interesting idea. However, I think that cluster auditing is basically
a too small problem and not so sure whether a Winnower-like approach can be extend to
cover more general environments, i.e., enterprise computer systems. Furthermore,
I believe that, for a cluster environment, an administrator can define strict
security policies, e.g., detailed SELinux rules, to avoid attacks because
each cluster node only needs to run specified applications. Do we need reactive
security solutions, i.e., provenance systems, for such restricted environments?

* It seems that many decisions to avoid over-specification and over-generalization
were just based on experience and intuition. It would be beneficial if the authors
show the tradeoff between over-specification and over-generalization by varying
graph induction rules. Also, it is unclear how the authors selected some important
system parameters, e.g., epoch size and tau thresholds. Comparing results by varying
such system parameters would be needed.

* Showing false positive and negative rates would be necessary.
Currently, this paper only showed that Winnower detected five attack scenarios.
How can we ensure that such results are representative? That is, can we say that
the derived models do not suffer from overfitting problems?
In addition, the example attack scenarios are so "loud," e.g., spawning a root shell.
Not so sure whether Winnower can handle stealth attacks, e.g., fileless malware and
DLL injection.

* I think that Winnower cannot deal with slow-steady attacks that try to gradually
change a provenance model so that the final model could treat attacks as normal behaviors.
To avoid this problem, Winnower may need retraining phases.

* In evaluation this paper omitted parsing computation cost because it is a linear
time operation. I doubt it because linear time operations can also take long
depending on constants and data size.


* Minor
- p5: filed -> field


Review #141D
===========================================================================

Overall merit
-------------
3. Weak accept

Relative Ranking
----------------
3. Top 25% but not top 10% of reviewed papers

Reviewer expertise
------------------
2. Some familiarity

Writing quality
---------------
3. Adequate

Paper summary
-------------
The paper presents Winnower, that leverages provenance graphs in common across different nodes to perform a network-wise auditing. The core technique is to apply DFA (discrete finite automata) learning technique [14]. Based on the heuristics that auditing operations are similar across the nodes, each node can acquire provenance graph abstractions which are mostly in common. Then they perform monitoring functions over these multiple graphs. 
In particular, the worker generate behavior models and then their models are aggregated into a global one and performs auditing and updates. 
 
The work deploys it in a cluster (one server with multiple VMs) and evaluate it against standard application traffic, in terms of storage reduction (benefits) and cost (on computation and networking). Finally, it uses different attacks as case study to illustrate the provenance graph generated.

Comments for author
-------------------
Strength: 

+ Using graph abstractions, rather than the raw provenance data, is a promising and effective means to data compression, without scarifying much at auditing. 

+ Case studies on different attacks illustrate how it works well. It also validates the effectiveness. 

However, some technical parts are not very clear. In particular, 
(1) the graph abstraction seems to heavily count on the assumption that different auditing nodes perform homogenous operations and thus can be easily aggregated. Is it true? 
In the work, the authors use equivalence classes to represent the common parts. However, in the extreme case that all are equivalent, there seems no challenges to address (no benefits, either. To ensure that, the authors claim that they remove specific instances. My question is whether these removal will ignore the most important behaviors which should be audited. 

(2) Moreover, based on Fig 6, the algorithm need to create graph abstraction across different nodes in a recursive way. 
However, as each node does it locally first and it is not clear whether the algorithm is guaranteed to converge and how fast it is. Intuitively, it might work well on the assumption most provenance operations are common or similar. It is not sure what will happen if the assumption doesn’t hold. 

(3) Regarding the benefit on storage reduction, I feel that the current evaluation is a little bit unfair. 
Simply comparing with the raw data volume, the graph clearly compresses the volume to process. However, for the storage, each node still need to store all the raw data to generate the graph and process the auditing. To this sense, I don’t feel that it can save storage a lot. 
Instead, I feel that the benefit lies in reducing the networking load. As most computation are done locally and little done at the aggregator (monitor), there is no need to send a huge amount of raw data to the monitor node (in a centralized approach). However, along this line, have you considered the effects of the recursive procedure. In the current evaluation, it seems that one transmission is needed. Does it imply that the graph abstraction converges as long as it receives graphs from different nodes? Is it really held in reality?