NDSS 2018 Paper #308 Reviews and Comments =========================================================================== Paper #308 I'm Listening to your Location! Inferring User Location with Acoustic Side Channel Review #308A =========================================================================== Overall merit ------------- 2. Weak reject Relative Ranking ---------------- 3. Top 25% but not top 10% of reviewed papers Reviewer expertise ------------------ 1. No familiarity Writing quality --------------- 3. Adequate Paper summary ------------- This paper presents an attack on location privacy by using ENF signals that leak into audio streams to reveal the location of a end user. To build an ENF map, the authors used public streaming data, such as from nature streams throughout the world, to understand the variation of ENF in different power grids and within the same power grid. The authors then conducted an experiment using raw audio streams, streams transitted via skype over a vpn, and audio streams over Tor. They found that they can predict which power grid (e.g., which country) an audio stream originated in, and within a power grid, likely detect the location within a 140 mile radius. Comments for author ------------------- I found this paper quite interesting and approachable even for someone without a strong background in signals. While I am not able to comment directly on the techniques in signal processing used to build the map, the use of the data and the data sources seems novel. That being said, I find the presentation of the results a bit confusing a times and also the limited (and perhaps non-diverse?) number of audio streams as the biggest detractor from this paper. Given all that, the techniques of using live video streams and the practical applications are big positives. Overall, though, I would like to see more clear presentation of the source and test data as well as an increase in the number of experiments performed. Below I outline some more specific comments: Diversity of Live Video Streams Training: It would seem to me that the geographic diversity of the video streams would make a BIG difference in the accuracy of the map that underpins the attack. It would seem relevant to have that presented visually, as a map, so that we can get a sense of where these data points exits. Impact of Training Data: The results are all based on the amount of test data, e.g., how many minutes of audio sampled, but there are no experiments that consider situations with a lack of training data. Diversity/Presentation of Training Data: For the audio streams being tested, it would be very helpful for the authors to display the sources used in testing visually, such as on a map. Additionally, it is not clear to me (maybe I missed it) if the same audio stream was used in all scenarios or how exactly the recording to transmission occurred to ensure that ENF was captured properly in different locations. For example, did you have someone in those locations play a fixed recording sample in each location played to a microphone to pick up local ENF? This whole process can use some explanation. Intra-Grid Estimates: I didn't follow how you got to the 140 mile radius value. Based on Figure 8 where you visual the boundaries, these seem to cover wide areas and are concentric boundary lines. The estimation of accuracy here seems a bit tenuous. It would be great to visualize the attack. For example, show where the source was and what your model predicted. Analysis of Negative Results: From Table III, so many of these values across the different types have pretty much the same accuracy. This suggests to me that there is a subset of your data that is simply too hard to classify or has some other properties about it that make it non-conforming to your model. Either way, it would be good to understand why this isn't working because it may provide insight into your process. Questions for authors’ response ------------------------------- Would it be possible to better visualize the data training, data training, and results, such as in maps? Can you better explain how the intra-grid estimations work and how you arrive at the accuracy number? What is driving the negative results? Why can't that small subset of samples be classified by your model? Review #308B =========================================================================== Overall merit ------------- 3. Weak accept Relative Ranking ---------------- 2. Top 50% but not top 25% of reviewed papers Reviewer expertise ------------------ 2. Some familiarity Writing quality --------------- 2. Needs improvement Paper summary ------------- This work introduces a new technique to estimate world-wide location based on local distortion of the Electrical Network Frequency (ENF) pattern in audio data recorded while powered by a AC power grid. It uses several filtering and amplifying methods to extract a ENF trace out of a audio stream and compares this traces to a pre-generated territorial ENF map. It provides a method to interpolate areas in this map with no direct ENF base sample. It claims to be superior to previous solutions because it builds up its database from publicly available audio feeds with known location rather than using expensive equipment. It evaluates this technique for data from Europe, North and South America in the inter-grid case (with a accuracy of 90% with a 40 minutes audio sample) and for the intra-grid case on data from the eastern power-grid in the United States, it claims to achieve a accuracy of 76% for a 5 minutes sample. Comments for author ------------------- Interesting approach. Especially the publicly available audio sources to build up the reference ENF map makes this approach scalable. It allows localization in real-time easily. The evaluations itself could be more extensive. Some parts like the description of the decision boundary and its implications were difficult to understand. The writing could be improved. The abstract is misleading: Specify that the video files need a audio channel. Or that your approach only works for audio. It is not inherently clear how an anonymity network like TOR can distort audio data and this statement seems therefore odd. You speak only in the abstract about a 5 minutes segment for the inter-grid estimation but nowhere else. For the evaluation, the amount of ENF anchor points seems to be very low. In the case of the intra-grid evaluation, I wasn't able to figure out what was the sample size of your test set, nor how you got the accuracy. have you used cross correlation? But How do you get a accuracy of 76.4% with 16 samples (17 works)? While the technique seems to work, the presented accuracy numbers are IMHO not reliable. In the first section the authors claim to have constructed the "first interpolated global ENF map", later on they mention that they only created the map for the eastern USA. This should be clarified in the paper. Response by Author [Yoon Ji-Won ] (488 words) --------------------------------------------------------------------------- WRT R1’s comments “... better visualize the data…,” yes, we can improve visualization by plotting training, testing, and result data in a map as shown here: https://drive.google.com/open?id=0B9GAi874Y9EiTzlmSUtsLWY4XzA). WRT R1’s comment “... how the intra-grid estimations work and how you arrive at the accuracy number?” and R2’s comment “... description of the decision boundary …,” for clarity, we modified the definition of accuracy in page 7 as “The attack accuracy is defined as the probability that the actual location is included inside the k-th boundary.” We also changed “k/n decision boundary” to “the k-th boundary”. WRT R1’s comment “... you got to the 140 mile radius...” after calculating a similarity map for a given segment, the map is partitioned into $n=17$ areas with equal size. Each area’s size is $S_i=S/n$ where $S=341,754miles^{2}$ (the total size of the US). We found that 76% accuracy is obtained if the test data is located in the 3rd boundary. Therefore, the area’s size in the 3rd boundary is $V = 3S_i=\frac{3}{17}S=60,309miles^{2}$. We calculated an approximate radius $R$ to represent the average distance in the area. Finally, we have $R=138.55mile$ since $V =\pi~R^2$. WRT R1’s comment “... the negative results?...” an interesting observation is that the performance improvement in accuracy was not linear, and had step-like changes as shown in Table 3. We surmise that the tested ENF signals were categorized into three groups depending on geographic origins. 76.4% of samples (group 1) were clearly identified even with a short segment because their origins were located very far from each other. About 6% of samples (group 2) were uniquely distinguished when the segment length is between 30 and 35 minutes - this is because their uniqueness was obtained only with a longer segment. We failed to uniquely identify the remaining samples (group 3) even with a longer 40-minute segment. Much longer ENF signals would be needed to distinguish those samples. WRT R1’s comment “... a lack of training data” and R2’s comment “the amount of ENF anchor points ...” we note that the accuracy of an inferred area can be improved by collecting more data. WRT R1’s comment “... have someone in those locations play a fixed recording sample…,” we will analyze the case of ENFs recorded at two different locations with an identical music or voice. WRT R2’s comment “... TOR can distort audio data....,” data loss in Torfone occurs through the Torfone’s codec, not through Tor networks. WRT R2’s comment “... in the abstract about a 5 minutes segment ...” the abstract is a bit confusing - 5 minute segment actually refers to our intra-grid estimation results. We will clarify this. WRT R2’s comment, “... the authors claim to have constructed...” we actually built a global ENF map as shown in Figure 5. However, the intra-grid estimation was applied to the east region of the US because we only had access to the ground truth data for the east region. We will clarify this. Review #308C =========================================================================== Overall merit ------------- 2. Weak reject Relative Ranking ---------------- 2. Top 50% but not top 25% of reviewed papers Reviewer expertise ------------------ 3. Knowledgeable Writing quality --------------- 3. Adequate Paper summary ------------- In this paper, the authors proposed a new side channel attack using acoustic side channel to identify users’ location from multimedia files (audio clips, videos, etc.). The paper introduces a cost-effective and reliable reference map to identify electric network frequency (ENF) embedded in multimedia files. Using this map, the physical location of the recorded files can be inferred without installing any malicious software in the victim’s device. The paper also provided an evaluation of the proposed technique using online multimedia files from different popular websites (e.g., Youtube). Finally, the paper also discussed the effectiveness of the attack in the case of anonymous networks like Tor in the later part of the paper. Comments for author ------------------- Positives: 1. The presentation of a reference map of ENF which is cost-effective and reliable. 2. Introduction of a novel side channel attack to capture the location of the recorded multimedia files with an accuracy rate of 76% for videos over 5 minutes and over 90% for videos over 40 minutes. 3. Performing the attack in anonymous network like Tor and also against a proxy server to show the effectiveness of the attack. Some comments for improvement: a. The paper claims that there is no need to install any malicious software in the victim’s device in Abstract and Introduction. However, in Section 4, the paper used a signal capturing application (service application) in the victim’s device to capture ENF signals. Although the application used for capturing ENF does not require any explicit location permission, it decreases the weight of the claim of “no need to install any application in victim’s device”. b. In the threat model, the installed application to capture ENF signal needs to connect with attacker’s device via Internet to send the captured ENF signal. This raises question against effectiveness of this attack as different security mechanisms can be installed in the victim’s device to identify unauthorized network connections. c. The paper notes 76% of accuracy for multimedia files over 5 minutes. No evaluation is given for the files less than 5 minutes. It is hard to see how it behaves across all the sizes. d. The attack can be performed only if AC powered audio device is used by the user which is a big limitation as most of the new recording devices (smart phones, digital camera, etc.) are DC powered. This may limit the applicability of the attack. e. Online videos can be recorded and merged in different location and different segments of a video can have different ENF based on the location. The paper did not talk about this mitigation technique and how location inference can be performed in that file is not discussed in the paper. f. Also, it would have been better to connect the results with the actual location data of the collected data, which is not clear from the paper. Review #308D =========================================================================== Overall merit ------------- 3. Weak accept Relative Ranking ---------------- 3. Top 25% but not top 10% of reviewed papers Reviewer expertise ------------------ 2. Some familiarity Writing quality --------------- 2. Needs improvement Paper summary ------------- The paper studies a particular type of side-channel attack to identify the location of users when they use multimedia streaming services like VoIP. The core technique used is levering ENF signals. Such signals slightly change for various electricity grids, therefore, and are modulated into the audio/video signals streamed by users. Therefore, the authors first show how to fingerprint various grids based on the unique patterns of the ENF signals in various regions. Then, the authors show how such fingerprints can be used to identify users' locations using various signal processing and machine learning techniques. The authors evaluate the performance of their side channel attack through experiments on Tor and Skype. Comments for author ------------------- The paper studies an interesting type of acoustic side channel attacks. The attack is possible due to the fact that if a microphone is connected to the power grid, the generated audio signal will be modulated with the ENF signals that are unique to various grids. The authors overcome interesting challenges to make the side channel attack practical. First, for the attack to be useful the attackers need to have fingerprinted the ENF behaviour of many locations. A naive, impractical approach is to mount ENF measurement signals at various locations, but the paper suggest to use public video streams that contain location tags to extract ENF signals and map them to different locations. The other technical challenge overcome by the authors is compensating that the modulated audio signals will undergo various kinds of noise such VoIP codings as well as network jitter. The authors use various signal processing and learning techniques to compensate for such noises and offer a reasonable attack accuracy. The attack still faces various challenges in practice. One is that the ENF patterns will change over time, and therefore the adversary will need to keep collecting the ENF signals from geographically taged video streams continuously. Also, based on the evaluations, the classifier needs to collect a long interval of audio from the target user, e.g., 40 minutes, in order to be able to identify location with high accuracy. Most typical VoIP or user-generated streams will likely be much shorter than this. Also, the attack works only if the victim uses a mains-powered audio recording device. I still find the paper interesting despited these limitations. While the paper cites and introduces some of the past works on the topic, the distinction from such works is not quite clear. Particularly papers like 17, 23 and 25 also aim at extracting and mapping ENF signals from multimedia content. What is this paper's contribution from those works? Is it that it offers a better accuracy? Or perhaps the possibility of detecting over noisy network channels like Skype and Tor? Also, some of the signal processing techniques used by this paper seems to be similar to those used by previous systems. Other comments: - The evaluations do not show the impact of time on the ENF models. How dynamic are these models over time? - The idea of interpolating ENF signals neighboring ENF signals is quite interesting - The evaluations do not analyze the impact of various network conditions (like jitter, fraction of dropped packets, etc) on the accuracy of the classifiers - I appreciate that authors included a discussion of possible mitigation techniques. It would be better if the authors had evaluated the impact of at least some of these countermeasures on their attack - The writing needs a lot of improvements -"Skype, which is one of the most widely used VoIP applications, recently updated its default application settings to use a proxy server to hide users’ IP addresses" I dont think Skype did this change to protect user privacy, but rather to have more control on the connections (though it also hides IP addresses). -"without losing generosity."? Questions for authors’ response ------------------------------- The distinction from the cited previous works is not quite clear. Particularly papers like 17, 23 and 25 also aim at extracting and mapping ENF signals from multimedia content. What is this paper's contribution from those works? Is it that it offers a better accuracy? Or perhaps the possibility of detecting over noisy network channels like Skype and Tor? Also, some of the signal processing techniques used by this paper seems to be similar to those used by previous systems.