Detecting Android Malware Variants using Opcodes

Transcript

Univeristá degli studi del Sannio
DIPARTIMENTO DI INGEGNERIA
Corso di Laurea Magistrale in
Ingegneria Informatica
Master Thesis
in
Sicurezza delle reti e dei sistemi software
Detecting Android Malware Variants
using Opcodes Frequency Distribution
and Call Graphs Isomorphism Analysis
Supervisor:
Author :
Prof. Corrado Aaron
Visaggio
Antonio Pirozzi
mat. 399/37
Co-Supervisor:
Ing. Francesco Mercaldo
Academic Year 2013/2014
Dedicated to my Family. Dedicated to my parents Angelo and
Orsola, for all that i received from them, for their daily sacrifices,
for the entire life.
Dedicated to my aunt Loredana, always present in my life, a
guardian angel for the whole family, thanks for your support.
Dedicated to my uncle Vincenzo, my real older brother, undisputed
Master of Jazz.
Dedicated to Luciana, a special aunt. Dedicated to Joshua, the joy
of the family. Dedicated to my Grandparents, Antonio, Maddalena,
Italo, Maria unfortunately they are not there more, they would be
very proud of me.
Dedicated to my Grandmother Maria, my other mom, I will take
her forever in my heart, i know that you are always beside me.
Dedicated to my love Alessia, the best girl that I could never find,
thanks for making my life better.
Dedicated to Steve Jobs and Nikola Tesla, the greatest
revolutionaries, they made this world a better place.
i
Acknowledgements
First and foremost a Special Thanks to Prof. Corrado Aaron
Visaggio for his help and for the trust he has always placed in me.
Thanks also to Ing. Francesco Mercaldo for its continued
motivation and collaboration.
A special thanks to a special person, Tito. A special thanks to my
best friend Marco, he is always there.
A special thanks to my friend Angelo, we started this adventure
together, all those days passed in “Labis”, believing in our ideas . . .
A special thanks to Nicola, he gave me a place in his company, he
has always believed in me from day one, taught me day by day the
qualities that must have a good engineer.
A special thanks also to my colleagues Sonia Laudanna, Giovanni
Izzo, Fabio Giovine, Angelo Ciampa, Piero Uniti, special people
with whom i shared this long University journey.
A special thanks to all my University Professors, each of which has
been able to give me a great teaching for life.
ii
“Here’s to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs
in the square holes. The ones who see things differently. They’re not fond of rules.
And they have no respect for the status quo. You can praise them, disagree with them,
quote them, disbelieve them, glorify or vilify them. About the only thing you can’t do is
ignore them. Because they change things. They invent. They imagine. They heal. They
explore. They create. They inspire. They push the human race forward. Maybe they
have to be crazy. . . While some see them as the crazy ones, we see genius. Because the
people who are crazy enough to think they can change the world, are the ones who do.”
Steve Jobs
Abstract
Android platform starts became the universal front-end in the IoE and IoT, mobile
attacks will continue to grow rapidly as new technologies expand the attack surface.
Vendors, manufacturer, providers should extend vulnerability shielding and exploitprevention technologies and Anti-Malware vendors have to enhance their actual solutions, because new type of malware have a fileless payload that only runs in memory
and to circumvent detection as well it adopt more complex obfuscation techniques.
The idea behind this work, arises from the awareness that a more effective and holistic
anti malware approach have to first outline the phylogenesis, understand its evolution
and sophistication, their belonging semantics. This methodology move toward this direction, implementing a clone-detection heuristic for outline common payload components
in order to identify malware variants. Our Heuristic, is a contribution in the Malware
Analysis phase, not in the Detection phase, to well-understand Android Malware and
their evolution, to trace back a possible Malware descent.
To achieve these goals, we start from the analysis of the Opcodes Frequency Distribution, obtaining by similarities, the 10 nearest vectors from the Data-set (build from the
Android Drebin Project), then, an n-grams heuristic on the Adjacency Lists, detect
isomorphism features in the Call Graphs to identify payloads components as common
sub-graphs. Then we a re able to outline a possible genome for each malware family and
are able to define a possible descent for each malware variant, also multiple-descents,
proving the effectiveness of this methodology.
This work aims to lay the foundation of a new types of methodologies based on the
study of the payload philogenesy.
Contents
Acknowledgements
ii
Abstract
iv
Table of Contents
v
List of Figures
viii
List of Tables
xi
List of Listings
xiii
Abbreviations
xv
1 Introduction
1.1 Motivation and Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Android: A Security Overview
2.1 The Android Security Model . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Android Security Program . . . . . . . . . . . . . . . . . . . .
2.1.2 Android Platform Security Architecture . . . . . . . . . . . .
2.1.2.1 Kernel-level Security . . . . . . . . . . . . . . . . . .
2.1.2.2 The Application Sandbox . . . . . . . . . . . . . . .
2.1.2.3 The System Partitions . . . . . . . . . . . . . . . . .
2.1.2.4 The Secure BootLoader and QFuses . . . . . . . . .
2.1.2.5 Cryptography . . . . . . . . . . . . . . . . . . . . .
2.1.2.6 Security-Enhanced Android and Samsung Knox . .
2.1.2.7 Secure IPC : INTENTS and BINDER . . . . . . . .
2.1.3 Android Application Security . . . . . . . . . . . . . . . . . .
2.1.3.1 The Android Permission Model and Protected APIs
2.1.3.2 Application Signing Permission Enforcement . . . .
2.2 Malicious Android apps . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 An Overview: Botnet, Data collectors and Madware . . . . .
2.2.2 Android Malware Characterization . . . . . . . . . . . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
3
3
4
5
6
6
8
9
13
13
18
34
34
36
47
47
53
Contents
2.2.3
vi
2.2.2.1 Malware Installation
2.2.2.2 Activation . . . . .
2.2.2.3 Malicious Payload .
2.2.2.4 Permission Used . .
Evolutions and Challenges . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Malware Detection Methodologies
3.1 Malware Detection Techniques . . . . . . . . .
3.1.1 Signature-based Detection Techniques .
3.1.2 Anomaly-based Detection Techniques .
3.1.3 Application Permission Analysis . . . .
3.1.4 Cloud-based Detection Analysis . . . . .
3.2 Malware trasformation . . . . . . . . . . . . . .
3.3 Evaluating Android anti-malware Products . .
3.4 Malware Analysis using Call Graphs . . . . . .
3.4.1 Similarity Detection using Call Graphs .
3.4.2 Software Clones Taxonomy . . . . . . .
3.4.3 The Isomorphism Problem . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
58
58
60
65
.
.
.
.
.
.
.
.
.
.
.
67
68
69
72
73
78
80
85
91
92
94
101
4 Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis
102
4.1 The Data-Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.1.1 The Malicious Data-Set, Malware Variants . . . . . . . . . . . . . 104
4.1.1.1 Malware families . . . . . . . . . . . . . . . . . . . . . . . 104
4.1.2 The Trusted Data-set . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 The multi-staged Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 The Relative Opcodes Frequency Distribution . . . . . . . . . . . . . . . . 112
4.4 Isomorphism Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.4.1 The Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . 142
4.4.2 The Function Call Graph . . . . . . . . . . . . . . . . . . . . . . . 143
4.4.3 Vector-Space Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4.4 N-grams Analysis stage . . . . . . . . . . . . . . . . . . . . . . . . 156
4.4.4.1 N-grams Analysis for each of the 10 returned vectors . . 158
4.4.5 The Method-level similarity analysis stage. Type I and II clones
detection at method-level . . . . . . . . . . . . . . . . . . . . . . . 164
4.5 The Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.6 Real System implementation . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.6.0.1 DescentDroid.pm . . . . . . . . . . . . . . . . . . . . . . . 170
5 Experimental Phase
5.1 Focus of the Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Focus I: Performance evaluation : A MultiClass Classification Problem .
5.2.1 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Evaluations Detail . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.1 Classification Accuracy, Precision&Recall for each Malware Class. . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Conclusions and Considerations . . . . . . . . . . . . . . . . . . .
5.3 Focus II: Detection Evaluation of Android Malware . . . . . . . . . . . .
173
. 173
. 173
. 174
. 175
. 178
. 187
. 188
Contents
5.3.1
5.3.2
5.3.3
vii
Characterization . . . . . . . . . . . . .
Evaluations Detail . . . . . . . . . . . .
5.3.2.1 I classifier: 0.99800 CosSim .
5.3.2.2 II classifier: 0.99500 CosSim .
5.3.2.3 III classifier: 0.99000 CosSim
Conclusions and Considerations . . . . .
6 Applications and future works
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
188
189
189
189
190
191
195
A Partial code of Descent tool. Opcodes Frequency Distribution Computation
197
Bibliography
214
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
2.26
2.27
2.28
2.29
2.30
2.31
2.32
2.33
Android Software Stack [1] . . . . . . . . . . . . . . . . . . . . . . . . .
Android Sandbox Mechanism [2] [3] . . . . . . . . . . . . . . . . . . . .
Applicazions sharing the same UID [2] [3] . . . . . . . . . . . . . . . . .
Motorola OMAP Secure Boot Chain [4] . . . . . . . . . . . . . . . . . .
Samsung Knox efuse warranty bit . . . . . . . . . . . . . . . . . . . . . .
HTC S-OFF NAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEAndroid partialc coverage [5] . . . . . . . . . . . . . . . . . . . . . . .
Samsung KNOX System Security Overview [6] . . . . . . . . . . . . . .
Samsung KNOX Secure Boot [7] . . . . . . . . . . . . . . . . . . . . . .
Samsung KNOX Application Container [6] . . . . . . . . . . . . . . . . .
Samsung KNOX Support CAC [6] . . . . . . . . . . . . . . . . . . . . .
Samsung KNOX Client Certificate Management (CCM) [7] . . . . . . .
Android simple form of IPC [8] . . . . . . . . . . . . . . . . . . . . . . .
Android Process IPC [9] . . . . . . . . . . . . . . . . . . . . . . . . . . .
Binder Communication [10] . . . . . . . . . . . . . . . . . . . . . . . . .
Binder Framework [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Binder Permission Enforcement [10] . . . . . . . . . . . . . . . . . . . .
Binder Token Object Reference [10] . . . . . . . . . . . . . . . . . . . .
Display of permissions for applications [11] . . . . . . . . . . . . . . . .
Securing Activities with Custom permission [12] . . . . . . . . . . . . .
Securing Services with Custom permission [12] . . . . . . . . . . . . . .
Securing BroadcastReceiver with Custom permission [12] . . . . . . . .
Securing ContentProvider with Custom permission [12] . . . . . . . . . .
GMail AndroidManifest prior to 2.3.5 without signature-level enforcing .
GMail AndroidManifest after to 2.3.5 with signature-level enforcing . . .
Securing with URI permissions [12] . . . . . . . . . . . . . . . . . . . . .
Apps collect your Information [13] . . . . . . . . . . . . . . . . . . . . .
Apps collect your Information [13] . . . . . . . . . . . . . . . . . . . . .
Which apps are abusing [13] . . . . . . . . . . . . . . . . . . . . . . . . .
AdLibraries’ privacy scores [13] . . . . . . . . . . . . . . . . . . . . . . .
AdLibraries brought to you by malware [13] . . . . . . . . . . . . . . . .
An Update Attack from BaseBridge [14] . . . . . . . . . . . . . . . . . .
Comparison of the top 20 permission requested by malicious and bening
Android apps [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.34 An Overview of existing Android Malware (PART I: INSTALLATION
AND ACTIVATION) 1 of 2 [14] . . . . . . . . . . . . . . . . . . . . . .
2.35 An Overview of existing Android Malware (PART I: INSTALLATION
AND ACTIVATION) 2 of 2 [14] . . . . . . . . . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
7
7
10
12
12
14
15
15
16
17
18
19
27
28
29
31
32
38
40
41
42
43
45
45
46
47
48
49
50
51
56
. 60
. 61
. 62
List of Figures
ix
2.36 An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 1 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.37 An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 2 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
4.1
4.2
4.3
4.4
Malware detection Technologies Taxonomy [15] . . . . . . . . . . . . . . .
Dynamic Signature Extraction [15] . . . . . . . . . . . . . . . . . . . . . .
Top 20 requested permissions which has the most different requested rate
in different dataset. The ordinate is the difference between the requested
rate in malware dataset and the requested rate in benign dataset [16] . . .
Difference in the frequencies of 18 selected permission in malware and
benign .apk files [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
COMPARATIVE ANALYSIS OF BI-NORMAL SEPARATION AND MUTUAL INFORMATION FEATURE SELECTION METHOD [17] . . . .
TOP 5 PERMISSION COMBINATIONSWHEN K = 5 [18] . . . . . . . .
TOP 5 PERMISSION COMBINATIONSWHEN K = 6 [18] . . . . . . . .
Cloud-based malware protection techniques: (a) Paranoid Android and
(b) Crowdroid [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
After ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
String Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Code Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of a junk code fragment [20] . . . . . . . . . . . . . . . . . . .
Av-test carried out in the 2014: Detection rates in the endurance test [21]
ANTI-MALWARE PRODUCTS EVALUATED [20] . . . . . . . . . . . .
MALWARE SAMPLES USED FOR TESTING ANTI-MALWARE TOOLS
[20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Trasformation Keys [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DROIDDREAM TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FAKEPLAYER TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EVALUATION SUMMARY [20] . . . . . . . . . . . . . . . . . . . . . . .
“Android.Trojan.FakeInst.AS” from the FakeInstaller Malware Family [22]
Complete function call graph of “Android:RuFraud-C” from the malware
family FakeInstaller. Dark shading of nodes indicate malicious structures
identified by the SVM [22] . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two corresponding methods in two app clones are from different markets.
The first method has one more function call to initialize several ads [23] .
69
71
74
75
75
76
77
79
81
81
82
82
82
82
83
86
87
87
87
88
88
89
93
93
94
Android Drebin logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
The Whole System Multi-staged Architecture . . . . . . . . . . . . . . . . 111
66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa sample, Opcode frequency Distribution . . . . . . . . . . . . . . . . . . . . . . 116
Opcodes Distribution extraction and Computation . . . . . . . . . . . . . 117
List of Figures
4.5
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs
Relative Opcodes Frequency Distribution of the 1st Vector (red) . . . .
Relative Opcodes Frequency Distribution of the 2nd Vector (red) . . . .
Relative Opcodes Frequency Distribution of the 3rd Vector (red) . . . .
Relative Opcodes Frequency Distribution of the 4th Vector (red) . . . .
Relative Opcodes Frequency Distribution of the 10th Vector (red) . . .
Gappusin family, Opcodes classes Frequency Distribution . . . . . . . .
Android Dragon Quest game CFG in Gephi . . . . . . . . . . . . . . . .
Android Dragon Quest game FCG in Gephi . . . . . . . . . . . . . . . .
Structural data mapped in the Vector-Space . . . . . . . . . . . . . . . .
Adjacency Lists preparation and similarities search . . . . . . . . . . . .
Architecture. N-gram analysis . . . . . . . . . . . . . . . . . . . . . . . .
Common Subgraphs with the vector n. 9 Detected by n-grams analysis
stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Common Subgraphs with the vector n. 10 Detected by n-grams analysis
stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture. Method-level similarity analysis stage . . . . . . . . . . . .
method.pl execution phase . . . . . . . . . . . . . . . . . . . . . . . . . .
AdjList.pl -freq execution phase . . . . . . . . . . . . . . . . . . . . . . .
AdjList.pl -adjFCG execution phase . . . . . . . . . . . . . . . . . . . .
AdjList.pl -adjFCG execution phase . . . . . . . . . . . . . . . . . . . .
Execution phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DescentDroid Execution phase . . . . . . . . . . . . . . . . . . . . . . .
DescentDroid Execution phase . . . . . . . . . . . . . . . . . . . . . . .
DescentDroid Execution phase, with debug option . . . . . . . . . . . .
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
979 malware variants proper Classified from 1000 Malware samples . . .
Multiclass Classification, Precision in function to the size of the dataset
Multiclass Classification, Precision vs the size of the dataset . . . . . . .
Multiclass Classification Recall in function to the size of the dataset . .
Multiclass Classification, Recall vs the size of the dataset . . . . . . . .
3 classifiers on the ROC space . . . . . . . . . . . . . . . . . . . . . . . .
3 classifiers - Performance Comparison . . . . . . . . . . . . . . . . . . .
3 classifiers - FP/FN Comparison . . . . . . . . . . . . . . . . . . . . . .
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
x
. 120
. 122
. 124
. 126
. 129
. 131
. 133
. 135
. 137
.
.
.
.
.
.
.
139
141
142
144
145
147
156
. 162
.
.
.
.
.
.
.
.
.
.
163
164
167
168
169
169
170
171
171
172
.
.
.
.
.
.
.
.
175
183
184
185
186
193
193
194
List of Tables
2.1
2.2
Cryptocurrencies Mining Difficult Rate . . . . . . . . . . . . . . . . . . . . 52
Android Malware Families [14] . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1
4.2
4.3
Malware Drebin Project Families . . . . . . . . . . . . . . . . . . . . . . . 105
Dalvik Bytecode set Categories . . . . . . . . . . . . . . . . . . . . . . . . 113
66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa Relative Opcodes Frequency Distribution Vector . . . . . . . . . . . . . . . . 116
10 most similar vector by the Cosine Similarity . . . . . . . . . . . . . . . 118
Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs
Relative Opcodes Frequency Distribution of the 1st Vector (right) . . . . 119
Relative Opcodes Frequency Distribution of the 2nd Vector (right) . . . . 121
Relative Opcodes Frequency Distribution of the 3rd Vector (right) . . . . 123
Relative Opcodes Frequency Distribution of the 4th Vector (right) . . . . 125
Candidate apk CFG Adjacency List extracted and sorted . . . . . . . . . 148
Candidate apk FCG Adjacency List extracted . . . . . . . . . . . . . . . 151
Candidate apk FCG Adjacency List extracted . . . . . . . . . . . . . . . 152
n-grams on the Adjacency Lists of the 1st returned vector . . . . . . . . . 158
n-grams on the Adjacency Lists of the 2nd returned vector . . . . . . . . 158
n-grams on the Adjacency Lists of the 3rd returned vector . . . . . . . . . 158
n-grams on the Adjacency Lists of the 4th returned vector . . . . . . . . . 158
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
xi
List of Tables
xii
4.25
4.26
4.27
4.28
4.29
4.30
n-grams on the Adjacency Lists of the 8th returned vector .
n-grams on the Adjacency Lists of the 9th returned vector .
n-grams on the Adjacency Lists of the 10th returned vector
Common Subgraph with the n. 9 vector . . . . . . . . . . .
Common Subgraph with the n. 10 vector . . . . . . . . . .
The Classifier Analysis stage 1 . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
159
160
160
162
163
165
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
Results detail on 1000 malware samples, Synoptic table . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision and Recall of the MultiClass Classification Process . .
Different Classifiers of Opcodes Frequency Distribution Similarity
Confusion Matrix for 0.99800 . . . . . . . . . . . . . . . . . . . .
Performance measures for 0.99800 CosSim . . . . . . . . . . . .
Performance measures for 0.99800 CosSim . . . . . . . . . . . .
Performance measures for 0.99800 . . . . . . . . . . . . . . . . .
Accuracy of the three Classifiers . . . . . . . . . . . . . . . . . .
Recall of the three Classifiers . . . . . . . . . . . . . . . . . . . .
Precision of the three Classifiers . . . . . . . . . . . . . . . . . . .
F-Score of the three Classifiers . . . . . . . . . . . . . . . . . . .
FP/FN ratio of the three Classifiers . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
176
177
180
188
189
189
189
190
190
190
191
191
191
192
192
Listings
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
3.1
3.2
3.3
3.4
3.5
4.1
4.2
Intent-filter declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Explicit Intent declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Implicit Intent declaration [24] . . . . . . . . . . . . . . . . . . . . . . . . 23
BroadcastReceiver registration for incoming SMS . . . . . . . . . . . . . . 24
Intent Filter with priority . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Exported CellValidateService component [25] . . . . . . . . . . . . . . . . 25
CellValidateService Class [25] . . . . . . . . . . . . . . . . . . . . . . . . . 25
Exploit [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
IBinder.transact() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
IBinder.onTransact() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Client application acquiring wavelock on PowerManager service [26] . . . 32
PowerManager source code [26] . . . . . . . . . . . . . . . . . . . . . . . . 33
WhatsApp.SF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
RECEIVE SMS permission . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ACCESS FINE LOCATION permission . . . . . . . . . . . . . . . . . . . 37
ACCOUNT MANAGER permission . . . . . . . . . . . . . . . . . . . . . 38
Securing Android components . . . . . . . . . . . . . . . . . . . . . . . . . 44
URI permissions example . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Kirin policy example [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Type I Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Type II Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Type III Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Type IV Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Node properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde sample Adjacency List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.1 Partial code of Descent tool. Opcodes Frequency Distribution Computation198
xiii
List of Tables
A.12 Partial code of Descent tool.
xiv
Opcodes Frequency Distribution Computation209
Abbreviations
AIDL
Android Interface Definition Language
BYOD
Bring Your Own Device
CA
Certification Authority
DoD
Department of Defense
FIPS
Federal Information Processing Standard
FOTA
Firmware Over TheAir
FSA
Finite State Automaton
IoT
Internet of Things
IoE
Internet of Everything
MAC
Mandatory Access Control
MDM
Mobile Device Management
NSA
National Security Agency
ODE
On-device Data Encryption
OTA
Over TheAir
PDA
Push Down Automaton
TAN
Transaction Authentication Number
xv
Chapter 1
Introduction
1.1
Motivation and Problem
Every year, malware previsions, are ever more adverse. Mobile attacks will continue to
grow rapidly as new technologies expand the attack surface. New vulnerabilities did not
only reside on devices but also on platforms and apps. Installing malicious apps and
visiting malicious websites will no longer be the sole mobile infection vectors. Vendors,
manufacturer, providers should extend vulnerability shielding and exploit-prevention
technologies and Anti-Malware vendors have to enhance their actual solution because
new type of malware have a fileless payload that only runs in memory and to circumvent detection as well it adopt more complex obfuscation techniques. Exploit kits are
growing and will attack android platform, which starts became the universal front-end
in the IoE and IoT.
SmartTV, Boxee TV devices, biometric systems, Home Automation devices, Android
Car (Open Automotive Alliance) and the digital payments, ever more pervasive inside
our life.
1.2
Goals
This Master Thesis work doesn’ t want to be another methodology to Detecting Malware
on Android devices, wants to represent an inversion in the strategy, posing itself before
1
Chapter 1. Introduction
2
any existing Anti-Malware solution, to support them, to enhance them, to build signs for
them. The idea behind, arises from the awareness that a more effective and holistic anti
malware approach have to first outline the phylogenesis, understand its evolution and
sophistication, their belonging semantics. This methodology move toward this direction,
implementing a clone-detection heuristic for outline common payload components in
order to identify malware variants. Our Heuristic, is a contribution in the Malware
Analysis phase, not in the Detection phase, to well-understand Android Malware and
their evolution, to trace back a possible Malware descent. To achieve these goals, our
methodology start from the analysis of the Opcodes Frequency Distribution, obtaining
by similarities, the 10 nearest vectors from the Data-set (build from the Android Drebin
Project), then, an n-grams heuristic on the Adjacency Lists, detect isomorphism features
in the Call Graphs to identify payloads or malicious behaviours. In this way we a re
able to outline a possible genome for each malware families.
1.3
Thesis Overview
This Master Thesis is organized as follow. CHAPTER 2 (Android: A Security
Overview) start with a Security Overview of the Android Security model, describing its internals, focusing on the Security mechanism implemented by Google and by
vendors, in the second part of the chapter, will be described the Android Malware Characterization, malware evolutions and challanges. CHAPTER 3 (Malware Detection
Methodologies) focusing on the state of the art of Malware detection Methodologies
and its related unresolved problems, we also evaluate commercial mobile anti-malware
products for Android and test how resistant they are against various common obfuscation techniques. In the second part of the Chapter we exploring detection methodologies
based on Call-graph isomorphism analysis for malware variants detection. CHAPTER 4
(Detecting Android Malware Variants using Opcodes Frequency Distribution
and Call Graphs Isomorphism Analysis) provide a detailed analysis of our implemented methodology, based on Opcodes Distribution analysis and an n-grams heuristic to outline isomorphism features in the Call Graphs. CHAPTER 5 (Experimental
Phase) focus on the experimentation phase trying to confirm the value of this methodology. Finally CHAPTER 6 (Conclusions, Applications and future works) describes
possible applications of this methodology and purposes for the future.
Chapter 2
Android: A Security Overview
2.1
The Android Security Model
The project was founded in October 2003 at Palo Alto, CA by Andy Rubin working
on software for mobile phones. In the 2005 Google acquired Android Inc. and in the
2007 unveiled it by the Open Handset Alliance, a consortium of device manufactures
and operators lead by Google which intention is promoting open standards for mobile
devices. Android is a mobile platform based on a linux Kernel, designed, from the
very beginning, with openness and interoperability in mind, enable developers to create
engaging and innovative applications.
Android uses an open architecture which also allows third party manufacturers to install
their own custom NAND Flash to provide customizations for their device. This open
design can undermine the platform itself and it is necessary ensuring strong protection
mechanism in respect to some foundamental assets: users, developers, devices, HW
component, applications.
As a result, since the early stages of its development, Android is subject to a rigorous
Security Program. Following we analize how this program is composed, and will therefore
refer to the official documentation “Android Security Overview” [11] by Google.
3
Chapter 2. Android: A Security Overview
2.1.1
4
Android Security Program
The “Android Security Program” consist of 4 main elements:
Design Review
During the design phase of the Android Platform, each Major Feature is reviewed
by specialized Security Engineer and possibly integrated through Security Controls
in the Architecture.
Penetration Testing and Code Review
During the development of the Platform components, each component must pass
rigorous Security checks conducted by Android Security Team, by Google’s Information Security Engineering team and by indipendent Security Expert. The goal
of these assessments, is to identify weakness and vulnerabilities.
Open Source and Community Review
After the issuing of a brand new version of the Android Platform, it will be subject
to further review by the OpenSource Android Community, both on specific Board
Review such as the Kernel Community or the Google Play Forum.
Incidente Response
Security issues may also occours during the shipping phase, for this reason, Android
has a Security Response process. A Security expert team constantly monitoring
Android Security-specific communities, and after have identified potential securityissues, they trying to act immediately for mitigate vulnerabilities with OTA and
FOTA updates.
Following, referring to the “Android Security Overview” [11] documentation, will be
analyzed the Android Platform Security Architecture.
2.1.2
5
Android Platform Security Architecture
In order to protect their foundamental assets, Android provides the following Security
features:
• Kernel-level Security
• Application Sandbox
• The System Partitions and Secure Bootloaders.
• Cripthography
• Secure BootLoaders
• Security-Enhanced Android and Samsung Knox
• Secure IPC : INTENT e BINDER
• Application Signing Enforcement.
Following is presented the whole Android Platform Architecture, full of the key components for ensuring the Security and the Integrity:
Figure 2.1: Android Software Stack [1]
An Android application is written in java and for this are protected from standard
BufferOverflow exploit due to its implicit buond checking but some applications can
6
also access to C/C++ libraries via JNI the Java Native Interface. Developers use JNI
for performance reasons but most of C/C++ libraries are mapped to fixed memory
address in memory. All the application codes, running on top of the kernel is confined
and isolated by the Application Sandbox.
2.1.2.1
Kernel-level Security
Android is based on Linux Kernel Security mechanism in addition it includes some enforced IPC (Intents, BroadcastReceiver, Binder). Linux is a world-wide Operating System used in Security-sensitive and governative context, the kernel provides to Android
some Security Features including:
• A user-based permission model.
• Process Isolation and Sandboxing.
• Secure IPC.
• Network Security (iptables)
2.1.2.2
The Application Sandbox
The Sandbox is a sealed container which ensure isolation among Android applications,
running on top of the kernel, it is auditable. The kernel assigns to each applications, that
is executed, an unique ID, the UID, in this way the kernel ensure that each application
has its own data and no other application can read or modify it if the application itself
is considered untrusted. This isolation mechanism is based on the Linux user privileges,
normally processes will have different process space and different UID’s , it is a DAC
Access Control model:
7
Figure 2.2: Android Sandbox Mechanism [2] [3]
By default Android applications run in the Sandbox and doesn’t have any grant to
access resources, they can access to device resources by defining specific permissions in
their AndroidManifest file. All application components runs in the same Process Space
it is, however, possible to specify a process in which each compnent should run, the
android:process attribute of the <application> element, in the AndroidManifest
file serves to this. It is also possible to define android:process attribute so that
components in different applications run in the same process but also, in this case the
application must share their UID and must be signed with the same key. To share
the same UID among applications, first of all, it is necessary to define the UID in the
AndroidManifest file, specifying the android:sharedUserId [3] attribute then sign them
with the same developer’s private key, the signature on any resource is easily traced back
to it’s author.
Figure 2.3: Applicazions sharing the same UID [2] [3]
8
Android Application Sandbox protect against Memory Corruption exploits which could
compromise the entire System.
2.1.2.3
The System Partitions
The Android Platform’s internal memory is generally a solid-state (flash) memory.
Usually the bootloader has its own partition /boot, Recovery is another partition
/recovery and the system partition is under /system. Following a short description of Android standard partitions:
- /misc miscellaneous.
- /boot bootloader, filesystem root.
- /recovery holds the recovery program.
- /system OS, Kernel, OS libraries, application runtime, binary applications.
- /cache cache data.
- /data user application data/preferences.
To protect files use mode param :
• MODE PRIVATE
• MODE WORLD READABLE :
• MODE WORLD WRITEABLE :
It is important to note that application files, stored on internal storage, are secured by
linux UID mechanisms but application files stored on external storage like SD card are
not secured, it is not necessary to specify a permission for a third-part application to
read these files. For this reason, application files stored on external storage must be
encrypted.
2.1.2.4
9
The Secure BootLoader and QFuses
The bootloader is a little piece of code that runs before the OS start up.
This low-level code tells the device how to start-up and find the kernel image, the authentication process must guarantee the origin and the integrity of the software, it is
stored on a device non-volatile memory and vary from one device to another. Manufactures sign and securing their bootloaders in order to ensure that their OS image is
securely programmed into flash memory at the factory, to permit verification of the image and avoid tampering. Usually the implementation is OEM-dependent. The Secure
Bootloader verifies signatures of Kernel, Android Recovery and if the signature check
fails drop into Recovery mode and then the device was unable to boot. Following
some Secure Boot technologies:
Motorola OMAP Boot Chain
Motorola in their devices, but not only it (Motorola Droid Milestone, Nokia N900,
Motorola Defy but also Archos 43, Archos 70, Archos 101, LG Optimus Black, LG
Optimus Bright, etc.), used the OMAP Crypto Processor by Texas Instruments. The
JTAG port is locked down to prevent emulation and debug attacks.
• SHA1 hash of root public key stored in eFUSE.
• Boot ROM verification using either SHA-1 or SHA-256, and AES-128.
• NIST-800-22 certified random number generator.
• Signature verification on Kernel or Recovery.
• Boot modules encrypted.
Figure 2.4: Motorola OMAP Secure Boot Chain [4]
10
11
The Boot ROM verifies the keys stored in the mbmloader and the mbmloader signature,
mbmloader verifies signature of the mbm, mbm verifies the Linux BootLoader and the
Linux BootLoader verifies the Kernel image or the Recovery.
Qualcomm and QFuse
Most of Qualcomm devices such as HTC family (HTC One, HTC Desire, HTC M8,
HTC ONE X, etc.) using Qfuse driver. The Qfuse driver allow the one-time immutable
configuration of the device settings and for examples to storing cryptographic keys. The
value of each Qfuse is sensed at powerup and stored in a register, blowing Qfuses is
done by placing a value to a register and applying current to the fuse. This process
is irreversible. For instance, if the FORCE TRUSTED BOOT QFuse is blown, each
stage of the Boot Chain is verified and this means that at each Boot stage, only verified
bootloaders can run. The PBL “Primary Bootloader” verifies the signature of the SBL
“Second BootLoader”, the SBL verifies signature on REX/AMSS (baseband) and on
HBOOT (application bootloader) and then HBOOT verify Kernel image and finally
boots the OS. Stage2 bootloader set ups the network, memory, and everything else that
is required to run the kernel. When the kernel is finished loading, it starts the init process
The Qfuse technology is also implemented in Samsung Knox devices as a warranty flag
(Knox Warranty bit), in fact, once the device is flashed, the Knox warranty bit remain
set in a register by blowing an efuse and remain always blown. This technology provide
some Security Controls on the device by the OEM that, for example, may decide not to
release more updates for their devices as happens for most of Samsung Galaxy devices.
12
Figure 2.5: Samsung Knox efuse warranty bit
HTC S-ON/S-OFF
In the HTC devices, the NAND memory is secured (S-ON Security On) for writeprotection, passing from S-ON to S-OFF means disable the NAND’s Security checks.
Switching to S-OFF is possible to NAND repartition or to flashing the boot.img or the
Kernel.
Figure 2.6: HTC S-OFF NAND
2.1.2.5
13
Cryptography
Android provides developers, a set of cryptography API for the implementation of the
Cryptographics Security Standards such as AES, DES, SSL, RSA SHA. These APIs
are provided by the javax.crypto package. By the Android documentation [11], from
Android 4.0 it provide the KeyChain API for the management of Security Certs.
2.1.2.6
Security-Enhanced Android and Samsung Knox
SELinux is a technology developed by NSA in the 2000 and it is used in much government
and corporate environments for Data Security and for implementing the Bell-LaPadula
model. Samsung worked with NSA on Android SELinux porting by referring as SEAndroid (Security-Enhanced Android). SEAndroid implements a MAC policy over the
entire system, also on system processes (root, UID 0) enhancing the basic Linux DAC
model. From the Android 4.3 Aka KitKat the Sandbox is enforced introducing SEAndroid but is still in permissive-mode by default for compatibility reason. This mode only
log all policy violations but without intervening in the behaviour.
From Android 4.4 [28] SEAndroid works in enforcing-mode for the installd, netd, vold
and zygote root processes but for all others still works in permissive-mode [29]. In
enforcing-mode, each policy violations is logged and then blocked. As reported by the
document Validating Security-Enhanced Linux in Android [29].
Policies are defined in the external\sepolicy dir and it is possible to implementing
its own policy defining a policy file in the device\manufacturer\device-name\sepolicy
dir and after updating the BoardConfig.mk Makefile.
At this point after the Android rebuild, SEAndroid will be aware about new user-defined
policies, as OEM do with their devices, Google advice to conduct a policy rules validation process [29]. The “Policy Rules” are defined in the form: allow domains
types:classes permissions as white List model, where:
• Domain A label for a process or a process group.
14
• Type A label for a object or a set of objects.
• Class The Object’s type.
• Permission The action to perform (as Read and Write).
Those Security Enhancements make the exploiting much more difficult but, we have to
take some considerations. For instance, not the whole platform is Secured:
Figure 2.7: SEAndroid partialc coverage [5]
For example, the Policy Manager does not protect SurfaceFlinger which renders image
to the screen, hackers can capture raw content send to SurfaceFlinger. Another effective
way for hackers is to disable the Policy Management exploiting Linx Kernel Flaws.
A depth study is done on a proprietary technology: Samsung Knox. This technology allows the applications’ sealing and isolation implementing a MDM and BYOD
Enterprise-level technology.
Samsung Knox is composed by the following components [6]:
15
Figure 2.8: Samsung KNOX System Security Overview [6]
The Secure Boot technology prevents NAND flashing , Custom ROMs through a
digital cert verification process. The firmware images are loaded only if they went
cryptographically verified by a key stored in the device at factory time. The firmware
images that are allowed are that cryptographically signed by trusted keys then verifyed
by the HW. That is each stage of the boot process verifies the digital signature of the
next before executing it:
Figure 2.9: Samsung KNOX Secure Boot [7]
Samsung Knox use SE for Android Security Enhancements for Android) to implementing a MAC model. To implementing MAC model, the OS integrity must be
ensured, to address this, Samsung Knox use a layer called TIMA (TrustZone-based
Integrity Measurement Architecture) which uses the ARM TrustZone technology.
The TrustZone technology is a tamper-resistant sector of an ARM processor. During the
boot process, TrustZone save fingerprint called “measurements” from all bootloaders and
16
OS kernel, at system runtime ARM TrustZone apps on KNOX Workspace constantly
compares all measurements, in this way Trusted Boot is ensured. Critical security
decisions are made based on the compared results [7]. CPU and Memory are partitioned
into “containers” of secure and normal type. TIMA runs in secure mode and it can’t
be disabled while SEAndroid runs in normal mode. Based on Samsung Knox documentation [6], if TIMA were encounter an integrity violation, it will notify that violation
to the enterprise IT stuff via MDM. Samsung Knox besides implements an Application
Security mechanism which provide different “execution environnements” to the device,
each with its own launcher, applications, configurations ensuring isolation among these
containers forbids the standards IPC mechanisms and allowing the “inter-containers”
communications only through configuration policies.
Figure 2.10: Samsung KNOX Application Container [6]
Besides the Cryptoprocessor enable the ODE by means of which the IT administrators can encrypt all data on the device. The ODE use FIPS 140-2 certified Advanced
Encryption Standard (AES) NIST standard with an 256 bit AES Key, the encryption
key derives from an user supplied password through a Password-Based Key Derivation
Function 2 (PBKDF2) secure function. There is also a secure VPN support, KNOX
VPN, also FIPS 140-2 certified by which is possible to configuring per applications policies. Samsung Knox provide a full support to a governative utilization of the device
through:
17
• Boot Attestation Samsung sign the Certificates of the Government’s Agencies
by creating a 2 level CA, Knox Government root certificate, at this point, also
government’s agencies can build their own firmware image to load on the device:
• Smartcard - CAC Support
By DoD regulamentation compliance, Samsung Knox permit to store the digital
certs on the device. The browser, email and VPN clients can use credentials on the
Common Access Cards (CACs) to log in, if the enterprise IT admin has configured
this policy. CAC can be used for two-factor authentication on the device lock
screen.
Figure 2.11: Samsung KNOX Support CAC [6]
• Client Certificate Management (CCM) As reporte by the Knox documentation [7], IMA Client Certificate Management (CCM) enables storage and retrieval
of digital certificates, as well as other operations using those certificates such as
encryption, decryption, signing, verification, and so on. The certificates and associated keys are encrypted with a device-unique hardware key that can only be
decrypted from code running within ARM TrustZone. A default certificate is
provided for apps that do not require their own certificate. TIMA Real-time Kernel Protection (RKP) intercepts critical kernel events to stop attempt that could
undermine kernel integrity. TIMA also authtenticate kernel modules.
18
Figure 2.12: Samsung KNOX Client Certificate Management (CCM) [7]
• Certification&Validations Samsung Knox is compliant with the NIST FIPS
140-2 level 1 specifications for DAR Data at Rest and DIT Data in Transit. On
May 2, 2013 the Defense Information Systems Agency DISA approved the STIG
for Samsung KNOX Workspace drafted for the Mobile Operating System SRG.
2.1.2.7
Secure IPC : INTENTS and BINDER
Android use classical IPC mechanisms of the Unix-like System (but no SySV semaphores,
shared memory segments, message queues etc.), also implementing their own Secure and
flexibly IPC mechanisms within an exporting components mechanism. Android support
a form of IPC through Intent and ContentProvider [8]:
- Intents allow Asynchronous communication style among components.
- Point-to-point style, publish-subscribe style.
- Some components act as senders, other asreceivers
- All low-level communication are managed by Binder.
19
Figure 2.13: Android simple form of IPC [8]
An ovewrview of the main components of the Android Platform to better understand
communication mechanisms:
• Activities : Is an appication component by which the user can interact with
the application. Usually, a complete application is composed by more Activities
logically connected.
• BroadcastReceiver: Is an Android application component that allows to registrating for system or application’s events. For instance, an application can registrating for the system event ACTION BOOT COMPLETED which will thrown
when the OS will be launched. A BroadcastReceiver can be registered statically in
the AndroidManifest file or in a dynamic way through the Context.registerReceiver()
method. The class implementing the receiver must extends Context.registerReceiver()
class, then it is possible to trigger when a particular event occourred. For this to
work, the developer must define the relative permission in the AndroidManifest.
For instance, if we want to create a Broadcast Receiver for the android.intent.action.BOOT COMP
event will be necessary to define the android.permission.RECEIVE BOOT COMPLETED
permission, in this way, the user, will be aware, through the Manifest file, what
events application will listen.
20
• Services A service is a component that runs in the background to perform longrunning operations without needing to interact with the user.
• Intents An Intent is a message Object generated by the Intent class of android.content
package, that represents the “intention” to do something. It is a flexible and reliable asynchronous IPC.
It is composed by the following fields [30] :
– ComponentName: It is the component name that should receive the Intent
with the relative data (Data field ). It is used to define Explicit Intents. The
ComponentName class has different constructors : ComponentName(String
package name,String class name), ComponentName( Context package,String
class name ) e ComponentName( Context package, Class class) each of which
permit the identification of a particular component.
– Action: It is the action to take, the complete list can be found on the
ufficial Intent class documentation and it is described by this syntax: package name.intent.action.ACTION. An activity declare in it own Manifest, the
action that it would be able to manage through action elements. For instance ACTION MAIN to launch the main activity and ACTION EDIT for
the modification of data.
– Data: Represent the data and the type of these data. Activities will states
in the Manifest, the data type that they could manage, through the attribute
android:mimeType es: android:mimeType=image* augmenting the action description, in this case, an action ACTION EDIT on an image will be oriented
on a image manipulation.
– Category : Intent class permit to define categories, they will be first defined
in the AndroidManifest with category also specifying the android:category attribute. It is used to provide more informations on actions to take.
– Extras: This is an optional filed key:value based that it is useful to use when
we want to provide additional informations with the putExtras(Bundle
extras) methos, the receiver activity will extract the value with getExtras()
method.
21
– Flags: Flags permit some extra caractherization, for example when starting
an activity, it is also possible to use FLAG ACTIVITY SINGLE TOP for the
verification of the singleton of the activity’s instance. In the official Intent
class documentation [30] there is the complete list.
With Intents is possible to activate application’s inner components or components
exported by other applications. A broadcastIntent method can be used to directly
send to a listening BroadcastReceiver [30] or with startActivity method to launch
ana activity, it is possible to launch an activity, starting a service or thrown a
broadcast message.
It is possible to use 2 different form for intents: Explicit Intents or Implicit Intents. Explicit Intents directly calls the component they needs finding it through
a name, while Implicit Intents ask System to send the message for them. Implicit
Intents acts through a mechanism called Intent Resolution at run-time, when
creating an Implicit Intent Android search for an appropriate component looking
through Intent Filters declared insiede AndroidManifest of all installed applications. Once finding the right Intent fIlter, the Intent object will be delivered to
it, if instead more Intent-filters are compatible, at that point, a dialog-box will be
displayed to the user to choose the appropriate application to use for that action.
An Intent-filter is used to tell Android, of which Intents listening. Google specifies,
as good practice the utilizazion of Explicit Intents and recommend to not declare
Intent filters for services [30].
22
Listing 2.1: Intent-filter declaration
<activity android:name="ShareActivity">
<intent-filter>
<action android:name="android.intent.action.SEND"/>
<category android:name="android.intent.category.
DEFAULT"/>
<data android:mimeType="text/plain"/>
</intent-filter>
</activity>
An Activity without declared Intent-filter will only receive Explicit Intents. This
Intent-filter receive only ACTION SEND intents. By the official documentation
[30], from a Security perspective, would not be sufficient to avoid Intent-filter
utilization, because avoiding Intent-filters utilization, the application can’ t reply
to Implicit Intents but any other application can still use its component through
Explicit Intents knowing the exact component namespace. It is necessary, about
this consideration, to set to false the exported attribute:
Listing 2.2: Explicit Intent declaration
Intent i = new Intent(this, ActivityTwo.class); #Explicit
Intent
i.putExtra("Value1", "This value one for ActivityTwo ");
i.putExtra("Value2", "This value two ActivityTwo");
23
In this example, we see that, to invoke a component is necessary to just only know
its exact name.
Listing 2.3: Implicit Intent declaration [24]
Intent helpIntent=new Intent();
helpIntent.setAction(android.intent.action.ACTION_EDIT);
helpIntent.putExtras(message,userText);
helpIntent.setType(text/plain);
startActivity(helpIntent);
In this example, an Implicit Intent starts, the recipient component should have
text Editing capabilities. The Intent-resolution mechanism analyze the Action,
Data and Category fields of the Intent object [24] .
Intents Security
BroadcastIntent ed Exported Components
It is possible to use Broadcast Intent also through sendBroadcast(), sendOrderedBroadcast() methods, which Intents after will be “consumed” by receiver, they
could not arrive to all applications. Sending a Brodcast request is dangerous,
above all, if they contains sensible data, any evil application can intercept those
data, for avoiding this, it is convenient to specifying, in the Request Intent, nonnull permissions. In this way, only the right target application could receive that
Intent, for this reason, if Intent contains sensible data, is convenient to using Explicit Intents which calls directly the target recipient.
Broadcast Intents could be easily hijacked by third-part applications. Let’s see,
how it is possible. If we would create a stealth SMS hijacking application, first
define a BroadcastReceiver for intercept incoming SMS:
24
Listing 2.4: BroadcastReceiver registration for incoming SMS
<receiver android:name=".SMSReceiver">
<intent-filter>
<action android:name="android.provider.
Telephony.SMS_RECEIVED" />
</intent-filter>
</receiver>
When an SMS will be received, for avoid the default SMS application interception, we can set android:priority attribute in the Intent-filter for android.provider.
Telephony.SMS RECEIVED actions and give, as value, maximum priority, that is
100.
Listing 2.5: Intent Filter with priority
<receiver android:name=".SMSReceiver">
<intent-filter android:priority="100">
<action android:name="android.provider.
Telephony.SMS_RECEIVED" />
</intent-filter>
</receiver>
At this point, application first intercept the BroadcastIntent but after they’ ll continue to propagate and so that can be aborted in this way: this.abortBroadcast().
This evil behaviour is typical of many malware, just think that in that way it
is possible to intercept Bank or TAN SMS about banking operations. But the
not carefully use of BroadcastIntent, itsn’t the only way to introduce weakness in
the system, other weakness can be introduced by default exported components.
In fact, Services, by default were exported, unless the android:exported attribute
were not specified to “true”. It will be sufficient to read the AndroidManifest
and see if vulnerable services are exported, at that point if is also defined an Intent Filter inside service element, we could create an ad-hoc application which
exploits that exported component without specifying particular permissions inside
25
AndroidManifest. A great example of how to exploit this weakness i reported
on [25], the application in question is the famous GoSMS PRO. The application
has the CellValidateService component exported by default an with an Intent-filter
following defined:
Listing 2.6: Exported CellValidateService component [25]
<service android:name="com.jb.gosms.im.
CellValidateService">
<intent-filter>
<action android:name="com.jb..gosms.goim.
ACTION_CELL_VALIDATE"/>
</intent-filter>
</service>
Inside the CellValidateService code there is the following logic:
Listing 2.7: CellValidateService Class [25]
public void onStart(Intent paramIntent, int paramInt)
{.....
if("com.jb..gosms.goim.ACTION_CELL_VALIDATE".equals(paramIntent.
getAction()))
{String str1 = paramIntent.getStringExtra("phone")
....}
Code(str1,str3)
.....
public void Code(String paramString1, String paramString2){
SmsManager.getDefault().sendTextMessage(paramString1,null,
paramString2,null,null)
........}
}
26
At that point it is possible to exploit the weakness, forging an application without
authorization. That application will exploit the default exported CellValidateService component, forging an the com.jb.gosms. goim.ACTION CELL VALIDATE
Intent and after setting an extra field containing the telephone number as:
Listing 2.8: Exploit [25]
Intent intent= new Intent();
intent.setAction(com.xxx.ACTION_CELL_VALIDATE);
intent.putExtra("phone","3424325")
startService(intent);
• BINDER In android each single applications run inside a process space isolated,
for Security and Stability reasons, but anyway it is necessary that these components, communicating between them. For instance, an application needs for access
to a Service, at that point through an IPC interface, the service is requested ,
through a native library interface which interfaces with the driver in kernel-space.
This interaction, among services, applications and the kernel, occours in Android
through BINDER:
27
Figure 2.14: Android Process IPC [9]
The framework perform an ioctl() call to the /dev/binder fd, trasferring data
to the kernel (1), the Driver will find the Server Service, will copy data in th server
process space and will wake up a thread in the server process to evading request
(2). At this point, the server, after the Parcels data unmarshalling, will verify
that the Client must have the proper permission to execute the request and, if
it is necessary, it will send to the kernel, the authorization to use a resource (3),
waiting for a response (4). Subsequently, a libbinder library copy, loaded inside
the server process space, will do the response data marshalling and then will send
to Binder driver (5) which will send to the Client process (6) [9].
The android libc “bionic” does not support the SystemV IPC:
- no semaphores
- no message queues
- no shared memory segments
28
Binder is a Secure and Reliable IPC mechanism in Android. It derives from an
Open Source project called “OpenBinder” by Palm Inc. born under Linux and it
extends the IPC classical concept. Substantially, it is a mechanism which permit
to bind data and functions from an execution environments to another. Binder
born from the idea to make the IPC communication more performing, the communication does not done by memory copy but through a kenrnel memory space and
therefore shared among all processes. Binder is a driver and it’s task is to map
each memory address referenced by a process, in a kernel-memory address. The
communication model it used is client-server based, the Client begin the communication awaiting for a proxy from the Server. The Server will have a thread pool,
to server incoming Request, this client-server communication is possible through
Binder driver (/dev/binder ) in Kernel-space with ioctl() calls :
Figure 2.15: Binder Communication [10]
Processes cannot directly invoke read/write operations on other processes, only
the kernel does, for this reason, processes use Binder as intermediary. Binder
offers a C++ framework to create Remote Services. Developers can build its own
remote service using AIDL language, specifying the interface, which once compiled
provides Java classes for Proxy and Stub. Remote services are functionalities that
applications can export and share with other processes:
29
Figure 2.16: Binder Framework [10]
Binder is defined, from Application point of view, in the Binder.java class, implementation of IBinder interface. Each time that an Android process sends some
data to another, a “transaction” is defined, with the IBinder.transact() IBinder
method :
Listing 2.9: IBinder.transact()
public final boolean transact (int code, Parcel data, Parcel
reply, int flags)
This function allow the Client to transmit an action to do (code field) and some
data (data field) to work on, these data must be first “parcelized”, must be of
Parcel type (must implements Parcel interface) to permit the Marshalling and
Unmarshalling. The recipient process receive the call on the IBinder.onTransact()
method :
30
Listing 2.10: IBinder.onTransact()
protected boolean onTransact (int code, Parcel data, Parcel
reply, int flags)
Binder Security
Permission Enforcement
In the moment, in which a process build a Binder transaction, the calling process
UID and PID is registered into the kernel, which is possible to accessing to some
informations:
- android.os.Binder.getCallingPid() process PID .
- android.os.Binder.getCallingUid() process UID .
- PackageManager.getPackagesForUid(int uid) Application’s package name.
- PackageManager.getPackageInfo(String packageName, int flags),
Context.checkCallingOrSelfPermission(String permission) which
can return PackageManager.
PERMISSION GRANTED o PackageManager.
PERMISSION DENIED flags implementing Permission Enforcement.
Binder Reference Security
If 2 processes have to comminicate, the Client process must send data to kernel
(through an ioctl call) and a reference, to the Server process’ binder object. When
Server process, read the Reference from Parcel, with the Parcel.writeStrongBinder()
31
Figure 2.17: Binder Permission Enforcement [10]
method is sure that the Client Process has a particular Binder and the grants to
do it. This mechanism prevents fake client Processes to send ad-hoc Reference to
deceive Server Process, representing Binder References which not possess. Binder
References, called Binder Tokens are uniques and assigned by the driver, avoiding, in this way, false Client processes’s identities. With this mechanism, getting
a Binder Reference is not for sure.
Binder Tokens:
Each Binder objects maintains an unique identity across process boundaries, by
the Binder Kernel Driver which assigns an unique 32-bit Integer value to each
object in each transaction it sees, maintaining in this way a cross-process identity.
A Binder object reference can be either:
- A Virtual Memory address to a Binder Object in the same process.
- A 32 bit handle to a Binder object in another process.
On each transaction it sees, the Binder driver maps local addresses to remote
Binder handles and viceversa, the driver maintains this mapping between process.
The Binder unique-identity across process boundary rules is used as Security
Access Token.
32
Figure 2.18: Binder Token Object Reference [10]
Following we consider an application that want to acquire the wavelock on the
PowerManager Service [26]:
Listing 2.11: Client application acquiring wavelock on PowerManager service
[26]
public class MyActivity extends Activity {
private PowerManager.WakeLock wakeLock;
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
PowerManager pm = (PowerManager) getSystemService(
Context.POWER_SERVICE);
wakeLock = pm.newWakeLock(PowerManager.PARTIAL_WAKE_LOCK
, "My Tag");
wakeLock.acquire();
}
@Override
protected void onDestroy() {
super.onDestroy();
wakeLock.release();} }
Following the Analysis of the PowerManager service source code to better understand the Binder Token mechanism [26]:
Listing 2.12: PowerManager source code [26]
public final class PowerManager {
// Our handle on the global power manager service.
private final IPowerManager mService;
public WakeLock newWakeLock(int levelAndFlags, String tag)
{
return new WakeLock(levelAndFlags, tag);
}
public final class WakeLock {
private final IBinder mToken;
private final int mFlags;
private final String mTag;
WakeLock(int flags, String tag) {
// Create a token that uniquely identifies this wake
lock.
mToken = new Binder();
mFlags = flags;
mTag = tag;
}
public void acquire() {
// Send the power manager service a request to acquire
a wake
// lock for the application. Include the token as part
of the
// request so that the power manager service can
validate the
// application’s identity when it requests to release
the wake
// lock later on.
mService.acquireWakeLock(mToken, mFlags, mTag);
}
public void release() {
// Send the power manager service a request to release
the
// wake lock associated with ’mToken’.
mService.releaseWakeLock(mToken);
}
}
33
34
1 The Client request an instance of PowerManager class in onCreate() method,
responsible for managing the device power state.
2 The Client application create and acquire a waveLock, then the PowerManager send the unique Binder Token of the wavelock in the acquire() request.
3 The Client Application releases the waveLock in onDestroy(). The PowerManager send the Binder Token of the waveLock as a part of the request.
When the PowerManager Service receive the request, it compares the tokens
against of all others waveLock it stored and release the waveLock only if it
finds a match ensuring that malicious applications cannot trick the PowerManager Service to release the waveLock of others applications.
A special kind of Binder Token is the Windows Token, the WindowsManager use
it to uniquely identifying Windows in the System. The WindowsManager require
that each applications pass their Windows Token as a part of the Request, in this
way, a malicious application cannot draw something on top of another application. If a token don’t match, the WindowManager throws a BadTokenException.
BinderTokens are used extensively in the Android Platform for security reasons to
achieve a secure component communication.
2.1.3
2.1.3.1
Android Application Security
The Android Permission Model and Protected APIs
Android applications are installed from a single apk (Android Application Package) file,
it is an archive file format, variant of the jar file format, more in general, it contains the
following resources:
• META-INF/
– MANIFEST.MF : The Manifest file.
– CERT.RSA: The application certificate.
– CERT.SF : The list of resources and the SHA-1 digest of the following Manifest resources:
35
Listing 2.13: WhatsApp.SF
Name: res/drawable-xhdpi-v4/mic_background_incoming_normal.
png
sha1-Digest: 8wEzOIIOCx4L0R/E4/3Dy6IjXDs=
Name: res/drawable-hdpi-v4/
common_signin_btn_icon_pressed_light.9.png
sha1-Digest: X4WAf4UWRpI4GENKGJqxId5qqTI=
Name: lib/armeabi-v7a/libcurve25519.so
sha1-Digest: cT2rnF/ZorIjUdANe5vZoJE0SfQ=
Name: res/drawable-hdpi-v4/ic_call_decline_normal.png
sha1-Digest: ZGoPYtfJXZH3VSpIBcqulYw6zO4=
• lib: this folder contains the compiled code for a specific library used in the application (armeabi, armeabi-v7a, x86, mips)
• Resource files.
• AndroidManifest.xml : The AndroidManifest.xml tells the system the definitions
of all high-level component (activities, services, broadcast receivers, and content
providers ). This is very important for the Application Security of the system and
how users understand the application, because it specifies which permissions the
application requires.
• classes.dex : All the classes compiled for the Dalvik Virtual Machine in the dex
format.
Referring to [11], all Application in Android runs in an Application Sandbox in a
privilege-separated way and for this reason, ti can only access a limited set of System
Resources, since the system manage the accesses to the resources. The System provide
for this, a security fine-graned mechanism that can restrict operations that application
can perform. Applications must explicitly share their resources and data statically and
explicitely declare permissions they need in the AndroidManifest file. Android has no
mechanism to grant permissions at run-time.
2.1.3.2
36
Application Signing Permission Enforcement
All apk’s must be signed with a certificate whose private key is owned by the developer.
All certificates are not signed by a CA but are self-signed with the developer’s private
key. This process is called “Application Signing” which permit the grant or deny when
an Application request the same identity of another application. When an application
is installed, the system give an unique UID valid for all the application’s life cycle on
that device, on another device the UID could be different for the same application. The
System enforce Security at process level, in fact, 2 distinct apps, signed with 2 different
private keys cannot run in the same process space with the same UID. To permit this it
is first necessary to define, inside the AndroidManifest.xml, the shareUserId attribute
which specifies the name of a user ID that will be shared to other apps. Setting this value,
each application will be given the same UID but for this to work the applications must
be signed with the same private key. Normally each application data will be assigned
the same application’s UID, in this way the data cannot be accessible (in read/write)
by other applications (with different UIDs), if the application share the same UID with
others, after a signing process with the same key, also share data and resources.
Enforcing Permissions
An Android Application has no permission by default, it cannot do anything that impact the data and protected resources of the device. To grant a permission, an application must include some permission declarations inside the AndroidManifest file with
the ¡uses-permission¿ tag. For instance, following some example of specific permission
declarations.
Allow an application to receive and read SMS:
37
Listing 2.14: RECEIVE SMS permission
<manifest xmlns:android="http://schemas.android.com/apk/res/
android"
package="com.android.app.myapp" >
<uses-permission android:name="android.permission.
RECEIVE_SMS" />
...
</manifest>
Allow an application to access fine graned location through the GPS, cell towers and
Wifi:
Listing 2.15: ACCESS FINE LOCATION permission
android"
ACCESS_FINE_LOCATION" />
...
</manifest>
Allows applications to call into AccountAuthenticators:
38
Listing 2.16: ACCOUNT MANAGER permission
android"
ACCOUNT_MANAGER" />
...
</manifest>
Permissions requested by an application are granted by the user explicitely by the package installer at install time and no checks are done while the application is running. If
the application while running request a particular features, for which does not have permission, the result is a SecurityException means permissions failures, being thrown back
to the application and being logged. Prior to installation of any application, the user is
shown a clear message about the different permissions the application is requesting, it is
also possible to show the application’s permission after installation, through “Settings”
menu:
Figure 2.19: Display of permissions for applications [11]
39
Android sandboxes all applications by default , for this reasons, application must explicitely share resource and data, they do it declaring permissions they need, in the AndroidManifest . When an app is installed, it is assigned a unique UID with permissions
associated with it, Android permissions are typically implemented by mapping them
to Linux groups, all components inside the app have the same permissions [31]. When
the app includes some third-part components such as in-app advertisements libraries,
plug-ins, social network APIs ect. also these components inherite those permissions.
Protecting Security-sensitive application’s IPC with Signature-Protection
level
Android permissions fall into four levels.
• Normal: These permissions cannot impart real harm to the user, (e.g. change the
wallpaper) and, while apps need to request them, they are automatically granted.
• Dangerous: These can impart real harm (e.g. call numbers, open Internet connections, etc) and apps need to request them with user confirmation.
• Signature: These are automatically granted to a requesting app if that app is
signed by the same certificate (so, developed by the same entity) as that which
declared or created the permission. This level is designed to allow apps that are
part of a suite, or otherwise related, to share data.
• Signature System: Same as Signature, except that the system image gets the
permissions automatically as well. This is designed for use by device manufacturers
only.
In Android is possible to use standard permission for the whole application or use permission only on the desidered component. It is also possible to create Custom Permissions
used by developers to restrict access to various services/components and any application
40
that interacts with another application’s component would need to possess the required
permission [12]. High-level permissions restricting access to entire components of the
system or application can be applied through your AndroidManifest file. All that this
requires is including an android:permission attribute on the desired component:
• Activity permissions restrict who can start the associated activity:
Figure 2.20: Securing Activities with Custom permission [12]
41
• Service permissions restrict who can start or bind the associated services. To protect whole service permissions are defined in the AndroidManifest, to protect individual methods (fine-graned approache) we can use Context::check*Permission()
set of APIs : checkCallingPermission() or checkCallingOrSelfPermission():
Figure 2.21: Securing Services with Custom permission [12]
42
• BroadcastReceiver permissions restrict who can send broadcasts to the associated receiver, a permission failure, not thrown an Exception, will just not deliver
the intent. Must considering who can receive my broadcasts:
Figure 2.22: Securing BroadcastReceiver with Custom permission [12]
43
• ContentProvider permissions estrict who can access the data in a ContentProvider. IT also has an additional security facility available called URI permissions:
Figure 2.23: Securing ContentProvider with Custom permission [12]
44
In this way is possible to secure fine-graned Android components:
Listing 2.17: Securing Android components
<application
android:permission="com.permission.XXX"
<activity
android:exported="false"
<service
android:permission="com.permission.XXX"
...
To protect security-sensitive IPC, in the case for example of ContentProviders, the developer could use <permission>. It is possible, for example, as cited by the official
documentation, to consider using the Signature Protection level on permissions for
IPC between applications by a single developer. Each permission has an associated “protection level” that indicates how the system proceeds when deciding whether to grant or
deny the permission. Signature permissions only allow access by applications signed by
the same developer, it is necessary first to define the android:protectionLevel attribute of <permission> element, with the “signature” value, the system grants only
if the requesting application is signed with the same certificate as the application that
declared the permission. Let’s we analyze the GMail app case. Gmail app by Google,
had a weakness until the 2.3.5. version, it does not require signature level permission for
reading mail, this means that another third-part (evil) application can read email from
Gmail ContentProvider just defining appropriate permissions on its Manifest. The enforced permission is com.google.android.gm. permission.READ GMAIL which includes,
since ver. 2.3.5 the signature-level enforcing:
Figure 2.24: GMail AndroidManifest prior to 2.3.5
without signature-level enforcing
45
Figure 2.25: GMail AndroidManifest after to 2.3.5
with signature-level enforcing
Protecting Security-sensitive application’s IPC with URI permissions
To protect a ContentProvider, is possible to define a URI permission. Permissions are for
a specific content URI and for the the whole Provider itself. When starting an activity or
returning a result to an activity, the caller can set Intent.FLAG GRANT READ URI PERMISSION
or Intent.FLAG GRANT WRITE URI PERMISSION for example, this grants the receiving activity permission to access specific data URI in the Intent. This fine-graned,
per URI mechanism is possible through the android:grantUriPermissions attribute or
¡grant-uri-permissions¿ tag. First decide what URIs you want to dynamically grant
access to :
- All data: android:grantUriPermissions=“true”
- Specific URIs: android:grantUriPermissions=“false” :
Listing 2.18: URI permissions example
<provider
android.name=‘‘com.example.test.MailProvider"
android.authorities=‘‘com.example.test.mailprovider"
android.readPermission=‘‘com.example.test.permission.DB_READ
"
android.writePermission=‘‘com.example.test.permission.
DB_WRITE">
<grant-uri-permission android:path=‘‘/attachments/" />
</provider>
In this way it is possible for third-part application to use in R/W only
the attachments of the ContentProvider of the mail app and not the
whole emails data.
Figure 2.26: Securing with URI permissions [12]
46
2.2
47
Malicious Android apps
“ There is a massive growth in the volume of malware families
and samples...”
from Symantec Mobile Adware and Malware Analysis report
2.2.1
An Overview: Botnet, Data collectors and Madware
Android has become the leading smart phone Operating System in the world. Android
continues to dominate the global smartphone market, with over 255 million units shipped
and nearly 85% of the market share in the second quarter of 2014 as IDC reported and
for this reason has become a favorite target for cybercriminals. Cybercriminals have
started to use the internet marketing also to promote and sell their services on the black
market. New schemes are based on make money capturing SMS messages used for online
banking logins, or by sending premium-rate SMS messages without the users’ knowledge.
For instance, users downloaded pirated copies of paid Android apps, without cautions,
that which were infected with trojans or through social engineering techniques such as
fake antivirus, which trick users into paying to get rid of non-existent malware. McAfee,
on its “mobile consumer threats report 2014” finds that privacy-invading apps dominate
the landscape, and many leveraging adlibraries that go hand in hand with malware.
Figure 2.27: Apps collect your Information [13]
48
many apps use your unique device identifier (IMEI, MEID, ESN, or IMSI) to track
you—and their bot clients—through your device. Watch for an app requesting the
READ PHONE STATE permission [13]. From the McAfee Report emerging that new mobile malware analyzied collect aggressively personal data than regular app:
Figure 2.28: Apps collect your Information [13]
49
Interesting from the report is also which apps are abusing (releting to the period
May/December 2013):
Figure 2.29: Which apps are abusing [13]
Following will be shown how cybercriminals can make money from mobile malware.
For instance, and android compromised device can partecipate in dangerous botnet
activities such launching DDos attacks, click fraud or to send SMS to premium rate
numbers. A compromised device can also theft data such as Acount details, call logs,
steling IMEI etc.. can also steal your money tealing transaction authentication numbers
(TANs) or through Extortion via ransomware or with Fake antivirus. But can also be
used to monitor users’ activities through audio, camera, GPS etc. and behind all these
activities there are billionaire costs for data breaches for companies and governments
and billionaire earnings for cybercriminals.
50
Aggressives AdLibraries: MADWARE
Based on the Symantec report [32], Madware refers to apps that use aggressive ad
libraries. There are at least 65 known ad libraries and over 50 percent of them are
classified as aggressive libraries. The percentage of madware on Google Play is steadily
increasing, reaching over 23 percent in the first half of 2013. The data collected enables
retailers to target you with ads and promotions based upon your location, the more
precise the targeting, the more personal data takes. Based on the Mcafee report [13]
some AdLibraries are more aggressive than the others. The AdLibrary Ledbolt it is
demonstered to be to most aggressive ones to invading the privacy :
Figure 2.30: AdLibraries’ privacy scores [13]
51
But based on the Mcafee report Analysis [13] we can notice that invasive data collection
and malware don’t correlate perfectly, in fact some ad libraries were not used very often
by malware authors and some other ones are only brought by malware :
Figure 2.31: AdLibraries brought to you by malware [13]
Cryptocurrency Mining
According to the Sophos mobile threath report, there is another trend started in 2012,
52
the use of mobile botnets to “mining” Bitcoins. Minings is a way to earn Bitcoins and
and because mining is incredibly resource-intensive, the process can severely run down
a phone’s battery, eat through a data plan by periodically downloading what is known
as a block chain. For this purpose, from May 2012 to February 2013, and after for
other 3 weeks in April 2013, the mobile clients infected by ZeroAccess Botnet were
enrolled with this purpose: earnings Bitcoins. But for an unknow reason, ZeroAccess
Botnet stop to provide this functionality. But Lookout’s researchers calculated that if
you’re mining for 24 solid hours on a Samsung Galaxy SIII, you’d only earn .00000007
Bitcoin or $0.00004473. In order to make just one Bitcoin in a day, Lookout says you’d
need 14,285,714 phones working full-tilt simultaneously. Marc Rogers, Principal Security
Researcher at Lookout, explains: “In order to control the rate at which new digital coins
are minted, the software that runs the currency sets a difficulty rate which governs just
how much processing power you need to expend in order to solve the blockchain and
get new coins. The difficulty for Bitcoin is so tough right now that a recent mining
experiment using 600 quadcore servers was only able to generate 0.4 bit coins”.
Table 2.1: Cryptocurrencies Mining Difficult Rate
Currency
Bitcoin
Litecoin
Dogecoin
Casinocoin
Difficulty
4,250,217,919.87
5,162.40
1,151.55
1.62557361
For the difficulty of rate of mining, malware such as CoinKrypt is currently targeting
the likes of Litecoin, Dogecoin, and Casinocoin, but not Bitcoin. It is almost one million
times easier to mine Litecoin than Bitcoin, and over 3.5 million times easier to mine
Dogecoin.
2.2.2
53
Android Malware Characterization
A great study of Android Malware Characterization is “Dissecting Android Malware:
Characterization and Evolution” [14]. Following , is provided a malware characterization
based on this research work. In this paper, the authors focus on the Android platform
and aim to systematize or characterize existing Android malware. The resesearch collect
more than 1,200 malware samples covering the majority of malware families:
54
Table 2.2: Android Malware Families [14]
Family
FakePlayer
GPSSMSSpy
TapSnake
SMSReplicator
Geinimi
ADRD
Pjapps
BgServ
DroidDream
Walkinwat
zHash
DroidDreamLight
Endofday
Zsone 12
BaseBridge
DroidKungFu1
GGTracker
jSMSHider
Plankton
YZHC
Crusewin
DroidKungFu2
GamblerSMS
GoldDream
HippoSMS
Lovetrap
Nickyspy
SndApps
Zitmo
CoinPirate
DogWars
DroidKungFu3
GingerMaster
NickyBot
RogueSPPush
AnserverBot
Asroot
DroidCoupon
DroidDeluxe
Gone60
Spitmo
BeanBot
DroidKungFu4
DroidKungFuSapp
DroidKungFuUpdate
FakeNetflix
Jifake
KMin
RogueLemon
Discovered
2010-08
2010-08
2010-08
2010-11
2010-12
2011-02
2011-02
2011-03
2011-03
2011-03
2011-03
2011-05
2011-05
2011-05
2011-06
2011-06
2011-06
2011-06
2011-06
2011-06
2011-07
2011-07
2011-07
2011-07
2011-07
2011-07
2011-07
2011-07
2011-07
2011-08
2011-08
2011-08
2011-08
2011-08
2011-08
2011-09
2011-09
2011-09
2011-09
2011-09
2011-09
2011-10
2011-10
2011-10
2011-10
2011-10
2011-10
2011-10
2011-10
2.2.2.1
55
Malware Installation
By analyzing malware samples in that collection, the authors categorize existing ways
Android malware use to install onto user phones and generalize them into three main social engineering-based techniques, repackaging, update attack, and drive-by download. These techniques are not mutually exclusive.
Repackaging
Repackaging is one of the most common techniques malware authors use to piggyback
malicious payloads into popular applications. Malware authors find and download popular apps, disassemble them, insert the malicious payload , repackage the app, signed
it with a new Key and then submit the new app on the official or alternative Android
Market. The payload should be initiated by an event such as the application starting
or a phone call being received. In this research, among the 1260 malware samples,
1083 of them (or 86.0%) are repackaged. Also, possibly due to the attempt to hide
piggybacked malicious payloads, malware authors tend to use the class-file names which
look legitimate and benign. For instance, AnserverBot malware uses a package name
com.sec.android.provider.drm for its payload, which looks like a module that
provides legitimate DRM functionality. One malware Family uses a publicly available
private key that is distributed in the Android Open Source Project (AOSP). The current
Android security model allows the apps signed with the same platform key of the phone
firmware to request the permissions and ,for example, installation of additional apps
without user intervention [14].
Update Attack
Instead of enclosing the Payload as a whole, like in the Repackaging Techniques, it
only includes an update component which donwloads the malicious payload at runtime.
56
In the Research [14] dataset, there are four malware families : BaseBridge, DroidKungFuUpdate, AnserverBot, and Plankton, that adopt this attack. For an example, a
BaseBridge-infected app it will check for updates and an update dialog box is displayed
to the user and if the user accepts, an “updated” version with the malicious payload will
then installed. The DroidKungFuUpdate malware notify the users through a third-party
library that provides the notification functionality. This 2 examples works updating the
whole app but AnserverBot and Plankton updates only some components without user
approval. AnserverBot donwload its payload from the C&C while Plankton direct fetch
the component from a remote server.
Figure 2.32: An Update Attack from BaseBridge [14]
Drive-by Download
This kind of malware is installed enticing users to download “rich” or “interestings”
apps. The GGTracker malware starts from its in-app advertisement, in particular if a
57
user click on this advertisement, will be redirected to a malicious web site for example
claims to be analyzing the battery usage of user’s phone and will redirect the user to
one fake Android Market to download an app claimed to improve battery efficiency
[14]. Then the downloaded app is a malware sending premim-rate number SMS. Spitmo
and ZitMo are porting of a PC malware Spyeye and Zeus, when a user is doing online
banking with a compromised PC, the user will be redirected to download a particular
smartphone app, which is claimed to better protect online banking activities. However,
the downloaded app is actually a malware, which can collect and send mTANs or SMS
messages to a remote server. These two malware families rely on the comprised desktop
browsers to launch the attack. Thus attacker has ability to do some bank transaction
Others
The Research dataset [14] consist of 1083 repackaged apps which leaves alone 177 apps.
They organize these remaining apps into spyware or including malicious functionalities.
For instance GPSSMSSpy is an example that listens to SMS-based commands to record
and upload the victim’s current location. FakeNetflix isn’t the real NetFlix app but
disguises the user with the same UI of the real app stealing Netflix users credential.
FakePlayer is another example that masquerades as a movie player but instead send
SMS to Premium-rate Numbers or RogueSPPush an astrology that will automatically
subscribe to premium-rate services by hiding confirmation SMS messages [14].
2.2.2.2
58
Activation
Android Malware relies on system-wide events to trigger its malicious payload. BOOT COMPLETED
is the most interested one to existing Android malware, in that dataset, 29 (with 83.3% of
the samples) malware families listen to this event. Geinimi listen for this event to bootstrap the background service com.geinimi.AdService, the SMS RECEIVED comes
second with 21 malware families. For examples, zSone listens to this event and intercepts or removes all SMS message from particular originating numbers such as “10086”
and “10010”. Some malware samples directly hijack the entry activity of the host apps
which will be triggered when the user clicks the app icon on the home screen or an intent
with action ACTION MAIN is received by the app [14].
2.2.2.3
Malicious Payload
Android Malware can be also characterized by their payload, we have four different categories : privilege escalation, remote control, financial charges, and personal information
stealing.
Privilege Escalation
Android is a complex system built not only on Linux Kernel but also with more than 90
libraries such as WebKit, OpenSSL, SQLLite etc. This complexity introduces software
vulnerabilities that can be potentially exploited to facilitate the execution of the payloads, for instance DroidKungFu does not embed the root exploit but first encrypt the
exploit as an asset image and then at runtime decrypt this file and execute it, in fact
where DroidKungFu was discovered no existing mobile antivirus was able to discover it.
59
Remote Control
There are 1, 172 samples (93.0%) that turn phones into bot. They receive command
from HTTP traffic from C&C servers, and most of them also encrypt C&C addresses.
DroidKungFu3 employs the standard AES encryption scheme to hide its C&C servers,
Geinimi similarly applies DES encryption scheme to encrypt its communication to the
remote C&C server. Often C&C server are hosted under domains controlled by attackers
but there are cases where servers are hosted on public clouds, for instance, C&C servers
of Plankton malware are hosted on Amazon cloud.
Financial Charge
Another motivation behind malware infections is financial charge. One way is to subscribe to attacker-controlled Premium-rate services and silently sending SMS to them.
For Instance, FakePlayer sends SMS message “798657” to multiple premium-rate numbers in Russia, GGTracker silently subscribe users to a premium-rate service in US and
zSone malware sends premium-rate number SMS to a service located in China, all without user’s consent. The 4.4% of the dataset send SMS to premium-rate numbers. To
succesfully subscribe and activate premium-rate services, users must confirm an SMS
sent by the Provider, to avoid users being notified, for instance RogueSPPush will automatically reply “Y” to such incoming messages in the background, GGTracker will
reply “YES” to one premium number, 99735, to active the subscribed service.
Information Collection
Some malware harvesting various information on the infected phones like sms, contacts,
user accounts. In the dataset there are 13 malware families (138 samples) that collect
SMS messages, 15 families (563 samples) gather phone numbers, and 3 families (43
samples) obtain and upload the information about user accounts [14]. For instance,
SndApps malware collect users’ email addresses and then send them to a remote server,
both Zitmo and Spitmo attempt to intercept SMS verification messages and then upload
them to a remote server to perform some fraud actions.
2.2.2.4
60
Permission Used
Malware whitout root privileges, relies on the Android Permission model. It is interesting to compare top permissions requested by malicious android apps with top
permissions required by benign ones. For instance INTERNET, READ PHONE STATE,
ACCESS NETWORK STATE, and WRITE EXTERNAL STORAGE permissions are widely requested in both malicious and benign apps. But malicious apps tend to request more frequently on the SMS-related permissions, such as READ SMS, WRITE SMS, RECEIVE SMS,
and SEND SMS. Most malware request the RECEIVE BOOT COMPLETED permission.
This number is five times of that in benign apps, due tot he fact that malware often
starts background services without user’s intervention. Another interesting observation
is that malicious apps tend to request more permissions than benign ones [14].
Figure 2.33: Comparison of the top 20 permission requested by malicious and bening
Android apps [14]
Figure 2.34: An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 1 of 2 [14]
61
Figure 2.35: An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 2 of 2 [14]
62
Figure 2.36: An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 1 of 2 [14]
63
Figure 2.37: An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 2 of 2 [14]
64
2.2.3
65
Evolutions and Challenges
Based on the study “Dissecting Android Malware: Characterization and Evolution”
[14] we discovered that, in the rapid growth of Android Malware there are some variants
that make use of some stealth and advanced techniques to avoid detection demonstrating
Malware evolution. For Instance, let analyze the DroidKunFu case with all its variants.
The first version of this malware was detected in the June 2011 and , at that time, was
considered one of the most complex Android Malware, some months later, researchers
found other variants of the DroidKungFu Malware, in October 2011 they discovered the
4 version DroidKungFu4 and some months later again they discovered the fifth variant
DroidKungFuSapp. Also another variant was discovered called DroidKungFuUpdate.
Between this six variant, four of them contains encripted root exploits, they are located
under /assets dir and then executed at Runtime. Any different variants use a different
key to encrypt the exploit to avoid detection. For instance, DroidKungFu1 the file containing the root exploits is named “ratc” while in DroidKungFu2 and in DroidKungFu3
the file is named “myicon”. All variants of DroidKungFu malware receives command
from C&C servers, but changes the way to store the C&C addresses. In DroidKungFu1
the C&C server addresses are stored in plaintext in a Java Class, in DroidKungFu2 server
addresses are stored in plaintex in a Native Class, in DroidKungFu3 the addresses are
stored encrypted in a Java Class and in DroidKungFu4 they are encrypted in a Native
Class. If the rooting is succesfull, a Shadow app will be installed then DroidKungFu
can access arbitrary files in the phone and have the capability to install or remove any
packages. One built-in payload of DroidKungFu is to install. In DroidKungFu1, the
embedded app will show a fake Google Search icon but in DroidKungFu2 the embedded app is encrypted and not displaying any icon on the phone.// To make Detection
more diffucult, DroidKungFu use encryption to hide C&C server addresses , constant
strings, native payloads and shadows apps, also rapidly changes encryption keys, extensively use JNI interfaces to make more diffucult the static analysis. For instance,
DroidKungFu2 and DroidKungFu4 uses JNI to communicate with remote servers and
fetch commands [14]. There are some mechanism implemented by modern Malware
which make the analysis very challenging. Another example is AnserverBot malware,
in particular, it contains the name of 3 mobile AV Software com.qihoo360.mobilesafe,
com.tencent.qqpimsecure and com.lbe.security, and attempts to match them with those
installed apps on the phone and if one of the 3 AV is detected, AnserverBot stop itself
66
unexpectedly.
New Android Malware thwarte Software Detection using obfuscation through Encryption or dynamic loading or modification of code , in this way an application’s binary
code before and after an execution can be different. For instance, malware authors
can use native code using JNI executed directly by the processor and not by the DVM
making possible dynamic code manipulation. Android vulnerabilities are used by criminals to enhance the rights of malicious applications, which extends their capabilities
and makes it more difficult to remove malicious programs. For example, to bypass Android Integrity check, malware uses the “Master Key Vulnerability” embedding unsigned
executable files in Android installation packages.
Chapter 3
Malware Detection
Methodologies
In this Chapter we discuss about a number of Detection Methodologies available in literature, as we can see, there are a substantial amount of still unsolved problems. One
primary line of defense is given by the Security Architecture of the device, for example
by the permissions system which restrict apps privileges. The user can grant or not the
privileges to the app and sometimes without fully understanding what each permission
means. This behaviour can lead to serious risks like data Exfiltration, data leakage such
as one’s location or the list of contacts or the SMS. If a malware get it way into a device
it is not clear how it is possible to detect its presence [33]. Following we also analyzing AV technologies and their limit to Detect Malware. Signature-based antimalware
techniques have some limitations in fact, they can only Detect Malware where Signature is available and are less efficient to detect polymorphic and metamorphic viruses.
Anomaly-based approaches are promising but are not efficintly possible to adapt to
smart devices [15]. More recent technologies aren’t just signature based but uses an hybrid approach. We also evaluate commercial mobile anti-malware products for Android
and test how resistant they are against various common obfuscation techniques. Some
papers goes in the direction of exploring Call-Graph for Malware variant Analysis. The
real challenge of using graphs is to show similarity in polynomial time. Given 2 Graph,
they are isometric if they share the same structure and same functional properties, but
the isomorphism problem take a NP time. To solve this, in literature there are some
interesting researches that uses features of the Graph instead of the whole inter o intra
67
Chapter 3. Malware Detection Methodologies
68
procedural Graph that gratly simplify the isomorphism problem. Considering that malicious functionality of an Android application often concentrates on only a small number
of its functions and second, considering the malware landscape of payload reuse.
3.1
Malware Detection Techniques
Following we analyze limits and weakness of the techniques available in literature, we will
refer to the “Detection of Mobile Malware in the Wild ” [19] and “Smartphone Malware
Detection: From a Survey Towards Taxonomy” paper [15].
A Malware Detector is a system responsible to determine if a program has a malicious
behaviour. In this paper [15] they classified Smartphone malware detection techniques
according to well defined three rules:
Reference Behaviour Rule
Malware Detectors identify malicious behaviours using Reference Behaviour, according to that they classify detection techniques in 2 main group: Signature-based and
Anomaly-based. Signature-based tecniques use known malware signature or pattern while
Anomaly-based techniques use a “normal” behaviour defined in the “training” phase and
use this normal model to identify malicious programs [15].
Analysis approache Rule
According to this Rule, there are 2 different Analysis approaches: Static-Analysis and
Dynamic Analyisis. Static Analysis studies program behaviour without executing it.
This kind of Analysis is characterized by some stages: the app unpacking, disassembling
and analysis with fuatures extraction. Dynamic Analysis studies program behaviour
executing it in an isolated environment such as a VM or an emulator, collecting at runtime some information such as events, system calls and monitoring program behaviours
Malware Behaviour Representation Rule
Malware detection techniques use a variety of program behaviour representations, for
example: hash values, hexadecimal bytes, sequence of run-time events and structural
69
properties. Malware Detection techniques can be classified into Static-signature and
Behaviour-signature, behaviour-signature can also be classified into Static-behaviour and
Dynamic-behaviour [15].
Figure 3.1: Malware detection Technologies Taxonomy [15]
3.1.1
Signature-based Detection Techniques
These Detection techniques model malicious behaviours of the malware in form of Signature, are unique values that indicate the presence of malicious code. These Signatures
are then stored in a database and are used in the Detection process. To address effectiveness, the Signature Database must be regularly updated and this could lead to some
“gap” where malware is not recognized because the Signature Database it isn’t updated
yet. The Signature could be Static or Behaviour according to the Malware Behaviour
Representation Rule.
70
Static-Signature based technique
The most common used Signature could be the series of bytes in the file or a cryptographic hash of the file or its sections. This technique is the one used by commercial
AV software. For instance, a most common used Signature in this technique is the Byte
Signature of the malware consisting in the sequence of Hexadecimal/Bytecode very
common in that malware. Another Signature is the Hash Signature created by a hash
function on particular instruction like MD5 or SHA-1 but the main drawback is that
if a string or instruction change a bit, the Hash Signature changes.// Static-Signature
based techniques [15] are very efficient and rielable in the Detection of new Malware and
helps malware Detectors to run fuster consuming few resources. This techniques can
be circumvented by using packers, obfuscation techniques, self-modifying code, means
recycling malware with a sifferent signature.
Behaviour-Signature based technique
Behaviour Signature use semantic and Dynamic interpretation to Detect Malware and
it is resilient to detect self-modifying code, encrypted or packed code. Analysis Approach Rule classify behaviour signature into static behaviour signature and dynamic
behaviour signature. The static behaviour signature technique is based on static code
analisys, extraacting features and information in a given executable file without executing them. The main advantages are for example, the ability to detects entire family
of malware variants with one signature enumarating a priori all the execution paths,
the limits are that the feature extraction could be a computational expansive process
because the disassembling and after applying complex classification algorythm. While
the dynamic behaviour signature technique is based on the dynamic features extraction
and monitoring the “real” behaviour of the malware to model a Behaviour Signature to
compare in the Detection Process:
71
Figure 3.2: Dynamic Signature Extraction [15]
Dai et al [34] proposed a technique that use API interception to build a FSA and a
PDA is used to describe the code samples candidates to be analyzed, this technique
matches malicious signature by computing the interception between FSA and PDA and
is able to detect malware also after packing. Bose et al [35] proposed an approach that
model the application behaviours using temporal logic of causal knowledge (TLCK), The
behaviour monitor agent monitors the application behaviour to construct behaviour signatures online from APIs calls. They use a machine learning classifier of type SVM to
detect malware, the technique tested on a symbian phone has more than 96% accuracy.
In another work, Kim et al [36] proposed a signature-based power detection approache,
a power monitor agent monitors the mobile device and taking samples of power consumption which are used to build a power consumption models then the data analyser
receives the power consumption history from the monitor agent and extract a unique
pattern to drive a power signature. This power Signature is then compared with other
signature contained in a database, it has high rate (up to 95%) of detection accuracy and
high true positive rate (99%) in classifying mobile malwares. Also Dixon and Mishra
[37] proposed a power-signature based malicious code detection technique, modeling individual power consumption profiles, focuses on abnormalities in power consumption
introduced by malware.
3.1.2
72
Anomaly-based Detection Techniques
Anomaly-based Detection Techniques are based upon 2 stages: The training stage and
the classification stage. During the training stage, a profile of a system normal behavivour is constructed, during the classification stage, these “normal” profiles are compared within the profiles of other applications to Detect deviation from these profiles.
This technique can also detect 0-Day attacks but on the other side, it is a very resource consuming technique in term of memory usage, time, CPU computation, and an
inadeguate profile of the “normal” behaviour can leads to high false positives rate.
Dynamic Anomaly-based technique
In the Dynamic Anomaly-based technique, the “normal” behaviour are modeled from
the informations and traces extracted during program execution. For instance, Buennemeyer et al [38] propose a techinque based on the battery constraints to detect malware
activities. The sensing system constantly monitor device power consumption for detect
anomalies when the values exceding the system’s dynamic threshold value. There are
other research works in literature using the Dynamic Anomaly-based technique, for instance, Schmidt et al [39] proposed a framework to monitor smartphone by collecting
dynamic features describing the system state such as CPU usage, the number of running processes and send all these informations to the Remote Anomaly Detection System
(RADS), a system which collect and store inside a db the received features and running
some machine learning algorythms to classify the “normal” or “malicious” behviour
Static Anomaly-based technique
Static Anomaly-based technique, the static program features such as execution paths,
structural properties are extracted and examinated to find the “anomaly” behaviour.
This technique detect malware before its execution. There are some interesting research
works in literature , for instance, Schmidt et al [40] have extracted static information
from Linux ELF file such as function calls, the function call are then compared with malware samples for the classification using Decision Tree learner (DTL), Nearest Neighbor
(NN) algorithm, and Rule Inducer (RI) with 96% detection-accuracy and with 10% false
positives.
3.1.3
73
Application Permission Analysis
There are also some works focusing on application permissions. As we have seen, they
play an important role in Android Security model. Will be cited an application certification framework for Android, Kirin [27], it extracts its security configurations and checks
them against the security policy rule that it already has, If an application fails to pass
all the security policy rules, Kirin can either delete it or alert the user. For instance,
Listing 3.1: Kirin policy example [27]
An application must not have PHONE_STATE,
RECORD_AUDIO, and INTERNET permission labels
This rule ensures that an application doesn’t record audio or access the Internet while
the user is talking on the phone. All the security rules are encoded with the KIRIN
SECURITY LANGUAGE (KSL). Testing Kirin with 311 popular applications revealed
that the rules flagged 12 applications with eavesdropper-like characteristics. Another
interesting work is those by Barrera [41], with self-organizing maps (SOMs) to visualize
the relationship between the applications and the permissions requested but without
focusing on their implications. Barrera et al. discovered that 93% of the free and 82%
percent of the paid applications had at least one dangerous permission request. In
[16] Liu et al. proposed a two-layered permission based detection scheme for detecting
malicious Android applications based on a decision-tree. The feature they used are
Permission Pairs and a J48 classifier to classify the applications, concluding that a
permission-based malware detecting mechanism can be used as a filter in the first step
to identify malicious applications with its high detection rate.
74
Figure 3.3: Top 20 requested permissions which has the most different requested
rate in different dataset. The ordinate is the difference between the requested rate in
malware dataset and the requested rate in benign dataset [16]
In [17] they proposed a static analysis of android malware files by mining prominent
permissions. Permissions are extracted from apk’s and then is divided into train and
test set. During data preprocessing phase, from the training samples they determine
all those permissions that are used by both benign as well as malware applications and
filter out features that have minimum probability in identifying the target classes. From
these synthesized permissions, they create four categories of permission sets [17] :
- Malware and benign features (M ∪ B)
- Common to malware and benign features (M ∩ B)
- Discriminant malware features (M k B)
- Discriminant benign features (B k M)
The Feature selection is performed to extract a subset of k best features from a set of
large feature space consisting of n features. They used Bi-Normal separation and Mutual
Information (MI) as the feature selection methods, with a feature lenght of 15 features.
75
Figure 3.4: Difference in the frequencies of 18 selected permission in malware and
benign .apk files [17]
As evaluations reported, this method is useful also for initial malware classification.
Figure 3.5: COMPARATIVE ANALYSIS OF BI-NORMAL SEPARATION AND
MUTUAL INFORMATION FEATURE SELECTION METHOD [17]
In [18], Liang and Du proposed a malware detection scheme based on permission combinations. They developed a tool called k-map which inspects the combinations of k (k≥1)
permissions instead of a single permission, K-map iteratively reads application package
files as input and calculates the permission request frequency in the form of k (k = 1, 2,
3, . . . ) permission combinations [18]. K-map requires two kinds of datasets to generate
the permission maps, i.e., malware samples and benign application samples. The rule
76
sets selection process is to extract the permission combinations that are requested more
often by malwares than by benign Apps:
Figure 3.6: TOP 5 PERMISSION COMBINATIONSWHEN K = 5 [18]
Figure 3.7: TOP 5 PERMISSION COMBINATIONSWHEN K = 6 [18]
77
78
Another interesting works in literature is [42]. Bartel, Klein and Le Traon starts from
the failures of the static analysis approaches presenting and advanced class-hierarchy and
field-sensitive set of analyses to extract mapping between API methods of the framework
and the permissions they require. For end-user applications, their evaluation revealed
that 94/742 and 35/679 applications crawled from Android application stores indeed
suffer from permission gaps (permissions declared but not used) [42].
3.1.4
Cloud-based Detection Analysis
Because of limited computational power and energy sources, smartphones don’t carry
fully featured security mechanisms. There are some cloud-based approaches in literature,
they move Security Analysis on the cloud hosting numerous replicas of Android Emulator to perform some behavioural-based analisys. For instance, Paranoid Android (PA),
consist of a tracer, located in the smartphone, that records all the necessary information required to replay the mobile application’s execution. This execution information
are transmitted to a cloud-based replayer which replay the application execution and
performs some security checks such as dynamic analisys, memory scanners, system call
anomaly detection and similar. PA consist also of a proxy to cache informations preventing in this way , the phone to ritransmitting the data back to replayer. PA incurs some
significant overhead, increases the CPU load by 15 percent, and consumes 30 percent
more energy during heavyweight tasks, for example, tracing a single system call such
as read() takes 0.7 milliseconds in the user space, but less than 0.1 millisecond in the
kernel [19]. Another example is Crowdroid, it consist of a moible lightweight application
that monitor syscall and then send them to the cloud where is implemented a machine
learning clustering technique which determinate the malicious or benign app nature.
If the sample size is very small, the false positive rate is high. An important privacy
implication of this techniques is that users have to send their applications behaviour to
third part.
Figure 3.8: Cloud-based malware protection techniques: (a) Paranoid Android and
(b) Crowdroid [19]
79
3.2
80
Malware trasformation
In this paragraph it is addressed the trasformation that may be applied to malware
sample to avoid detection, preserving their malicious behaviour. This is an important
test by which is possible to evaluate the effectiveness of Android anti-malware tools.
The transformations are classified on the basis of the work “Catch Me If You Can:
Evaluating Android Anti-Malware Against Transformation Attacks” [20], in this work,
authors classify malware transformation by 3 categories:
- trivial which do not require code level changes.
- DSA which result in variants that can still be detected by static analysis.
- NSA which can render malware undetectable by static analysis.
Let’s examine each transformation class [20]:
trivial
Repacking
Android applications are signed jar files, they can be unzipped and then reassembled
with the tools available in the Android SDK. Once repacked, android application must
be signed but the original developer’s key is not available, and for this are signed with a
custom key. Each detection methodology that match the developer’s key or match the
checksum of the entire package are ineffective against this kind of transformation.
Disassembling and Reassembling
The compiled Dalvik Bytecode is presented under the dex format. A dex file, can be
disassembled and then reassembled and the various items (classes, methods, strings, and
so on) inside it can be re-arranged, re-ordered, for this reason, methods that check the
whole dex file can be avoided by this transformation.
81
Changing Package Name
Each application is identified by a unique package name defined in the android Manifest
file. We can change this package name in everything we want. Methods based upon
Package Name detection can be avoided by this simple transformation.
Transformation Attacks Detectable by Static Analysis (DSA)
Identifier Renaming
Classes, methods and fields in bytecode can be renamed. For instance, the ProGuard
tool provide identifier renaming. The ProGuard tool shrinks, optimizes, and obfuscates
your code with the result of a smaller apk file harder to reverse engineer because by
default, compiled bytecode still contains a lot of debugging information: source file
names, line numbers, field names, method names, argument names, variable names, etc.
Figure 3.9: Before
ProGuard
Figure 3.10: After ProGuard
ProGuard in this situation removes the human-readable static member field in smali
assembly code.
Data Encoding
The Dex file contains all the strings used in the code, these strings can be used as a
“signature” against malware, for this reason, malware authors can encode strings, for
instance:
82
Figure 3.11: Before
ProGuard
Figure 3.12: String Encoding
Call Indirections
Call indirections is an advanced trasformation in which authors manipulate the control
flow of the application by inserting dummy methods which call the proper methods
and this can be done for all the method calls. In this way malware authors can avoid
detection methodologies based on Call Graph-based signatures.
Code Reordering
This method consist of change the order of the instruction by inserting goto statements
preserving the program execution flow. Following an example of Code Reordering from
the paper [20] :
Figure 3.13: Before
ProGuard
Figure 3.14: Code Reordering
Junk code Insertion
This method consist of inserting code sequences that not affects the program semantics.
This method can avoid detection methodologies based on analyzing particular bytecode
sequences. The junk code can be just a sequence of nop instruction.
83
Figure 3.15: An example of a junk code fragment [20]
Encrypting Payloads and Native Exploits
In Android, native code is accessed, as native libraries, through the JNI bridge. Some
malware, for example DroidDream pack its payload encryptedin a non standard location in the application package. This payload is decripted at runtime. These payload
can also be stored encrypted. As the paper cite [20], payload and exploit encryption
are categorized in DSA because a static detection is still possible based on the main
application’s bytecode.
Composite Transformation
Obviously these transformation can be combined togheter to obtain more sophisticated
transformations.
Transformation Attacks Non-Detectable by Static Analysis (NSA)
As cited by the paper [20], these transformation can break all kind of static analysis, it
is still possible to perform some behavioral analysis or emulation-based.
Reflection
Reflection is commonly used by programs which require to examine or modify the runtime behavior of applications running. It can be used by malware authors, for example,
to avoid method-based detection techniques by calling method using the name of the
method. A subsequent encryption of the method name can make it impossible for any
static analysis to recover the call.
84
Bytecode Encryption
Bytecode Encryption is used mainly to avoid static analysis. An encrypted dex file is
included in the application package and then, when an application component is created
it first call a decryption routine that decrypt the dex file and loads it via user-defined
class loader. For instance, in Android the DexclassLoader provides the functionality to
load arbitrary dex files. Parts of the encryption keys may even be fetched remotely [20].
3.3
85
Evaluating Android anti-malware Products
Google’s chief security engineer for Android, Adrian Ludwig, told journalists that most
users shouldn’t bother with anti-virus. Ludwig reportedly said:
“ I don’t think 99% plus users even get a benefit from [antivirus]. There’s certainly no reason that they need to install
something in addition to [the security we provide]. . . If I were
to be in a line of work where I need that type of protection it
would make sense for me to do that. [But] do I think the average user on Android needs to install [anti-virus]? Absolutely
not.
In this paragraph we evaluate some commercial Android anti-malware products to measure the effectiveness of that solutions. This evaluation is important to understand the
actual limitation and take inspiration in the development of new solutions. There are
various AV products for the Android platform, before going to evaluate some of these
products, we have to understand their limiting factors.
The Android platform, instead of others OS , where the software is considered as
“trusted” a priori, however, a file system based sandbox ensures that each installed
app may only access its own data and not that of the user or other apps, unless explicitly permitted by the user. For this, an Android AV cannot list for directories contents,
file-system analysis is not possible and also hooking is not allowed by the sandbox. Thus,
dynamically downloaded code will not be found by AV.
Android AV applications cannot remove malware or place malware into quarantine.
Users have to remove manually by try forcing uninstall by the application manager or
often is required to the reinstall the device’s software image, this is necessary if malware
gains elevated privileges and installs itself to the system partition, which is only readable
by all other software.
To evaluate Android AV products is important to refer to indipendent Test Lab such
as AV-test at www.av-test.org . In one of their latest endurance test, AV-test [21], examined 36 Android Security apps, some of them are free to use but others require users
86
to pay an annual fee for some extra premium functions. A total of 8 apps achieved
the maximum possible score of 13 points in the endurance test. With regards to the
last year (2013), not even one of the apps tested was able to achieve 100 percent in the
Protection category, one year on, the results are much better: 14 apps in the Protection
category with a detection rate of 100 percent:
Figure 3.16: Av-test carried out in the 2014: Detection rates in the endurance test
[21]
In the [20] of the 2014 there is an interesting study on Android Anti-Malware products
evaluated on the basis of the different kind of trasformation discussed before. Following
we reported their results. The product evaluated are:
87
Figure 3.17: ANTI-MALWARE PRODUCTS EVALUATED [20]
The malware samples used for testing are:
Figure 3.18: MALWARE SAMPLES USED FOR TESTING ANTI-MALWARE
TOOLS [20]
The Evaluation are performed after trasforming a malware with one (or combination)
of the following:
Figure 3.19: Trasformation Keys [20]
88
Following the Detection results about a DroidDream sample, the “X” represent a failure
in the Detection:
Figure 3.20:
DROIDDREAM TRANSFORMATIONS AND ANTI-MALWARE
FAILURE [20]
Following the Detection results about a FakePlayer sample, the “X” represent a failure
in the Detection:
Figure 3.21: FAKEPLAYER TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20]
Following the Evaluation Summary of all solutions:
89
Figure 3.22: EVALUATION SUMMARY [20]
The most important finding in that paper, show that all the anti-malware products
evaluated are susceptible to common evasion techniques and may defeat to even
trivial transformations not involving code-level-changes.
More specifically, their findings include:
• Finding 1: All the anti-malware product evaluated are vulnerable to common
transformation. For example, Geinimi variants have encrypted strings, the DroidKungFu malware uses encrypted exploit code. Transformation like Identifier Renaming or Data Encryption are easily available with free or commercial tools.
The most important consideration is that such signatures do not capture
semantic properties of malware such as data and control flow.
• Finding 2: At least 43% signatures are not based on codelevel artifacts. These
signatures are based on file-name, dex checksum, information obtained by the
PackageManager API or like AVG only by the content of the Manifest File
• Finding 3: 90% of signatures do not require static analysis of bytecode. Only one
of ten anti-malware tools was found to be using static analysis. Names of classes,
methods, and fields, and all the strings and array data are stored in the classes.dex
file as they are and hence can be obtained by content matching [20].
• Finding 4: Anti-malware tools have evolved towards content-based signatures over
the past one year. Last year, 45% of the signatures were evaded by trivial transformations. their present results show a marked decrease in this fraction to 16%.
New solutions moved to content-based matching such as matching Identifier and
Strings.
90
Their results demonstrate that new signatures still lack resilience against polymorphic malware.
3.4
91
Malware Analysis using Call Graphs
Christodorescu et al. [43] describe a technique for semantics based detection, to overcome the limitations of the Pattern Matching Approach that ignore the semantic of
the instruction. The Pattern-matching is a pure syntactic and ignores the semantics
of instruction. A Common technique used by malware to evade Detectors is Program
Obfuscation, for example a virus can morph by encrypting its malicious payload and
decrypting it at runtime. Polymorphic viruses can also obfuscate its decryption loop
by code transposition, Metamorphic instead avoid Detection by obfuscating the entire
malware and when they replicate, they change their code in different ways.
Their algorithm [43] first unify nodes in a given program with nodes in a Signature Template, the signature template abstracts data flows and control flows, which are semantics
properties of a program. It is a Data-Flow based technique and it is not vulnerable to
any of the transformations, have a very low false positive rate, as authors cited but developing signature templates manually may be challenging. The use of code Encryption
and reflection can still defeat also this kind of Detection because they hide the edges in
the call graph. To address these limitations, the use of real-time Dynamic Analysis is
essential.
Malware Classification and Detection problems can be mean the detecting of novel instances of malware or the detecting of new malware variants. The detection of novel
instances of malware can be addressed by statistical machine learning approache while,
the detection of new malware variants is based upon the concept of ”similarity” or ”clone
Detection” against a repository of malware samples. Polymorphic malware variants, use
code modification, and obfuscation to hide themself, this means that, considering more
instances of polymorphic malware, the content at byte-level or the content at instruction level may change. This drive to an important limitation for static-based approaches,
but, it is observed that the Control Flow is more invariant in polymorphic
malware [44]. Control Flow is the Execution Path a Program may take, the Call
Graph represent the inter-procedural Control Flow (Function Call Graph) while
the intra-procedural control flow (Control Flow Graph) is represented by a set of
control flow graphs with one graph per procedure.
A Call Graph for a program is a set of nodes and edges such that there is one node for
92
each procedure and if procedure c may call procedure p, then there is an edge from c to
p (place where a procedure is invoked). A Control-Flow Graph is defined as a directed
Graph G = (V,E) in which vertices u,v ∈ V represent basick block and an edge e ∈ E :
u → v represent a possible Control Flow from u to v. A basic block consist of a series of
instruction without jump instruction in the middle while directed edges between blocks
represent jump in the Control Flow such as call, return instructions or unconditional
jumps.
The use of Reflection hides the edges in the Call Graph and if the methods name are
also encrypted these edges are opaque to static analysis [44].
3.4.1
Similarity Detection using Call Graphs
Future attempts to define valid and reliable approaches for the Detection of Malware,
may take into account the limitation of static-based analysis, some papers goes in the
direction of exploring Call-Graph for Malware variant Analysis. The real challenge of
using graphs to detect code clones is that the byte code streams contain no higher level
semantic knowledge about the code, making this approach vulnerable to code modifications.
Structural Malware detection starts from 2 consideration : first, malicious functionality
of an Android applicaton often concentrates on only a small number of its functions
and second, similar malicious code is often found throughout the malware landscape as
attackers reuse existing code to infect different applications [22].
Research must investigate methods that make using graphs feasible for large scale malware detection [? ], considering that the ability to cluster similar samples together will
make more generic detection techniques possible [45] and it made possible to implements
a more generic malware Detection schemes. In the [45] they compute pair-wise graph
similarity scores via graph matchings which approximately minimize the graph edit distance. Experiments show that it is indeed possible to accurately detect malware families
via call graph clustering [45]. In general, graph matching involves discovering structural
isomorphism between graphs through one of the following techniques [45]:
93
• Finding Graph Isomorphisms.
• Detecting Maximum Common Subgraph.
• Finding Minimum Graph Edit Dinstances.
Figure 3.23: “Android.Trojan.FakeInst.AS” from the FakeInstaller Malware Family
[22]
Figure 3.24: Complete function call graph of “Android:RuFraud-C” from the malware
family FakeInstaller. Dark shading of nodes indicate malicious structures identified by
the SVM [22]
Call Graph Isomorphism Analysis is also used for Clone Detection, Software Plagiarism
and Software Vulnerabilities in general. In the paper [23] they implement a method
for the Detection of so called “purpose-added” apps, to achieve both goals, they use a
geometry characteristic, called centroid, of dependency graphs to measure the similarity
between method(code fragments) in two apps. Then they synthesize the method-level
similarities and draw a Y/N conclusion on app (core functionality) cloning. In the [23],
they use the so called “centroid” instead of the whole CFG that can be viewed as the
94
“mass center” of the CFG. For achieving performance, they only check the Centroids,
under the consideration that a small change in a method means a small change in
its Centroid. They define the Centroid Difference Degree (CDD) and the Methods
Difference Degree MDD metrics for methods comparison. They use the longest common
subsequence LCS to compute the similarity between methods. For two methods m1 and
m2, they view each method as a sequence of opcodes. Secondly, adding one node in
small CFGs (with less than 4 nodes) may change the centroids a lot. But for big CFGs
(with 4 nodes or above), centroid-based approach is effective [23].
Figure 3.25: Two corresponding methods in two app clones are from different markets.
The first method has one more function call to initialize several ads [23]
3.4.2
Software Clones Taxonomy
An important clone taxonomies can be found in [46] , as they stated, two code fragments
can be similar based on the similarity of their program text or they can be similar in
their functionalities without being textually similar. They consider clone types based
on the kind of similarity two code fragments can have.
• Textual Similarity: Based on the textual similarity we distinguish the following
types of clones
95
- Type I: Identical code fragments except for variations in whitespace (may
be also variations in layout) and comments.
- Type II: Structurally/syntactically identical fragments except for variations
in identifiers, literals, types, layout and comments.
- Type III: Copied fragments with further modiØcations. Statements can be
changed, added or removed in addition to variations in identiØers, literals,
types, layout and comments
• Functional Similarity: If the functionalities of the two code fragments are identical
or similar i.e., they have similar pre and post conditions, we call them semantic
clones and referred as Type IV clones.
- Type IV: Two or more code fragments that perform the same computation
but implemented through diÆerent syntactic variants.
It is important to consider that each clone type represent a different level in the sophistication of the Detection Process, increasing from type I to type IV. The detection of
Type IV clones is the hardest even after having a great deal of background knowledge
about the program construction and software design
96
Type I Clones
In Type I clones, a copied code fragment is the same as the original, there might be
some variations in whitespace (blanks, new line(s), tabs etc.), comments and/or layouts.
Type I is widely know as Exact clones. They are “textually similar”:
Listing 3.2: Type I Clone
if (a >= b) {
c = d + b; // Comment1
d = d + 1;}
else
c = d - a; //Comment2
An exact copy clone of this original copy could be as
follows:
if (a>=b) {
// Comment1’
c=d+b;
d=d+1;}
else // Comment2’
c=d-a;
97
Type II Clones
Type II clone is a code fragment that is the same as the original except for some possible
variations about the corresponding names of user-defined identifiers (name of variables,
constants, class, methods and so on), types, layout and comments [46].
Listing 3.3: Type II Clone
if (a >= b) {
d = d + 1;}
else
A Type II clone for this fragment can be as follows:
if (m >= n)
{ // Comment1’
y = x + n;
x = x + 5; //Comment3
}
else
y = x - m; //Comment2’
98
Type III Clones
In Type III clones, the copied fragment is further modified with statement(s) changed,
added and/or deleted.
Listing 3.4: Type III Clone
if (a >= b) {
d = d + 1;}
else
If we now extend this code segment by adding a statement e =
1 then we can get,
if (a >= b) {
e = 1; // This statement is added
d = d + 1; }
else
99
Type IV Clones
Type IV . Functional similarity reflects the degree to which the components act alike,
i.e., captures similar functional properties and similarity assessment methods rely on
matching of pre/post-conditions clones are the results of semantic similarity between
two or more code fragment [46].
Listing 3.5: Type IV Clone
Fragment 1:
int i, j=1;
for (i=1; i<=VALUE; i++)
j=j*i;
Now consider the following code fragment 2, which is
actually a recursive function that calculates the
factorial of its argument n.
Fragment 2:
int factorial(int n) {
if (n == 0) return 1 ;
else return n * factorial(n-1) ;
}
This Type IV semantic clones although one is a simple code fragment and another is a recursive function with no lexical/syntactic/structural similarities between the statements
of the two fragments.
100
Centroid-based approach [23] although is extremely effective to detect Type 1, Type
2 and Type 3 clones, it may not be effective to detect Type 4 clones. Considering
Type 1, Type 2 and Type 3 clones the Function Call Graph would remain the same, also
changing the Identifiers name, the Control Flow remain the same. Also adding / deleting
some instructions would remain the same some sequences of opcodes (Opcodes Common
Sequences). About Type 4 clones, they are difficult to Detect by Static Analysis tools
but, as we can see further, they may be irrelevant for Malware Analysis.
3.4.3
101
The Isomorphism Problem
An exact graph isomorphism for two graphs, G and H [47] is a bijective function f(v)
that maps the vertices V(G) to V(H) such that for all i,j ∈ V(G), i,j ∈ E(G) if and
only if (f(i) ,f(j)) ∈ E(H).
Graph Isomorphism captures the notion that some objects have “the same structure”,
two isomorphic graphs enjoy the same graph theoretical properties. Unfortunately the
graph isomorphism problem is one of few standard problems in computational complexity theory belonging to NP, but not known to belong to either of its well-known (and,
if P 6≡ NP, disjoint) subsets: P and NP-complete. It is known to be NP-complete. For
this reason detecting the largest common subgraph for a pair of graphs is closely related
to graph isomorphism as it attempts to find the largest induced subgraph of G which
is isomorphic to a subgraph in H [45] and a misure of Minimum Graph Edit Dinstance
could be a similarity measure.
For a definition of Graph Similarity we refer to that in [45]:
(Graph similarity): The similarity σ(G, H) between two graphs G and H indicates the
extent to which graph G resembles graph H and vice versa. The similarity σ(G, H) is
a real value on the interval [0,1], where 0 indicates that graphs G and H are identical
whereas a value 1 implies that there are no similarities. . It is important to considerate
what “similarity” means for call Graph in order to bring out relevant strucural properties
between malware samples. The real challenge of using graphs is to show similarity in
polynomial time. Given 2 Graph, they are isometric if they share the same structure
and same functional properties, but the isomorphism problem take a NP time. To
solve this, in literature there are some interesting researches that uses features of the
Graph instead of the whole inter o intra procedural Graph that gratly simplify the
isomorphism problem. These features are the Centroid Difference Degree or Methods
Difference Degree [23], the count of common identical substructures in two graphs [22]
with the neighborhood hash graph kernel function, q-grams and k-subgraphs to create
features vectors [48].
Chapter 4
Detecting Android Malware
Variants using Opcodes
Frequency Distribution and Call
Graphs Isomorphism Analysis
In this Chapter it will be described our contribute in the Android Malware Analysis.
Our contribution is placed before actual Malware Detection Methodologies, in fact it will
not be another method for Detecting Android Malware but it will be a new methodology to study Malware, Detecting Malware variants, identify similarities accross malware
families and identify similarities at method-level in a method-reuse scenarios . This
methodology is placed before any existing Malware detection Methodologies to enhance
actual solutions or for example to build Malware “signs”, to track malware families evolution.
It is not always possible to identify the malware payload, due to advanced obfuscation
techniques applied on the apk in a code-reuse scenario or more in general variation in
its structure. But however is possible to identify invariants characteristic components.
And this is our direction.
Our methodology is, however, guided by the Opcode Frequency Distribution Analysis,
in fact all the analysis stage starts from the 10 more similar returned vectors. Will be
described this methodology, which is realized across two directions. As first direction we
102
Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution
103
compute the relative Opcodes frequency distribution as a measure to identify software
clones. With this detection methodology we can implement a more resilient analysis to
detect type I, Type II, Type III code clones. The realized methodology can perform
the Opcodes Frequency Distribution Analysis at Method-level also, in order to conduct
code-clone detection at more fine-graned level, in this way is it possible to identify Type
I and type II code clones at method-level and this approach seems to be quite effective
considering Code-Reuse Scenarios in the Malware Land.
Second, after extracting the Inner Function Call Graph and the Inner Control Flow
Graph our methodology tend to identifying isomorphism features between Android Malicious Applications, common subgraphs of both CFG and FCG to detecting code-clones.
The call Graph are then mapped in a Vector Space with the use of the Adjacency Lists
and after, these lists will be compared in an n-grams analysis stage to outline structural
similarities in the Control Flow. Finally a “Similarity Score” will be computed by the
Classifier, and assessed in the Experimental phase to confirm the value of this methodology.
When we talk about Type X clones, we refer to code-clones at bytecode-level, because,
we know , that a small change in an high-level code means a big change at low-level,
both for optimization reason or for high-level API. Also for this reason we starting our
methodology from the Opcodes Frequency Distribution, that it is more robust to detect
changes at low-level.
104
4.1
4.1.1
The Data-Set
The Malicious Data-Set, Malware Variants
Figure 4.1: Android Drebin logo
The system extracts structural properties from the candidate apk and then measures
the distance of these properties against a Malware Database build above the Android
Drebin (http://user.informatik.uni-goettingen.de/ darp/drebin/ ) [49] [50] data. The
dataset contains 5,560 applications from 179 different malware families. The samples
have been collected in the period of August 2010 to October 2012.
Each sample has as unique Id its own sha256 sum , each sample has a different id, this
means that each sample can be considered a malware variant inside that family. We
have retrieved 4544 apk‘s of 178 different well-know family from the Drebin db and then
extracting structural properties like Inners CFG/FCG, Opcodes Frequency Distribution
and method-level Opcodes-frequency Distribution. All these data are stored in a local
database , which is considered our starting dataset.
Lets first examinate the data-set in terms of its composition and then we’ll take a look
on our builded local db and how it was structured.
4.1.1.1
Malware families
The Malware samples used to build our local dataset belong to 178 well-know different
malware families:
105
Table 4.1: Malware Drebin Project Families
Stealthcell
Koomer
GGtrack
Adsms
YcChar
FaceNiff
Nickspy
Ksapp
SMSZombie
FakeDoc
Loicdos
Stealer
FakeFlash
RediAssi
Anti
Copycat
DroidRooter
Geinimi
SeaWeth
Acnetdoor
Tesbo
SpyBubble
FarMap
AccuTrack
Nandrobox
Kmin
EICAR-Test-File
UpdtKiller
Flexispy
Spy.GoneSixty
Fakengry
Glodream
Moghava
FinSpy
SpyMob
Proreso
Exploit.RageCage
Sakezon
Generic
Zsone
TrojanSMS.Hippo
106
LifeMon
Yzhc
Saiva
SMSSend
SmForw
SMSBomber
Lypro
RootSmart
Ackposts
Spitmo
Dabom
Fjcon
Updtbot
Rooter
EWalls
Maxit
Pirates
Coogos
Lemon
Fakelogo
Dogowar
Mobinauten
Penetho
Nisev
DroidDream
Ansca
GPSpy
JS Exploit-DynSrc
TheftAware
NickyRCP
Netisend
Placms
Stiniter
Antares
Typstu
Bosm
Mobsquz
FoCobers
CgFinder
JSmsHider
Pirater
RuFraud
Loozfon
BeanBot
Gamex
Sonus
107
Vdloader
SpyPhone
Nyleaker
FakeInstaller
Aks
Adrd
Whapsni
Zitmo
FakeNefix
CrWind
Qicsom
FakeRun
Iconosys
DroidSheep
Gonca
Gappusin
Xsider
Cawitt
Fidall
Cosha
Mania
Jifake
CellShark
SMSreg
ExploitLinuxLotoor
SuBatt
MobileTx
Bgserv
Fauxcopy
Gasms
Imlog
Kidlogger
Tapsnake
Hispo
MTracker
QPlus
Dougalek
Fatakr
RATC
Plankton
Steek
DroidKungFu
Hamob
Mobilespy
Booster
108
Spyset
CellSpy
PJApps
Fsm
Smspacem
Gapev
Arspam
Dialer
Spy.ImLog
TrojanSMS.Boxer.AQ
Fujacks
Trackplus
TigerBot
Ceshark
Anudow
Raden
SheriDroid
Biige
SendPay
TrojanSMS.Stealer
SmsWatcher
BaseBridge
MMarketPay
Maistealer
GinMaster
Sdisp
FakeTimer
Replicator
Kiser
PdaSpy
Ssmsp
TrojanSMS.Denofow
GlodEagl
SafeKidZone
Opfake
SmsSpy
SpyHasb
FakePlayer
Fakeview
Foncy
Boxer
Vidro
Gmuse
Spyoo
SerBG
Luckycat
109
4.1.2
The Trusted Data-set
However, regarding the “Trusted set”, we have selected about 1000 apk’s from the Google
Play Store in a more or less randomly way. By this channel, we are ensured that each
apk is “trusted” or at least it should be, this because google to keep malicious apps
off the official Android app store, submit the application to different detection tools.
One of these services is called Bouncer, it quietly and automatically scans apps (both
new and previously uploaded ones) and developer accounts in Google Play with its
reputation engine and cloud infrastructure. At Summercon 2012, Charlie Miller and
Jon Oberheide gave a talk on how to be possible to fingerprint Android’s Bouncer, they
submitted an Android app which had shell code included that allowed them to poke
around Bouncer while the submitted app was being analyzed. This code also connected
back and reported some interesting findings like Bouncer ip range, platform etc etc. After
knowing that Bouncer can be easily fingerprinted, it is not difficult to figure a malware
that can be avoid Bouncer Detection, for instance , this can be achieved through 2
different techniques:
• Delayed attack: The application can include malicious payloads in the submitted
app but behave benign when it is running in Bouncer. Once it gets onto a user’s
device, then starts to run malicious code.
• Update attack: No malicious code needs to be included in the initial installer.
Once the application passes Bouncer’s check and gets installed on a real user’s
device, then the application can either download additional malicious code to run
or connect to its remote control and command (C&C) server to upload stolen data
or receive further commands.
Everyway we consider as “Trusted” application provided by the Google Play Store. This
could be interesting for this study, for example to find malware variants between Google
Play Store applications.
110
4.2
The multi-staged Architecture
In this section i’ ll start describing the whole System Architecture with a top-down
approach starting from the system general overview and after focusing the attention on
the individual elements.
It is a multi-staged Architecture which each stage acts as a separate Analysis stage, the
System is realized in Software with some perl scripts.
Each Analysis stage extract and store metrics in a database for both providing measures
to the next stage or to any other further data-mining in a malware analysis environment.
As we’ll see, the system is composed by 2 fundamental parts, the first is the extraction
of the Opcodes Relative Frequency Distribution, by this, the system return the top 10
Vectors more similar than the candidate vector using the cosine similarity, after that,
these 10 returned vectors are used for the second part. The second part consist in the
n-gram analysis of the CFG and the FCG to find similar or cloned pattern, first the
Call Graphs are extracted and then are mapped in the vector-space to better compute
similarities. Finally, based on these measures, a classifier compute the final similarity
taking into account these computed “distances”. Following is provided the System
Architecture:
Figure 4.2: The Whole System Multi-staged Architecture
111
112
4.3
The Relative Opcodes Frequency Distribution
Our Heuristic, as we will see, start from the Relative Opcodes Frequency Distribution, for this we have first extracted and stored the opcodes frequency distribution
for each apk and then dividing it for the total number of Opcodes in the apk. For the
Opcodes Distribution we referred to the Official Dalvik Bytecode Set Table available
at source.android.com/devices/tech/dalvik/dalvik-bytecode.html. For each kind of operations, are assigned a category, for example, we have grouped all the operations on array
in the category arrayop or for example, we have grouped all the binary operation in the
category binop etc.. Following is the Bytecode set categories Table we have defined:
113
Table 4.2: Dalvik Bytecode set Categories
Category
nop op
moveobject
moveexception
move
Opcodes
nop
move-object, move-object/from16,
move-object/16
move-exception
return op
move, move/from16, move/16,
move-wide,
move-wide/from16,
move-wide/16, move-result, moveresult-wide, move-result-object.
return
returnobject
return-object
returnvoid
returnwide
return-void
return-wide
monitorenter
monitor-enter
monitor-exit
monitor-exit
goto
goto
goto16
goto/16
goto32
goto/32
invokevirtual
invoke-virtual
invokesuper
invoke-super
invokedirect
invoke-direct
invokestatic
invoke-static
invokeinterfaces
invoke-interface
Description
Waste cycles.
Move the contents of one objectbearing register to another.
Save a just-caught exception into
the given register. This must be
the first instruction of any exception
handler whose caught exception is
not to be ignored, and this instruction must only ever occur as the first
instruction of an exception handler.
move category includes all move/*,
all move-wide/* and all the moveresult/*
Return from a single-width (32-bit)
non-object value-returning method.
Return from an object-returning
method.
Return from a void method.
Return from a double-width (64-bit)
value-returning method.
Acquire the monitor for the indicated object.
Release the monitor for the indicated object.
Unconditionally jump to the indicated instruction.
Unconditionally jump to the indicated instruction with 16 bit offset.
Unconditionally jump to the indicated instructionwith 32 bit offset.
invoke-virtual is used to invoke a
normal virtual method (a method
that is not private, static, or final,
and is also not a constructor).
invoke-super is used to invoke the
closest superclass’s virtual method.
invoke-direct is used to invoke a
non-static direct method (that is, an
instance method that is by its nature non-overridable, namely either
a private instance method or a constructor).
invoke-static is used to invoke a
static method (which is always considered a direct method).
invoke-interface is used to invoke
an interface method, that is, on
an object whose concrete class isn’t
known, using a method id that
refers to an interface.
114
packedswitch
packed-switch
sparseswitch
sparse-switch
arrayop
aget, aget-wide, aget-object, agetboolean, aget-byte, aget-char, agetshort, aput, aput-wide, aput-object,
aput-boolean, aput-byte, aput-char,
aput-short
if-eq, if-ne, if-lt, if-ge, if-gt, if-le
iftest
instanceop
staticop
iftestz
cmpop
unop
binop
throw
iget, iget-wide, iget-object, igetboolean, iget-byte, iget-char, igetshort, iput, iput-wide, iput-object,
iput-boolean, iput-byte, iput-char,
iput-short
sget, sget-wide, sget-object, sgetboolean, sget-byte, sget-char, sgetshort, sput, sput-wide, sput-object,
sput-boolean, sput-byte, sput-char,
sput-short
if-eqz, if-nez, if-ltz, if-gez, if-gtz, iflez
cmpl-float (lt bias), cmpg-float (gt
bias), cmpl-double (lt bias), cmpgdouble (gt bias), cmp-long
neg-int, not-int, neg-long, not-long,
neg-float, neg-double, int-to-long,
int-to-float, int-to-double, long-toint, long-to-float, long-to-double,
float-to-int, float-to-long, float-todouble, double-to-int, double-tolong, double-to-float, int-to-byte,
int-to-char, int-to-short
add-int, sub-int, mul-int, div-int,
rem-int, and-int, or-int, xor-int,
shl-int, shr-int, ushr-int, add-long,
sub-long, mul-long, div-long, remlong, and-long, or-long, xor-long,
shl-long, shr-long, ushr-long, addfloat, sub-float, mul-float, div-float,
rem-float, add-double, sub-double,
mul-double,
div-double,
remdouble, binop/2addr, binop/lit16,
binop/lit8
throw
Jump to a new instruction based on
the value in the given register, using
a table of offsets corresponding to
each value in a particular integral
range, or fall through to the next
instruction if there is no match.
Jump to a new instruction based on
the value in the given register, using an ordered table of value-offset
pairs, or fall through to the next instruction if there is no match.
Perform the identified array operation at the identified index of the
given array, loading or storing into
the value register.
Branch to the given destination if
the given two registers’ values compare as specified.
Perform the identified object instance field operation with the identified field, loading or storing into
the value register.
Perform the identified object static
field operation with the identified
static field, loading or storing into
the value register.
Branch to the given destination if
the given register’s value compares
with 0 as specified.
Perform the indicated floating point
or long comparison.
Perform the identified unary operation on the source register, storing
the result in the destination register.
Perform the identified binary operation on the two source registers,
storing the result in the first source
register.
Throw the indicated exception.
115
After defined these categories, we have extracted, for each apk, all the Dalvik Opcodes and then creating a fixed-size vector with size 29 where each element is a defined category C=c1,c2,c3,c4.... and the corresponding value is the relative frequency
of that category (a density measure). At the end, we have stored in a local db, for
each apk, a characterizing vector, representing the Relative Opcodes Frequency Distribution. Following we provide an example of a characterizing vector for a malware
sample (66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa) of the
FakeDoc family:
Figure 4.3: 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa sample, Opcode frequency Distribution
Table 4.3: 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa Relative Opcodes Frequency Distribution Vector
116
nop op
0
move
0.01702
moveexception
0.00868
moveobject
0.05488
return op
0.02362
returnobject
0.05974
returnvoid
0.14345
returnwide
0.00764
monitorenter
0.00069
monitorexit
0
goto
0.14484
goto16
0.03161
goto32
0
invokevirtual
0.03890
invokesuper
0
invokedirect
0.00834
invokestatic
0.03334
invokeinterfaces
0.03091
packedswitch
0.01181
sparsewitch
0.00208
arrayop
0.00243
iftest
0.03091
instanceop
0.00660
staticop
0.01111
iftestz
0.27405
cmpop
0
unop
0
binop
0.00174
throw
0.02431
117
The Relative Opcodes Frequency Distribution is a metric for characterize applications
and it is our starting point in the Architecture. In fact, as first thing, the Opcodes
distribution of an apk candidate is extracted and then compared with those inside our
local db before extracted and a vector of 10 most similar samples will be returned
measuring the Cosine Similarity in SW. Cosine similarity is a measure of similarity
between two vectors of an inner product space that measures the cosine of the angle
between them, given two vectors of attributes, A and B, the cosine similarity, cos(θ), is
represented as:
A·B
cos(θ) = kAkkBk
In the case of information retrieval, the cosine similarity of two documents or text in
general will range from 0 to 1, the angle between two term frequency vectors cannot be
greater than 90◦ .
This vector contains the 10 more similar apk, respect to the candidate, in terms of
Relative Opcodes frequency Distribution:
Figure 4.4: Opcodes Distribution extraction and Computation
We also computed the Relative Opcodes Frequency Distribution for each method included
in the apk. In this way we are able to identify similarities at method-level in a more
fine-graned way, identifying clones at method level. This metrics is also stored in a local
db.
118
The method-level clone detection reported on Architecture, is another methods to find
structural similarities at method-level but we can discuss about it in the next paragraph.
Let’s show now a real case of the similarities detection methodology we implement. The
real case starts from the candidate apk we want to analyze, and we select a malicious
sample from the family Gappusin. Analyzing the sample, our methodology start the
Opcodes Frequency Distribution analysis and returning the 10 most similar vector by
the Cosine Similarity:
Table 4.4: 10 most similar vector by the Cosine Similarity
1
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde
Sim: 1
Family: Gappusin
2
85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281
Sim: 1
Family: Gappusin
3
4465602bd2c37dff6819798831828ac0cbb70614874afb179a8efbf26e538411
Sim: 1
Family: Gappusin
4
40ec5ac7d23316b2d3ae674efcb9a48073ac08082aeaa8f9d5f4945d2e1ae4d3
Sim: 1
Family: Gappusin
5
8f617e67ea6bdc4daf680875de10b3df79ec141780b1ff7604155a66021fee76
Sim: 1
Family: Gappusin
6
622cdedc202c2876e2e9b4c949100963b9b74b7d3977c31d15a0bb10ab2a7c97
Sim: 1
Family: Gappusin
7
12780c0ed735628dc21a5c0e4783e82c017a6ad999fd74e29420f5c1a598ca6c
Sim: 1
Family: Gappusin
8
212aa193b9f5a688a650366970c02987c99d07f6b1e50f04c510bfb74f7062f1
Sim: 1
Family: Gappusin
9
da07316645455a02b2f9eda79662c585aee96e9ab06aeb139fb895b9ea89238a
Sim: 0.99890
Family: Gappusin
10
2a1582f742edeb2ba7d0a90c0bf4358643612a38544d13a297c2ea5f6b7d86ce
Sim: 0.99728
Family: Gappusin
We can notice that all the 10 returned vectors returned belong to the Gappusin Family as expected, these vectors are more close to the candidate according to the Opcodes
Frequency Distribution Similarity measured with the Cosine Similarity metric. This is
a whole apk-level coarse grained analysis which take into account only affinity with the
Opcodes Frequency Distribution.
These first analysis show us that most probably, the candidate apk belong to the Gappusin Family. As we can see from the previous table, the Sim score is the Similarity Score
for the Opcodes Frequency Distribution measured by the Cosine Similarity. Following
we are going to compare the Opcodes Frequency Distribution between the candidate
apk and each of the 10 returned vector, following are showed first the real frequency for
each kind of opcode and the related istogram:
119
Table 4.5: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs
Relative Opcodes Frequency Distribution of the 1st Vector (right)
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
120
Figure 4.5: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 1st Vector
(red)
121
Relative Opcodes Frequency Distribution of the 2nd Vector (right)
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
122
Figure 4.6: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 2nd Vector
(red)
123
Relative Opcodes Frequency Distribution of the 3rd Vector (right)
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
124
Figure 4.7: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 3rd Vector
(red)
125
Relative Opcodes Frequency Distribution of the 4th Vector (right)
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
126
Figure 4.8: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 4th Vector
(red)
127
As we can notice, the first 4 returned vectors are the same, in fact, they have a Similarity
Score of 1 (as we can see on the previous table) this means for the Cosine Similarity that
they are the same. In the next page, we show the others remaining 6 vectors returned
which has anyway an high similarity Score (but not 1), this means that the vectors are
not the same but they are very similar to the Candidate in terms of Opcodes Frequency
Distribution. The vectors from 1 to 4 are the same sample, while the vector num 5 is a
little bit dissimilar but it belong to the same Zitmo Family, this means most probably
that these num 5 vector represent a sample of that family a little bit dissimilar sharing
the 99% of the Opcodes, as in a Code-Reuse scenario.
128
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
129
(red)
130
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
131
(red)
132
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
133
(red)
134
nop op
0
nop op
0
move
0.02791
move
0.02791
moveexception
0
moveexception
0
moveobject
0.00465
moveobject
0.00465
return op
0.06047
return op
0.06047
returnobject
0.05581
returnobject
0.05581
returnvoid
0.16279
returnvoid
0.16279
returnwide
0.00465
returnwide
0.00465
monitorenter
0
monitorenter
0
monitorexit
0
monitorexit
0
goto
0.15349
goto
0.15349
goto16
0.02326
goto16
0.02326
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.00465
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00465
invokestatic
0.04651
invokestatic
0.04651
invokeinterfaces
0.00930
invokeinterfaces
0.00930
packedswitch
0
packedswitch
0
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0
iftest
0.06512
iftest
0.06512
instanceop
0
instanceop
0
staticop
0
staticop
0
iftestz
0.28372
iftestz
0.28372
cmpop
0
cmpop
0
unop
0
unop
0
binop
0
binop
0
throw
0.00465
throw
0.00465
135
(red)
136
nop op
0
nop op
0
move
0.02791
move
0.03744
moveexception
0
moveexception
0.00399
moveobject
0.00465
moveobject
0.01348
return op
0.06047
return op
0.02546
returnobject
0.05581
returnobject
0.09636
returnvoid
0.16279
returnvoid
0.15876
returnwide
0.00465
returnwide
0.00100
monitorenter
0
monitorenter
0.00100
monitorexit
0
monitorexit
0.00150
goto
0.15349
goto
0.15277
goto16
0.02326
goto16
0.03495
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.04693
invokesuper
0
invokesuper
0
invokedirect
0.00465
invokedirect
0.00899
invokestatic
0.04651
invokestatic
0.01398
invokeinterfaces
0.00930
invokeinterfaces
0.01248
packedswitch
0
packedswitch
0.00599
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0.00300
iftest
0.06512
iftest
0.04943
instanceop
0
instanceop
0.02147
staticop
0
staticop
0.01947
iftestz
0.28372
iftestz
0.25961
cmpop
0
cmpop
0
unop
0
unop
0.00050
binop
0
binop
0.0050
throw
0.00465
throw
0.00399
137
(red)
138
nop op
0
nop op
0
move
0.02791
move
0.03096
moveexception
0
moveexception
0.00593
moveobject
0.00465
moveobject
0.00791
return op
0.06047
return op
0.02141
returnobject
0.05581
returnobject
0.08531
returnvoid
0.16279
returnvoid
0.16634
returnwide
0.00465
returnwide
0.00132
monitorenter
0
monitorenter
0.00132
monitorexit
0
monitorexit
0.00132
goto
0.15349
goto
0.16271
goto16
0.02326
goto16
0.02931
goto32
0
goto32
0
invokevirtual
0.00465
invokevirtual
0.04447
invokesuper
0
invokesuper
0.00033
invokedirect
0.00465
invokedirect
0.01812
invokestatic
0.04651
invokestatic
0.01581
invokeinterfaces
0.00930
invokeinterfaces
0.01252
packedswitch
0
packedswitch
0.00527
sparsewitch
0
sparsewitch
0
arrayop
0
arrayop
0.00296
iftest
0.06512
iftest
0.04282
instanceop
0
instanceop
0.03195
staticop
0
staticop
0.02075
iftestz
0.28372
iftestz
0.26219
cmpop
0
cmpop
0
unop
0
unop
0.00066
binop
0
binop
0.00099
throw
0.00465
throw
0.00428
139
(red)
140
In an attemept to place this methodology before the Detection process, a most interesting analysis is that of, choosing a malware family, by going to identify the Dalvik
instructions more representative for that family. This is possible by first examing the
distribution of each Opcode category. For this, we choose to examinate the Gappusin
malware family. We can observe that Opcodes Classes like “iftestz”, “goto”, “return*”
are instructions that are present in each method of eack known application, but certain types of instructions rarely occur. Are the rare opcodes classes that can be more
representive of each malware family:
141
Figure 4.15: Gappusin family, Opcodes classes Frequency Distribution
142
4.4
Isomorphism Analysis
As we also mentioned, we starting retrieving the malicious apk’s from the Android
Drebin dataset, at this point we extracted some metric from each apk and store them
in a local database, following are showed the relevant informations we decided to extract
and store. For detecting code clones, we implemented some heuristics for the so called
Isomorphism Analysis, to detect the isomorphism ratio between malware samples.
4.4.1
The Control Flow Graph
As first step of this Heuristic, we extracted the Control Flow Graph of each application
in our Data-set but without considering external imports, this choice was made with the
intention of considering only internal structural properties because our study focuses on
Similarities among Malware variants. Considering only internal structural properties of
an application and not External or third-part libraries means less Entropy to Analyze
with the assumption that external libraries are variant.
For extracting CFG we used the AndroGuard suite exporting the CFG in gexf format.
After, all the extracted CFG’s are stored in a local filesystem for further analysis. We
used the Gephi Software to visualize and interact with the CFG:
Figure 4.16: Android Dragon Quest game CFG in Gephi
143
For each node, we have some useful informations like the Class, The Method Name and
the specific Opcode it implements:
Listing 4.1: Node properties
<node id="15" label="Ljp/waraerudouga/HttpPostTask;-access$3-BB@0x0-(Ljp/
waraerudouga/HttpPostTask; Ljava/lang/String;)V">
<att type="string" name="classname" value="Ljp/waraerudouga/HttpPostTask
;"/>
<att type="string" name="name" value="access$3"/>
<att type="string" name="descriptor" value="(Ljp/waraerudouga/HttpPostTask;
Ljava/lang/String;)V"/>
<att type="integer" name="offset" value="0"/>
<att type="string" name="node.label" value="access$3\nreturn-void"/>
After extracting the CFG and the FCG for the Dataset, we’ ll map these Graphs in the
Vector-Space using Adjacency Lists to detect Android malware using machine learning
techniques.
After extracting the CFG’s we are able to identify “common patterns” in the Control
Flow, for the identification we first mapping the CFG in the Vector-space and then
applying some n-grams analysis. In this way we are able to identify Code Clones or
everyway “common patterns” as a proof of Code Reuse. More n-grams are founded and
more the Similarity Score is increased.
4.4.2
The Function Call Graph
As we have already see for the CFG, same thing is for the Function Call Graph. We
extracted the Function Call Graph of each application in the Data-set but without
considering external imports, using the Androguard Framework.
Each FCG is then storead in a local db for further analysis. It is also possible to manually
analyze it with Gephi tool. Following we can observe a particular of the Call Graph of
the Dragon quest game for Android:
144
Figure 4.17: Android Dragon Quest game FCG in Gephi
Each FCG is extracted and then stored in a local db for further analysis. Our research
also use the Call Graph structural properties to found similarities in malware variants,
same thing of the CFG, in a Code-Reuse scenario, it is possible that some methods
are cloned or some behaviors are cloned between malware families. We then apply an
n-gram analysis to identify “Common Pattern” in the Call Graph. Then the similarity
score is increased more n-gram pattern will be found.
4.4.3
Vector-Space Mapping
To better and efficiently compute the “similarity” between Call Graphs, we introduce a
Vector transformation. Also for the Opcodes Distribution we have introduces a trasformation in the Vector Space.
We mapped all the CFG’s and FCG’s in the Vector Space with the use of the so called
Adjacency Lists. In graph theory and computer science, an adjacency list representation of a graph is a collection of unordered lists, one for each vertex in the graph.
Each list describes, for each vertex, the set of its neighbors. The use of an Adjacency
List instead of an Adjacency Matrix is because for a sparse graph an adjacency list
is significantly more space-efficient than an adjacency matrix. The space usage of the
adjacency list is proportional to the number of edges and vertices in the graph, while
for an adjacency matrix stored in this way the space is proportional to the square of
the number of vertices. Also in regard to the Architecture, we can notice that, before
each Similarity Computation, the structural data (CFG, FCG, Opcodes Distribution
Frequency) are mapped in the Vector-space:
145
Figure 4.18: Structural data mapped in the Vector-Space
In regards to the Architecture, firts of all, the system returns the 10 most similar Vectors
respect to the candidate apk, of these, it map from the Flow Graphs the relative 10 Adjacency Lists to provide them to the next stage n-gram analysis. The Building of the Adjacency List is done in Central Memory using hash map data structure, only when it was
complete, it was stored on the local storage. Following we show the construction of our
Adjacency List for malware sample 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975
as you can notice, it is a little bit different respect to a “right” Adjacency List, in fact on
the left we have the node id while, on the right we have the node value (Dalvik Opcode)
followed by the other node connected to it (value).
146
Listing 4.2: 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde sample Adjacency List
’127’ => ’goto,return’,
’1801’ => ’goto/16,goto’,
’1648’ => ’if-eqz,invoke-virtual,goto’,
’882’ => ’goto,goto’,
’1050’ => ’if-eqz,goto,move’,
’1178’ => ’if-eqz,if-eqz,if-nez’,
’898’ => ’if-eqz,const/4,goto’,
’798’ => ’goto,return-object’,
’1387’ => ’if-lez,if-lez,return-object’,
’1031’ => ’invoke-virtual,return-void’,
’512’ => ’move,return’,
’463’ => ’if-nez,move-result-object,if-eqz’,
’517’ => ’goto,return’,
’451’ => ’move-object,return-object’,
’1785’ => ’goto,return-void’,
’1364’ => ’if-eqz,if-eqz,return-void’,
’1408’ => ’if-nez,iput-object,if-eqz’,
’1250’ => ’sput-boolean,if-eqz’,
’1363’ => ’iput-boolean,if-eqz’,
’722’ => ’if-eqz,if-eqz,goto’,
’634’ => ’iput-object,invoke-direct/range’,
’578’ => ’goto,return-void’,
’1630’ => ’if-eqz,iput-object,if-eqz’,
’1244’ => ’if-eqz,if-eqz,goto’,
’1638’ => ’if-eqz,iput-object,return-void’,
’2144’ => ’invoke-virtual,return-void’,
’742’ => ’monitor-exit,return-void’,
’378’ => ’goto,if-eqz’,
’889’ => ’goto,if-lt’,
’350’ => ’invoke-interface,goto/16’,
’1969’ => ’if-le,invoke-virtual/range,goto/16’,
’1233’ => ’if-eqz,if-lez,goto/16’,
’540’ => ’if-eqz,if-ne,goto’,
’1615’ => ’if-eqz,if-nez,if-eqz’,
’1162’ => ’if-nez,sput-object,if-nez’,
...
147
We have to make a consideration, we don’t have the needs to reconstructing the whole
graph, our study can only take into consideration “identical” Structural Connections,
in fact, the Adjacency List of the Candidate apk is created and then , the Adjacency
List of each of 10 similar vector returned by the system, is extracted. At this point
the Adjacency Lists are sorted, and after that the system starts the Similarity Search
to identify “common” shared structures. This analysis is conducted for both CFG and
FCG.
Figure 4.19: Adjacency Lists preparation and similarities search
148
Following we present the Candidate apk CFG Adjacency List extracted and sorted,
ready for the n-gram Analisys stage:
Table 4.15: Candidate apk CFG Adjacency List extracted and sorted
return-void,goto
check-cast,if-ne,
const-string,if-eqz,
const-string,if-nez,
const-string,return-object,
const/,check-cast,
const/,goto,
const/,if-eq,
const/,if-lt,
const/,if-nez,
const/,move-object,
const/,return,
const/,return-object,
goto,const/,
goto,goto,
goto,if-eq,
goto,if-eqz,
goto,if-gtz,
goto,if-lt,
goto,if-nez,
goto,move,
goto,return,
goto,return-object,
goto,return-void,
goto/,if-lt,
goto/,return,
goto/,return-object,
goto/,return-void,
if-eq,goto,if-eq,
if-eq,if-eqz,return-void,
if-eq,if-lt,return,
if-eqz,const/,goto,
if-eqz,const/,return-void,
if-eqz,goto,goto,
if-eqz,goto,if-eqz,
if-eqz,goto/,goto/,
if-eqz,goto/,return-void,
149
if-eqz,if-eq,return-void,
if-eqz,if-eqz,if-eqz,
if-eqz,if-eqz,return-void,
if-eqz,if-ge,goto,
if-eqz,if-le,if-eqz,
if-eqz,if-ne,if-eqz,
if-eqz,if-ne,invoke-static,
if-eqz,if-ne,return,
if-eqz,if-nez,goto/,
if-eqz,if-nez,if-eqz,
if-eqz,if-nez,invoke-static,
if-eqz,if-nez,move,
if-eqz,if-nez,return,
if-eqz,if-nez,return-object,
if-eqz,if-nez,return-void,
if-eqz,invoke-direct,if-eqz,
if-eqz,invoke-static,goto,
if-eqz,invoke-static,if-eqz,
if-eqz,invoke-virtual,if-eqz,
if-eqz,invoke-virtual/range,if-eqz,
if-eqz,move,goto,
if-eqz,return-void,if-nez,
if-ge,goto,const/,
if-gtz,return-object,goto,
if-le,goto,if-eqz,
if-lez,move-result-object,return-object,
if-lt,goto,goto,
if-lt,if-nez,goto/,
if-lt,return,if-eq,
if-ne,if-eqz,move,
if-ne,if-nez,if-eqz,
if-ne,invoke-static,goto,
if-ne,invoke-static,if-eqz,
if-ne,move,goto,
if-ne,throw,return,
if-nez,const-string,goto,
if-nez,const-string,if-eqz,
if-nez,const-string,if-nez,
150
if-nez,const-string,return-object,
if-nez,const/,goto,
if-nez,const/,if-eqz,
if-nez,goto/,return-object,
if-nez,if-eqz,goto,
if-nez,if-eqz,if-eqz,
if-nez,if-eqz,if-ne,
if-nez,if-eqz,if-nez,
if-nez,if-eqz,invoke-virtual/range,
if-nez,if-eqz,return-void,
if-nez,if-ne,return,
if-nez,if-nez,goto,
if-nez,invoke-interface,return,
if-nez,invoke-interface,return-object,
if-nez,invoke-static,goto,
if-nez,invoke-static,move,
invoke-direct,if-eqz,
invoke-interface,return,
invoke-interface,return-object,
invoke-static,const/,
invoke-static,if-eqz,
invoke-static,move,
invoke-static,return,
invoke-static,return-void,
invoke-virtual,return,
invoke-virtual/range,if-eqz,
move,goto,
move,if-eqz,
move,if-gtz,
move,if-nez,
move,return,
move-object,const/,
move-result-object,return-object,
throw,goto,
151
Following we present the Candidate apk FCG Adjacency List ready for the n-grams
Analisys stage:
Table 4.16: Candidate apk FCG Adjacency List extracted
sendNotify
awardPointsHelper
closeOfDialog
showPushDialog
onDestroy
isValid
FinishApplicationRunnable
d
Initialize
drawGif
showOffers forTab
updateOnlineConfig
handleConnectResponse
bindNewApp
g
access$300
getPushAdDateFromServer
getDisplayAdDataFromServer
e
onError
access$100
access$500
getAnimationList2
¡init¿
onSurfaceDestroyed
drawFrame
onPostExecute
flush
q
b
¡init¿,a
¡init¿
submit
initNotityData,¡init¿,¡init¿,getInstanceNoConnect,
notify receiver
access$0
getConfigParams, isEn, edit, get
¡init¿, a
j, h, i, k, n, p, l, showUpdateDialog, ¡init¿, ¡init¿
getWapsInfo, ¡init¿, edit
access$2, access$2
showOffers forTab
a, f
buildDocument, getNodeTrimValue, getNodeTrimValue,
getNodeTrimValue, getNodeTrimValue
getNewAppInfo
n, h, a
getDownload
¡init¿
¡init¿
A, f, r, i, g, h, a, a, h, k, f, f, h, g
a, ¡init¿
updateResultsInUi
buildResponse
getAnimation
¡init¿, ¡init¿, ¡init¿, ¡init¿, updateOnlineConfig, update, setUpdateOnlyWifi, getConfigParams, a, ¡init¿, onSharedPreferenceChanged, ¡init¿, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿,
¡init¿, a, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿, getWapsInfo, ¡init¿,
¡init¿, ¡init¿, get, edit, ¡init¿, ¡init¿, ¡init¿, getInstance,
getInstance, getPoints, getParams, getParams, getParams,
¡init¿
access$0
access$0, drawGif, drawTouchPoint, drawGif
a
e
¡init¿, a
a, c, a, a, a, h, p, k, n, a, b, d, b, c, c, b, h, ¡init¿, ¡init¿, b,
a, d, f, d, a, a, a, a, a, ¡init¿, ¡init¿
152
Table 4.17: Candidate apk FCG Adjacency List extracted
DisplayAd
onVisibilityChanged
showFeedback
pushInfo
access$2100
access$800
getNetConfig
showADS
handleMessage
getAnimation
getParams
getPushAd
showNewApp
updateResultsInUi
getWAPS ID
doInBackground
startRotation
onCreate
getView
access$2700
getAnimationList
handleGetPointsResponse
onReceivedError
handleSpendPointsResponse
onResume
access$2600
access$2400
getPointsHelper
ExitDialog
onPause
startAnimation
getDownload
update
¡init¿, DisplayAd, showADS
access$0, drawFrame
showFeedback forTab
¡init¿, get, getInstance, setPushAudio, getInstance, getPushAd
UpdateDialog
getConfigParams, getConfigParams, get, get, edit
getInstance, getDisplayAd
a, b, c, a
¡init¿
initMetaData, getInstance, getParams
getPushAdDateFromServer, getPushAd, getInstance, getPushAd
¡init¿, ¡init¿, ¡init¿, bindNewApp, ¡init¿
¡init¿, startAnimation, ¡init¿
getInstance, getWapsId
a, a, a, a, a, a, a, a, a
¡init¿
a, a, b, a, a, a, a, ¡init¿, c, f, g, ¡init¿, ¡init¿, a, i, ¡init¿, initMetaData, initNotityData, getAlreadyInstalledPackages,
showNewApp, Initialize, ¡init¿, ¡init¿, ¡init¿, getInstance,
getInstanceNoConnect, notify receiver, getInstanceNoConnect, notify receiver, ¡init¿
access$1500, access$1500, getARGB, getARGB, getARGB,
getARGB, getARGB, getARGB, getARGB, getARGB,
getARGB, getARGB, getARGB, getARGB, getARGB,
getARGB, getARGB, getARGB
sendNotify
getAnimation
buildDocument, getNodeTrimValue, getNodeTrimValue,
getNodeTrimValue
access$502
buildDocument, getNodeTrimValue, getNodeTrimValue
onResume, a, b, onResume, ¡init¿, onResume, getInstance,
getPoints
handleAwardPointsResponse
¡init¿
¡init¿, ¡init¿
onPause, ¡init¿, onPause
getAnimationList2, getAnimationList
¡init¿, ¡init¿, ¡init¿, a, b
a, p, a, a, b
153
access$400
access$3000
shouldOverrideUrlLoading
getDisplayAdResponseFailed
onPageFinished
a
initFeedback
getValid
drawCube
access$2000
buildResponse
getPoints
toMD5
onPreferenceChange
enterPage
showMessage
onPageStarted
t
buildResponse
i, access$202, access$200, getInstanceNoConnect, package receiver
w
access$600
b, b, a, a, a, a, A, o, c, b, j, a, r, b, c, b, l, a, a, a, l,
a, b, d, b, a, a, c, o, s, k, p, a, a, a, a, ¡init¿, a, b, a,
e, a, a, a, a, a, a, a, a, ¡init¿, a, a, a, b, a, a, a, a, a, a, a,
¡init¿, ¡init¿, access$300, access$100, access$200, access$400,
access$500, ¡init¿, ¡init¿, access$100, access$2500, access$300, access$2600, access$2200, access$300, access$100,
access$2100, access$2200, access$1600, access$1000, access$1000, access$1000, access$300, access$1700, access$100, access$1800, access$1600, access$1900, access$2000, access$300, access$400, access$100, access$2300,
access$300, access$2400, access$2200, access$100, access$200, access$200, access$300, access$400, access$500,
access$600, access$600, access$700, access$100, access$800,
access$900, access$1000, access$1000, access$1100, access$1202, access$1200, access$1400, access$1500, access$100, access$902, access$1500, access$100, access$902,
access$2900, access$300, access$3000, getInstanceNoConnect, notify receiver, getInstanceNoConnect, notify receiver
a, a
get
drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine,
drawLine
toMD5
¡init¿, a, ¡init¿, getNodeTrimValue, getNodeTrimValue,
getNodeTrimValue, getNodeTrimValue, decodeBase64,
getNodeTrimValue, getNodeTrimValue, getNodeTrimValue,
getNodeTrimValue, getView, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue,
decodeBase64
getPointsHelper
toHexString
access$0, isValid
onEvent
¡init¿
access$600
r
154
getInstance
showNotifyList
s
c
showMore
onCreateEngine
notifyReceiverHelper
onDownloadStart
getConfigParams
onReceive
openFeedbackActivity
onSurfaceChanged
getWapsInfo
onItemClick
submit
access$2800
packageReceiverHelper
onClick
onFeedbackReturned
showOffers
spendPointsHelper
notify receiver
run
getDisplayAd
¡init¿, a, ¡init¿, a, ¡init¿, a, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿,
¡init¿, ¡init¿, ¡init¿, ¡init¿, ¡init¿
bindNewApp
r
b, b, a, a, d, e, c, ¡init¿, ¡init¿, ¡init¿
showMore
¡init¿
¡init¿
¡init¿, access$100, access$200, access$300
r
a, getWapsInfo, pushInfo, getWapsInfo, pushInfo,
getInstanceNoConnect, package receiver, getInstanceNoConnect, package receiver
a, a
drawFrame
¡init¿
access$1100
¡init¿
getPushAd
¡init¿
a, b, c, a, a, a, d, ¡init¿, access$0, access$0, access$0, access$0, access$0, l, m, a, a, a, a, i, a, ¡init¿, c, ¡init¿, access$700, access$700, access$800, access$900, access$1000,
access$1000, access$900, access$1000, access$1000, access$1202, access$1202, access$1300, access$1300, access$1400, access$1400, access$900, access$400, access$400,
access$4, access$5, access$8, access$2, access$7, access$5,
access$2, access$7, access$7, access$600, access$600, access$600, access$700, access$700, access$600, access$700,
access$900, access$600, access$000, getInstance, spendPoints, getInstance, showOffers, access$1400, access$3100,
access$1400, access$502, access$502, getInstanceNoConnect, package receiver
a
showOffers
¡init¿
notifyReceiverHelper
¡init¿, access$0, a, a, a, a, a, a, b, c, ¡init¿, a,
drawFrame, d, e, b, access$002, access$000, showADS, access$100, access$0, ¡init¿, access$0, access$1, showMessage,
access$2, showMessage, access$2, showMessage, access$2,
showMessage, access$4, access$5, showMessage, access$3,
showMessage, access$4, access$5, access$6, access$6, access$5, access$7, access$5, edit, access$1400, access$1400,
access$2700, access$2800
155
h
setOpenGLContext
onEvent
f
getInstanceNoConnect
i
setDefaultReportPolicy
showMore forTab
UpdateDialog
¡clinit¿
spendPoints
awardPoints
showUpdateDialog
initMetaData
onOffsetsChanged
access$1100
package receiver
finalize
m, n, a, a, e
a
onEvent, onEvent, onEvent, a, ¡init¿
b, a, b, s, t, n, ¡init¿, ¡init¿
¡init¿, a, ¡init¿, ¡init¿
b, a, a, j, a, b, p, q
a
showMore forTab
buildDocument, getNodeTrimValue, getNodeTrimValue
¡init¿, ¡init¿
¡init¿
spendPointsHelper
awardPointsHelper
g, b, d
getWapsId, getWapsId, toMD5
access$1, drawFrame
loadApps
packageReceiverHelper
¡init¿
156
4.4.4
N-grams Analysis stage
After the preparation of the Adjacency Lists, starts the real Isomorphism Analysis
on Call Graphs. Until this point we have extracted some structural properties of the
apk, mapped them in the Vector-space to better and efficiently represent them, now we
need to identify and quantify the common shared sub-structures in the CFG and in the
FCG, after that phase a final similarities score is computed. Lets to analyze this phase.
An n-gram is a contiguous sequence of n items from a given sequence of text or speech,
with the sorted Adjacency Lists, it is time to detect the contiguous sequence of the CFG
or the contiguous sequence of the Call graph in common. In this way we are able to
Identify common subgraph, if the candidate apk has a valuable “portion” of its graph
identical to one of the 10 most similar vector returned (10 most similar apk returned)
we can consider this portion as malicious and probably can be considered as a Payload
Portion.
We can starting considering a 7-gram analysis on the CFG Adjacency Lists and 4gram analysis on the FCG Adjacency Lists. The choiche of these n value in this ngram analysis is a purely empirical choiche, we have noticed that a small gram sequence
conduct to a much more number of false positive, at this point we decided to choiche a
major n to only consider “relevant” portion of the graph. Following is represented the
system architecture until the n-gram phase analysis, omit now a second methodology
for the identification of similarities, the method-level clone detection:
Figure 4.20: Architecture. N-gram analysis
157
Following are showed the results of the n-gram analysis phase both for the CFG and the
FCG Adjacency Lists of each returned vector. This is the result of the n-gram analysis
for the CFG Adjacency List:
158
4
3
2
1
Table 4.18: n-grams on the Adjacency Lists of the 1st returned vector
N-grams Analysis for each of the 10 returned vectors
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 40ec5ac7d23316b2d3ae674efcb9a48073ac08082aeaa8f9d5f4945d2e1ae4d3
Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 25
Table 4.21: n-grams on the Adjacency Lists of the 4th returned vector
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 4465602bd2c37dff6819798831828ac0cbb70614874afb179a8efbf26e538411
Table 4.20: n-grams on the Adjacency Lists of the 3rd returned vector
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281
Table 4.19: n-grams on the Adjacency Lists of the 2nd returned vector
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde
4.4.4.1
159
8
7
6
5
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 212aa193b9f5a688a650366970c02987c99d07f6b1e50f04c510bfb74f7062f1
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 12780c0ed735628dc21a5c0e4783e82c017a6ad999fd74e29420f5c1a598ca6c
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 8f617e67ea6bdc4daf680875de10b3df79ec141780b1ff7604155a66021fee76
160
10
9
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 2a1582f742edeb2ba7d0a90c0bf4358643612a38544d13a297c2ea5f6b7d86ce
37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde da07316645455a02b2f9eda79662c585aee96e9ab06aeb139fb895b9ea89238a
161
The n-grams analysis has detected the “common (maximum) subgraph” of the Candidate
with known instances of some malware families. This behavior means only one thing,
that candidate adopted or in a proper manner cloned one or more malware behaviours.
As stated by the previous tables, the first 8 returned vector have a Similarity Score of “1”
about the Opcodes Frequency Distribution, this means that the first 8 vector are “almost
the same”. The n-gram analysis return the whole adjacency list both for the CFG and
FCG, the vector #1 is the same sample, with the name 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde. In fact all of these belong to the Gappusin
Family, the vector #9 have a bit less similarity Score than the first 8 (0.99890) but
still remains an high score. This result show that the #9 vector is enough similar to
the candidate, in fact, the n-gram analysis stage return 18 entry on 64 for the CFG
adjacency list and 3 entry on 28 for the FCG adjacency list, this means that vector
#9 has some behaviour in common with the candidate, or in other words, there is a
portion of the vector #9 cloned by the candidate. Same thing also for the vector #10.
Candidate have almost the whole CFG/FCG in common with vectors from 1 to 8, about
vector #9 it share a common subgraph for the FCG:
162
Table 4.28: Common Subgraph with the n. 9 vector
access$2100:
access$2400:
access$2600:
access$2700:
access$2800:
access$3000:
access$300:
access$400:
handleAwardPointsResponse:
handleConnectResponse:
handleGetPointsResponse:
handleMessage:
sendNotify
getPushAd
buildResponse
getDownload
buildDocument,
getNodeTrimValue,
getNodeTrimValue
buildDocument,
getNodeTrimValue,
getNodeTrimValue, getNodeTrimValue,
getNodeTrimValue
buildDocument,
getNodeTrimValue,
getNodeTrimValue, getNodeTrimValue
a,b,c,a
Following is shown “Common Subgraphs” detected by n-grams Analysis stage.
Figure 4.21: Common Subgraphs with the vector n. 9 Detected by n-grams analysis
stage
163
about vector #10 it share a common subgraph for the FCG:
Table 4.29: Common Subgraph with the n. 10 vector
access$2100:
access$2400:
access$2600:
access$2700:
access$2800:
access$3000:
access$300:
access$400:
getAnimationList2:
getAnimationList:
getConfigParams:
getDisplayAd:
onError:
onEvent:
onFeedbackReturned:
onItemClick:
startAnimation:
startRotation:
submit:
t:
sendNotify
getPushAd
buildResponse
getDownload
getAnimation
getAnimation
r
a,¡init¿
onEvent,onEvent,onEvent,a,¡init¿
a
access$1100
getAnimationList2,getAnimationList
¡init¿
¡init¿
r
Figure 4.22: Common Subgraphs with the vector n. 10 Detected by n-grams analysis
stage
164
4.4.5
The Method-level similarity analysis stage. Type I and II clones
detection at method-level
It was also implemented a more fine-graned analysis stage called The Method-level
similarity analysis stage. The idea behind is to identify clones at method-level
rather then at class level. With this analysis we can identify same reused behaviours
from a well-know malware family. This analysis stage is implemented in parallel with
the n-gram analysis stage as we can see from a slice of the architecture:
Figure 4.23: Architecture. Method-level similarity analysis stage
After the vectors are returned, 2 distinct fundamental analysis stage are implemented:
the n-gram analysis stage, we appropriately analyzed previously, for the identification of
common subgraphs, and the Method-level similarity analysis stage for the Identification
of code clones at method-level. Both these 2 analysis stage provide a contribute for the
next stage classifier for the computation of the final similarity scores.
Let’s examinate this stage, as first thing, for each identified method in the candidate,
an Opcode Frequency Distribution for that method, is computed and then compared
with those of each returned vector to identify methods clone. The choice of compare
an opcode distribution instead of the methods name is dictated by the fact that, in
a code-reuse scenario, methods name or more in general “identifiers” can be changed
or encrypted, while an identical opcodes distribution still remains indicative of a code
clone. This analysis is resilient at least for type I, and type II Clones but not for type
III and type IV, in fact, type I and II still have the same opcodes distribution, not the
165
same for type III and IV.
Could make it even more resilient to type III and Type IV code clones with a little
improvment, computing a similarity or distance score for the opcodes distribution, in
fact, by the way, the opcodes distributions are strictly compared entirely and if (in case
of Type III or IV code clone) a little variation is present, the result is “no match” for
that distribution.
4.5
The Classifier
At this point, each analysis stage products some score, a similarity score indicating
the “isomorphism ratio” between samples, the ratio of the portion cloned by the candidate. Based on this portion we can attribute our candidate to one of the malware
families present in the data-set, or more than one malware family. This task is under
the responsability of the Classifier module. As we can see from the Architecture, the
last stage is the Classifier, which is responsible to acquire metrics for each stage and
provide a final Similarity Score which represent memberships to one or more than one
malware families. The Detection of the malware variants is a threshold analysis based
on isomorphism ratio between malware samples. Let’s analyze our implementation of
the Classifier. At this phase, we will have 4 different metrics normalized, for each of the
most similar vectors returned:
Table 4.30: The Classifier Analysis stage 1
Relative Opcodes Frequency Distribution Sim. Score
# of 7-grams CFG
# of 4-grams FCG
# of Methods Similarity Ratio
Our Classifiers consist of formulas embedded in SW that take into accounts these 4
score, we can also consider that the classifier can be well calibrated basing on experimental phase, this flexyble architecture can permit us to also have other Classifiers. We
make some considerations about. Our methodology is, however, guided by the Opcode
Distribution Frequency Analysis, because each of the following analysis stage take into
166
consideration only the 10 returned vectors (most similar vectors), starting from these,
the n-gram analysis is useful to detect code-clones and the Method-level similarity analysis stage show us the # of methods cloned. The Formula have to take into account also
the # of n-grams detected in the Call Graphs, if there is no one n-gram token both for
CFG or for the FCG, then the Similarity Score is 0, this means there is no similarities
between the candidate and one of the malware samples present in the dataset. Our
Classifier consist of 2 similarity scores following defined
The I Similarity score is (the values are first normalized):
I Sim. Score = max(# of 7-grams CFG , # of 4-grams FCG ) ∗ methodSimRatio
In this formula, we take into account that at least n-grams for CFG or FCG exists,
the max value of the n-grams is then multiplied with the methodSimRatio [0,1]. In this
formula we doesn’t take into account the Relative Opcodes Frequency Distribution Sim.
Score because, the 10 returned vectors are yet sorted by this score. Let we analyze the
2 Similarity Score.
The II Similarity score is (the values are first normalized):
II Sim. Score = OpCodeSimScore ∗ (# of 7-grams CFG) ∗ (# of 4-grams FCG
) ∗ methodSimRatio
In this case we multiplied all the scores, this Similarity Score is close to 1 when all the
scores are close to 1 (1 when the app is exactly a clone). The goodness of these scores
will be evaluated in the experimental Chapter.
167
4.6
Real System implementation
The System consist of a some perl scripts:
• method.pl which is responsible to extract and store the Opcodes Frequency Distribution at method-level.
• AdjList.pl which is responsible to extract and store the Adjacency Lists for both
CFG and Call Graph and also for the extraction of the Opcodes Frequency Distribution.
• DescentDroid.pm which is the core framework. It take the candidate apk in input and perform all the analysis based on Opcodes Frequency Distribution and
Isomorphism between Call Graphs.
Let we analyze the usage of these tools.
method.pl
It is necessary to execute the perl script as perl method.pl with no arguments. It take all
the malicious samples in the dataset and compute the method-level Opcodes Frequency
Distribution and store it in the method level sim table. This is a screenshot of the script
execution:
Figure 4.24: method.pl execution phase
168
As we can see from the screenshot, the yellow underlined is the sample id, the sample
under analysis in that moment, the red underlined is the family membership and the blu
underlined is the method on which the tool compute the Opcode Frequency Distribution.
Below each entry, are the opcodes of that method.
AdjList.pl
The script takes an argument, we have to specify if we want to extract the Opcodes
Frequency Distribution or the relative Adjacency Lists.
perl AdjList.pl -freq for the frequency distribution computation or perl AdjList.pl -adjxxx
for the adjacency lists extraction.
Figure 4.25: AdjList.pl -freq execution phase
with the flag -adjCFG or -adjFCG it is possible howeverto extract the related AdjacencyList:
169
Figure 4.26: AdjList.pl -adjFCG execution phase
The yellow underlined is the sample id under analysis, and the red part is the relative
extracted Adjacency List for the Call Graph.
Figure 4.27: AdjList.pl -adjFCG execution phase
170
The yellow underlined is the sample id qhile, the red part is the relative CFG Adjacency
List. On the left, the blu underlined, it’s the nodeid.
4.6.0.1
DescentDroid.pm
These script are responsible for the extraction of the metrics from the dataset and for
the storage of the same. In this way we are able to build a local database of signatures,
a kind of Gold Standard for our further analysis. The DescentDroid.pm is the module
where we implemented our methodology, in fact it takes a candidate apk as input and
then classify it.
Following are shown some screenshot captured during the execution of the system:
Figure 4.28: Execution phase
It is also possible to enable a debug option to show also portions of the adjacency Lists
shared:
171
Figure 4.29: DescentDroid Execution phase
Figure 4.30: DescentDroid Execution phase
172
Figure 4.31: DescentDroid Execution phase, with debug option
Chapter 5
Experimental Phase
5.1
Focus of the Experiments
Before defining a strategy in the experimental phase, we have to first take into account
that the idea behind this methodology is to detect a probably malware variant in the
candidate apk. We have to apply this methodology on a set of malware samples choosen
randomly and before we trained our dataset.
The main and first focus is, to evaluate, the Accuracy and the Precision& Recall on
1000 malware samples, also evaluating the Precision& Recall of the Classification as a
MultiClass Classification Problem, and also evaluating the quality of the informations
produced. These “informations” can be considered also as “payload presence signs”.
A secondary focus, is that of evaluate the Effectiveness in Detecting Android Malware,
in terms of Total Precision & Recall. We use the Weka tool. Weka is a collection
of machine learning algorithms for data mining tasks, it contains tools for data preprocessing, classification, regression, clustering, association rules and visualization.
5.2
Focus I: Performance evaluation : A MultiClass Classification Problem
Our main focus is to evaluate the Accuracy as the overall correctness of the Malware
Classification process and the Precision& Recall of the Classifier for each class. We set
up the problem as a MultiClass Classification Problem, where for a given input instance,
173
Chapter 5. Experimental Phase
174
there is exactly one correct output class which is selected out of several classes. Our
evaluation will be also qualitative, in the sense that we evaluate the type and the quality
of information produced in the perspective in which we can also use this methodology
to build “signs” that other tools can use.
In this evaluation we trying to response these Research Questions:
• RQ 1 ) What is the Classification Accuracy of the proposed methodology
?
• RQ 2 ) What is the Classification Effectiveness of the proposed methodology ?
• RQ 3 ) Is there a relationship between the effectiveness of Classification
and the size of the trained dataset ?
5.2.1
Characterization
We consider 1000 malicious samples taken randomly from drebin project before the step
of training. In this way we are sure that these 1000 samples used are malware variants
of well-know malware families. Then we submitted, with a batch script, all these 1000
samples to the DescentDroid tool and after collect and store all the extracted metrics
in a local db we are able to start Classifing malware variants. At this point we evaluate
the Malware Classifier Accuracy and the Precision& Recall of the Classifier for each
class. This is the characterization of the dataset used:
5.2.2
175
Evaluations Detail
Starting from 1000 malware variants samples , DescentDroid is able to proper Classify
979 malware samples:
Figure 5.1: 979 malware variants proper Classified from 1000 Malware samples
Following is showed a portion of the table reporting these results, the table is composed
by the these fields:
• id : The id of the malicious being analyzed.
• similar : The most similar apk in the dataset.
• family : The family membership.
• OpcodeSim : The Relative Opcodes Frequency Distribution Similarity.
• CFG 7gramsC : The total counts of 7-grams identified on the CFG.
• FCG 4gramsC : The total counts of 4-grams identified on the FCG.
• methodLSimRatio :The Similarity Score at method-level (Indicative of the # of
methods cloned).
176
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ID
similar
8171962cbfe3dd358215be31644bcec8126995adbd17460f6280e080c3693af3
89653e3db8ed264d6cb8c0456410ff15a26e2f3452b1e0101294e845069bb331
552557183355e34353ab93e7de7fe281132cbc422eefdefcdadcb3af2235c631
b62d69ed54c99ed9a3ad5c9109ec384f3757273f9762afeb151dfcfb860dec24
9631d73c6d30ac906849b2ed9c0e42e1526f6a502690fa7f93888a70320c307d
756826a907cb774d9c71bdd63f8a3c3b30e38820f8811d6b38542d20c9d6ba70
1418642f008b0b623ae0cc627bf3c5214b82a376de21db9fd69cbe18cc211e5d
53bc8bb25e4ec65c6109ae763e0fc22812d56b4f5338f490405fb72e68515552
a88bdbd2170e2e9d1e20d1276fbe8c5cea75746c651e9f77170ed5f5480be5a7
efc6a641cb2a7da75daa60c8c8f2bf4fa4fef41ce640c294c837683374c6d14d
fd08cb6e2c865af562e7ea171ede7a9d1165a189bb305f9f7102d58e7dd678d6
c6a25935de07fbfb3424d429b1c942e296b61849b839530f0eede3c52c066290
82aca2743ad212d945af1c56f52a01f1fd9e603baee3841f4d78202bd5732d4c
cc25605a7362de48f70b288f851cb12f1f361f66ff5203f09ddc1cd8468c36d0
484804147b90b5ad68e86f35fffe156638a6aae2a783ff2f4c49f2137de9026a
b387fee6de2400fb045b100437f82bcaf10dfa605eedebe02d854b59e51bc122
11e78b279cf47f46ec5290e7b79ed72b8992d73184a0fd10c4bdc39960613546
4269e9f03f43d82df992b417e961554caecb80d06ca3b0c1b847a09fd257901f
90e0253334786c6a99be569a7e62ca82fc50fce56045320e2473dd840d67096a
6f3908828424cbbe2e4fc8f1d1d67e538f8813a22942d9a7eb9ee5f74173c553
family
FakeInstaller
Nyleaker
DroidKungFu
Jifake
Plankton
Nickspy
FakeInstaller
Adrd
Jifake
Opfake
Opfake
FakeInstaller
FakeInstaller
ExploitLinuxLotoor
Yzhc
Gappusin
Yzhc
DroidKungFu
FakeTimer
Gonca
Table 5.1: Results detail on 1000 malware samples, Synoptic table
0.99997
0.99582
1.00012
0.96858
0.99999
1.00002
1.00001
1.00012
0.99999
1.00505
1.00000
1.00001
0.99999
0.99820
1.00011
0.99995
1.00001
1.00013
1.00002
0.99921
OpcodeSim
9
5
45
1
3
24
8
66
4
12
23
18
18
14
88
43
2
26
3
12
CFG 7gramsC
1
4
13
0
82
3
2
15
0
0
4
12
5
2
33
18
36
29
0
1
FCG 4gramsC
0.10000
0.10000
1.00000
0.10000
0.54559
1.00000
1.00000
0.24419
1.00000
1.00000
1.00000
1.00000
1.00000
0.10000
1.00000
0.49315
1.00000
1.00000
1.00000
0.10000
methodLSimRatio
177
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
ID
similar
59d7632c2b19adcd2bab55a37b13fe895ae3d98c335ea3d8f0c359a92c7e2ef4
d3c97661814502dd8d192a020e6bfbc6e9842eb6d39f0a9dc738053825d72274
aa49610e023b1026c924ea1a523ef5c63790c4e5c236437e14ff8a4f988f8d08
76003cc3dfb3de1635d1459e3bebfedd60bb05abaa410c76675bcfb187ee5015
39545b4d7e832ee5f11f9ea141b510f82eaa7fb777808b9be5b4e05fa459e9e2
8e662a20ad2a049b9b11c770d47c2db0c555ae2a863c7eef57a394dae1d40f6b
ec06c6f995ae0702802a14336bf563c6c1c196f06e7d8488ef16d2bc94d42c8b
a3615dd98e1301f9bf35304cb3253855c455982423a58732ad27beed99b2bda1
7fda6a527f1e9a2dbb918d902a856fc8cb9bf66fb75cd07e51feebd87642db2c
053a5a6b5de01ac16784db09422c3c462bf208f7cbd4cde4b5244393f162e357
9943b22719f7b9543c7fa8f1e03e791d99257689c4d9dc44b01268aad883ed56
5208c091b315c25975719f7bc0a7cc01a62287843eb6299a454605f9bdfd3a91
ba48a05e03073cdbd8230ba9194eb1866d9f790ab0f10d949295e5479825b927
e4eb02b2d64d33e4c0536406bfc9a6d8fcc6b5237642d92333ee3e089bd82723
15e800e9abb449b50372b9b9807b585e546f8efd84c3182e670107bca8c74dde
563e5a2db3a05703c9a1006d726fa8ed2ccabed66a431d180409fb8b5f19406d
4fb6bcac98872b0ee94d3ca258069b6b02cafcb0980c5a44482a950b2ab62b45
Table 5.2
family
FakeFlash
Opfake
Opfake
Sakezon
Opfake
DroidKungFu
Opfake
Plankton
Opfake
Stealer
Opfake
FakeInstaller
FakeInstaller
FakeInstaller
DroidKungFu
Opfake
Opfake
Vidro
Opfake
ExploitLinuxLotoor
1.00001
1.00013
1.00003
1.00002
0.98758
0.99856
1.00015
0.99441
1.00000
1.00004
1.00063
0.99999
1.00000
0.99999
1.00016
0.99996
1.00000
0.99999
0.99997
1.00004
OpcodeSim
25
23
24
14
1
9
24
0
23
22
37
14
19
19
25
20
23
10
20
54
CFG 7gramsC
3
2
2
2
0
7
2
0
4
2
1
8
2
12
39
2
4
7
1
16
FCG 4gramsC
1.00000
1.00000
1.00000
1.00000
0.10000
0.10000
1.00000
0.10000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.10000
1.00000
1.00000
1.00000
1.00000
methodLSimRatio
178
From these results, we are able to state that each analyzed sample, is been traced back
to a most similar malware in our dataset (builded from Drebin dataset) and therefore
to its own family as the direct consequence of the Classification process. This Classification process is guided by some metrics that we can also extracted as the direct
consequence of the presence of the payload. Following we evaluate the Accuracy, and
the Precision&Recall of this methodology to proper classify malware.
5.2.2.1
Classification Accuracy, Precision&Recall for each Malware Class.
In this section we provide an evaluation about the Accuracy and the Precision& Recall to proper identify the real malware family memberships. For this we first calculate
the Accuracy of the Classification.
Accuracy
Is the overall correctness of the model and is calculated as the sum of Correct Classifications divided by the total number of classifications:
Accuracy =
tp+tn
tp+f p+tn+f n
and then:
979
Accuracy = 1000
= 97.9%
Then we measure the Precision of the model for each class (for each malware family):
Precision
Precision is a measure of the accuracy provided that a specific class has been predicted.
It is defined by:
tp
Precision = tp+f
p
179
For instance, for the class A:
P A
Precision = T otalPTredictedasA
Recall
Recall is a measure of the ability of a prediction model to select instances of a certain
class from a data set. Measures to which extent classes are scattered across clusters: It
is defined by the formula:
tp
Recall = Sensitivity = tp+f
n
For instance, for the class A:
TP A
Recall = Sensitivity = T OT ALGOLDF
ORA
Now we try to address the RQ 3, relating the size of the family sample for each class,
about our trained dataset, to their relative Precision&Recall. This evaluation try to underline a possible relationships between the size of the trained dataset and the precision
or sensitivity in the Classification Process, if it is so, we have a good degree of freedom
in enhancing this Process.
180
Table 5.3: Precision and Recall of the MultiClass Classification Process
Class
AccuTrack
Adrd
Adsms
Aks
Anti
BaseBridge
BeanBot
Biige
Boxer
Ceshark
Coogos
Copycat
Cosha
Dabom
Dialer
Dougalek
DroidDream
DroidKungFu
DroidRooter
DroidSheep
ExploitLinuxLotoor
FaceNiff
FakeDoc
FakeFlash
FakeInstaller
FakePlayer
FakeRun
FakeTimer
Fakelogo
Fakengry
FarMap
Fatakr
Fauxcopy
FinSpy
Fjcon
Flexispy
FoCobers
Fsm
GPSpy
Gamex
Gapev
Gappusin
Geinimi
Generic
GinMaster
Glodream
Gmuse
Gonca
Hamob
Hispo
TotSamples
2
3
1
2
1
3
2
2
3
2
2
3
3
1
1
3
13
180
2
4
16
1
3
1
191
3
18
3
4
2
1
3
1
1
2
1
3
1
1
1
1
8
25
1
23
6
1
2
3
1
Precision
1.00000
0.28571
1.00000
1.00000
1.00000
0.23077
1.00000
1.00000
0.50000
1.00000
0.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.90909
0.95954
0.95954
1.00000
0.78571
1.00000
0.60000
1.00000
0.98413
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.75000
1.00000
1.00000
1.00000
1.00000
0.75000
1.00000
1.00000
0.53333
0.83333
1.00000
1.00000
1.00000
1.00000
Recall
1.00000
0.66667
1.00000
1.00000
0.00000
1.00000
1.00000
1.00000
0.33333
1.00000
0.00000
0.66667
1.00000
1.00000
1.00000
1.00000
0.76923
0.92222
0.00000
1.00000
0.68750
1.00000
1.00000
1.00000
0.97382
0.33333
1.00000
1.00000
0.75000
0.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.00000
0.75000
0.72000
1.00000
0.69565
0.83333
1.00000
1.00000
1.00000
0.00000
Class
Iconosys
Imlog
JSmsHider
Jifake
Kidlogger
Kiser
Kmin
Koomer
Ksapp
Lemon
LifeMon
Luckycat
Mania
MobileTx
Mobilespy
Mobinauten
Moghava
Nandrobox
Nickspy
NickyRCP
Nisev
Nyleaker
Opfake
PdaSpy
Penetho
Placms
Plankton
Proreso
QPlus
Raden
RediAssi
RootSmart
Rooter
SMSZombie
SMSreg
Sakezon
SeaWeth
SendPay
SerBG
SheriDroid
SmsWatcher
Spitmo
SpyBubble
SpyHasb
SpyMob
181
TotSamples
13
6
1
4
2
2
19
1
1
2
1
1
1
19
3
1
1
3
2
1
1
3
188
1
3
3
44
1
1
2
1
2
1
2
12
2
2
8
3
1
1
3
1
3
1
Precision
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.95000
0.00000
0.00000
0.00000
1.00000
1.00000
1.00000
0.95000
0.66667
0.66667
0.66667
0.75000
1.00000
1.00000
0.33333
1.00000
0.98947
1.00000
1.00000
1.00000
0.82353
1.00000
1.00000
1.00000
1.00000
0.66667
1.00000
1.00000
0.83333
0.66667
1.00000
0.88889
0.50000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
Recall
1.00000
1.00000
0.00000
1.00000
0.50000
1.00000
1.00000
0.00000
0.00000
0.00000
1.00000
1.00000
1.00000
1.00000
0.66667
0.00000
0.00000
1.00000
0.50000
1.00000
1.00000
0.33333
1.00000
1.00000
1.00000
1.00000
0.95455
1.00000
1.00000
0.50000
1.00000
1.00000
1.00000
1.00000
0.83333
1.00000
0.50000
1.00000
1.00000
1.00000
1.00000
1.00000
0.00000
1.00000
1.00000
Class
SpyPhone
Spyoo
Spyset
Stealer
Stealthcell
Steek
Stiniter
Tapsnake
TigerBot
Trackplus
TrojanSMS.Denofow
TrojanSMS.Hippo
Typstu
Vdloader
Vidro
Xsider
Yzhc
Zitmo
Zsone
182
TotSamples
2
1
2
2
2
3
2
1
1
2
1
2
2
3
2
2
10
2
2
Precision
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.00000
1.00000
0.33333
1.00000
0.66667
1.00000
1.00000
0.50000
Recall
0.00000
1.00000
1.00000
1.00000
0.50000
1.00000
0.50000
1.00000
1.00000
0.50000
1.00000
0.00000
1.00000
0.33333
1.00000
1.00000
0.90000
1.00000
1.00000
183
Figure 5.2: Multiclass Classification, Precision in function to the size of the dataset
184
Figure 5.3: Multiclass Classification, Precision vs the size of the dataset
185
Figure 5.4: Multiclass Classification Recall in function to the size of the dataset
186
Figure 5.5: Multiclass Classification, Recall vs the size of the dataset
5.2.3
187
Conclusions and Considerations
The I Focus experimentation, highlight us an important result, this methodology is
effective and reliable to proper Classify Android Malware Variants, reliable because the
final precision of the Classification process does not depend upon the number of samples
in the training set. Now we can give an answer to each Research Questions:
• RQ 1 )What is the Classification Accuracy of the proposed methodology
?
As we have seen , the Total Accuracy in the Classification Process is about 97.9%.
• RQ 2 ) What is the Classification Effectiveness of the proposed methodology ? We have before reported the results table with the Precision&Recall for
each class and in most cases they tend to 1. For the Classification, this methodology seems to be very accurate and sensible.
• RQ 3 ) Is there a relationships between the effectiveness of Classification
and the size of the trained dataset ? There isn’t a direct relationships between
the Precision or Recall and the size of samples in the trained dataset. We can state
that, the Precision and the Recall are 1 in the case with a family with few samples.
It is important for the future trying to address other kinds of possible relationships
to improve this methodology. This is a synonimous of reliability because, this
means that we can cluster a malware also if we have few samples of that family in
our training dataset.
5.3
188
Focus II: Detection Evaluation of Android Malware
The Focus II is focusing on the evaluation about the Detection of Malware in terms of
Accuracy and Precision&Recall. First and foremost we conduct these experiments with
3 different classifiers for the Opcodes Frequency Distribution Similarity. Then we
trying to responde these following Research Questions for each classifier:
• RQ 1) What is the Detection Accuracy of the proposed methodology ?
• RQ 2) What is the Detection Effectiveness of the proposed methodology
?
• RQ 3) As can be satisfactory the extracted data in terms of quality
(F-Score, FP/FN rate) ?
5.3.1
Characterization
We consider now a 2-class problem defining the following values:
Table 5.4: Different Classifiers of Opcodes Frequency Distribution Similarity
I classifier
0.99800 CosSim
II classifier
0.99500 CosSim
III classifier
0.99000 CosSim
For each values, after build their confusion matrix, it is possible to estimate:
• TPR : True Positive Rate. Measures the proportion of actual positives which are
correctly identified. Also defined as Recall
• FPR : False Positive Rate. The probability of falsely rejecting the null hypothesis.
• PPV : Positive Predictive Values . The probability that samples with a positive
test truly is a malware. Also defined as Precision.
• ACC : Accuracy. Is a measure of the sistematic error, the proximity of measurement results to the true value.
189
• F1-score : Is a measure of a test’s accuracy. It considers both the precision p and
the recall r of the test to compute the score.
• FP/FN : Is the ratio between False Positives and False Negatives, useful to establish the performance limits of the classifier.
5.3.2
5.3.2.1
Evaluations Detail
I classifier: 0.99800 CosSim
For the first classifier we obtained the following confusion matrix:
Table 5.5: Confusion Matrix for 0.99800
Negative Cases
Positive Cases
Predicted Negative
TN: 46.5%
FN: 7.15%
Predicted Positive
FP: 3.5%
TP: 42.85%
Table 5.6: Performance measures for 0.99800 CosSim
Measure
TPR
FPR
PPV
ACC
F1-score
FP/FN
5.3.2.2
value
85.7%
0.7%
92.4%
89.4%
0.889
0.489
II classifier: 0.99500 CosSim
For the second classifier we obtained the following confusion matrix:
Negative Cases
Positive Cases
Predicted Negative
TN: 45%
FN: 5.85%
Predicted Positive
FP: 5%
TP: 44.15%
190
Table 5.8: Performance measures for 0.99800 CosSim
Measure
TPR
FPR
PPV
ACC
F1-score
FP/FN
5.3.2.3
value
88.3%
10%
89.8%
89.1%
0.890
0.854
III classifier: 0.99000 CosSim
For the third classifier we obtained the following confusion matrix:
Negative Cases
Positive Cases
Predicted Negative
TN: 41.8%
FN: 5.4%
Predicted Positive
FP: 8.2%
TP: 44.6%
Table 5.10: Performance measures for 0.99800
Measure
TPR
FPR
PPV
ACC
F1-score
FP/FN
value
89.2%
16.4%
84.4%
86.4%
0.867
1.515
5.3.3
191
Conclusions and Considerations
• RQ 1 ) What is the Detection Accuracy of the proposed methodology ?
Table 5.11: Accuracy of the three Classifiers
Classifier
0.99800 CosSim
0.99500 CosSim
0.99000 CosSim
value
89%
89%
86%
• RQ 2 ) What is the Detection Effectiveness of the proposed methodology
?
Table 5.12: Recall of the three Classifiers
Classifier
0.99800 CosSim
0.99500 CosSim
0.99000 CosSim
value
85%
88%
89%
It is possible to verify that defining a low threshold for the Opcodes Frequency
Distribution, (i.e the 0.99000), more samples we are able to identify as malware.
Table 5.13: Precision of the three Classifiers
Classifier
0.99800 CosSim
0.99500 CosSim
0.99000 CosSim
value
92%
90%
84%
More the threshold is high (i.e. the 0.99800 classifier), more the data extracted
corresponds to the exact data.
• RQ 3) As can be satisfactory the extracted data in terms of quality
(F-Score, FP/FN rate) ?
192
Table 5.14: F-Score of the three Classifiers
Classifier
0.99800 CosSim
0.99500 CosSim
0.99000 CosSim
value
88%
89%
87%
We obtained values of the F1-Score between 87% (0.99000) and 89% (0.99500),
we believe that our defined classifiers are still satisfactory.
Table 5.15: FP/FN ratio of the three Classifiers
Classifier
0.99800 CosSim
0.99500 CosSim
0.99000 CosSim
value
49%
85%
151%
About the FP/FN ratio, we can state that the II classifier is the best choiche
because the ratio remains below the value 1. It is still a good trade-off.
Then evaluate the 3 classifiers on the ROC space:
Figure 5.6: 3 classifiers on the ROC space
A Performance comparison of the 3 classifiers:
Figure 5.7: 3 classifiers - Performance Comparison
193
194
The FP/FN ratio comparison:
Figure 5.8: 3 classifiers - FP/FN Comparison
Only about the 0.99000 CosSim classifier we have a ratio greater than 1, this means
that, in that case, the number of False Positives is greater than the number of False
Negatives. This result means that we can opt for a good classifier both the 0.99800 and
the 0.99500.
Chapter 6
Applications and future works
This Master Thesis work is placed in the context of a Malware Analysis Research Project.
Our defined methodology, based on a code-clone heuristic is based upon a new approach
for the Malware Analysis, representing an inversion in the strategy, placing it before any
Anti-malware solution.
Our approach, use an heuristic which is based on the extraction of the Opcodes Frequency Distribution for each sample and then a similarity function can measure the
distance towards our trained dataset, then we extract the Common Subgraphs in both
the CFG and FCG to measure the Isomorphism between malware. After a classifier is
able tocategorize our candidate apk in the right class.
This method is able to outline, for each candidate malware, a possible genome, and
possible descents based on the inherited portions of payload. And above all it is resilient
to Android Common Obfuscation Techniques.
About the possible applications, It is possible to place this methodology before any existing Anti Malware Detection solution for example for build signs, to pre-filter malware,
to correlate malware informations. It is also possible to put it after an Anti Malware
Detection Solution to proper classify malware and study its philogenesys.
In conclusion, we can assert of realizing a good methodology, with a good predictive
195
Chapter 6. Applications and future works
196
power in detecting Android Malware and with an high predictive power in classifying
android malware.
A big disvantage of this methodology is that isn’t scalable at the moment, this means
that we need a big computation time for the analysis of apk bigger than 100MB. But in
this moment we want to focus on the goodness of the method, focusing on the Accuracy
rather than the computational time.
In the optics to carry on this Research Project, we need ,for example :
• Work on the scalability.
• Updates and expands our trained dataset. Actually based on the Drebin Project.
• Compute the Opcodes Frequency Distribution distance with other similarity scores
than the Cosine Similarity.
• Outline common payload portions between malware families.
• Study the effectiveness against common Opcodes Trasformations.
197
Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution
Computation
Appendix A
Partial code of Descent tool.
Opcodes Frequency Distribution
Computation
Listing A.1: Partial code of Descent tool. Opcodes Frequency Distribution Computation
%relFreqOpcodes = ("nop_op", 0 , "move", 0
,"move_exception", 0
,"move_object", 0 , "return_op", 0 , "return_object", 0, "
return_void", 0 , "return_wide", 0 , "monitor_enter", 0 , "
monitor_exit", 0 , "goto_op", 0 , "goto16", 0 , "goto32", 0 ,
"invoke_virtual", 0 , "invoke_super", 0 , "invoke_direct", 0
, "invoke_static", 0 , "invoke_interface", 0 , "
packed_switch", 0 , "sparse_switch", 0 , "arrayop", 0 , "
iftest", 0 , "instanceop", 0 , "static_op", 0 , "iftestz", 0
, "cmpop", 0 , "unop", 0 , "binop", 0 , "throw", 0 );
%relFreqOpcodes2 = ("nop_op", 0 , "move", 0
,"move_exception", 0
,"move_object", 0 , "return_op", 0 , "return_object", 0, "
return_void", 0 , "return_wide", 0 , "monitor_enter", 0 , "
monitor_exit", 0 , "goto_op", 0 , "goto16", 0 , "goto32", 0 ,
"invoke_virtual", 0 , "invoke_super", 0 , "invoke_direct", 0
, "invoke_static", 0 , "invoke_interface", 0 , "
packed_switch", 0 , "sparse_switch", 0 , "arrayop", 0 , "
iftest", 0 , "instanceop", 0 , "static_op", 0 , "iftestz", 0
, "cmpop", 0 , "unop", 0 , "binop", 0 , "throw", 0 );
198
Computation
my %L_FCG= ();
my %topTen= ();
my %opCFG;
my %opFCG;
my %nodeOP = ();
my @edges = $dom->getElementsByTagName("edge");
my @nodeId;
my %method = ();
for (keys %L_FCG)
{delete $L_FCG{$_};}
for (keys %topTen)
{delete $topTen{$_};}
for (keys %nodeOP)
{delete $nodeOP{$_};}
for (keys %method)
{delete $method{$_};}
my %CFGcounter= ();
for (keys %CFGcounter)
{delete $CFGcounter{$_};}
my %FCGcounter= ();
for (keys %FCGcounter){delete $FCGcounter{$_};}
my %totalFCG= ();
for (keys %totalFCG)
{delete $totalFCG{$_};}
my %totalCFG= ();
for (keys %totalCFG)
{delete $totalCFG{$_};}
%methodSimRatio= ();
for (keys %methodSimRatio)
{delete $methodSimRatio{$_};}
%methodSimRatio2= ();
for (keys %methodSimRatio2)
{delete $methodSimRatio2{$_};}
my %methodSim= ();
for (keys %methodSim)
{delete $methodSim{$_};}
199
Computation
foreach my $a (@nodes){
my @att;
#print $a->getAttribute(’id’)."\n";
push(@nodeId,$a->getAttribute(’id’));
@att= $a->getChildnodes();
foreach(@att){
if(($_->nodeName) eq ’att’){
if($_->getAttribute(’name’) eq ’name’){
$tempName=$_->getAttribute(’value’);}
elsif(($_->getAttribute(’name’) eq ’node.label’) and ($_->
getAttribute(’value’) !˜ /ˆL/)
){
$L_FCG{$a->getAttribute(’id’)}=$tempName;
$totalOpcodes++;
my $opcode = $_->getAttribute(’value’);
$opcode =˜ s/ˆ.*\\n// ;
$method{$tempName}.=",".$opcode;
if($opcode =˜ m/nop/gi){$relFreqOpcodes{nop_op}++}
elsif(($opcode =˜ m/ˆmove-exception$/)){$relFreqOpcodes{’
move_exception’}++}
elsif(($opcode =˜ m/ˆmove-object/)){$relFreqOpcodes{’move_object
’}++}
elsif((($opcode !˜ m/ˆmove-exception$/) or ($opcode !˜ m/ˆmoveobject$/)) and ($opcode =˜ m/ˆmove/)){$relFreqOpcodes{’move
’}++}
elsif($opcode =˜ m/ˆreturn$/gi){$relFreqOpcodes{’return_op’}++}
elsif($opcode =˜ m/ˆreturn-object$/gi){$relFreqOpcodes{’
return_object’}++}
elsif($opcode =˜ m/ˆreturn-void$/gi){$relFreqOpcodes{’return_void
’}++}
elsif($opcode =˜ m/ˆreturn-wide$/gi){$relFreqOpcodes{’return_wide
’}++}
elsif($opcode =˜ m/ˆmonitor-enter$/gi){$relFreqOpcodes{’
monitor_enter’}++}
elsif($opcode =˜ m/ˆmonitor-exit$/gi){$relFreqOpcodes{’
monitor_exit’}++}
elsif($opcode =˜ m/ˆgoto$/gi){$relFreqOpcodes{’goto_op’}++}
elsif($opcode =˜ m/ˆgoto\/16$/gi){$relFreqOpcodes{’goto16’}++}
elsif($opcode =˜ m/ˆgoto\/32$/gi){$relFreqOpcodes{’goto32’}++}
200
Computation
elsif($opcode =˜ m/învoke-virtual$/gi){$relFreqOpcodes{’
invoke_virtual’}++}
elsif($opcode =˜ m/învoke-super$/gi){$relFreqOpcodes{’
invoke_super’}++}
elsif($opcode =˜ m/învoke-direct$/gi){$relFreqOpcodes{’
invoke_direct’}++}
elsif($opcode =˜ m/învoke-static$/gi){$relFreqOpcodes{’
invoke_static’}++}
elsif($opcode =˜ m/învoke-interface$/gi){$relFreqOpcodes{’
invoke_interface’}++}
elsif($opcode =˜ m/ˆpacked-switch$/gi){$relFreqOpcodes{’
packed_switch’}++}
elsif($opcode =˜ m/ˆsparse-switch$/gi){$relFreqOpcodes{’
sparse_switch’}++}
elsif(($opcode =˜ m/âget$/) or ($opcode =˜ m/âget-wide$/)
or (
$opcode =˜ m/âget-object$/) or ($opcode =˜ m/âget-boolean$
/)
or ($opcode =˜ m/âget-byte$/) or ($opcode =˜ m/âget-
char$/)
or ($opcode =˜ m/âget-short$/) or ($opcode =˜ m/ˆ
aput$/)
or ($opcode =˜ m/âput-wide$/) or ($opcode =˜ m/ˆ
aput-object$/)
or ($opcode =˜ m/âput-boolean/) or ($opcode
=˜ m/âput-byte$/)
or ($opcode =˜ m/âput-char$/) or (
$opcode =˜ m/âput-short$/)){$relFreqOpcodes{’arrayop’}++}
elsif(($opcode =˜ m/îf-eq$/) or ($opcode =˜ m/îf-ne$/)
$opcode =˜ m/îf-lt$/) or ($opcode =˜ m/îf-ge$/)
or (
or (
$opcode =˜ m/îf-gt$/) or ($opcode =˜ m/îf-le$/)){
$relFreqOpcodes{’iftest’}++}
elsif(($opcode =˜ m/îget$/) or ($opcode =˜ m/îget-wide$/)
or (
$opcode =˜ m/îget-object$/) or ($opcode =˜ m/îget-boolean$
/)
or ($opcode =˜ m/îget-byte$/) or ($opcode =˜ m/îget-
char$/) or ($opcode =˜ m/îget-short$/) or ($opcode =˜ m/ˆ
iput$/)
or ($opcode =˜ m/îput-wide$/) or ($opcode =˜ m/ˆ
iput-object$/) or ($opcode =˜ m/îput-boolean$/) or ($opcode
=˜ m/îput-byte$/)
or ($opcode =˜ m/îput-short$/) or (
$opcode =˜ m/îput-char$/)){$relFreqOpcodes{’instanceop’}++}
elsif(($opcode =˜ m/ˆcmpl-float/g) or ($opcode =˜ m/ˆcmpg-float/g
)
or ($opcode =˜ m/ˆcmpl-double/g) or ($opcode =˜ m/ˆcmpg-
double/g)
cmpop’}++}
or ($opcode =˜ m/ˆcmp-long/g)){$relFreqOpcodes{’
201
Computation
202
elsif(($opcode =˜ m/ˆsget$/) or ($opcode =˜ m/ˆsget-wide$/)
or (
$opcode =˜ m/ˆsget-object$/) or ($opcode =˜ m/ˆsget-boolean$/)
or ($opcode =˜ m/ˆsget-byte$/) or ($opcode =˜ m/ˆsget-char$/)
or ($opcode =˜ m/ˆsget-short$/) or ($opcode =˜ m/ˆsput$/)
or (
$opcode =˜ m/ˆsput-wide$/) or ($opcode =˜ m/ˆsput-object$/)
or
($opcode =˜ m/ˆsput-boolean/) or ($opcode =˜ m/ˆsput-byte$/)
or
($opcode =˜ m/ˆsput-char$/) or ($opcode =˜ m/ˆsput-short$/)){
$relFreqOpcodes{’static_op’}++}
elsif(($opcode =˜ m/îf-eqz$/) or ($opcode =˜ m/îf-nez$/)
$opcode =˜ m/îf-ltz$/) or ($opcode =˜ m/îf-gez$/)
or (
or ($opcode
=˜ m/îf-gtz$/) or ($opcode =˜ m/îf-lez$/)){$relFreqOpcodes{’
iftestz’}++}
elsif(($opcode =˜ m/ˆneg-int$/)
or ($opcode =˜ m
/ˆnot-int$/) or ($opcode =˜ m/ˆneg-long$/)
or ($opcode =˜ m/ˆ
not-long$/) or ($opcode =˜ m/ˆneg-float$/)
neg-double$/) or ($opcode =˜ m/înt-to-long$/)
or ($opcode =˜ m
/înt-to-float$/) or ($opcode =˜ m/înt-to-double$/)
or (
$opcode =˜ m/ˆlong-to-int/) or ($opcode =˜ m/ˆlong-to-float$/)
or ($opcode =˜ m/ˆlong-to-double$/) or ($opcode =˜ m/ˆfloat-toint$/) or ($opcode =˜ m/ˆfloat-to-long$/)
float-to-double$/) or ($opcode =˜ m/ˆdouble-to-int$/)
or (
$opcode =˜ m/ˆdouble-to-long/) or ($opcode =˜ m/ˆdouble-tofloat$/)
or ($opcode =˜ m/înt-to-byte$/) or ($opcode =˜ m/înt
-to-char$/) or ($opcode =˜ m/înt-to-short$/)){$relFreqOpcodes{’
unop’}++}
elsif(($opcode =˜ m/âdd-int/) or ($opcode =˜ m/ˆsub-int/)
$opcode =˜ m/ˆmul-int/) or ($opcode =˜ m/ˆdiv-int/)
=˜ m/ˆrem-int/) or ($opcode =˜ m/ând-int/)
or-int/) or ($opcode =˜ m/ˆxor-int/)
/) or ($opcode =˜ m/ˆshr-int/)
($opcode =˜ m/âdd-long$/)
or (
or ($opcode
or ($opcode =˜ m/ˆshl-int
or ($opcode =˜ m/ûshr-int/) or
or ($opcode =˜ m/ˆsub-long$/) or (
$opcode =˜ m/ˆmul-long/) or ($opcode =˜ m/ˆdiv-long$/)
or (
$opcode =˜ m/ˆrem-long$/) or ($opcode =˜ m/ând-long$/)
$opcode =˜ m/ôr-long/) or ($opcode =˜ m/ˆxor-long/)
or (
or (
$opcode =˜ m/ˆshl-long/) or ($opcode =˜ m/ˆshr-long/) or (
$opcode =˜ m/ûshr-long/) or ($opcode =˜ m/âdd-float/) or (
$opcode =˜ m/ˆsub-float/) or ($opcode =˜ m/ˆmul-float/)
or (
$opcode =˜ m/ˆdiv-float/) or ($opcode =˜ m/ˆrem-float/)
or (
$opcode =˜ m/âdd-double/) or ($opcode =˜ m/ˆsub-double/)
or (
$opcode =˜ m/ˆmul-double/) or ($opcode =˜ m/ˆdiv-double/) or (
$opcode =˜ m/ˆrem-double/)){$relFreqOpcodes{’binop’}++}
Computation
##################### Extracting Adjacency Lists for CFG and FCG
#####################################
print "\n\t > Extracting Adjacency Lists for CFG and FCG ...\n";
foreach my $e (@edges){
my($label) = $e->getAttribute(’label’);
if($label =˜ m/$cfg$/g){
my $S= $e->getAttribute(’source’);
$edgeCFG{$S}.=",".$e->getAttribute(’target’);
if(!exists $opCFG{$S}){$opCFG{$S}=$nodeOP{$e->getAttribute(’
source’)}}
$opCFG{$S}.=",". $nodeOP{$e->getAttribute(’target’)}
}elsif($label =˜ m/$fcg$/g){
my $S= $e->getAttribute(’source’);
$edgeFCG{$L_FCG{$S}}.=",".$L_FCG{$e->getAttribute(’target’)};
#$opFCG{$S}.=",". $nodeOP{$e->getAttribute(’target’)}
}}
my $json_CFG = encode_json(\%opCFG);
my $json_FCG = encode_json(\%edgeFCG);
$json_CFG=˜s/(\d+)//g;
my @opcfg=split(’:’,$json_CFG);
my @tempopcfg= sort @opcfg;
@opcfg="";
@opcfg= uniq(@tempopcfg);
my %hash= ();
for (keys %hash)
{delete $hash{$_};}
my %relFreqOpcodes01= ();
for (keys %relFreqOpcodes01)
{delete $relFreqOpcodes01{$_};}
203
Computation
##############################################################
my $methodsTotal=0;
foreach my $key ( keys %method )
{
$methodsTotal++;
my %relFreqOpcodes00= ();
for (keys %relFreqOpcodes00)
{delete $relFreqOpcodes00{$_};}
%relFreqOpcodes00 = ("nop_op", 0 , "move_object", 0 ,"
move_exception", 0 ,"move", 0
, "return_op", 0 , "
return_object", 0, "return_void", 0 , "return_wide", 0 , "
monitor_enter", 0 , "monitor_exit", 0 , "goto_op", 0 , "
goto16", 0 , "goto32", 0 , "invoke_virtual", 0 , "
invoke_super", 0 , "invoke_direct", 0 , "invoke_static", 0 ,
"invoke_interface", 0 , "packed_switch", 0 , "sparse_switch",
0 , "arrayop", 0 , "iftest", 0 , "instanceop", 0 , "
static_op", 0 , "iftestz", 0 , "cmpop", 0 , "unop", 0 , "
binop", 0 , "throw", 0 );
my @List="";
@List= split(’,’,$method{$key});
shift @List; foreach(@List){#print "XXXXX ".$_ ."\n";
if($_ =˜ m/nop/gi){$relFreqOpcodes00{’nop_op’}++}
elsif(($_ =˜ m/ˆmove-exception$/)){$relFreqOpcodes00{’
move_exception’}++}
elsif(($_ =˜ m/ˆmove-object/)){$relFreqOpcodes00{’move_object
’}++}
elsif((($_ !˜ m/ˆmove-exception$/) or ($_ !˜ m/ˆmove-object$/))
and ($_ =˜ m/ˆmove/)){$relFreqOpcodes00{’move’}++}
elsif($_ =˜ m/ˆreturn$/gi){$relFreqOpcodes00{’return_op’}++}
elsif($_ =˜ m/ˆreturn-object$/gi){$relFreqOpcodes00{’
return_object’}++}
elsif($_ =˜ m/ˆreturn-void$/gi){$relFreqOpcodes00{’return_void
’}++}
elsif($_ =˜ m/ˆreturn-wide$/gi){$relFreqOpcodes00{’return_wide
’}++}
elsif($_ =˜ m/ˆmonitor-enter$/gi){$relFreqOpcodes00{’
monitor_enter’}++}
elsif($_ =˜ m/ˆmonitor-exit$/gi){$relFreqOpcodes00{’monitor_exit
’}++}
elsif($_ =˜ m/ˆgoto$/gi){$relFreqOpcodes00{’goto_op’}++}
204
Computation
elsif($_ =˜ m/ˆgoto\/16$/gi){$relFreqOpcodes00{’goto16’}++}
elsif($_ =˜ m/ˆgoto\/32$/gi){$relFreqOpcodes00{’goto32’}++}
elsif($_ =˜ m/învoke-virtual$/gi){$relFreqOpcodes00{’
invoke_virtual’}++}
elsif($_ =˜ m/învoke-super$/gi){$relFreqOpcodes00{’invoke_super
’}++}
elsif($_ =˜ m/învoke-direct$/gi){$relFreqOpcodes00{’
invoke_direct’}++}
elsif($_ =˜ m/învoke-static$/gi){$relFreqOpcodes00{’
invoke_static’}++}
elsif($_ =˜ m/învoke-interface$/gi){$relFreqOpcodes00{’
invoke_interface’}++}
elsif($_ =˜ m/ˆpacked-switch$/gi){$relFreqOpcodes00{’
packed_switch’}++}
elsif($_ =˜ m/ˆsparse-switch$/gi){$relFreqOpcodes00{’
sparse_switch’}++}
205
Computation
elsif(($_ =˜ m/âget$/) or ($_ =˜ m/âget-wide$/)
aget-object$/) or ($_ =˜ m/âget-boolean$/)
aget-byte$/) or ($_ =˜ m/âget-char$/)
short$/) or ($_ =˜ m/âput$/)
$_ =˜ m/âput-object$/)
=˜ m/âput-byte$/)
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆ
or ($_ =˜ m/âget-
or ($_ =˜ m/âput-wide$/) or (
or ($_ =˜ m/âput-boolean/) or ($_
or ($_ =˜ m/âput-char$/) or ($_ =˜ m/ˆ
aput-short$/)){$relFreqOpcodes00{’arrayop’}++}
elsif(($_ =˜ m/îf-eq$/) or ($_ =˜ m/îf-ne$/)
lt$/) or ($_ =˜ m/îf-ge$/)
or ($_ =˜ m/îf-
or ($_ =˜ m/îf-gt$/) or ($_ =˜
m/îf-le$/)){$relFreqOpcodes00{’iftest’}++}
elsif(($_ =˜ m/îget$/) or ($_ =˜ m/îget-wide$/)
iget-object$/) or ($_ =˜ m/îget-boolean$/)
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆ
iget-byte$/) or ($_ =˜ m/îget-char$/) or ($_ =˜ m/îgetshort$/) or ($_ =˜ m/îput$/)
or ($_ =˜ m/îput-wide$/) or (
$_ =˜ m/îput-object$/) or ($_ =˜ m/îput-boolean$/) or ($_
=˜ m/îput-byte$/)
or ($_ =˜ m/îput-short$/) or ($_ =˜ m/ˆ
iput-char$/)){$relFreqOpcodes00{’instanceop’}++}
elsif(($_ =˜ m/ˆcmpl-float/g) or ($_ =˜ m/ˆcmpg-float/g)
=˜ m/ˆcmpl-double/g) or ($_ =˜ m/ˆcmpg-double/g)
or ($_
or ($_ =˜ m
/ˆcmp-long/g)){$relFreqOpcodes00{’cmpop’}++}
elsif(($_ =˜ m/ˆsget$/) or ($_ =˜ m/ˆsget-wide$/)
sget-object$/) or ($_ =˜ m/ˆsget-boolean$/)
sget-byte$/) or ($_ =˜ m/ˆsget-char$/)
short$/) or ($_ =˜ m/ˆsput$/)
$_ =˜ m/ˆsput-object$/)
=˜ m/ˆsput-byte$/)
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆsget-
or ($_ =˜ m/ˆsput-wide$/) or (
or ($_ =˜ m/ˆsput-boolean/) or ($_
or ($_ =˜ m/ˆsput-char$/) or ($_ =˜ m/ˆ
sput-short$/)){$relFreqOpcodes00{’static_op’}++}
elsif(($_ =˜ m/îf-eqz$/) or ($_ =˜ m/îf-nez$/)
-ltz$/) or ($_ =˜ m/îf-gez$/)
or ($_ =˜ m/îf
or ($_ =˜ m/îf-gtz$/) or ($_
=˜ m/îf-lez$/)){$relFreqOpcodes00{’iftestz’}++} elsif(($_
=˜ m/ˆneg-int$/)
long$/)
or ($_ =˜ m/ˆnot-int$/) or ($_ =˜ m/ˆneg-
or ($_ =˜ m/ˆnot-long$/) or ($_ =˜ m/ˆneg-float$/)
or ($_ =˜ m/ˆneg-double$/) or ($_ =˜ m/înt-to-long$/)
or (
$_ =˜ m/înt-to-float$/) or ($_ =˜ m/înt-to-double$/)
or (
$_ =˜ m/ˆlong-to-int/) or ($_ =˜ m/ˆlong-to-float$/)
or ($_
=˜ m/ˆlong-to-double$/) or ($_ =˜ m/ˆfloat-to-int$/) or ($_
=˜ m/ˆfloat-to-long$/)
or ($_ =˜ m/ˆfloat-to-double$/) or (
$_ =˜ m/ˆdouble-to-int$/)
or ($_ =˜ m/ˆdouble-to-long/) or (
$_ =˜ m/ˆdouble-to-float$/)
or ($_ =˜ m/înt-to-byte$/) or (
$_ =˜ m/înt-to-char$/) or ($_ =˜ m/înt-to-short$/)){
$relFreqOpcodes00{’unop’}++}
206
Computation
elsif(($_ =˜ m/âdd-int/) or ($_ =˜ m/ˆsub-int/)
mul-int/) or ($_ =˜ m/ˆdiv-int/)
$_ =˜ m/ând-int/)
int/)
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆrem-int/) or (
or ($_ =˜ m/ôr-int/) or ($_ =˜ m/ˆxor-
or ($_ =˜ m/ˆshl-int/) or ($_ =˜ m/ˆshr-int/)
=˜ m/ûshr-int/) or ($_ =˜ m/âdd-long$/)
or ($_
or ($_ =˜ m/ˆsub-
long$/) or ($_ =˜ m/ˆmul-long/) or ($_ =˜ m/ˆdiv-long$/)
($_ =˜ m/ˆrem-long$/) or ($_ =˜ m/ând-long$/)
or-long/) or ($_ =˜ m/ˆxor-long/)
or
or ($_ =˜ m/ˆ
or ($_ =˜ m/ˆshl-long/) or
($_ =˜ m/ˆshr-long/) or ($_ =˜ m/ûshr-long/) or ($_ =˜ m/ˆ
add-float/) or ($_ =˜ m/ˆsub-float/) or ($_ =˜ m/ˆmul-float/)
or ($_ =˜ m/ˆdiv-float/) or ($_ =˜ m/ˆrem-float/)
=˜ m/âdd-double/) or ($_ =˜ m/ˆsub-double/)
or ($_
or ($_ =˜ m/ˆ
mul-double/) or ($_ =˜ m/ˆdiv-double/) or ($_ =˜ m/ˆremdouble/)){$relFreqOpcodes00{’binop’}++}
##########################################
my $sp1= sprintf("%.5f",($relFreqOpcodes00{’nop_op’}/
$totalOpcodes));
$relFreqOpcodes01{$key}.= $sp1; $relFreqOpcodes01{$key}.=", ";
my $sp4= sprintf("%.5f",($relFreqOpcodes00{’move’}/$totalOpcodes)
);
my $sp3= sprintf("%.5f",($relFreqOpcodes00{’move_exception’}/
$totalOpcodes));
my $sp2= sprintf("%.5f",($relFreqOpcodes00{’move_object’}/
$totalOpcodes));
my $sp5= sprintf("%.5f",($relFreqOpcodes00{’return_op’}/
$totalOpcodes));
my $sp6= sprintf("%.5f",($relFreqOpcodes00{’return_object’}/
$totalOpcodes));
207
Computation
my $sp7= sprintf("%.5f",($relFreqOpcodes00{’return_void’}/
$totalOpcodes));
my $sp8= sprintf("%.5f",($relFreqOpcodes00{’return_wide’}/
$totalOpcodes));
my $sp9= sprintf("%.5f",($relFreqOpcodes00{’monitor_enter’}/
$totalOpcodes));
my $sp10= sprintf("%.5f",($relFreqOpcodes00{’monitor_exit’}/
$totalOpcodes));
my $sp11= sprintf("%.5f",($relFreqOpcodes00{’goto_op’}/
$totalOpcodes));
my $sp12= sprintf("%.5f",($relFreqOpcodes00{’goto16’}/
$totalOpcodes));
my $sp13= sprintf("%.5f",($relFreqOpcodes00{’goto32’}/
$totalOpcodes));
my $sp14= sprintf("%.5f",($relFreqOpcodes00{’invoke_virtual’}/
$totalOpcodes));
my $sp15= sprintf("%.5f",($relFreqOpcodes00{’invoke_super’}/
$totalOpcodes));
my $sp16= sprintf("%.5f",($relFreqOpcodes00{’invoke_direct’}/
$totalOpcodes));
my $sp17= sprintf("%.5f",($relFreqOpcodes00{’invoke_static’}/
$totalOpcodes));
my $sp18= sprintf("%.5f",($relFreqOpcodes00{’invoke_interface’}/
$totalOpcodes));
my $sp19= sprintf("%.5f",($relFreqOpcodes00{’packed_switch’}/
$totalOpcodes));
208
Computation
my $sp20= sprintf("%.5f",($relFreqOpcodes00{’sparse_switch’}/
$totalOpcodes));
my $sp21= sprintf("%.5f",($relFreqOpcodes00{’arrayop’}/
$totalOpcodes));
my $sp22= sprintf("%.5f",($relFreqOpcodes00{’iftest’}/
$totalOpcodes));
my $sp23= sprintf("%.5f",($relFreqOpcodes00{’instanceop’}/
$totalOpcodes));
my $sp24= sprintf("%.5f",($relFreqOpcodes00{’static_op’}/
$totalOpcodes));
my $sp25= sprintf("%.5f",($relFreqOpcodes00{’iftestz’}/
$totalOpcodes));
my $sp26= sprintf("%.5f",($relFreqOpcodes00{’cmpop’}/
$totalOpcodes));
my $sp27= sprintf("%.5f",($relFreqOpcodes00{’unop’}/$totalOpcodes
));
my $sp28= sprintf("%.5f",($relFreqOpcodes00{’binop’}/
$totalOpcodes));
my $sp29= sprintf("%.5f",($relFreqOpcodes00{’throw’}/
$totalOpcodes));
209
Computation
my $sth = $dbh->prepare( "SELECT * FROM opcodes_freq_distribution
");
$sth->execute;
while (@data = $sth->fetchrow_array()) {
my $id = $data[0];
my $nop_op = $data[2];
my $move = $data[3];
my $moveexception = $data[4];
my $moveobject = $data[5];
my $return_op = $data[6];
my $returnobject = $data[7];
my $returnvoid = $data[8];
my $returnwide = $data[9];
my $monitorenter = $data[10];
my $monitorexit = $data[11];
my $goto = $data[12];
my $goto16 = $data[13];
my $goto32 = $data[14];
my $invokevirtual = $data[15];
my $invokesuper = $data[16];
my $invokedirect = $data[17];
my $invokestatic = $data[18];
my $invokeinterfaces = $data[19];
my $packedswitch = $data[20];
my $sparseswitch = $data[21];
my $arrayop = $data[22];
my $iftest = $data[23];
my $instanceop = $data[24];
my $staticop = $data[25];
my $iftestz = $data[26];
my $cmpop = $data[27];
my $unop = $data[28];
my $binop = $data[29];
my $throw = $data[30];
$hash{$id}= $nop_op.",".$move.",".$moveexception.",".$moveobject
.",".$return_op.",".$returnobject.",".$returnvoid.",".
$returnwide.",".$monitorenter.",".$monitorexit.",".$goto.",".
$goto16.",".$goto32.",".$invokevirtual.",".$invokesuper.",".
$invokedirect.",".$invokestatic.",".$invokeinterfaces.",".
$packedswitch.",".$sparseswitch.",".$arrayop.",".$iftest.",".
$instanceop.",".$staticop.",".$iftestz.",".$cmpop.",".$unop
.",".$binop.",".$throw;
210
Computation
$sth->finish();
$dbh->commit or die $DBI::errstr;
foreach my $key ( keys %hash )
{
my @opc = split(’,’,$hash{$key});
$relFreqOpcodes2{’nop_op’}=$opc[0];
$relFreqOpcodes2{’move’}=$opc[1];
$relFreqOpcodes2{’move_exception’}=$opc[2];
$relFreqOpcodes2{’move_object’}=$opc[3];
$relFreqOpcodes2{’return_op’}=$opc[4];
$relFreqOpcodes2{’return_object’}=$opc[5];
$relFreqOpcodes2{’return_void’}=$opc[6];
$relFreqOpcodes2{’return_wide’}=$opc[7];
$relFreqOpcodes2{’monitor_enter’}=$opc[8];
$relFreqOpcodes2{’monitor_exit’}=$opc[9];
$relFreqOpcodes2{’goto_op’}=$opc[10];
$relFreqOpcodes2{’goto16’}=$opc[11];
$relFreqOpcodes2{’goto32’}=$opc[12];
$relFreqOpcodes2{’invoke_virtual’}=$opc[13];
$relFreqOpcodes2{’invoke_super’}=$opc[14];
$relFreqOpcodes2{’invoke_direct’}=$opc[15];
$relFreqOpcodes2{’invoke_static’}=$opc[16];
$relFreqOpcodes2{’invoke_interface’}=$opc[17];
$relFreqOpcodes2{’packed_switch’}=$opc[18];
$relFreqOpcodes2{’sparse_switch’}=$opc[19];
$relFreqOpcodes2{’arrayop’}=$opc[20];
$relFreqOpcodes2{’iftest’}=$opc[21];
$relFreqOpcodes2{’instanceop’}=$opc[22];
$relFreqOpcodes2{’static_op’}=$opc[23];
$relFreqOpcodes2{’iftestz’}=$opc[24];
$relFreqOpcodes2{’cmpop’}=$opc[25];
$relFreqOpcodes2{’unop’}=$opc[26];
$relFreqOpcodes2{’binop’}=$opc[27];
$relFreqOpcodes2{’throw’}=$opc[28];
211
Computation
my $s1 = sprintf("%.5f",($relFreqOpcodes{’nop_op’}/$totalOpcodes)
);
my $s2 = sprintf("%.5f",($relFreqOpcodes{’move’}/$totalOpcodes)
);
my $s3 = sprintf("%.5f",($relFreqOpcodes{’move_exception’}/
$totalOpcodes));
my $s4 = sprintf("%.5f",($relFreqOpcodes{’move_object’}/
$totalOpcodes));
my $s5 = sprintf("%.5f",($relFreqOpcodes{’return_op’}/
$totalOpcodes));
my $s6 = sprintf("%.5f",($relFreqOpcodes{’return_object’}/
$totalOpcodes));
my $s7 = sprintf("%.5f",($relFreqOpcodes{’return_void’}/
$totalOpcodes));
my $s8 = sprintf("%.5f",($relFreqOpcodes{’return_wide’}/
$totalOpcodes));
my $s9 = sprintf("%.5f",($relFreqOpcodes{’monitor_enter’}/
$totalOpcodes));
my $s10 = sprintf("%.5f",($relFreqOpcodes{’monitor_exit’}/
$totalOpcodes));
my $s11 = sprintf("%.5f",($relFreqOpcodes{’goto_op’}/
$totalOpcodes));
my $s12 = sprintf("%.5f",($relFreqOpcodes{’goto16’}/
$totalOpcodes));
my $s13 = sprintf("%.5f",($relFreqOpcodes{’goto32’}/
$totalOpcodes));
my $s14 = sprintf("%.5f",($relFreqOpcodes{’invoke_virtual’}/
$totalOpcodes));
my $s15 = sprintf("%.5f",($relFreqOpcodes{’invoke_super’}/
$totalOpcodes));
my $s16 = sprintf("%.5f",($relFreqOpcodes{’invoke_direct’}/
$totalOpcodes));
my $s17 = sprintf("%.5f",($relFreqOpcodes{’invoke_static’}/
$totalOpcodes));
my $s18 = sprintf("%.5f",($relFreqOpcodes{’invoke_interface’}/
$totalOpcodes));
my $s19 = sprintf("%.5f",($relFreqOpcodes{’packed_switch’}/
$totalOpcodes));
my $s20 = sprintf("%.5f",($relFreqOpcodes{’sparse_switch’}/
$totalOpcodes));
212
Computation
my $s21 = sprintf("%.5f",($relFreqOpcodes{’arrayop’}/
$totalOpcodes));
my $s29 = sprintf("%.5f",($relFreqOpcodes{’throw’}/
$totalOpcodes));
$sum += $s1 * $relFreqOpcodes2{’nop_op’};# $sum += $s2 *
$relFreqOpcodes2{’move’};# $sum += $s3 * $relFreqOpcodes2{’
move_exception’};#
$sum += $s4 * $relFreqOpcodes2{’move_object’};# $sum += $s5 *
$relFreqOpcodes2{’return_op’};#
$sum += $s6 * $relFreqOpcodes2{’return_object’};# $sum += $s7 *
$relFreqOpcodes2{’return_void’};#
$sum += $s8 * $relFreqOpcodes2{’return_wide’};# $sum += $s9 *
$relFreqOpcodes2{’monitor_enter’};#
$sum += $s10 * $relFreqOpcodes2{’monitor_exit’};# $sum += $s11 *
$relFreqOpcodes2{’goto_op’};#
$sum += $s12 * $relFreqOpcodes2{’goto16’};# $sum += $s13 *
$relFreqOpcodes2{’goto32’};#
$sum += $s14 * $relFreqOpcodes2{’invoke_virtual’};# $sum += $s15
* $relFreqOpcodes2{’invoke_super’};# $sum += $s16 *
$relFreqOpcodes2{’invoke_direct’};#
$sum += $s17 * $relFreqOpcodes2{’invoke_static’};# $sum += $s18 *
$relFreqOpcodes2{’invoke_interface’};# $sum += $s19 *
$relFreqOpcodes2{’packed_switch’};#
$sum += $s20 * $relFreqOpcodes2{’sparse_switch’};# $sum += $s21 *
$relFreqOpcodes2{’arrayop’};#
$sum += $s22 * $relFreqOpcodes2{’iftest’};# $sum += $s23 *
$relFreqOpcodes2{’instanceop’};#
$sum += $s24 * $relFreqOpcodes2{’static_op’};# $sum += $s25 *
$relFreqOpcodes2{’iftestz’};#
$sum += $s26 * $relFreqOpcodes2{’cmpop’};# $sum += $s27 *
$relFreqOpcodes2{’unop’};#
$sum += $s28 * $relFreqOpcodes2{’binop’};# $sum += $s29 *
$relFreqOpcodes2{’throw’};#
213
Bibliography
1. Android software stack.
http://commons.wikimedia.org/wiki/File:
Android-System-Architecture.svg. Accessed: 2014-10-12.
2. Android sandbox mechanism. http://CEnriqueOrtiz.com. Accessed: 201410-12.
3. Android sandbox mechanism.
http://www.ibm.com/developerworks/
library/x-androidsecurity/. Accessed: 2014-10-12.
4. Menelkir D. motorola milestone boot chain. http://menelkir.itroll.org/
2011/03/motorola-milestone-boot-chain.html. Accessed: 2014-10-23.
5. Android: Is it secure enough? http://www.sierraware.com/SE_Android_
Security_Integrity_Management.pdf. Accessed: 2014-11-08.
6. White paper : An overview of samsung knox. http://www.samsung.com/se/
business-images/resource/2013/samsung-knox-an-overview/%7B3%
7D/Samsung_KNOX_whitepaper-0-0-0.pdf. Accessed: 2014-10-14.
7. Samsung knox technical overview.
https://www.samsungknox.com/en/
products/knox-workspace/technical. Accessed: 2014-10-25.
8. Ti android gingerbread 2.3.4 devkit 2.1 portingguides. http://processors.
wiki.ti.com/index.php/TI-Android-GingerBread-2.3.4-DevKit-2.
1_PortingGuides. Accessed: 2014-10-18.
9. Nitay Artenstein and Idan Revivo. Man in the binder: He who controls ipc, controls
the droid. BlackHat eu ’14, September 2014.
10. Deep dive into binder. https://thenewcircle.com/s/post/1340/Deep_
Dive_Into_Binder_Presentation.htm. Accessed: 2014-10-18.
214
Bibliography
215
11. Android security overview. https://source.android.com/devices/tech/
security/. Accessed: 2014-10-12.
12. Jeff Six. An in depth introduction to the android permission model. OWASP
pubblication presented at AppSecDC 2012, April 2012.
13. Mcafee 2014 mobile security consumer trends report. http://www.mcafee.com/
sg/resources/reports/rp-mobile-security-consumer-trends.pdf.
Accessed: 2014-10-31.
14. Yajin Zhou and Xuxian Jiang. Dissecting android malware: Characterization and
evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy,
SP ’12, pages 95–109, Washington, DC, USA, 2012. IEEE Computer Society.
15. A. Amamra, C. Talhi, and J. Robert. Smartphone malware detection: From a survey
towards taxonomy. In Malicious and Unwanted Software (MALWARE), 2012 7th
International Conference on, pages 79–86, Oct 2012.
16. Xing Liu and Jiqiang Liu. A two-layered permission-based android malware detection scheme. In Proceedings of the 2014 2Nd IEEE International Conference on
Mobile Cloud Computing, Services, and Engineering, MOBILECLOUD ’14, pages
142–148, Washington, DC, USA, 2014. IEEE Computer Society.
17. A.M. Aswini and P. Vinod. Droid permission miner: Mining prominent permissions
for android malware analysis. In Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference on the, pages 81–86, Feb
2014.
18. Shuang Liang and Xiaojiang Du. Permission-combination-based scheme for android
mobile malware detection. In Communications (ICC), 2014 IEEE International
Conference on, pages 2301–2306, June 2014.
19. M. Chandramohan and Hee Beng Kuan Tan. Detection of mobile malware in the
wild. Computer, 45(9):65–71, Sept 2012.
20. V. Rastogi, Yan Chen, and Xuxian Jiang. Catch me if you can: Evaluating android
anti-malware against transformation attacks. Information Forensics and Security,
IEEE Transactions on, 9(1):99–108, Jan 2014.
Bibliography
216
21. Av-test report ”36 security apps for android are put under constant
fire”.
http://www.av-test.org/en/news/news-single-view/
36-security-apps-for-android-are-put-under-constant-fire/?=.
Accessed: 2014-12-05.
22. Hugo Gascon, Fabian Yamaguchi, Daniel Arp, and Konrad Rieck. Structural detection of android malware using embedded call graphs. In Proceedings of the 2013
ACM Workshop on Artificial Intelligence and Security, AISec ’13, pages 45–54, New
York, NY, USA, 2013. ACM.
23. Kai Chen, Peng Liu, and Yingjun Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the
36th International Conference on Software Engineering, ICSE 2014, pages 175–186,
New York, NY, USA, 2014. ACM.
24. Android
intent.
http://luca-petrosino.blogspot.it/2011/03/
intent-e-intent-filter.html. Accessed: 2014-10-15.
25. Daoyuan W. On the feasibility of automatically generating android component
hijacking exploits. HitCon ’14, (21), August 2014.
26. Binders windows tokens. http://www.androiddesignpatterns.com/2013/
07/binders-window-tokens.html. Accessed: 2014-10-18.
27. William Enck, Machigar Ongtang, and Patrick McDaniel. On lightweight mobile
phone application certification. In Proceedings of the 16th ACM Conference on
Computer and Communications Security, CCS ’09, pages 235–245, New York, NY,
USA, 2009. ACM.
28. Security enhancements in android 4.4.
https://source.android.com/
devices/tech/security/enhancements44.html. Accessed: 2014-10-13.
29. Validating security-enhanced linux in android. https://source.android.com/
devices/tech/security/se-linux.html. Accessed: 2014-10-13.
30. Android intent. https://developer.android.com/reference/android/
content/Intent.html. Accessed: 2014-10-14.
Bibliography
217
31. Yifei Wang, Srinivas Hariharan, Chenxi Zhao, Jiaming Liu, and Wenliang Du. Compac: enforce component-level access control in android. In CODASPY’14, pages
25–36, 2014.
32. Symantec
2013
mobile
adware
and
malware
analysis
report.
http:
//www.symantec.com/content/en/us/enterprise/media/security_
response/whitepapers/madware_and_malware_analysis.pdf.
Ac-
cessed: 2014-11-04.
33. G. Suarez-Tangil, J.E. Tapiador, P. Peris-Lopez, and A. Ribagorda. Evolution,
detection and analysis of malware for smart devices. Communications Surveys Tutorials, IEEE, 16(2):961–987, Second 2014.
34. Shuaifu Dai, Yaxin Liu, Tielei Wang, Tao Wei, and Wei Zou. Behavior-based malware detection on mobile phone. In Wireless Communications Networking and Mobile Computing (WiCOM), 2010 6th International Conference on, pages 1–4, Sept
2010.
35. Abhijit Bose, Xin Hu, Kang G. Shin, and Taejoon Park. Behavioral detection of
malware on mobile handsets. In Proceedings of the 6th International Conference on
Mobile Systems, Applications, and Services, MobiSys ’08, pages 225–238, New York,
NY, USA, 2008. ACM.
36. Hahnsang Kim, Joshua Smith, and Kang G. Shin. Detecting energy-greedy anomalies and mobile malware variants. In Proceedings of the 6th International Conference
on Mobile Systems, Applications, and Services, MobiSys ’08, pages 239–252, New
York, NY, USA, 2008. ACM.
37. B. Dixon and S. Mishra. Power based malicious code detection techniques for smartphones. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, pages 142–149, July 2013.
38. T.K. Buennemeyer, T.M. Nelson, L.M. Clagett, J.P. Dunning, R.C. Marchany, and
J.G. Tront. Mobile device profiling and intrusion detection using smart batteries.
In Hawaii International Conference on System Sciences, Proceedings of the 41st
Annual, pages 296–296, Jan 2008.
39. Aubrey-Derrick Schmidt, Frank Peters, Florian Lamour, and Sahin Albayrak. Monitoring smartphones for anomaly detection. In Proceedings of the 1st International
Bibliography
218
Conference on MOBILe Wireless MiddleWARE, Operating Systems, and Applications, MOBILWARE ’08, pages 40:1–40:6, ICST, Brussels, Belgium, Belgium, 2007.
ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications
Engineering).
40. A.-D. Schmidt, R. Bye, H.-G. Schmidt, J. Clausen, O. Kiraz, K.A. Yuksel, S.A.
Camtepe, and S. Albayrak. Static analysis of executables for collaborative malware detection on android. In Communications, 2009. ICC ’09. IEEE International
Conference on, pages 1–5, June 2009.
41. David Barrera, H. Güneş Kayacik, Paul C. van Oorschot, and Anil Somayaji. A
methodology for empirical analysis of permission-based security models and its application to android. In Proceedings of the 17th ACM Conference on Computer and
Communications Security, CCS ’10, pages 73–84, New York, NY, USA, 2010. ACM.
42. A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Static analysis for extracting
permission checks of a large scale framework: The challenges and solutions for analyzing android. Software Engineering, IEEE Transactions on, 40(6):617–632, June
2014.
43. M. Christodorescu, S. Jha, S.A. Seshia, D. Song, and R.E. Bryant. Semantics-aware
malware detection. In Security and Privacy, 2005 IEEE Symposium on, pages 32–46,
May 2005.
44. Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni
Vigna. Polymorphic worm detection using structural information of executables. In
Proceedings of the 8th International Conference on Recent Advances in Intrusion
Detection, RAID’05, pages 207–226, Berlin, Heidelberg, 2006. Springer-Verlag.
45. Joris Kinable and Orestis Kostakis. Malware classification based on call graph
clustering. J. Comput. Virol., 7(4):233–245, November 2011.
46. Chanchal Kumar Roy and James R. Cordy. A survey on software clone detection
research. SCHOOL OF COMPUTING TR 2007-541, QUEEN’S UNIVERSITY,
115, 2007.
47. D. B. West, editor. Introduction to Graph Theory (2nd Ed.). Prentice Hall, 2000.
Bibliography
219
48. S. Cesare and Yang Xiang. Malware variant detection using similarity search over
sets of control flow graphs. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference on, pages 181–
189, Nov 2011.
49. Daniel Arp, Michael Spreitzenbarth, Malte Huebner, Hugo Gascon, and Konrad
Rieck. Drebin: Efficient and explainable detection of android malware in your
pocket. 21th Annual Network and Distributed System Security Symposium (NDSS),
2014.
50. Michael Spreitzenbarth, Florian Echtler, Thomas Schreck, Felix C. Freling, and
Johannes Hoffmann. Mobilesandbox: Looking deeper into android applications.
28th International ACM Symposium on Applied Computing (SAC), 2013.

Detecting Android Malware Variants using Opcodes

Transcript

Documenti analoghi

Mobile Malware Report Q4 2015

G Data Mobile Malware Report H1/2013

Palo Alto Networks

PDF

1Q 2013 security roundup

Safeguarding Against Pinkslipbot

Chapter 4 Kernel Initialization (init)

Frazier (1965) proposed that “the ability to transmit Pierce`s disease

About "Correlations"

Note - IT eBooks