Advances in Information and Communication Networks PDF

The book, gathering the proceedings of the Future of Information and Communication Conference (FICC) 2018, is a remarkable collection of chapters covering a wide range of topics in areas of information and communication technologies and their applications to the real world. It includes 104 papers and posters by pioneering academic researchers, scientists, industrial engineers, and students from all around the world, which contribute to our understanding of relevant trends of current research on communication, data science, ambient intelligence, networking, computing, security and Internet of Things. This book collects state of the art chapters on all aspects of information science and communication technologies, from classical to intelligent, and covers both theory and applications of the latest technologies and methodologies. Presenting state-of-the-art intelligent methods and techniques for solving real-world problems along with a vision of the future research, this book is an interesting and useful resource.

109 downloads 4K Views 106MB Size

Report

Download pdf

Recommend Stories

Empty story

Idea Transcript

Advances in Intelligent Systems and Computing 887

Kohei Arai Supriya Kapoor Rahul Bhatia Editors

Advances in Information and Communication Networks Proceedings of the 2018 Future of Information and Communication Conference (FICC), Vol. 2

Advances in Intelligent Systems and Computing Volume 887

Series editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected]

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artiﬁcial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover signiﬁcant recent developments in the ﬁeld, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results.

Advisory Board Chairman Nikhil R. Pal, Indian Statistical Institute, Kolkata, India e-mail: [email protected] Members Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba e-mail: [email protected] Emilio S. Corchado, University of Salamanca, Salamanca, Spain e-mail: [email protected] Hani Hagras, School of Computer Science & Electronic Engineering, University of Essex, Colchester, UK e-mail: [email protected] László T. Kóczy, Department of Information Technology, Faculty of Engineering Sciences, Győr, Hungary e-mail: [email protected] Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA e-mail: [email protected] Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan e-mail: [email protected] Jie Lu, Faculty of Engineering and Information, University of Technology Sydney, Sydney, NSW, Australia e-mail: [email protected] Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico e-mail: [email protected] Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil e-mail: [email protected] Ngoc Thanh Nguyen, Wrocław University of Technology, Wrocław, Poland e-mail: [email protected] Jun Wang, Department of Mechanical and Automation, The Chinese University of Hong Kong, Shatin, Hong Kong e-mail: [email protected]

More information about this series at http://www.springer.com/series/11156

Kohei Arai Supriya Kapoor Rahul Bhatia •

Editors

Advances in Information and Communication Networks Proceedings of the 2018 Future of Information and Communication Conference (FICC), Vol. 2

123

Editors Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan

Rahul Bhatia The Science and Information (SAI) Organization New Delhi, India

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-03404-7 ISBN 978-3-030-03405-4 (eBook) https://doi.org/10.1007/978-3-030-03405-4 Library of Congress Control Number: 2018959425 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

On behalf of the organizing committee of the Future of Information and Communication Conference (FICC), it is an honour and a great pleasure to welcome you to the FICC 2018 which was held from 5 to 6 April 2018 in Singapore. The Conference is organized by the SAI Conferences, a group of annual conferences produced by The Science and Information (SAI) Organization, based in the UK. The digital services nowadays are changing the lives of people across the globe. So is the Information and Communication which plays a signiﬁcant role in our society. The Future of Information and Communication Conference (FICC), 2018, focuses on opportunities for the researchers from all over the world. Bringing together experts from both industry and academia, this Information and Communication Conference delivers programmes of latest research contributions and future vision (inspired by the issues of the day) in the ﬁeld and potential impact across industries. FICC 2018 attracted a total of 361 submissions from many academic pioneering researchers, scientists, industrial engineers, students from all around the world. These submissions underwent a double-blind peer-review process. Of those 361 submissions, 104 submissions (including 9 poster papers) have been selected to be included in this proceedings. It covers several hot topics which include ambient intelligence, communication, computing, data science, intelligent systems, Internet of things, machine learning, networking, security and privacy. The Conference held over two days hosted paper presentations, poster presentations as well as project demonstrations. Many thanks go to the Keynote Speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute signiﬁcantly to this Conference. We are also indebted to the organizing committee for their great efforts in ensuring the successful implementation of the Conference. In particular, we would like to thank the technical committee for their constructive and enlightening reviews on the manuscripts.

v

vi

Preface

We are pleased to present the ﬁrst proceedings of this Conference as its published record. Our sincere thanks to all the sponsors, press, print and electronic media for their excellent coverage of this Conference. Hope to see you in 2019, in our next Future of Information and Communication Conference but with the same amplitude, focus and determination. Kohei Arai

Contents

Hybrid Data Mining to Reduce False Positive and False Negative Prediction in Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . . . . Bala Palanisamy, Biswajit Panja, and Priyanka Meharia

1

Adblock Usage in Web Advertisement in Poland . . . . . . . . . . . . . . . . . . Artur Strzelecki, Edyta Abramek, and Anna Sołtysik-Piorunkiewicz

13

Open Algorithms for Identity Federation . . . . . . . . . . . . . . . . . . . . . . . . Thomas Hardjono and Alex Pentland

24

A Hybrid Anomaly Detection System for Electronic Control Units Featuring Replicator Neural Networks . . . . . . . . . . . . . . . . . . . . . Marc Weber, Felix Pistorius, Eric Sax, Jonas Maas, and Bastian Zimmer Optimizing Noise Level for Perturbing Geo-location Data . . . . . . . . . . . Abhinav Palia and Rajat Tandon Qualitative Analysis for Platform Independent Forensics Process Model (PIFPM) for Smartphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Chevonne Thomas Dancer Privacy Preserving Computation in Home Loans Using the FRESCO Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fook Mun Chan, Quanqing Xu, Hao Jian Seah, Sye Loong Keoh, Zhaohui Tang, and Khin Mi Mi Aung

43 63

74

90

Practically Realisable Anonymisation of Bitcoin Transactions with Improved Efﬁciency of the Zerocoin Protocol . . . . . . . . . . . . . . . . 108 Jestine Paul, Quanqing Xu, Shao Fei, Bharadwaj Veeravalli, and Khin Mi Mi Aung Walsh Sampling with Incomplete Noisy Signals . . . . . . . . . . . . . . . . . . . 131 Yi Janet Lu

vii

viii

Contents

Investigating the Effective Use of Machine Learning Algorithms in Network Intruder Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . 145 Intisar S. Al-Mandhari, L. Guan, and E. A. Edirisinghe Anonymization of System Logs for Preserving Privacy and Reducing Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Siavash Ghiasvand and Florina M. Ciorba Detecting Target-Area Link-Flooding DDoS Attacks Using Trafﬁc Analysis and Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Mostafa Rezazad, Matthias R. Brust, Mohammad Akbari, Pascal Bouvry, and Ngai-Man Cheung Intrusion Detection System Based on a Deterministic Finite Automaton for Smart Grid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Nadia Boumkheld and Mohammed El Koutbi The Detection of Fraud Activities on the Stock Market Through Forward Analysis Methodology of Financial Discussion Boards . . . . . . . 212 Pei Shyuan Lee, Majdi Owda, and Keeley Crockett Security Enhancement of Internet of Things Using Service Level Agreements and Lightweight Security . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Shu-Ching Wang, Ya-Jung Lin, Kuo-Qin Yan, and Ching-Wei Chen Enhancing the Usability of Android Application Permission Model . . . . 236 Zeeshan Haider Malik, Habiba Farzand, and Zahra Shaﬁq MUT-APR: MUTation-Based Automated Program Repair Research Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Fatmah Y. Assiri and James M. Bieman GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method for Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Mohamed A. Nassar, Layla A. A. El-Sayed, and Yousry Taha Effective Local Reconstruction Codes Based on Regeneration for Large-Scale Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 Quanqing Xu, Hong Wai Ng, Weiya Xi, and Chao Jin Blockchain-Based Distributed Compliance in Multinational Corporations’ Cross-Border Intercompany Transactions . . . . . . . . . . . . 304 Wenbin Zhang, Yuan Yuan, Yanyan Hu, Karthik Nandakumar, Anuj Chopra, Sam Sim, and Angelo De Caro HIVE-EC: Erasure Code Functionality in HIVE Through Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Aatish Chiniah and Mungur Utam Avinash Einstein

Contents

ix

Self and Regulated Governance Simulation . . . . . . . . . . . . . . . . . . . . . . 331 Hock Chuan Lim Emergency Departments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Salman Alharethi, Abdullah Gani, and Mohd Khalit Othman Benchmarking the Object Storage Services for Amazon and Azure . . . . 359 Wedad Ahmed, Hassan Hajjdiab, and Farid Ibrahim An Improvement of the Standard Hough Transform Method Based on Geometric Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Abdoulaye Sere, Frédéric T. Ouedraogo, and Boureima Zerbo Recognition of Fingerprint Biometric System Access Control for Car Memory Settings Through Artiﬁcial Neural Networks . . . . . . . 385 Abdul Rafay, Yumnah Hasan, and Adnan Iqbal Haar Cascade Classiﬁer and Lucas–Kanade Optical Flow Based Realtime Object Tracker with Custom Masking Technique . . . . . . . . . . 398 Karishma Mohiuddin, Mirza Mohtashim Alam, Amit Kishor Das, Md. Tahsir Ahmed Munna, Shaikh Muhammad Allayear, and Md. Haider Ali Modiﬁed Adaptive Neuro-Fuzzy Inference System Trained by Scoutless Artiﬁcial Bee Colony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Mohd Najib Mohd Salleh, Norlida Hassan, Kashif Hussain, Noreen Talpur, and Shi Cheng Conditional Image Synthesis Using Stacked Auxiliary Classiﬁer Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Zhongwei Yao, Hao Dong, Fangde Liu, and Yike Guo Intelligent Time Series Forecasting Through Neighbourhood Search Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Murphy Choy and Ma Nang Laik SmartHealth Simulation Representing a Hybrid Architecture Over Cloud Integrated with IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Sarah Shafqat, Almas Abbasi, Tehmina Amjad, and Haﬁz Farooq Ahmad SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers . . . . 461 Walaa F. Elsadek and Mikhail N. Mikhail Smart Eco-Friendly Trafﬁc Light for Mauritius . . . . . . . . . . . . . . . . . . . 475 Avinash Mungur, Abdel Sa’d Bin Anwar Bheekarree, and Muhammad Bilaal Abdel Hassan Logistics Exceptions Monitoring for Anti-counterfeiting in RFID-Enabled Supply Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Xiaoming Yao, Xiaoyi Zhou, and Jixin Ma

x

Contents

Precision Dairy Edge, Albeit Analytics Driven: A Framework to Incorporate Prognostics and Auto Correction Capabilities for Dairy IoT Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Santosh Kedari, Jaya Shankar Vuppalapati, Anitha Ilapakurti, Chandrasekar Vuppalapati, Sharat Kedari, and Rajasekar Vuppalapati An IoT System for Smart Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 Khoumeri El-Hadi, Cheggou Rabea, Farhah Kamila, and Rezzouk Hanane Aiding Autobiographical Memory by Using Wearable Devices . . . . . . . 534 Jingyi Wang and Jiro Tanaka Intelligent Communication Between IoT Devices on Edges in Retail Sector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 M. Saravanan and N. C. Srinidhi Srivatsan A Survey on SDN Based Security in Internet of Things . . . . . . . . . . . . . 563 Renuga Kanagavelu and Khin Mi Mi Aung Cerebral Blood Flow Monitoring Using IoT Enabled Cloud Computing for mHealth Applications . . . . . . . . . . . . . . . . . . . . . 578 Beulah Preethi Vallur, Krishna Murthy Kattiyan Ramamoorthy, Shahnam Mirzaei, and Shahram Mirzai When Siri Knows How You Feel: Study of Machine Learning in Automatic Sentiment Recognition from Human Speech . . . . . . . . . . . 591 L. Zhang and E. Y. K. Ng A Generic Multi-modal Dynamic Gesture Recognition System Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 G. Gautham Krishna, Karthik Subramanian Nathan, B. Yogesh Kumar, Ankith A. Prabhu, Ajay Kannan, and Vineeth Vijayaraghavan A Prediction Survival Model Based on Support Vector Machine and Extreme Learning Machine for Colorectal Cancer . . . . . . . . . . . . . 616 Preeti, Rajni Bala, and Ram Pal Singh Sentiment Classiﬁcation of Customer’s Reviews About Automobiles in Roman Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Moin Khan and Kamran Malik Hand Gesture Authentication Using Depth Camera . . . . . . . . . . . . . . . . 641 Jinghao Zhao and Jiro Tanaka FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 Kamran Kowsari, Nima Bari, Roman Vichr, and Farhad A. Goodarzi A Novel Method for Stress Measuring Using EEG Signals . . . . . . . . . . 671 Vinayak Bairagi and Sanket Kulkarni

Contents

xi

Fog-Based CDN Architecture Using ICN Approach for Efﬁcient Large-Scale Content Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 Fatimah Alghamdi, Ahmed Barnawi, and Saoucene Mahfoudh Connectivity Patterns for Supporting BPM in Healthcare . . . . . . . . . . . 697 Amos Harris and Craig Kuziemsky A Systematic Review of Adaptive and Responsive Design Approaches for World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 Nazish Yousaf, Wasi Haider Butt, Farooque Azam, and Muhammad Waseem Anwar Quantum Adiabatic Evolution with a Special Kind of Interpolating Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718 Jie Sun, Songfeng Lu, and Chao Gao Extraction, Segmentation and Recognition of Vehicle’s License Plate Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Douglas Chai and Yangfan Zuo A Comparison of Canny Edge Detection Implementations with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733 Josiah Smalley and Suely Oliveira Analysis and Prediction About the Relationship of Foreign Exchange Market Sentiment and Exchange Rate Trend . . . . . . . . . . . . . . . . . . . . 744 Wanyu Du and Mengji Zhang Effectiveness of NEdT and Band 10 (8.3 lm) of ASTER/TIR on SSST Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750 Kohei Arai Raspberry Pi Based Smart Control of Home Appliance System . . . . . . . 759 Cheggou Rabea, Khoumeri El-Hadi, Farhah Kamila, and Rezzouk Hanane Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771

Hybrid Data Mining to Reduce False Positive and False Negative Prediction in Intrusion Detection System Bala Palanisamy(&), Biswajit Panja, and Priyanka Meharia Department of Computer Science, Eastern Michigan University, Ypsilanti, MI 48197, USA {bpalanis,bpanja,pmeharia}@emich.edu

Abstract. This paper proposes an approach of data mining machine learning methods for reducing the false positive and false negative predictions in existing Intrusion Detection Systems (IDS). It describes our proposal for building a conﬁdential strong intelligent intrusion detection system which can save data and networks from potential attacks, having recognized movement or infringement regularly reported ahead or gathered midway. We have addressed different data mining methodologies and presented some recommended approaches which can be built together to enhance security of the system. The approach will reduce the overhead of administrators, who can be less concerned about the alerts as they have been already classiﬁed and ﬁltered with less false positive and false negative alerts. Here we have made use of KDD-99 IDS dataset for details analysis of the procedures and algorithms which can be implemented. Keywords: Intrusion Detection Systems Data mining Anomaly detection SVM KNN ANN

Intrusion detection

1 Introduction With rapid developments and innovations in computer technology and networks, the number of people using technology to commit cyber-attacks is also increasing. In order to prevent this, we must take preventive measures to stop these crimes and stay secure. A strong computer system can prevent potential attacks by having a good Intrusion Detection System (IDS) in place. Intrusion Detection Systems are used to preserve data availability over the network by detecting patterns of known attacks which are deﬁned by experts. These patterns are usually deﬁned by a set of rules which are validated with a set of common occurring events and probable intrusion sequences. There are many Intrusion Detection Systems available in the market, based on which environment and system they are used for. IDSs can be used in small home networks or in huge organizations which have large systems in multiple locations across the globe. Some of the well-known IDSs are Snort, NetSim, AIDE, Hybrid IDS, Samhain, etc. The Internet has become an indispensable tool for exchanging information among users and organizations, and security is an essential aspect in this type of © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 1–12, 2019. https://doi.org/10.1007/978-3-030-03405-4_1

2

B. Palanisamy et al.

communication. IDSs are often used to sniff network packets to provide a better understanding of what is happening in a particular network. Two mainstream preferences for IDSs are (1) host-based IDSs, and (2) network-based IDSs. Correspondingly, the detection methods used in IDS are anomaly based and misuse based (also called signature or knowledge based), each having their own advantages and restrictions. In misuse-based detection, data gathered from the system is compared to a set of rules or patterns, also known as signatures, to describe network attacks. The core difference between these two techniques is that anomaly-based IDS uses collections of data containing examples of normal behavior and builds a model of familiarity, therefore, any action that deviates from the model is considered suspicious and is classiﬁed as an intrusion in misuse-based detection, attacks are represented by signatures or patterns. However, this approach does not contribute much in terms of zero-day attack detection. The main issue is how to build permanent signatures that have all the possible variations and non-intrusive activities to lower the false-negative and false-positive alarms. An Intrusion Detection System screens a computer system or networks for any malignant movement, crime, or illegal violation over the network. Any recognized infringement is regularly reported ahead or gathered midway. It consolidates activities from different sources, and uses alerting methods to distinguish malignant movement from false alarms. One way of building a highly secure system is by using a very good Intrusion Detection System by recognizing Illegal utilization of those systems. These systems work by monitoring strange or suspicious actions that are most likely indicators of crime activity. Although an extensive variety of approaches are used to secure data in today’s composed environment, these systems regularly fail. An early acknowledgment of such occasions is the key for recovering lost or affected data without much adaptable quality issues. As of now, IDSs have numerous false cautions and repetition alerts. Our approach is to reduce the number of false positive and false negative alerts from the Intrusion Detection System by improving the system’s proﬁciency and precision. In such scenarios the global rule set will not be applicable to some of the common usage pattern in some surroundings. Separating approved and unapproved acts is a troublesome issue. Marks pointing to an intrusion may also correspond to approved system use, bringing about false alerts. This is especially troublesome while implementing business sector corresponding Intrusion Detection Systems. Here we propose the design of an intelligent Intrusion Detection System which makes use of Data Mining techniques to detect and identify the possible threats from the threats suggested by a currently implemented Intrusion Detection System by reducing the False Positive and False Negative alerts. We apply data mining techniques such as Decision Trees, Naïve Bayes Probability classiﬁer, Artiﬁcial Neural Networks (ANN), and K-Nearest Neighbor’s algorithm (KNN). Data mining is helpful in identifying obtrusive behavior and typical normal behavior. Currently we can differentiate between normal actions and illegal or strange acts from data mining continuous information updates. The administrators can be less concerned about the alerts as they have been already classiﬁed and ﬁltered. We also allow making rapid addition and acquiring of knowledge which will be more helpful in terms of maintenance and operational administration of Intrusion Detection System.

Hybrid Data Mining to Reduce False Positive and False Negative Prediction

3

Today’s developments in technology have created huge breakthroughs in internet networks which have also increased computer related crime. Thus there is a huge demand for developments in cyber security to preserve our data, detect any security threats and make corrections to enhance the security over the sharing networks. We must make our environments secure with use of secured ﬁrewall and antivirus software, and a highly sophisticated Intrusion Detection System. Some of the common types of attacks in a secured network are Probe or Scan, Remote to Local, User to Root, and Denial of Service attacks. Organizations make use of IDSs to preserve data availability from these attacks. The current Intrusion Detection Systems send a security alert whenever any sequence of events occurs that is similar to the ruleset deﬁned for the common generic environment. Large organizations which use these systems quite often receive numerous alerts which are false alarms, because it is of a generic type, which may be the organization’s common and typical work sequence. This means the administrators must validate a large number of sequences. For example: An online marketing company is vulnerable of internet attacks and is also entirely dependent on the live values of the items. Due to this reason they face problem of security alerts raised very often saying the possibility of an attack on network. But the heavy user trafﬁc is a typical part of their operation during starting and end of the day scenario. Having a generic Intrusion Detection System in place causes an overhead for the network administrators due to frequent false positive and false negative alarms and also they do not have an ability to identify new unknown attacks. So there is a need for an Intelligent Intrusion Detection System which can overcome this problem with less false positive and false negative cases. In this proposed approach we have applied data mining techniques to common generic predictions made by the Intrusion Detection system. This allows us to further analyze the current system and try to classify and cluster our predictions and improve them on the system events. Using such a system will create an intelligent IDS, which can provide more stable predictions and the flexibility of having customizable systems based on speciﬁc business sectors of an organizations, allowing it to better identify a possible attack on the system. In order to build a data mining machine learning system, we must have the data evidently which has clearly labeled or collected data from the Intrusion Detection System. IDSs usually have a huge amount of data available, regarding type of attack, current network trafﬁc, source port, destination port, type of service, region, trafﬁc sequence number, acknowledge number, message length, activity time, etc. This data must be collected on normal working scenarios over a period of time and can then be used to identify events and determine whether they are regular trafﬁc or an actual intrusion and attack on the system. Data mining operations are done by two approaches in the proposed approach: supervised and unsupervised learning. Supervised learning, also known as classiﬁcation, is a method that uses training data obtained from observations and measurements classify events. This requires clearly labeled data indicating the operation type and class of severity of the action which triggered event. Unsupervised learning, also known as clustering, is a method used when the training data is unclear and only consisting of observations and measurements. It identiﬁes clusters in data and uses a

4

B. Palanisamy et al.

technique called clustering to ﬁnd patterns in the high-dimensional unlabeled data. It is a non-supervised pattern encounter method whenever the data is assembled together with a comparison quantity. For analyzing our proposed approach, we have used KDD 99 IDS dataset. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. It provided a standard set of data to be audited, which included a wide variety of intrusions simulated in a military network environment. The 1999 KDD intrusion detection contest uses a version of this dataset. Attacks in the dataset fall into four main categories: DOS: denial-of-service, e.g. syn flood; R2L: unauthorized access from a remote machine, e.g. guessing password; U2R: unauthorized access to local super user (root) privileges, e.g., various “buffer overflow” attacks; Probing: surveillance and other probing, e.g., port scanning.

2 Related Work Xu et al. [1] discussed privacy preserving data mining (PPDM) in this paper, where the basic idea is to do data mining approach efﬁciently devoid of conceding data. A protected data like ID card numbers and cellphone numbers which shouldn’t be used with mining and results which is disclosed will consequence in conﬁdentiality deﬁlement. Data Provider is the source for the data which is chosen through the data mining task. The provider must handle the sensitive data which some time he may hide. Data collector has major role on protecting sensitive data. They suggest making use of association rule mining or decision tree mining technique to predict the customer buying habits by spending associations between different items that customers or to make random substitutions using decision trees. Yu et al. [2] proposes approach for the prevention of crimes by predicting the hotspots and time where crime can happen using datamining classiﬁcation techniques using existing crime data. Classiﬁcation is made on the data set to identify the hotspots and heating up location in the grid. They are classiﬁed as either sociological norms or the social signal. The predictions are made based from the probability of crime incidents that happened in previous month or pattern. The variation of hotspot and cold spot are identiﬁed based on the no True positives and False positives & negatives. Hajian et al. [3] discussed usage of data mining machine techniques effectively on the datasets which are commonly used in intrusion detection systems (IDS) applications by avoiding the data factors in datasets that causes discrimination (gender, age, race, etc.). They introduce anti-discrimination with cyber security and proposes its ideas of its success in judgment prevention with its data feature. Discrimination discovery is the identiﬁcation of the discriminating factors that influence the decisions and the degree of discrimination which is affected by every factor. So they calculate the probability of occurrence of a decision which is derived as the support and conﬁdent factor for each discrimination factors used from the dataset.

Hybrid Data Mining to Reduce False Positive and False Negative Prediction

5

Xu et al. [4] concentrates on the identiﬁcation of malware or any security violation by using data mining technique for mobile application. They propose solution to the problem with a cloud computing platform implemented with data mining which analyses the android apps with (ASEF) which is an automated tool which works as a virtual machine and can install applications on it and (SAAF) analysis method which provides information on applications. They have described ways to static and dynamic behavior analysis patterns of applications and then they apply machine learning technique to analyze and classify the Android software by monitoring things happening on kernel level. They have used PART, Prism and nNb classiﬁers to identify the software either it is original or malicious.

3 Data Normalization and Pre Processing Data from real-time systems will often be mostly unformatted, and in case of network data the data usually consists of multiple formats and ranges. It is therefore necessary to convert the data into a uniform range, which can help the classiﬁer to make better predictions. Data cleaning is required to ﬁlter accurate information from invalid or unimportant data and distinguish important data ﬁelds from insigniﬁcant ones [5]. In cases where organizations use multiple systems we must remove the redundant data from the data set consolidated from the different systems. Missing value estimation must also be performed for some of the ﬁelds which are not available. In many cases, when we collect the data from the Intrusion Detection System we will have certain data ﬁelds missing based on the particular sequence of events or scenarios. Missing value estimation can be used to estimate probable values for those missing ﬁelds; however information sections with excessively numerous missing qualities are probably not going to convey much helpful data. Subsequently, information sections with a number of missing qualities that exceeds a given limit can be expelled. The more stringent this limit is, the more data that is expelled. The KDD Cup99 dataset available in three different ﬁles, the KDD Full Dataset which contains 4,898,431 instances, the KDD Cup 10% dataset which contains 494,021 instances, and the KDD Corrected dataset which contains 311,029 instances. Each sample of the dataset represents a connection between two network hosts according to network protocols. The connection is described by 41 attributes, 38 of which are continuous or discrete numerical quantities and 3 of which are categorical qualities. Each sample is labeled as either a normal sequence or a speciﬁc attack. The dataset contains 23 class labels, 1 for normally occurring events and the remainder for different attacks. The total 22 attack labels fall into the four attack categories: DOS, R2L, U2R, and Probing. Preprocessing this data consists of two steps, removing duplicate records and normalizing the data. In the ﬁrst step, we removed 3,823,439 duplicate records from the 4,898,431 instances of the KDD Full Dataset, leaving 1,074,992 distinct records. Similarly, 348,435 duplicate records were removed from the KDD Cup 10% dataset’s 494,021 instances, obtaining 145,586 distinct records, a reduction of about 70%. Data Normalization was performed by performing substitution for the columns pertaining to protocol type, service, flag, and land. We replaced the different types of

6

B. Palanisamy et al.

attack classes with the 4 main classes, DOS, R2L, U2R and probing. We then performed min–max normalization based on the values in respective columns, as the data was discontinuous with different ranges for each column. The attributes were then scaled to a range of 0–1.

4 Feature Selection We currently have 41 features in KDD Dataset [9, 10] which are to be studied and selected to obtain better performance by reducing the false positive and false negative errors in the prediction and improve the accuracy. We can identify the most important features and other least important features to improve our accuracy and prediction. We ﬁrst noted that several columns, including Num_outbound_cmds, and Is_hot_login, were zero valued for all the records, so these columns could be omitted when performing the data analysis. We also studied the entropy of each column with a list of features for which the class is most relevant, and selected the important features which were good indicators for identifying the class of the attacks.

5 Dimension Reduction of Data Having a large dataset makes predictability an issue as many of the data ﬁelds have different scales of variability. Hence, in order to make the parameter rescaling easier we make use of dimensional reduction. Figure 1 provides architecture of network intrusion detection system.

Fig. 1. Network intrusion detection system data mining process diagram.

5.1

Principal Component Analysis

Dimensional reduction is a measurable methodology that orthogonally changes the ﬁrst n directions of an information set into another arrangement of n directions. As an aftereffect of the change [6], the primary part has the biggest conceivable fluctuation of values each succeeding segment has the most elevated conceivable difference under the

Hybrid Data Mining to Reduce False Positive and False Negative Prediction

7

limitation that it is orthogonal to the previous segments. Keeping just the main m < n segments diminishes the information dimensionality while holding the vast majority of the variety in the information. 5.2

Random Forests/Ensemble Trees

Random forests/Ensemble trees are one of the possible ways to deal with dimensional reduction. It is implemented by reducing a vast decision tree, built in arrangement of trees against an objective trait and based on every nodes utilization measurements to locate the most instructive subset of components. In particular, we can produce a huge set of decision trees, with every tree being prepared on a little portion of the aggregate number of qualities. On the off chance that a trait is frequently chosen as best part, it is in all probability an instructive element to hold. A score computed on the characteristic utilization insights in the irregular branch which shows in respect to alternative qualities – which are the most prescient properties. Once the preparation of the data is ﬁnished we can apply the supervised classiﬁcation. This paper proposes to use well known data mining classiﬁcation algorithms such as Artiﬁcial neural network [7–9] (ANN), Decision tree learning (DT), Naïve Bayes classiﬁer(NBC), Support vector machines (SVM), Nearest Neighbor Algorithm (KNN). There are many signiﬁcant reasons for combining multiple methods in our analysis, mainly it provides a better performance and prediction supported by several methods will give us better accuracy and will reduce false positive and false negative cases. After we have speciﬁc data we need to separate the data sets into the training dataset and the testing dataset. Training datasets are used for learning the patterns of the system to classify or cluster the data, and testing datasets are used to test the accuracy of the system which we have trained. Step 1: Convert the symbolic attributes protocol, service, and flag to numerical ones. Step 2: Normalize data to [0, 1] Step 3: Separate the instances of 10% KDD training dataset into ﬁve categories: Normal, DoS, Probe, R2L, and U2R. Step 4: Apply modiﬁed K - means on each category and create new training datasets. Step 5: Train SVM and others with these new training datasets. Step 6: Test model with corrected KDD dataset. TN (True Negatives): Indicates the number of normal events successfully labeled as normal. FP (False Positives): Refer to the number of normal events being predicted as attacks. FN (False Negatives): The number of attack events incorrectly predicted as normal. TP (True Positives): The number of attack events correctly predicted as attack.

8

B. Palanisamy et al.

6 Classiﬁcation Process 6.1

Naïve Bayesian Classiﬁer

Naïve Bayesian classiﬁer uses the Bayes theorem to apply the posterior probability of one event on other. It accepts that the impact of a probability on a given class is autonomous of the estimations of different characteristics. Given a case, a Bayesian classiﬁer can anticipate the likelihood that the tuple has a place with a speciﬁc class. Bayesian system or network system is a probability based graphical approach that speaks to the factors and the connections between each of the classes. The system is developed with hubs representing the different or constant arbitrary factors and coordinated classes as the connections within them, setting up a coordinated non-cyclic diagram. Bayesian systems are fabricated utilizing master learning or utilizing productive calculations that perform surmising. When we use the Naive Bayes, we must assume the dataset ﬁelds and features are independent for each event and it is the most accurate because it is based on the probability and can be represented as Bayesian network by combining graph theory and probability. Bayesian networks can be used for learning the inference between the real time intrusion scenarios and common regular scenarios. 6.2

Support Vector Machine

Support Vector Machine (SVM) is a well-known classiﬁcation method. SVM utilizes a nonlinear mapping to change the ﬁrst train data into a higher measurement. Inside this new measurement, SVM hunt down a direct ideal isolating class distribution by utilizing support vectors and edges. The SVM is a classiﬁer in light of ﬁnding an isolating class distribution in the component space between two classes in a manner that the separation between the class distribution and the nearest information purposes of every class is improved. The approach depends on a minimized classiﬁcation chance instead of on ideal order. SVMs are outstanding for their speculation capacity and are especially valuable when the quantity of components, m is high and the quantity of information focuses, n is low (the value of m greater than n). At the point when the two entities are not distinguishable, loose factors are included and a cost feature is relegated for the covering information focuses. SVM among quick calculations is not stable when the quantity of qualities is high. We need to perform SVM classiﬁcation training by selecting the features such that it maximizes the prediction, picking the feature which is mutual from the attack case and common case. Here we have used a subset of KDD Dataset after multiple levels of normalization we have now classiﬁed using the complete dataset and we have now able to obtain a 96% accuracy in prediction of dataset classes as normal or an intrusion scenario.

Hybrid Data Mining to Reduce False Positive and False Negative Prediction

6.3

9

Artiﬁcial Neural Networks

Artiﬁcial Neural Networks is a well-known methodology based on how the neurons in our brain works. It consists of interconnected fake neurons equipped for speciﬁc calculations on their information sources. The data information actuates the neuron network nodes in the main layer of the system whose yield is passed to another node of neurons in the system. Every layer transfers its yield to the next node and the ﬁnal node yields the outcome. Nodes in the middle of the info and yield nodes are alluded to as concealed layers. At the point which an Artiﬁcial Neural Network is utilized as a classiﬁer, the yield nodes produce the last order classiﬁcation. ANN can create nonlinear models. The back-engendering highlight of the ANN makes it conceivable to create an EX-OR rationale. When modifying in this node with intermittent, feed forward, and convolutional NNs, ANNs are picking up in ubiquity once more, and in the meantime winning many leads in late scenarios acknowledging the. Since the propelled adaptations of ANNs require much additionally handling power, they are actualized ordinarily on graph making units. When ANN was applied to the 10% KDD Dataset, we were able to obtain an accuracy of 94% when trained with the complete dataset. 6.4

K-Nearest Neighbours

KNN is a non-parametric classiﬁcation algorithm which makes use of regression. It attributes the closest neighbors grouping, the yield is a particular class classiﬁcation. KNN is example based learning, or slow realizing, where the capacity is just approximated locally and all calculation is conceded until characterization. The k-NN calculation [11, 12] is a relatively simple machine learning calculation. In the training phase, it saves only the vectors and the ﬁeld types, using the most frequent data to influence the major decisions in this algorithm. This algorithm can be varied by selecting different values for k, the neighbor limit number to a particular sequence of data. When KNN was applied to the 10% KDD Dataset, we were able to obtain an accuracy of 70% when trained with the complete dataset with parameters k = 5. 6.5

Decision Trees

A decision tree is a simple algorithm which resembles a tree and branches in its connections of elements and groupings. A model is ordered by testing its quality values against the branches of the tree. At a particular node when constructing the decision tree and at every hub of the tree, we must pick the appropriate element that is most successfully suitable in its arrangement of cases into sub nodes. The leaf node is the standardized data and where the quantity with the most elevated standardized data pick up is settled on the root. The decision trees are the common representation where we can denote high order precision, and straightforward execution. The fundamental hindrance is that for information incorporating clear cut factors with an alternate number of levels, data pick up qualities are one-sided for elements with high density of

10

B. Palanisamy et al.

branches. The decision tree is implemented by improving the data reachability at every branch, results in data rank hierarchy. Once the data is classiﬁed, we must compare the results. Our proposal is to have a comparison graphs with the Accuracy, Classiﬁcation limits, Time taken for the Model which is known and Time taken for model that is unknown sequence of data. Several models have been proposed to design multi-level IDS. Some models classiﬁed DoS and Probe attacks at the ﬁrst level, Normal category at the second level, and both R2L and U2R categories at the third or last level. By contrast, other models classiﬁed Normal, DoS, and Probe at the ﬁrst level and R2L and U2R at the second level. Figure 2 shows the scattered plot and Fig. 3 parallel coordinate plots. Table 1 show that SVM is more accurate than KNN and ANN.

Fig. 2. Scatter plot of results of KDD 10% Dataset where normal sets are in blue color and intrusion sets are in red color.

Fig. 3. Parallel coordinate plot of data value distribution of 39 different features.

Table 1. Accuracy of KNN, ANN, SVM KDD 10% Dataset Results with each algorithm KNN ANN SVM Accuracy 70% 94% 96%

Hybrid Data Mining to Reduce False Positive and False Negative Prediction

11

7 Conclusion In this paper we have discussed data mining techniques which all we can be applied to build a strong and intelligent Intrusion Detection System. We have mainly targeted data mining procedures to reduce false positive and false negative alerts, which will reduce the overhead of a network administrator in an organization. It is also a very difﬁcult to collect the training dataset which is a crucial task for building the system. Even though our methodology has not been proved for cybercrimes and attacks, this approach will help us to build custom sector based Intrusion Detection System rather than having a generic system which can be very confusing to handle. The ideas which we have presented are introductory and in progress, so there must be signiﬁcant investment of further work to detail procedures on how to normalize the data and how unrelated details of some of the speciﬁc attack scenarios can be ﬁltered. There are also additional methods to reduce the dimensions of the data that can be explored. Sometimes it also depends on the type of intrusion detection systems as they have a variety of formats that they can produce as the report and we also have to see how we can extract the report to a data set which can used for mining. The classiﬁcation methods Naïve Bayes and Decision Trees will produce a straight forward results though in terms of other methods K-Nearest Neighbors we have a difﬁculty of selecting a proper k value which may have a lots of testing cases to be run to identify the right value similar difﬁculties will also be for support vector and artiﬁcial neural network method.

References 1. Xu, L., Jiang, C., Wang, J., Yuan, J., Ren, Y.: Information security in big data: privacy and data mining. IEEE Access 2, 1149–1176 (2014) 2. Yu, C.H., Ward, M.W., Morabito, M., Ding, W.: Crime forecasting using data mining techniques. In: 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 779–786. IEEE (2011) 3. Hajian, S., Domingo-Ferrer, J., Martinez-Balleste, A.: Discrimination prevention in data mining for intrusion and crime detection. In: 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pp. 47–54. IEEE (2011) 4. Xu, J., Yu, Y., Chen, Z., Cao, B., Dong, W., Guo, Y., Cao, J.: Mobsafe: cloud computing based forensic analysis for massive mobile applications using data mining. Tsinghua Sci. Technol. 18(4), 418–427 (2013) 5. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: a comparative study. Decis. Support Syst. 50(3), 602–613 (2011) 6. Hu, Y., Panda, B.: A data mining approach for database intrusion detection. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 711–716. ACM (2004) 7. Ravisankar, P., Ravi, V., Rao, G.R., Bose, I.: Detection of ﬁnancial statement fraud and feature selection using data mining techniques. Decis. Support Syst. 50(2), 491–500 (2011) 8. Erskine, J.R., Peterson, G.L., Mullins, B.E., Grimaila, M.R.: Developing cyberspace data understanding: using CRISP-DM for host-based IDS feature mining. In: Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research, p. 74. ACM (2010)

12

B. Palanisamy et al.

9. Lee, W., Stolfo, S.J., Mok, K.W.: A data mining framework for building intrusion detection models. In: Proceedings of the 1999 IEEE Symposium on Security and Privacy, 1999, pp. 120–132. IEEE (1999) 10. Feng, W., Zhang, Q., Hu, G., Huang, J.X.: Mining network data for intrusion detection through combining SVMs with ant colony networks. Future Gener. Comput. Syst. 37, 127– 140 (2014) 11. Portnoy, L., Eskin, E., Stolfo, S.: Intrusion detection with unlabeled data using clustering. In: Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001) (2001) 12. Julisch, K., Dacier, M.: Mining intrusion detection alarms for actionable knowledge. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 366–375. ACM (2002)

Adblock Usage in Web Advertisement in Poland Artur Strzelecki(&), Edyta Abramek, and Anna Sołtysik-Piorunkiewicz Department of Informatics, University of Economics in Katowice, Katowice, Poland {artur.strzelecki,edyta.abramek, anna.soltysik-piorunkiewicz}@ue.katowice.pl

Abstract. Research concerning users blocking advertisements constitutes a new research area both in the scope of analysis of collected data regarding that topic, determinants concerning users blocking advertisements and IT tools. The paper refers to this and systematizes knowledge in the scope of types of online advertisements and methods for blocking them using an adblock, and it identiﬁes reasons and main categories of reasons for users blocking advertisements. The research presented in the paper was confronted with results of an analysis of application of adblocks. The obtained results will facilitate conducting further, more thorough research. Considerations included in the paper can constitute a set of recommendations for publishers displaying advertisements on websites and they can be useful for drawing conclusions and preparing guidelines for projects supporting sustainable development in the scope of online advertising. Keywords: Adblock

Ads blocking Web advertisement

1 Introduction During the early years of development of the internet, advertisements displayed on websites were not considered as invasive. Usually, they were presented in the form of static or dynamic banners which included graphic designs. As technology developed and IT solutions became more available, the manner in which advertisements were shown on websites also developed. Currently, there are numerous new advertising formats available, most of which are considered as invasive, i.e. they affect the reception of content displayed in a web browser. Such advertisement formats include automatically displayed advertisements, formats of pictures covering the entire screen or expandable graphic advertisements. Development of modern technologies on the internet, in particular tracking the activity of users of web browsers and displaying advertisements for such users causes an increase of users’ social interest in the development and use of tools enabling prevention of tracking and display of advertisements. This prevents online content from being surrounded with advertisements which may distract the reader and, at the same time, scripts tracking web browser user’s activity are turned off or their operation is limited. These tools ensure that consumption of online content remains attractive, © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 13–23, 2019. https://doi.org/10.1007/978-3-030-03405-4_2

14

A. Strzelecki et al.

websites are not loaded with advertising items and user’s privacy is better protected. The conducted study is aimed at getting acquainted with the activity of online browser users in the scope of use of software which blocks the display of advertisements and limits the tracking of user’s activity. Additional programs which may be installed as an extension of the basic set of functions offered by a web browser, which add the function of blocking unwanted advertisements are becoming increasingly popular among users. Users use such programs not only to block displayed advertisements but also to improve the protection of data stored in a web browser and to increase the comfort of use and the performance of web browsers. Thus, the empirical objective of the paper is to identify the reasons for blocking online advertisements. The methodological objective of the paper is to identify IT technologies supporting advertisement blocking and criteria for their selection. The practical objective of the paper is to study adblock usage in Poland based on research. The three research questions were formulated: • RQ1: What are the reasons for users blocking advertisements? • RQ2: What are the methods for blocking web advertisements? • RQ3: What is the state of adblock usage in Poland? The objectives of the paper were achieved as a result of application of the following research methodology: study of literature, comparative analyses of reports and a survey analysis. First, we introduce the theoretical justiﬁcation of the ads blocking problem and systematize knowledge in the scope of reasons for blocking online advertisements using an adblock. Second, we identify methods and tools for blocking advertisements in web browsers. Adblock programs offer a broad range of options for adjusting them to user’s preferences and creating one’s own set of functions which are launched by default as well as preferences for blocking advertisements. Many users use the possibility to conﬁgure advertisement blocking programs and adjust their functions to their preferences. Third, we analyze dataset collected through a survey. There were 774 participants in the exploratory research project. The participants responded to 14 questions in a survey concerning the use of software for blocking advertisements (adblock programs). The paper contains a literature review, a presentation of methodology used for the study and an analysis of obtained results. The paper includes ﬁnal conclusions from the conducted study and directions for further research in the area of blocking displayed advertisements and privacy protection.

2 Related Work Over the course of recent years there were several studies conducted with regard to the use of adblock programs by internet users. They involved, inter alia, an analysis of legal measures which can be undertaken by content publishers who are ﬁghting off the activity of adblock programs. Content publishers lose advertising income which they could have

Adblock Usage in Web Advertisement in Poland

15

generated, if their advertisements were displayed or clicked by users [11, 28]. Also, the possibility to expand the functions of extensions for blocking advertisements [13] and the influence of functions blocking advertisements on battery performance in mobile devices [16] or a decrease of internet connection load [15] was examined. The researchers have also studied the manner in which content publishers detect the use of adblock programs and how many of them do it [14]. The research also shows that adblock programs can be used to reduce the tracking of personal data [3] and do not leave any trace of their use [17]. Some of the researchers have conducted surveys on students at a single university [19] or extensive research involving numerous countries and thousands of respondents [21, 22]. A state of the art analysis was conducted in 2009 as well. However, many things have changed since then with regard to advertisement display technology and development of user tracking scripts [20]. The characterized advertisements, despite their great popularity, are no longer as efﬁcient as they used to be during the initial period of development of the internet. This is mainly due to the fact that users became resistant to various “persistent” forms of promotion of products or services. Reaching potential clients with advertisements is becoming more complicated and creators of advertisements must consider limited access through the online marketing channel. Poland is one of the leaders in the global ranking of countries with the highest percentage of internet users blocking advertisements using plugins containing adblock software. A less aggressive solution with regard to previously used, conventional advertisements, which addresses the needs of advertisers, is native advertising, i.e. interesting and unobtrusive advertising material intertwined with the content of an article posted on a blog, in social media or on a website.

3 Systematization 3.1

Identiﬁcation of Factors Determining the Adblock Usage

Blocking advertisements is a complex problem in the contemporary world, which requires the following: • Firstly, aspects connected with privacy [3], security [30], user experience (pace of work, comfort, quality and other factors) [15, 19] as well as economic aspects should be integrated. • Secondly, the issue of balance between them and advertisements should be analyzed. Thus, it is necessary to mention the sustainable development theory (the term was deﬁned for the ﬁrst time in a report of the World Commission for Environment and Development at the UN: Our Common Future in 1987 [18]), because it is not about publishers and users “getting in each other’s way”, but rather about them starting to cooperate with each other. Their common good should motivate such activities. Numerous discussions among internet users include descriptions of cases [10] in which website publishers tried to prevent users from blocking advertisements, e.g.

16

A. Strzelecki et al.

• On the basis of legal regulations (online publishers wanted to introduce a ban on users blocking advertisements using programs for blocking them by invoking the right to maintain the integrity of their work or, in other words, the right to inviolability of content and form of their work. The act of blocking advertisements was interpreted as introducing changes to the work) [11, 28]. • By establishing cooperation with creators of programs for blocking advertisements (the cooperation allowed the publishers to create advertisements which users were unable to ﬁlter). To use an adblock, it needs to be installed ﬁrst, and then a ﬁlter subscription must be added. According to the sustainable development principle, the following principles should be thus adopted: • Publishers and users, as well as advertisers should jointly attempt to guarantee the best solutions by ensuring that none of the components constitutes a threat to others. • Publishers should use users’ knowledge and experience and propose advertisements, if they constitute their source of income, but they need to ensure that they do not disturb user’s work and allow users to decide what they want to view and which form suits them as well as to choose their form. • Appropriate conditions should be ensured to allow users to perceive advertisements in a manner different than the current one (as invasive, ubiquitous, badly adjusted, not interesting, harassing and heavy on processing power). Relationships between factors of the problem being the blocking of advertisements by internet users are nonlinear and they are in the form of feedback loops. On the one hand, there are website owners who want users to avoid blocking advertisements. They offer free access to their content but expect users to view advertisements apart from the primary content. The number of views of the publisher’s website grows, just as advertising income grows (positive feedback). Viewing advertisements in that case can be considered as a type of an online currency. Advertisements allow publishers to generate income. However, some users do not want to view advertisements, and thus they block them. If users block advertisements, publishers are forced to introduce paid access to content (negative feedback). As the number of website visitors drops, its popularity decreases, and thus publishers receive fewer advertising orders. To undertake activities aimed at compensating for the negative feedback, one should identify reasons for blocking advertisements. According to PageFair-2017 report [22], the main reason for blocking advertisements using an adblock was security (Fig. 1). Women usually mentioned that they were afraid of viruses and malware. On the other hand, men claimed that the greatest nuisance was the interference of advertising in continuous browsing of online content. Over 70% of respondents chose more than one reason as “the most important one” in connection with their use of an adblock. Apart from security and interference, users’ motivation was not signiﬁcantly different with regard to users’ age. The largest number of internet users in Europe who block advertisements is in Greece, Poland and Germany (Fig. 2).

Adblock Usage in Web Advertisement in Poland

17

Fig. 1. Motivation behind adblock usage [22].

Fig. 2. Usage of ad blocking software in selected European countries [21, 22].

According to PageFair’s report [21, 22], the index for Poland in 2015 amounted to 34.9%, while in 2017 it maintains the level of 33%. Apparently, the reason for this phenomenon is that websites are overloaded with advertisements. Users who do not like advertisements simply block them in advance. Others indicate protection of their privacy and security as the reason for doing so. The general conclusion of the reports was as follows: the growing use of adblock programs is fuelled by particular problems connected with provision of online advertisements by publishers rather than digital advertising itself. It was interesting that it was indicated that users do not mind advertisements as such, and they are bothered rather by their aggressive form, such as a sudden sound or an advertisement suddenly covering the browsed content and, additionally, the advertisement not allowing to be skipped or closed. The obtained results allow us to evaluate which problems are most frequently faced by users when advertisements are displayed. Users increasingly frequently care for privacy of their data and conﬁdentiality of their online activities, and due to that personalised advertisements are perceived as a threat, while aggressive advertisements or those which put them at risk of additional costs (using data transmission packages), are simply blocked. Therefore, it is necessary to look for new ways for users to stop blocking advertisements.

18

3.2

A. Strzelecki et al.

Methods for Blocking Online Advertisements

To counteract the growing trend of invasive advertising, software developers have independently started creating additional solutions for already existing basic web browser functions [13]. Their objective is to block advertisements displayed on the currently opened website. Such web browser extensions are conﬁgured by default (without any modiﬁcation required on the part of the user), and thus they do not block some formats of advertisements displayed on websites. This concerns advertisements which are referred to as nonintrusive ones [12]. Nonintrusive advertisements meet certain criteria regarding: location, contrast and size of the advertising unit [29]. The location means that the advertisement cannot disturb the natural reading process. The advertisement must be above, beside or under the primary content. Contrast means that an advertisement must stand out from the remaining content and be recognizable as an advertisement. It should be marked with the word “advertisement” or its equivalent. Size means that the amount of space occupied by an advertisement above the line break cannot exceed 15%. Also, the surface occupied by the advertisement below the line break can amount to 25% at most [5]. An example of nonintrusive advertisements is advertisements in search engine search results. The advertisements appear only after a user enters a query in the search engine and receives results. Such a set of advertisements is usually connected with the query entered by the user. Large companies (e.g. Google) providing advertisements pay for such advertising units not being blocked in their advertising networks. Other large companies which pay for being included on the list of acceptable advertisements are Microsoft, Amazon and Taboola advertising network. One of such companies anonymously stated that it pays for not blocking them approx. 30% of what it earns on them being displayed [9]. The most common web browser plugins which block advertisements currently include AdBlock, Adblock Plus, uBlock and uBlock Origin. The common term for such web browser plugins is adblock software. • AdBlock [2] is currently being developed by a group of programmers and maintained using user donations. AdBlock is available for users of the following web browsers: Chrome, Safari, Opera, Firefox and Edge. Currently, it is being used by 40 million users. • Adblock Plus [1] is developed by Eyeo GmbH and it is available for the following web browsers: Firefox, Chrome, Opera, Safari, Internet Explorer and their Android and iOS versions. The Adblock Plus extension does not block nonintrusive advertisements by default. The extension is active on over 100 million devices. • uBlock Origin [25] is being developed by the creator who was previously involved in the creation of uBlock. uBlock Origin is currently being used by 10 million users. The authors of the project do not accept any donations for its development. uBlock blocks advertisements as well as tracking scripts and malicious websites [17]. Its features include engaging less processing power compared to Adblock Plus [26]. • uBlock [27] is currently being developed by a group of programmers and its income is derived from user donations. uBlock is available for Chrome, Safari and Firefox users. Its features include not only blocking advertisements, but also malicious scripts.

Adblock Usage in Web Advertisement in Poland

19

Thus, it was assumed that the term adblock should be construed as a group of programs for blocking online advertisements (Fig. 3). Additionally, Table 1 presents types and methods of used blockades which are the subject of considerations included in the present paper [8, 23–25].

adblock

AdBlock

AdblockPlus

uBlock Origin

uBlock

Fig. 3. Adblock tree.

Table 1. Key data about most popular adblocks AdBlock AdblockPlus Founder Developers Eyeo GmbH FireFox Supported browsers FireFox Opera Opera Chrome Chrome Safari Safari Edge Edge And others Financing Donation Advertisers Block Type Ads (+) (+) Malware (−) (−) Privacy (−) (−) Users 40 m 100 m (+) active by default; (−) no active by default

uBlock Developers FireFox Chrome Safari

uBlock Origin Raymond Hill FireFox Chrome Safari Edge

Donation

Without payments

(+) (+) (+) 640 k

(+) (+) (+) 10 m

The common feature of all the presented programs is using lists of ﬁltered resources and URLs. The lists containing information about types and names of advertisements, the point of origin of these advertisements and many other ﬁltered elements are created and updated by a community centered around the topic of blocking advertisements. The lists can be used by several adblocks, such as e.g. the EasyList [7], which provides sets of ﬁltered advertisements for aforementioned programs. Adblocks enable creating own sets of blocked resources and URLs, as well as making them available to other users. They are made available by creating ﬁles and

20

A. Strzelecki et al.

publishing them on the internet. An example of such an individual set is AlleBlock, a set of ﬁlters for the greatest auction and trading platform in Poland [4] or the list published under the Certyﬁcate.IT project, which includes the most popular portals in Poland, which ﬁlters advertisements displayed on them [6].

4 Dataset and Results The authors have conducted studies in Poland which concerned the reasons for using adblock programs. The introduction to the study was based on interviews. Their objective was to assess selected aspects concerning the use of advertisement blocking programs. The increasing interest in the area of advertisement blocking and limitation of users’ personal data exposure makes it an excellent research subject. During the ﬁrst stage of the study the authors have developed empirical assumptions for the survey and conducted an analysis of the current state of adblock programs and possibilities for their conﬁguration and effective use. During the second stage of the study the authors have focused on examining the reasons for using and ways in which adblock programs are used in Poland. In order to become acquainted with the activity of Polish users of programs for blocking advertisements in web browsers, a questionnaire was developed. The questionnaire served as a basis for conducting a survey in which 774 respondents participated. An invitation to ﬁll in the survey was posted on several proﬁles and groups in social media and it was sent by e-mail as well. Having clicked the hyperlink in the invitation, participants were transferred to a website with the questionnaire. The survey started with the following question: Are you using an adblock? For users who responded negatively to the qualifying question this was the end of their involvement in the study. Remaining users were asked to indicate the way in which they use the adblock and the reasons for using the adblock as well as their level of acceptance of advertisements being displayed on the internet. The users have also provided their age, sex and level of education. The subsequent part of the study involved considering responses given by 596 respondents. As many as 56.1% of respondents in the group were men and the majority of respondents were people aged 18–24. Respondents aged 25–34 placed second, i.e. they comprised 32.8% of respondents. There were 9% of respondents aged 35–44. As many as 55.3% of respondents had higher education, while 44.5% indicated high school education. The study group included 77% users who indicated that they are using an adblock program. Everyone in that group has such a program on their personal computer and, moreover, one fourth of adblock users have installed it on their mobile device, such as a smartphone or a tablet as well. A substantial majority of adblock users keep its default settings after its installation, whereas 32.9% of users conﬁgure the program on their own by adding ﬁltering lists and turning on additional functions. As many as 87.1% of adblock users actively turn it on and off. The reason for users turning off the adblock program is mostly a need for temporary access to content which is unavailable in the browser due to the adblock’s operation. Having gained access to the relevant content, the users turn the adblock on again. The second reason for turning adblock off is

Adblock Usage in Web Advertisement in Poland

21

turning it off permanently for a particular website or adding the website to a list of exceptions. As many as 86.7% said that they have come across an adblock wall while using a web browser. After seeing the adblock wall 44.5% of them leave the website, while 55.5% decide to turn their adblock off or add the website to exceptions and they use the content provided by the website. Users in the analysed group came to know about the existence and the objective of adblock programs in numerous ways. As many as 31.5% of them heard about them from a friend or a relative. 22.7% read about the possibilities offered by adblock programs on the internet, e.g. while reading a news item. 8.7% found an adblock while browsing popular browser extensions. One third of respondents did not remember how they learned about the adblock. Study participants have most commonly (50.3%) indicated that advertisements are a disturbance during reception of online content as the reason for using an adblock. The other important reason for using an adblock (37.8%) was too many advertisements being displayed on a single website. Other indicated reasons for using an adblock included websites loading slowly, protection against viruses and malware, privacy protection regarding tracking of users’ online activity. This question consisted in asking respondents to indicate one; main reason for using an adblock, but some of them used the comments section for the question to provide further explanation, i.e. that other listed reasons also contributed to them using an adblock.

5 Discussion An adblock program constitutes a response to the constantly growing trend of lack of acceptance and rejection of advertisements. Currently, instead of consciously or subconsciously ignoring advertisements, users are able to install software which will prevent advertisements from being displayed. In many cases advertisements will not be downloaded at all, which saves data transfer capacity and reduces the engagement of processing power, as well as positively affects users’ sense of security and limits disturbances within contents published on websites. In conclusion, there is no legal basis for stopping the operation of advertisement blocking programs. Publisher’s reaction to that is creating new advertisement formats, the main feature of which is a lack of possibility to skip them. Considering the nature of the internet, there are no legal measures for forcing users to view advertisements in which they are not interested. It is likely that there will be a product created to neutralize the operation of each newly developed technology. Publishers and internet users will always be on two opposing sides of advertising until advertisements are accepted by users and content publishers start using advertising which is acceptable for the users. One of the main conclusions after conducting the study was the discovery of a signiﬁcant difference between the percentage of population using an adblock program according to PageFair’s study and own study. According to PageFair, 33% of internet users in Poland in 2017 are using the program. The study conducted in April 2017 indicated that the percentage of people using an adblock is more than twice as high and it amounts to 77%.

22

A. Strzelecki et al.

Such a signiﬁcant difference may have been caused by inaccuracy of data collected by PageFair, which at a certain level is estimated on the basis of publicly available data. The result obtained during our study conﬁrms the result obtained during the interview, when most of the users declared that they are using an adblock. On the other hand, such a high percentage of people using an adblock is visible in actions undertaken by advertisement publishers who are increasingly frequently employing scripts detecting adblock functionalities and limiting access to content. Another important conclusion is the displayed awareness of users regarding advertisement publishers wanting to hide content from them. This is the reason for users actively turning adblock programs on and off or adding websites to the list of exceptions. This conﬁrms a high level of awareness of users in the scope of independent conﬁguration of adblock programs and using them purposefully.

6 Conclusion The paper shows the reasons for using adblock programs and presents the current most popular programs of that type as well as their characteristics. Apart from aforementioned, a survey study was conducted involving a group of Polish internet users, which shows that they actively and consciously block the display of advertisements in their web browsers. In this study we show how to adblock software technologies are applied in Polish environment. We discover a strong desire to block advertisements and well known ways to do this. Also in Poland there are many adapted solutions to improve adblock accuracy and coverage. Future directions of research will concern the problem of intrusiveness of advertisements and ways for decreasing it. As it was mentioned in the paper, users are not against advertisements being displayed—they oppose the way in which they appear on their screen, e.g. by suddenly covering the viewed content. Further research will be also aimed at a quantitative analysis regarding elements blocked on websites, such as unwanted advertisements, scripts tracking user’s activity and malicious activities (malware).

References 1. Adblock Plus. https://adblockplus.org. Accessed 7 July 2017 2. AdBlock. https://getadblock.com. Accessed 7 July 2017 3. Ajdari, D., Hoofnagle, C., Stocksdale, T., Good, N.: Web privacy tools and their effect on tracking and user experience on the internet (2013) 4. AlleBlock. https://tarmas.pl/alleblock. Accessed 7 July 2017 5. Allowing acceptable ads in Adblock Plus. https://adblockplus.org/en/acceptable-ads. Accessed 7 July 2017 6. CertyﬁcateIT. https://www.certyﬁcate.it. Accessed 7 July 2017 7. EasyList. https://easylist.to. Accessed 7 July 2017 8. eyeo GmbH. https://eyeo.com/en/press#stats. Accessed 7 July 2017

Adblock Usage in Web Advertisement in Poland

23

9. Google, Microsoft and Amazon pay to get around ad blocking tool. https://www.ft.com/ content/80a8ce54-a61d-11e4-9bd3-00144feab7de. Accessed 7 July 2017 10. Haddadi, H., Nithyanand, R., Khattak, S., Javed, M., Vallina-Rodriguez, N., Falahrastegar, M., Murdoch, S.J.: The adblocking tug-of-war. Login USENIX Mag. 41(4), 41–43 (2016) 11. Hemmer, J.L.: The internet advertising battle: copyright laws use to stop the use of adblocking software. Temp. J. Sci. Technol. Environ. Law 24, 479–497 (2005) 12. Ming, W.Q., Yazdanifard, R.: Native advertising and its effects on online advertising. Glob. J. Hum. Soc. Sci. E Econ. 14(8), 11–14 (2014) 13. Nock, R., Esfandiari, B.: On-line adaptive ﬁltering of web pages. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) Knowledge Discovery in Databases: PKDD 2005, PKDD 2005. Lecture Notes in Computer Science, vol. 3721. Springer, Berlin (2005) 14. Nithyanand, R., et al.: Adblocking and counter blocking: a slice of the arms race. In: FOCI (2016) 15. Pujol, E., Hohlfeld, O., Feldmann, A.: Annoyed users: ads and ad-block usage in the wild. In: Proceedings of the 2015 Internet Measurement Conference (IMC 2015), pp. 93–106. ACM, New York (2015) 16. Rasmussen, K., Wilson, A., Hindle, A.: Green mining: energy consumption of advertisement blocking methods. In: Proceedings of the 3rd International Workshop on Green and Sustainable Software, pp. 38–45. ACM (2014) 17. Rens, W.: Browser forensics: adblocker extensions. http://work.delaat.net/rp/2016-2017/ p67/report.pdf. Accessed 8 Apr 2017 18. Report of the World Commission on Environment and Development: Our Common Future. http://www.un-documents.net/wced-ocf.htm. Accessed 8 Apr 2017 19. Sandvig, JCh., Bajwa, D., Ross, S.C.: Usage and perceptions of internet ad blockers: an exploratory study. Issues Inf. Syst. 12, 59–69 (2011) 20. Singh, A.K., Potdar, V.: Blocking online advertising-a state of the art. In: ICIT 2009 IEEE International Conference on Industrial Technology (2009) 21. The cost of ad blocking. https://pagefair.com/blog/2015/ad-blocking-report/. Accessed 7 July 2017 22. The state of the blocked web. https://pagefair.com/blog/2017/adblockreport/. Accessed 7 July 2017 23. uBlock Origin. https://addons.mozilla.org/en-us/ﬁrefox/addon/ublock-origin/statistics/. Accessed 8 Apr 2017 24. uBlock Origin. https://chrome.google.com/webstore/detail/ublock-origin/cjpalhdlnbpaﬁamejdnhcphjbkeiagm. Accessed 7 July 2017 25. uBlock Origin. https://github.com/gorhill/uBlock. Accessed 7 July 2017 26. uBlock vs. ABP: efﬁciency compared. https://github.com/gorhill/uBlock/wiki/uBlock-vs.ABP:-efﬁciency-compared. Accessed 7 July 2017 27. uBlock. https://www.ublock.org. Accessed 7 July 2017 28. Vallade, J.: Adblock plus and the legal implications of online commercial-skipping. Rutgers Law Rev. 61, 823–853 (2008) 29. Walls, R.J., Kilmer, E.D., Lageman, N., McDaniel, P.D.: Measuring the impact and perception of acceptable advertisements. In: Proceedings of the 2015 ACM Conference on Internet Measurement Conference, pp. 107–120. ACM (2015) 30. Wills, C.E., Uzunoglu, D.C.: What ad blockers are (and are not) doing. In: 2016 Fourth IEEE Workshop Hot Topics in Web Systems and Technologies (HotWeb), pp. 72– 77. IEEE (2016)

Open Algorithms for Identity Federation Thomas Hardjono(B) and Alex Pentland Connection Science and Media Lab, Massachusetts Institute of Technology, Cambridge, MA 02139, USA [email protected], [email protected]

Abstract. The identity problem today is a data-sharing problem. Today the ﬁxed attributes approach adopted by the consumer identity management industry provides only limited information about an individual, and therefore, is of limited value to the service providers and other participants in the identity ecosystem. This paper proposes the use of the Open Algorithms (OPAL) paradigm to address the increasing need for individuals and organizations to share data in a privacy-preserving manner. Instead of exchanging static or ﬁxed attributes, participants in the ecosystem will be able to obtain better insight through a collective sharing of algorithms, governed through a trust network. Algorithms for speciﬁc datasets must be vetted to be privacy-preserving, fair and free from bias. Keywords: Digital identity Trust networks

1

· Open algorithms · Data privacy

Introduction

The Open Algorithms (OPAL) paradigm seeks to address the increasing need for individuals and organizations to share data in a privacy-preserving manner [1]. Data is crucial to the proper functioning of communities, businesses and government. Previous research has indicated that data increases in value when it is combined. Better insight is obtained when diﬀerent types of data from diﬀerent areas or domains are combined. These insights allows communities to begin addressing the diﬃcult social challenges of today, including better urban design, containing the spread of diseases, detecting factors that impact the economy, and other challenges of the data-driven society [2,3]. Today there are a number of open challenges with regards to the data sharing ecosystem: • Data is siloed: Today data is siloed within organizational boundaries, and the sharing of raw data with parties outside the organization remains unattainable, either due to regulatory constraints or due to business risk exposures. • Privacy is inadequately addressed: The 2011 World Economic Forum (WEF) report [4] on personal data as a new asset class ﬁnds that the current ecosystems that access and use personal data is fragmented and ineﬃcient. c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 24–42, 2019. https://doi.org/10.1007/978-3-030-03405-4_3

Open Algorithms for Identity Federation

25

For many participants, the risks and liabilities exceed the economic returns and personal privacy concerns are inadequately addressed. Current technologies and laws fall short of providing the legal and technical infrastructure needed to support a well-functioning digital economy. The rapid rate of technological change and commercialization in using personal data is undermining end-user conﬁdence and trust. • Regulatory and compliance: The introduction of the EU General Data Protection Regulations (GDPR) [5] will impact global organizations that rely on the Internet for trans-border ﬂow of raw data. This includes cloud-based processing sites that are spread across the globe. Similarly, today there are a number of challenges in the identity and access management space, notably in the area of consumer identity: • Identity tied to specific services: Most digital “identities” (namely, identiﬁer strings such as email addresses) are created as an adjunct construction to support access to speciﬁc services on the Internet. This tight coupling between digital identiﬁers and services has given rise to the unmanageable proliferation of user-accounts on the Internet. • Massive duplication of data: Together with the proliferation of user-accounts comes the massive duplication of personal attributes across numerous service providers on the Internet. These service providers are needlessly holding the same set of person-attributes (e.g. name, address, phone, etc.) associated with a user. • Lack or absence of user control: In many cases users have little knowledge about what data is collected by service provider, how the data was collected and the actions taken on the data. As such, end-users have no control over the other usages of their data beyond what was initially consented to. • Diminishing trust in data holders or custodians: The laxity in safeguarding user data has diminished social trust on the part of users in entities which hold their data. Recent attacks on data repositories and theft of massive amounts of data (e.g. Anthem hack [6], Equifax attack [7], etc.) illustrates this ongoing problem. • Misalignment of incentives: Today customer-facing service providers (e.g. online retail) have access only to poor quality user data. Typically such data is obtained from data aggregators who in turn collate an incomplete picture of the user through various back-channel means (e.g. “scraping” various Internet sites). The result is a high cost to service providers for new customer on-boarding, coupled with low predictive capabilities of the data. The identity problem today is in reality a data-sharing problem. The overall goal of this paper is to provide an alternate architecture for identity management based on the open algorithms paradigm. Key to this approach is the notion of sharing information in a privacy-preserving manner based on vetted algorithms, instead of the exchange of ﬁxed attributes approach that has prevailed in the identity industry for the past two decades. The remainder of the paper is arranged as follows. Section 2 provides a brief history and overview of identity management and federation, providing some

26

T. Hardjono and A. Pentland

deﬁnitions of the entities and their functions. Readers familiar with the IAM industry and the current identity federation landscape can skip this section. Section 3 provides further detail the concepts and principles underlying the open algorithms paradigm. Section 4 addresses the open algorithms paradigm in the context of identity federation, while Sect. 5 brieﬂy discusses the need for a legal trust framework to share algorithms. Section 6 brieﬂy touches on the topic of subject consent. The paper closes with a discussion regarding possible future directions for the open algorithms paradigm.

2

Identity Federation and Attribute Sharing: A Brief History

Today Identity and Access Management (IAM) represents a core component of the Internet. IAM infrastructures are an enabler which allows organizations to achieve its goals. Enterprise-IAM (E-IAM) is already a mature product category [8] and E-IAM systems are already well integrated into other enterprise infrastructures, such as directory services for managing employees and corporate assets. The primary goal of E-IAM systems is to authenticate and identify persons (e.g. employees) and devices as they enter the enterprise boundary, and to enforce access control driven by corporate policies. In the case of Consumer-IAM (C-IAM) systems, the primary goal there is to reduce friction between the consumer and the online service provider (e.g. merchant) through a mediated-authentication process, using a trusted third party referred to as the Identity Provider.

IdP

Client User Agent

Relying Party Service Provider (SP)

User

Fig. 1. Overview of Web Single Sign On (Web-SSO).

Open Algorithms for Identity Federation

2.1

27

Web Single Sign-On and Identity Providers

Historically, notion of the identity provider entity emerged starting in the late 1990s in response to the growing need to aid users in accessing services online. During this early period, a user would typically create an account and credentials at every new service provider (e.g. online merchant). This cumbersome approach, which is still in practice today, has led to a proliferation of accounts and duplication of the same user attributes across many service providers. The solution that emerged became what is known today as Web Single SignOn (Web-SSO) [9]. The idea is that a trusted third party referred to as the Identity Provider (IdP) would mediate the authentication of the user on behalf of the service provider. This is summarized in Fig. 1. When a user visits a merchant website (Step 1), the user would be redirected to the IdP to perform authentication (Steps 2–4). After the completion of a successful authentication event, the IdP would redirect the user’s browser back to the calling merchant (Step 5), accompanied by an IdP-signed assertion stating that the IdP has successfully authenticated the user. The merchant would then proceed transacting with the user (Step 6). In order to standardize this protocol behavior, in 2001 an alliance of over 150 companies and organizations formed an industry consortium called the Liberty Alliance Project. The main goal of this consortium was to “establish standards, guidelines and best practices for identity management” [10]. Several signiﬁcant outcomes for the IAM industry resulted from Liberty Alliance, two among which were: (a) the standardization of the Security Assertions Markup Language (SAML2.0) [11], and (b) the creation of a widely used open source SAML2.0 server implementation called Shibboleth [12]. Today SAML2.0 remains the predominant Web-SSO technology used within Enterprise IAM, which is directly related to the type of authentication protocol dominant in enterprise directory services [13–15]. In the Consumer IAM space, the growth of the web-applications industry has spurred the creation of the OAuth2.0 framework [16] based on JSON web tokens. Similar to the SSO pattern, the purpose of this framework is to authenticate and authorize an “application” to access a user’s “resources”. The OAuth2.0 model follows the traditional notion of delegation where the user as the resource-owner authorizes an application, such as a Web Application or Mobile Application, to access the user’s resources (e.g. ﬁles, calendar, other accounts, etc). In contrast to SAML2.0 which requires the user to be present at the browser to interact with the service provider, in OAuth2.0 the user can disconnect after he or she authorizes the application to access his or her resources. In eﬀect the user is delegating control to the application (to perform some deﬁned task) in the absence of the user. Although the OAuth2.0 design as deﬁned in [16] lacks details for practical implementation and deployment, a fully ﬂedged system is deﬁned in the OpenIDConnect 1.0 protocol [17] speciﬁcation based on the OAuth2.0 model. It is this OpenID-Connect 1.0 protocol (or variations of it) which is today deployed by the major social media platforms in the Consumer-IAM space.

28

T. Hardjono and A. Pentland

Identity Federation AtP1

AtP2

...

AtPn

Attribute Sharing

IdP1

IdP2

IdP3

IdPx

User Client User Agent or Third Party

Relying Party Service Provider (SP)

Fig. 2. Identity federation and attribute sharing.

2.2

Identity Federation

The notion of federation among identity providers arose from a number of practical needs, one being that of the scaling of services. The idea is quite straightforward and extends from the previous scenario of a user seeking access to a service provider or relying party (i.e. online merchant). This is shown in Fig. 2. The problem was simply the following: when the relying party directs the user to an IdP with whom the relying party has a business relationship, there was a chance that the user will be unknown to that IdP. As such, the solution is for a group of IdPs to “network together” in such a way that when one IdP is faced with an unknown user, the IdP can inquire with other IdPs in the federation. The federation model opens-up other interesting possibilities, including the possible introduction of the so-called attribute provider (AtP) entity whose primary task is to issue additional useful assertions about the user. More formally, the primary goal of a federation among a group of identity providers (IdP) is to share “attribute” information (assertions) regarding a user [18,19]: • An Identity Provider is the entity (within a given identity system) which is responsible for the identiﬁcation of persons, legal entities, devices, and/or digital objects, the issuance of corresponding identity credentials, and the maintenance and management of such identity information for Subjects. • An attribute is a speciﬁc category of identifying information about a subject or user. Examples for users include name, address, age, gender, title, salary, health, net worth, driver’s license number, Social Security number, etc.

Open Algorithms for Identity Federation

29

• An Attribute Provider (AtP) is a third party trusted as an authoritative source of information and responsible for the processes associated with establishing and maintaining identity attributes. An Attribute Provider asserts trusted, validated attribute claims in response to attribute requests from Identity Providers and Relying Parties. Examples of Attribute Providers include a government title registry, a national credit bureau, or a commercial marketing database. • An Identity Federation is the set of technology, standards, policies, and processes that allow an organization to trust digital identities, identity attributes, and credentials created and issued by another organization. A federated identity system allows the sharing of identity credentials issued, and identity information asserted, by one or more Identity Providers with multiple Relying Parties (RP). • A Relying Party or Service Provider (RP) is system entity that decides to take an action based on information from another system entity. For example, a SAML relying party depends on receiving assertions from an authoritative asserting party (an identity provider) about a subject [19]. Although the federated identity model using the SSO ﬂow pattern remains the predominant model today for the consumer space, there are a number of limitations of the model – both from the consumer privacy perspective and from the service providers business model perspective: • Identity management as an adjunct service: Most large scale consumer-facing identity services today are a side function to another more dominant service (e.g. free email service, free social media, etc.). • Limited access to quality data: Service providers and relying parties are seeking better insights into the user, beyond the basic attributes of the user. However, other than the major social media providers today, the relying parties today do not have access to rich data regarding the user. Today the scope of attributes or claims being exchanged among federated identity providers is fairly limited (e.g. user’s name, email, telephone, etc.) and thus of little value to the relying party. An example of the list of claims can be found in the OpenID-Connect 1.0 core speciﬁcations (Section 5.1 on Standard Claims in [17]) for federations deploying the OpenID-Connect architecture. • Limited user control and consent: Today the user is typically “out of the loop” with regards to consent regarding the information being asserted to by an identity provider or attribute provider. Over the past few years several eﬀorts have sought to address the issue of user control and consent (e.g. [20–22]). The idea here is that not only should the user explicitly consent to his or her attributes being shared, the underlying identity system should also ensure that only minimal disclosure is performed for a constrained use among justiﬁable parties [23,24].

30

3

T. Hardjono and A. Pentland

Open Algorithms: Key Concepts and Principles

The concept of Open Algorithms (OPAL) evolved from several research projects over the past decade within the Human Dynamics Group at the MIT Media Lab. From various research results it was increasingly becoming apparent that an individual’s privacy could be aﬀected through the correlation of just small amounts of data [25,26]. One noteworthy seed project was OpenPDS that sought to develop further the concept of personal data stores (PDS) [27,28], by incorporating the idea of analytics on personal data and the notion of “safe answers” as being the norm for responses generated by a personal data store. However, beyond the world of personal data stores there remains the pressing challenges around how large data stores are to secure their data, safeguard privacy and comply to regulations (e.g. GDPR [5]) – while at the same time enable productive collaborative data sharing. The larger the data repository, the more attractive it would become to hackers and attackers. As such, it became evident that the current mindset of performing data analytics at a centralized location needed to be replaced with a new paradigm for thinking about data sharing in a distributed manner. The OPAL paradigm provides a useful framework within which industry can begin ﬁnding solutions for these constraints. 3.1

OPAL Principles

The following are the key concepts and principles underlying the open algorithms paradigm [1]: • Moving the algorithm to the data: Instead of pulling raw data into a centralized location for processing, it is the algorithms that should be sent to the data repositories and be processed there. • Raw data must never leave its repository: Raw data must never be exported from its repository, and must always be under the control of its owner. • Vetted algorithms: Algorithms must be vetted to be “safe” from bias, discrimination, privacy violations and other unintended consequences. The data owner (data provider) must ensure that the algorithms which it publishes have been thoroughly analyzed for safety and privacy-preservation. • Provide only safe answers: When executing an algorithm on a dataset, the data repository must always provide responses that are deemed “safe” from a privacy perspective. Responses must not release personally identifying information (PII) without the consent of the user (subject). This may imply that a data repository return only aggregate answers. • Trust networks (data federation): In a group-based information sharing conﬁguration – referred to as Trust network for data sharing federation – algorithms must be vetted collectively by the trust network members. Individually, each member must observe the OPAL principles and operate on this basis. The operational aspects of the federation should be governed by a legal trust framework (see Sect. 5).

Open Algorithms for Identity Federation

31

• Consent for algorithm execution: Data repositories that hold subject data must obtain explicit consent from the subject when this data is to be included in a given algorithm execution. This implies that as part of obtaining a subject’s consent, the vetted algorithms should be made available and understandable to subjects. Consent should be unambiguous and retractable (see Article 7 of GDPR [5]). Standards for user-centric authorization and consent-management [20,22] exist today in the identity industry, and which can be the basis for managing subject consent in data repositories. • Decentralized data architectures: By leaving raw data in its repository, the OPAL paradigm points towards a decentralized architecture for data stores [26]. Decentralized data architectures based on standardized interfaces/APIs should be applicable to personal data stores as legitimate endpoints. That is, the OPAL paradigm should be applicable regardless of the size of the dataset. New architectures based on Peer-to-Peer (P2P) networks should be employed as the basis for new decentralized data stores [29]. Since data is a valuable asset, the proper design of a decentralized architecture should aim at increasing the resiliency of the overall system against attacks (e.g. data theft, poisoning, manipulations, etc.). New distributed data security solutions based on secure multi-party computation (e.g. MIT Enigma [30]) and homomorphic encryption should be further developed. Additionally, a decentralized service architecture should enhance distributed data stores. Such a service architecture should introduce automation in the provisioning and deprovisioning of services through the use of smart contracts technology [31]. It is important to note that the term “algorithm” itself has been left intentionally undeﬁned. This is because each OPAL deployment instance must have the ﬂexibility to deﬁne the semantics and syntax of their algorithms. In the case of a community of data providers organized under a trust network, they must collectively agree on the semantics and syntax in the operational sense. Such a deﬁnition should be a core part of the legal trust framework underlying the federated community. 3.2

OPAL Query-Response Model

From a technological perspective, the OPAL model is fairly simple to understand (see Fig. 3). A querier entity (e.g. person or organization) that wishes to obtain information from a given data provider selects one or more vetted algorithms (Step (a)). In the MIT software implementation of OPAL, each algorithm is encapsulated in a “template” format that contains the algorithm description, its algorithmID, the identiﬁer of the intended (target) repository, the data-schema, and other parameters (e.g. cost to querier). The algorithm template itself is digitally signed (e.g. by the data provider or algorithm author) to ensure the source-authenticity of the template.

32

T. Hardjono and A. Pentland

(a)

Vetted Algorithms

Client Querier

(b) (c)

Algorithm Author

Data Owner

APIs/Signatures

Data Server

Consent Mgmt

Machine Learning

Repository Operator

Raw Data

Log

Smart Contracts

OPAL Data Repository (Data Provider)

Fig. 3. Overview of open algorithms (OPAL) architecture.

For a given algorithm, Querier uses this template to construct an “OPAL contract” (a JSON data structure) which contains among others the desired Algorithm-ID. The contract is digitally signed by the Querier and sent to the target data repository. This is shown as Step (b) in Fig. 3. Optionally, the querier may attach payment in order to remunerate the data repository. The data repository validates the signature on the OPAL contract, checks the identity of the Querier and executes the algorithm (corresponding to the Algorithm-ID) on the target dataset in its back-end. The results are then placed into a “OPAL contract-response” (another JSON data structure) by the data provider, digitally signed and returned to the Querier. This is shown as Step (c) in Fig. 3. Optionally, if conﬁdentiality of the query/response is required then the relevant entries in the OPAL contract and contract-response could be encrypted. 3.3

OPAL-based Data Sharing Federation

In a data sharing federation conﬁguration (e.g. consortium of data providers), the federation may employ a gateway entity that coordinates queries/responses they receive from each other (or from outside if the federation permits). This is shown in Fig. 4. Here the gateway entity mediates requests coming from the querier entity and directs them to the appropriate member data provider. The gateway also collates responses and packages the responses prior to transmitting to the querier. Note that a member data provider may always decline to respond (e.g. data unavailable, it detects attempts to send multiple related queries, etc.). Currently, small test-bed deployments of the basic open algorithms concept are underway for speciﬁc and narrow data-domains [32,33].

Open Algorithms for Identity Federation

Algorithm Authors

(a)

Client Querier

33

(b) (c)

Vetted Algorithms (published)

OPAL Data Repository #1

Data Provider #1

Gateway

OPAL Data Repository #2

Data Provider #2

OPAL Data Repository #N

Data Provider #N

...

OPAL Data Federation (Trust Network)

Fig. 4. Data sharing federation using OPAL.

4

OPAL for Identity Federation

Instead of the exchange of static attributes regarding a user or subject, identity providers and data providers should collectively share vetted algorithms following the open algorithms paradigm. This is summarized in Fig. 5. • Algorithms instead of attributes: Rather than delivering static attributes (e.g. “Joe is a resident of NY City”) to the relying party, allow instead the relying party to choose one or more vetted algorithms from a given data domain (Step (a)). The result from executing a chosen algorithm is then conveyed by the IdP to the relying party in a signed response (Step (c)). The response can also include various metadata embellishments, such as the duration of the validity of the response (e.g. for dynamically changing data sets), identiﬁcation of the datasets used, consent-receipts, timestamps, and so on. • Convergence of federations: Identity federation networks should engage data provider networks (data owners) in a collaborative manner, with the goal of converging the two types of networks. The goal should be to bring together data providers from diﬀering domains or verticals (e.g. telco data, health data, ﬁnance data, etc.) in such a manner that new insights can be derived about a subject with their consent based on the OPAL paradigm. Research has shown, for example, that combining location data with credit card data oﬀers new insights into the ﬁnancial well-being of a user [34]. • Apply correct pricing models for algorithms and data: For each algorithm and the data to which the algorithm applies, a correct pricing structure needs to be developed by the members of the federation. This is not only to remunerate the data repositories for managing the data-sets and for executing the algorithm (i.e. CPU usage), but also to encourage data owners to develop new business models based on the OPAL paradigm.

34

T. Hardjono and A. Pentland

Pricing information could published as part of the vetted-algorithms declaration (e.g. as metadata), oﬀering diﬀerent tiers of pricing for diﬀerent sizes of data sets. For example, the price for obtaining insights into the creditworthiness of a subject based solely on their credit card data should be diﬀerent from the price for obtaining insights based on combined data-sets across domains (e.g. appropriate combination of GPS data-set and credit card dataset) [2]. • Remunerate the subjects: A correct alignment of incentives must be found for all stakeholders in the identity federation ecosystem. Subjects should see some meaningful and measurable return on the use of their data, even if it is tiny amounts (e.g. in the pennies or sub-pennies). Returns should be measurable against some measure of data the subject puts forward (e.g. variety of data, duration of collection, etc.). The point here is that people will contribute more personal data if they are active participants in the ecosystem and understand the legal protections oﬀered by the trust frameworks that govern the data federation and govern the treatment of their personal data by member data providers. • Logs for transparency and regulatory compliance: All requests and responses must be logged, together with strong audit capabilities. Emerging technologies such as blockchains and distributed ledgers could be expanded to eﬀect an eﬃcient but immutable log of events. Such a log will be relevant for post-event auditing and for proving regulatory compliance. Logging and audit is also crucial in order to obtain social acceptance and consent from individuals whose data is present within a data repository. Users as stakeholders in the ecosystem must be able to see what data is present within these repositories and to see who is accessing their data through the execution of vetted algorithms.

User (Subject)

Member #1

IdP

IdP IdP

Vetted Algorithms

Client User Agent or Third Party

OPAL Data Repository #1 Member #2

OPAL Data Repository #2 Gateway

...

Member #N

Relying Party Service Provider (Querier)

OPAL Data Repository #N OPAL Data Federation (Trust Network)

Fig. 5. OPAL-based identity and data federation.

Open Algorithms for Identity Federation

35

The use of the OPAL paradigm for information sharing within an identity federation is summarized in Fig. 5 using the traditional Web-SSO ﬂows. Figure 5 shows an alternate ﬂow pattern, which essentially replaces the attribute providers in Fig. 2 with the OPAL-based federation. In Step 1 of Fig. 5 when the user seeks the services of the relying party, the relying party has the option to request the execution of one or more of the vettedalgorithms as part of the redirection of the user to the IdP for authentication (Step 2). Thus, in addition to performing user-authentication the IdP would forward requests for algorithm execution pertaining to the user (as subject) to the data providers as federation members (Step 4). Data providers whose repositories contain data relevant to the relying party could respond to this request from the IdP (Step 5). The IdP then relays these signed OPAL responses to the relying party (Step 6). Note that in Fig. 5 the relying party could in fact bypass the IdP by going straight to the Gateway and the Data Federation. In this topology, the relying party would become the querier and the Gateway itself could in fact, play the dual role of also being an IdP.

5

Trust Framework for OPAL Federation

Today trust frameworks for identity management and federation in the US is based on three types or “layers” of law [35]. The foundational layer is the general commercial law that consists of legal rules that are applicable to identity management systems and transactions. This general commercial law was not created speciﬁcally for identity management, but instead are public law written and enacted by governments which applies to all identity systems, its participants – and thus enforceable in courts. The second “layer” consist of general identity system law. Such law is written to govern all identity systems within its scope. The intent would be to address the various issues related to the operations of the identity systems. The recognition of the need for law at this layer is new, perhaps reﬂecting the slow pace of development in the legal arena as compared to the technology space. An example of this is the Virginia Electronic Identity Management Act [36], which was enacted recently. The third “layer” is the set of applicable legal rules and system-speciﬁc rules (i.e. speciﬁc to the identity system in question). The term “trust framework” is often used to refer to these system rules that have been adopted by the community. A trust framework is needed for a group of entities to govern their collective behavior, regardless if the identity system is operated by the government or the private sector. In the case of a private sector identity system, the governing body consisting of the participants in the system typically drafts rules that take the form of contracts-based rules, based on private law. Examples of trust frameworks for identity federation today are FICAM for federal employees [37], SAFE-BioPharma Association [38] for the biopharmaceutical industry, and the OpenID-Exchange (OIX) [35] for federation based on the OpenID-Connect 1.0 model.

36

T. Hardjono and A. Pentland

In order for an OPAL-based information sharing federation to be developed, it should use and expand the current existing legal trust frameworks for identity systems. This is because the overall goal is for entities to obtain richer information regarding a user (subject), and as such it must be bound to the speciﬁc identity system rules. In other words, a new set of third layer legal rules and system-speciﬁc rules must be devised that must clearly articulate the required combination of technical standards and systems, business processes and procedures, and legal rules that, taken together, establish a trustworthy system [18] for information sharing based on the OPAL model. It is here that systemspeciﬁc rules regarding the “amount of private information released” must be created by the federation community, involving all stakeholders including the users (subjects). Taking the parallel of an identity system, an OPAL-based information sharing system must address the following: • Verifying the correct matching between an identity (connected to a human, legal entity, device, or digital object) and the set of data in a repository pertaining to that identity. • Providing the correct result from an OPAL-based computation to the party that requires it to complete a transaction. • Maintaining and protecting the private data within repositories over its life cycle. • Deﬁning the legal framework that deﬁnes the rights and responsibilities of the parties, allocates risk, and provides a basis for enforcement. Similar to – and building upon – an identity system operating rules, new additional operating rules need to be created for an OPAL-based information sharing system. There are two components to this. The ﬁrst is the business and technical operational rules and speciﬁcations necessary to make the OPALbased system functional and trustworthy. The second is the contract-based legal rules that (in addition to applicable laws and regulations) deﬁne the rights and legal obligations of the parties speciﬁc to the OPAL-based system and facilitate enforcement where necessary. As the current work is intended to focus on the concepts and principles of open algorithms and their application to information sharing in the identity context, these two aspects will be subject for future work.

6

Consent Management in OPAL

The OPAL paradigm introduces an interesting perspective on consent management, because the subject is asked to consent to the execution of an algorithm. This is in contrast to the prevailing mindset today [4] where the subject is asked permission for the data provider to share the subject’s raw data with other entities with whom the subject did not have a transactional relationship. Since the topic of consent is complex and outside the scope of the current paper, we will only touch on it brieﬂy by providing an overview of how a consent management system could augment an OPAL Data Provider deployment.

Open Algorithms for Identity Federation

37

Fig. 6. Overview of consent management for OPAL.

6.1

Consent to Execute an Algorithm

Asking a subject’s permission to execute an algorithm on their data and obtain “safe answers” is radically diﬀerent from asking the subject for permission to copy (or “share”) their data to an external entity. This is true regardless if a “subject” is an individual, an organization or a corporation. The OPAL approach provides a greater degree of assurance to the subject that the raw data-set remains where it is (one copy or few local copies) and that the identity of the Querier is known. In OPAL deployments the composition of the consent notice and receipt should include at least the following: • Algorithm and algorithm identification: This indicates which vetted algorithm will be executed against the data-set. This may be a list of multiple algorithms that are designed and vetted to run against the data-set. An implementation of OPAL may include references (e.g. URIs, hashes, etc.) that point back to the signed template which contains the complete algorithm expressed in a given syntax. • Data-set identification: Data-sets and copies of them must be uniquely identiﬁable within an organization. This could be a ﬁxed identiﬁer with local or global scope, or an identiﬁer (e.g. URI) that resolves to an actual resource (i.e. ﬁle) containing the dataset. • Data provider identification: This is the unique identity of the data provider which holds the subject’s data. Examples include an X.509 certiﬁcate issued by a valid Certiﬁcate Authority. Note that more complex deployment cases may involve a Repository Operator entity who hosts the OPAL data provider system but does not have any legal ownership to the data. • Querier identification: This is the unique identity of the querier (e.g. X.509 certiﬁcate). • Terms of use: This is the terms of use (or terms of service) relating to the result of the execution of the algorithm. A simpliﬁed easy-to-understand prose

38

T. Hardjono and A. Pentland

must be included for readability by the subject. The same terms of use (legal prose) must be presented to the Querier (e.g. included in the algorithm “template” description – see Sect. 3.2). 6.2

The UMA Model for Consent Management

Given that the majority of social media platform providers today are deploying architectures for identity management and authorization based on the OAuth2.0 framework [16], it makes technological sense to extend the rudimentary OAuth2.0 framework for the purposes of consent management. The User Managed Access (UMA) proﬁle [20,21] of OAuth2.0 provides such an extension. A high-level illustration of the UMA extensions and ﬂows in the context of the OPAL architecture is shown in Fig. 6. In Fig. 6, the Subject as the Resource Owner (RO) predeﬁnes his or her consent “rules” at the Consent Management system in Step 1. The Consent Management system is the UMA Authorization Server (AS). The system matches these rules with the resources (e.g. ﬁles, data-sets, subsets, etc.) belonging to the Subject as the Resource Owner in Step 2. After the Querier selects the Algorithm (Step 3), the Querier must then obtain a “consent token” (a signed JSON data structure) from the Consent Management system (Step 4). The Querier binds the execution-request to the consent token by digitally signing them together, prior to transmitting to the Data Provider (Step 5). Finally, in Step 6 the repository returns the response to the Querier. A key contribution of the UMA extension of OAuth2.0 is its recognition that in practice the Client is a web-application operated by the Client Operator (CO) as separate legal entity from the user (Querier). Similarly the Resource Server (RS) containing the subject’s resources may be operated by a separate legal entity from the subject. Note that the basic OAuth2.0 framework [16] does not recognize the notion of operators of services (e.g. third parties). As such OAuth2.0 does not distinguish between the Client web-application (to which the user connects via their browser) from the Client Operator entity which legally owns and operates the web-application. This is turn leads to the possibility of the operator providing the remote web-application service without any legal obligations to the querier or the resource owner. More speciﬁcally, the operator of the web-application can “listen in” (eavesdrop) to any query/response traﬃc between the querier and the data provider. This allows the operator to obtain data and information through backdoor access via the web-application which they legally own. In the context of OPAL in Fig. 6 there is an additional need to provide legal recognition of the diﬀerent entities in the data sharing ecosystem. This includes the Subjects, the Data Owner, the Client Operator and the Repository Operator. The Data Owner legally owns (or co-owns) the collated data, the algorithms and the information derived from the raw data. Individually, the Subject owns a small portion of the raw dataset in the repository. In the case where the OPAL

Open Algorithms for Identity Federation

39

infrastructure is hosted, the Repository Operator also has legal obligations (e.g. not copying or leaking raw data).

7

Future Directions

Currently there is great interest in the use of artiﬁcial intelligence (AI) and machine learning (ML) techniques to obtain better insight into data for various use-cases. For the OPAL paradigm, there are several possible areas of interest: • Distributed machine learning: The principle of leaving raw data in their repositories points to deployment architectures based on distributed data stores and distributed computation. Corresponding to this architecture is the use of machine learning techniques in a distributed manner to improve performance. In an architecture with distributed instances of OPAL data providers (data servers), one approach could be to train the algorithm separately at each data server instance. Each data server instance could hold slightly diﬀerent training datasets. The model trained at each data server instance would then be serialized and made available to the remote Querier. The OPAL principles remain enforced, where the Querier does not see the raw data but obtains the beneﬁt of distributed data stores performing independent training. • Fairness and accountability: Fairness has been of concern to researchers in the area of machine learning for some time. A key aspect of interest is in ensuring non-discrimination, transparency and understandability of data driven decision-making (e.g. see [2,39]). For the OPAL paradigm fairness is crucial in the vetting process for new algorithms in the context of a given data and use-case. Transparency is a factor in obtaining consent for including data within an given OPAL-based data federation. • Algorithms expressed as smart contracts: Once an algorithm has been vetted to be safe to run against a given dataset, it can be expressed as a smart contract for a given blockchain system or distributed ledger platform. Here a smart contract is deﬁned to be the combined executable code and legalprose [40], digitally signed and distributed on the P2P nodes of a blockchain system. The legal prose would include statements on the terms of use for the resulting response for privacy and regulatory compliance purposes. Depending on the type of blockchain system (permissioned, permissionless, semi-anonymous) the algorithm itself may be publicly readable. The querier (caller) must invoke the smart-contract algorithm accompanied with payment, which will be escrowed until the completion of the execution of the algorithm upon the intended dataset.

8

Conclusions

The identity problem of today is essentially a problem of data – and more specifically a problem of privacy-preserving data sharing.

40

T. Hardjono and A. Pentland

This paper has described the concepts and principles underlying the open algorithms (OPAL) paradigm. The OPAL paradigm oﬀers a possible way forward for industry and government to begin addressing the core issues around privacypreserving data sharing. Some of these challenges include siloed data, the limited type/domain of data, and the prohibitive situation of cross-organization sharing of raw data. Instead of sharing ﬁxed-attributes regarding a user or subject, the OPAL paradigm oﬀers a way for Identity Providers, Relying Parties and Data Providers to share vetted algorithms. This in turn provides better insight into the user’s behavior, with their consent. It also allows for the development of a trust network ecosystem consisting of these entities, providing new revenue sources, governed by relevant legal agreements and contracts that form the basis for a information sharing legal trust framework. Finally, a new set of legal rules and system-speciﬁc rules must be devised that must clearly articulate the required combination of technical standards and systems, business processes and procedures, and legal rules that, taken together, establish a trustworthy system for information sharing in a federation based on the OPAL model. Acknowledgment. The authors thank the following for inputs and insights (alphabetically): Abdulrahman Alotaibi, Stephen Buckley, Raju Chithambaram, Keeley Erhardt, Indu Kodukula, Emmanuel Letouz, Eve Maler, Carlos Mazariegos, YvesAlexandre de Montjoye, Ken Ong, Kumar Ramanathan, Justin Richer, David Shrier, and Charles Walton. We also thank the reviewers for valuable suggestions on improvements for the paper.

References 1. Pentland, A., Shrier, D., Hardjono, T., Wladawsky-Berger, I.: Towards an internet of trusted data: input to the whitehouse commission on enhancing national cybersecurity. In: Hardjono, T., Pentland, A., Shrier, D. (eds.) Trust::Data - A New Framework for Identity and Data Sharing, Visionary Future, pp. 21–49 (2016) 2. Pentland, A.: Social Physics: How Social Networks Can Make Us Smarter. Penguin Books (2015) 3. Pentland, A., Reid, T., Heibeck, T.: Big data and health - revolutionizing medicine and public health: report of the big data and health working group 2013. World Innovation Summit for Health, Qatar Foundation, Technical report, December 2013. http://www.wish-qatar.org/app/media/382 4. World Economic Forum. Personal Data: The Emergence of a New Asset Class (2011). http://www.weforum.org/reports/personal-data-emergence-newasset-class 5. European Commission: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Oﬀ. J. Eur. Union L119, 1–88 (2016) 6. Abelson, R., Goldstein, M.: Millions of Anthem customers targeted in cyberattack. New York Times, February 2015. https://www.nytimes.com/2015/02/05/ business/hackers-breached-data-of-millions-insurer-says.html

Open Algorithms for Identity Federation

41

7. Bernard, T.S., Hsu, T., Perlroth, N., Lieber, R.: Equifax says cyberattack may have aﬀected 143 million in the U.S. New York Times, September 2017. https:// www.nytimes.com/2017/09/07/business/equifax-cyberattack.html 8. Gartner: 2017 Planning Guide for Identity and Access Management, Gartner Inc., Report, October 2016 9. OASIS: Proﬁles for the OASIS Security Assertion Markup Language (SAML) V2.0, March 2005. https://docs.oasis-open.org/security/saml/v2.0/saml-proﬁles-2.0-os. pdf 10. Liberty Alliance: https://en.wikipedia.org/wiki/Liberty Alliance. Accessed 29 May 2017 11. OASIS: Assertions and protocols for the OASIS security assertion markup language (SAML) V2.0, March 2005. http://docs.oasisopen.org/security/saml/v2.0/samlcore-2.0-os.pdf 12. Morgan, R.L., Cantor, S., Carmody, S., Hoehn, W., Klingenstein, K.: Federated security: the shibboleth approach. EDUCAUSE Q. 27(4), 1217 (2004) 13. Neuman, C., Yu, T., Hartman, S., Raeburn, K.: The kerberos network authentication service (V5). RFC 4120 (Proposed Standard), Internet Engineering Task Force, July 2005, updated by RFCs 4537, 5021, 5896, 6111, 6112, 6113, 6649, 6806. http://www.ietf.org/rfc/rfc4120.txt 14. Zhu, L., Leach, P., Jaganathan, K., Ingersoll, W.: The simple and protected generic security service application program interface (GSS-API) negotiation mechanism. RFC 4178 (Proposed Standard), Internet Engineering Task Force, October 2005. http://www.ietf.org/rfc/rfc4178.txt 15. Jaganathan, K., Zhu, L., Brezak, J.: SPNEGO-based Kerberos and NTLM HTTP authentication in microsoft windows. RFC 4559 (Informational), Internet Engineering Task Force, June 2006. http://www.ietf.org/rfc/rfc4559.txt 16. Hardt, D.: The OAuth 2.0 authorization framework. RFC 6749 (Proposed Standard), Internet Engineering Task Force, October 2012. http://www.ietf.org/rfc/ rfc6749.txt 17. Sakimura, N., Bradley, J., Jones, M., de Medeiros, B., Mortimore, C.: OpenID connect core 1.0. OpenID Foundation, Technical Speciﬁcation v1.0 – Errata Set 1, November 2014. http://openid.net/specs/openid-connect-core-1 0.html 18. American Bar Association: An overview of identity management: submission for UNCITRAL commission 45th session. ABA Identity Management Legal Task Force, May 2012. http://meetings.abanet.org/webupload/commupload/ CL320041/relatedresources/ABA-Submission-to-UNCITRAL.pdf 19. OASIS: Glossary for the OASIS Security Assertion Markup Language (SAML) V2.0, March 2005. http://docs.oasis-open.org/security/saml/v2.0/samlglossary-2. 0-os.pdf 20. Hardjono, T., Maler, E., Machulak, M., Catalano, D.: User-Managed Access (UMA) Proﬁle of OAuth2.0 – Speciﬁcation Version 1.0, April 2015. https://docs. kantarainitiative.org/uma/rec-uma-core.html 21. Maler, E., Machulak, M., Richer, J.: User-Managed Access (UMA) 2.0, January 2017. https://docs.kantarainitiative.org/uma/ed/uma-core-2.0-10.html 22. Lizar, M., Turner, D.: Consent Receipt Speciﬁcation Version 1.0, March 2017. https://kantarainitiative.org/conﬂuence/display/infosharing/Home 23. Cameron, K.: The Laws of Identity (2004). http://www.identityblog.com/stories/ 2004/12/09/thelaws.html

42

T. Hardjono and A. Pentland

24. Cavoukian, A.: 7 laws of identity - the case for privacy-embedded laws of identity in the digital age. Oﬃce of the Information and Privacy Commissioner of Ontario, Canada, Technical report, October 2006. http://www.ipc.on.ca/index. asp?navid=46&ﬁd1=470 25. de Montjoye, Y.A., Quoidbach, J., Robic, F., Pentland, A.: Predicting personality using novel mobile phone-based metrics. In: Social Computing, Behavioral-Cultural Modeling and Prediction, LCNS, vol. 7812, pp. 48–55. Springer (2013) 26. Pentland, A.: Saving big data from itself. Sci. Am., 65–68 (2014) 27. Hardjono, T., Seberry, J.: Strongboxes for electronic commerce. In: Proceedings of the Second USENIX Workshop on Electronic Commerce. USENIX Association, Berkeley (1996) 28. de Montjoye, Y.A., Shmueli, E., Wang, S., Pentland, A.: openPDS: protecting the privacy of metadata through SafeAnswers. PLoS ONE 9(7), 13–18 (2014). https:// doi.org/10.1371/journal.pone.0098790 29. De Filippi, P., McCarthy, S.: Cloud computing: centralization and data sovereignty. Eur. J. Law Technol. 3(2) (2012). SSRN: https://ssrn.com/abstract=2167372 30. Zyskind, G., Nathan, O., Pentland, A.: Decentralizing privacy: using blockchain to protect personal data. In: Proceedings of the 2015 IEEE Security and Privacy Workshops. IEEE (2015) 31. Hardjono, T.: Decentralized service architecture for OAuth2.0. Internet Engineering Task Force, draft-hardjono-oauth-decentralized-00, February 2017. https:// tools.ietf.org/html/draft-hardjono-oauth-decentralized-00 32. Frey, R., Hardjono, T., Smith, C., Erhardt, K., Pentland, A.: Secure sharing of geospatial wildlife data. In: Proceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich 2017, May 2017 33. DataPop: Data-Pop Alliance (2017). http://datapopalliance.org 34. Singh, V.K., Bozkaya, B., Pentland, A.: Money walks: implicit mobility behavior and ﬁnancial well-being. PLOS ONE 10(8), 1–17 (2015). https://doi.org/10.1371/ journal.pone.0136628 35. Makaay, E., Smedinghoﬀ, T., Thibeau, D.: OpenID exchange: trust frameworks for identity systems, June 2017. http://www.openidentityexchange.org/wp-content/ uploads/2017/06/OIX-White-Paper Trust-Frameworks-for-Identity-Systems Final.pdf 36. State of Virginia: Virginia Electronic Identity Management Act, VA Code 2.2-436 2.2-437; VA Code 59.1-550 59.1-555 March 2015. https://lis.virginia.gov/cgi-bin/ legp604.exe?151+ful+CHAP0483 37. US General Services Administration: U.S. Federal Identity, Credential and Access Management (FICAM) Program (2013). http://info.idmanagement.gov 38. SAFE-BioPharma Association: SAFE-BioPharma FICAM Trust Framework Provider Approval Process (FICAM-TFPAP) (2016). https://www.safebiopharma.org/SAFE Trust Framework.html 39. Adebayo, J., Kagal, L.: Iterative orthogonal feature projection for diagnosing bias in black-box models. In: Proceedings of 3rd Workshop on Fairness, Accountability, and Transparency in Machine Learning, New York, NY, USA, November 2016 40. Norton Rose Fulbright: Can smart contracts be legally binding contracts. Norton Rose Fulbright, Report, November 2016. http://www.nortonrosefulbright.com/ knowledge/publications/144559/can-smart-contracts-be-legally-binding-contracts

A Hybrid Anomaly Detection System for Electronic Control Units Featuring Replicator Neural Networks Marc Weber1(B) , Felix Pistorius1 , Eric Sax1 , Jonas Maas2 , and Bastian Zimmer2 1 Institute for Information Processing Technologies, Karlsruhe Institute of Technology, Karlsruhe, Germany {marc.weber3,felix.pistorius,eric.sax}@kit.edu 2 Vector Informatik GmbH, Stuttgart, Germany {jonas.maas,bastian.zimmer}@vector.com

Abstract. Due to the steadily increasing connectivity combined with the trend towards autonomous driving, cyber security is essential for future vehicles. The implementation of an intrusion detection system (IDS) can be one building block in a security architecture. Since the electric and electronic (E/E) subsystem of a vehicle is fairly static, the usage of anomaly detection mechanisms within an IDS is promising. This paper introduces a hybrid anomaly detection system for embedded electronic control units (ECU), which combines the advantages of an eﬃcient speciﬁcation-based system with the advanced detection measures provided by machine learning. The system is presented for - but not limited to - the detection of anomalies in automotive Controller Area Network (CAN) communication. The second part of this paper focuses on the machine learning aspect of the proposed system. The usage of Replicator Neural Networks (RNN) to detect anomalies in the time series of CAN signals is investigated in more detail. After introducing the working principle of RNNs, the application of this algorithm on time series data is presented. Finally, ﬁrst evaluation results of a prototypical implementation are discussed. Keywords: Intrusion detection system · Anomaly detection Machine learning · Automotive · Controller Area Network · Time series

1

Introduction

Today, connectivity and highly automated driving are the two major topics pushing the evolution of automotive electronics. Both enable a signiﬁcant improvement for passenger comfort and safety. However, especially in their combination, connectivity and highly automated driving yields new dangerous scenarios. On the one hand, vehicles become increasingly connected with their environment and other vehicles. New wireless technologies like WiFi, Bluetooth and Car2X communication are installed, which enable new cyber-attack vectors [1–3]. On the c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 43–62, 2019. https://doi.org/10.1007/978-3-030-03405-4_4

44

M. Weber et al.

other hand, ECUs get more and more control over safety-relevant functions of a vehicle, like braking and steering, in order to realize automated driving. To counter the risk of fatal cyber-attacks, several researchers and leading companies propose a multi-layer security concept [1,4–7]. A so-called defense in depth architecture could e.g. consist of four defense barriers as proposed by Miller and Valasek [1]: (1) Secure oﬀ-board communication. (2) Access control for in-vehicle networks. (3) Separation of diﬀerent domains within the electric and electronic architecture. (4) Mechanisms to detect and prevent cyber-attacks on vehicle networks and within ECUs. This paper focuses on the last defense barrier, for which related research and industry propose the installation of IDS and intrusion prevention systems (IPS), e.g. for in-vehicle CAN networks [1,7–9]. The presented IDS concept for CAN combines the eﬃciency of a speciﬁcation-based approach with the advanced detection of irregularities using machine learning algorithms. Although the IDS system is introduced for automotive ECUs, the concept is not limited to that use case. The system is designed in a ﬂexible way, enabling an easy adaption to diﬀerent network technologies, as well as ECUs in diﬀerent application areas. Primary goal of the presented system is improving the cyber security of ECUs. As one part of the last defense barrier, it checks for irregularities in vehicle networks and it does not rely on pre-deﬁned attack patterns. With this approach, it is possible to recognize also unknown and newly arising cyber-attacks. However, the detection mechanisms work independent of the root cause of an irregularity, which potentially is a cyber-attack but also could be a malfunction of an ECU. For the automotive domain, the latter is especially relevant for the next generation of vehicles, if we consider the upcoming self-adapting and machine learning-based vehicle functions. The proposed system could help safeguarding these functions during runtime since a complete validation at development time becomes diﬃcult. The remaining part of this paper is structured as follows: Sect. 2 introduces the term anomaly and its diﬀerent types, followed by the discussion of related work in Sect. 3. The ﬁrst central aspect is the elaboration of the proposed hybrid anomaly detection system in Sect. 4, starting with a general system overview and continuing with an explanation of the single building blocks. The second part of this paper starts with Sect. 5, which discusses the topic of anomaly detection in time series data with special emphasis on automotive communication signals. With respect to the proposed system and its boundary conditions, RNNs are selected to be evaluated in more detail for this use case. Therefore, Sect. 6 introduces the working principle of a RNN, followed by the description of its application on CAN communication signals in Sect. 7. This second central aspect of the paper at hand is further elaborated in Sect. 8 by presenting detailed evaluation results. Section 9 concludes this paper with a summary and gives an outlook on future work regarding the hybrid anomaly detection system.

A Hybrid Anomaly Detection System for Electronic Control Units

2

45

Diﬀerent Types of Anomalies

Irregularities, so-called anomalies, are deviations from normal behavior. In literature, an anomaly is also referred to as outlier, besides others. D.M. Hawkins gives a corresponding deﬁnition, which reﬂects the basic assumption of the proposed system [10]: “An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a diﬀerent mechanism.” The following paragraphs introduce diﬀerent types of anomalies, which are useful to classify anomaly detection mechanisms. The simplest form of an anomaly is the point anomaly, deﬁned by Chandola et al. as follows [11]: “If an individual data instance can be considered as anomalous with respect to the rest of data, then the instance is termed as a point anomaly.” Exemplarily for the automotive context, the available data could be a CAN bus log. In this case, the mentioned data instance is a single recorded CAN message. If that single message can be classiﬁed as anomaly without considering other data instances, e.g. because it contains the information that the vehicle is driving 200 km/h at 1500 rpm in the 1st gear, it is a point anomaly. In contrast to the point anomaly, a contextual anomaly can only be classiﬁed as such, if additional contextual information is taken into account [11]. Having a look to the example above, if a CAN message contains the data 250 km/h at 4000 rpm in the 6th gear, this would not be a point anomaly since it is realistic in the ﬁrst place. However, if the message was recorded while driving inner city, it can be declared as a contextual anomaly. The last type is the collective anomaly, referring to a sequence of data instances, which together form an anomaly [11]. This would e.g. be the case if two or more consecutive CAN messages indicate an unrealistic or physical impossible acceleration of the vehicle.

3

Related Work

Besides deﬁning the diﬀerent anomaly types, Chandola et al. did a comprehensive state of the art analysis for anomaly detection techniques [11]. Many of them are based on machine learning algorithms, having the advantage that the behavior of the observed system has not to be known in advance but the normal behavior is learned from test data. Machine learning is powerful but it also comes with a cost. Although most algorithms are successfully applied in diﬀerent application domains, including intrusion detection, for some of them it is hard to understand what happens in detail inside the system, e.g. when using large neural networks. In consequence, it is hard to prove that machine learning algorithms work as intended under all circumstances. Additionally, most of them are resource intensive in terms of required computing power and/or memory consumption. Therefore, their application within embedded software is currently limited and has to be evaluated with care. As mentioned previously, researchers and industry experts recommend the usage of an IDS for automotive CAN communication [1,7–9]. However, there is

46

M. Weber et al.

one crucial diﬀerence compared to the classical application areas of IDS solutions in computers and computer networks: CAN buses are used in a well-deﬁned environment. All CAN messages, sent and received by the connected ECUs, are deﬁned by the vehicle manufacturers in advance to guarantee interoperability. This means that there is a lot of knowledge available, deﬁning the normal behavior of the bus system. Additionally, the automotive industry has established several standards to specify the CAN communication between ECUs in a semi-formal manner, often referred to as communication matrix. This speciﬁcation includes the exchanged messages and the transported payload. The payload itself consists of signals, each representing a single transported data element, e.g. vehicle speed, rpm and gear. Re-using this available knowledge in an anomaly detection system for CAN communication is vital. Checking for deviations from known system behavior can be implemented eﬃciently and without using machine learning algorithms. It is not necessary to learn e.g. the period of a CAN message, if it is already statically deﬁned. However, there are properties and scenarios, which cannot be checked purely based on a given communication matrix. One example is the time series of a CAN signal. While minimum and maximum signal values are usually deﬁned, there is no information about temporal behavior given statically. This temporal behavior of a signal can be observed by using machine learning algorithms as shown e.g. by Andreas Theissler [12], although his work focuses on oﬄine analysis instead of real-time intrusion detection on ECUs.

Fig. 1. Block diagram of the hybrid anomaly detection system for CAN.

A Hybrid Anomaly Detection System for Electronic Control Units

4

47

Hybrid Anomaly Detection System

The basic idea of the proposed system is to use speciﬁcation-based anomaly detection and machine learning algorithms sequentially within the embedded software of an ECU. Figure 1 depicts the principle building blocks of the twostage system. Due to resource constraints on ECUs, machine learning should only be used if necessary. Therefore, speciﬁcation-based techniques are applied in the ﬁrst stage as much as possible, re-using the knowledge provided by the vehicle manufactures. This stage is further referred to as static checks. Static checks are very well suited to detect point anomalies like signal values out of range. In addition, simple collective anomalies can be detected eﬃciently using static checks, as long as they can be derived from the communication matrix, e.g. the evaluation of message periods. When a CAN message is sent or received by an ECU, the static checks evaluate the protocol header and the signals, contained in the payload section. In the second stage of the system, learning checks extend the detection possibilities, especially for advanced contextual and collective anomalies by using machine learning. To apply the corresponding algorithms, relevant information has to be extracted from CAN messages. The static checks only forward selected data elements like signal values to the learning checks, depicted in Fig. 1 as feature base. The feature extraction block, pre-processes the data elements and generates features, which represent the input data for the following algorithms. Pre-processing can contain multiple aspects, like building time series, calculating derivations and normalization. In their state of the art analysis, Chandola et al. [11] present diﬀerent machine learning algorithms for anomaly detection. Each of them is well suited for different use cases. Therefore, the proposed system allows using a variety of algorithms, working on the same features. Figure 1 shows two examples: Replicator Neural Networks [13] and One-Class Support Vector Machines (OCSVM) [14]. Each algorithm produces as output either a binary value (‘normal’/‘anomaly’) or a so-called anomaly score, which represents the probability of an anomaly. These produced outputs are evaluated within the anomaly analyzer block. Before it ﬁnally reports to the anomaly logging, this block e.g. checks an anomaly score for a deﬁned threshold value. The design of the learning checks also allows for ensemble-based methods, as proposed in recent research e.g. by Andreas Theissler [15]. In an ensemble, multiple algorithms run in parallel, checking for the same anomalies and each of them producing its own output. In such a setup, the anomaly analyzer performs a voting between the diﬀerent outputs and ﬁnally decides whether an anomaly is reported or not. This kind of post-evaluation is not necessary for static checks, since they do not work with probabilities. Therefore, they directly report to the anomaly logging, which securely stores the detected anomalies and the corresponding boundary conditions. Up to now, all discussed aspects of the anomaly detection system are designed to run on an ECU. The last part, the anomaly detection conﬁgurator, is a tool, running on an oﬃce computer. Its purpose is to generate the anomaly detection conﬁguration, specifying the used static and learning checks. The tool parses the

48

M. Weber et al.

semi-formal communication matrix and extracts all information, required for the implementation. Additionally, the user can conﬁgure pre-deﬁned learning checks. Due to the separation of conﬁguration and implementation, the anomaly detection system is also vertically divided into two stages. The conﬁguration is done in a uniform manner, independent from the target environment. In contrast, the implementation is speciﬁc for diﬀerent embedded software frameworks like Linux or Automotive Open System Architecture (AUTOSAR). However, it shall be possible to generate the implementation based on the anomaly detection conﬁguration, so that no or just little manual coding is necessary to apply anomaly detection on a speciﬁc ECU.

5

Anomaly Detection in Time Series Data

Currently, CAN is the most important and most frequently used bus system in vehicle networks. Data exchange between ECUs via CAN signals is vital for the majority of the implemented vehicle functions like engine control and electronic stability control. Besides the pre-deﬁned signal properties, e.g. absolute minimum and maximum value, the signal sequence is one major characteristic. Especially signals, representing continuous physical values, are well suited for anomaly detection in their sequence. As an example, the speed of a vehicle is bound to physical constraints. It cannot change too rapidly and some signal sequences are very unlikely to occur, e.g. oscillation. These constraints are not pre-deﬁned within the communication matrix and therefore cannot be observed by static checks. However, wrong signal sequences are collective anomalies, which are detectable by learning checks. In a ﬁrst step, a machine learning algorithm has to be selected, which is suited for anomaly detection in signal sequences, also referred to as time series data. For this use case, potentially a lot of diﬀerent classiﬁcation and clustering algorithms can be used. Classiﬁcation-based techniques use supervised learning and require the availability of so-called labeled data. In this case, data instances are marked to belong to one deﬁned class. Considering anomaly detection, there are the two classes normal and anomaly. With this pre-knowledge a classiﬁer is learned, which is able to distinguish normal from anomalous data instances. In contrast, clustering-based techniques use unsupervised learning and do not require labeled data. They group data instances and thereby try to ﬁnd the classes normal and anomaly automatically. However, most of these algorithms require many data instances to be available during runtime in order to perform the grouping. Since the proposed anomaly detection system shall be implemented on an embedded ECU, considering the required resources is important. Computing power on an embedded device is limited, compared to oﬃce computers or server systems. Following this constraint, the computational complexity of an algorithm must be low when executed on an ECU to detect anomalies. On the other hand, the training of the algorithm may be computational complex as long as it can be performed oﬄine, like on an oﬃce workstation. Afterwards, the trained algorithm has to be transferred to the ECU, e.g. in terms of updating a parameter

A Hybrid Anomaly Detection System for Electronic Control Units

49

set. Not only the computing power is limited on an ECU but also the available memory. This means that applied machine learning algorithms must not require much read-only memory (ROM) and, even more important, not much random access memory (RAM). Due to these constraints, clustering-based techniques are excluded from the decision process of ﬁnding an appropriate algorithm for anomaly detection in time series data. As mentioned above, the remaining classiﬁcation-based techniques require labeled data. However, getting labeled data for anomalous signal sequences in a large scale is diﬃcult or even impossible. Also, training with anomalous data instances would result in a classiﬁer, which is probably only able to detect known anomalies. This contradicts the idea of anomaly detection, which tries to ﬁnd any kind of deviation from normal behavior. Therefore, algorithms are used, which only require the presence of normal data instances during the training and which perform a so-called one-class classiﬁcation. Two of these one-class classiﬁcation algorithms were investigated in more detail. The ﬁrst one is the One-Class Support Vector Machine (OCSVM), introduced by Sch¨ olkopf et al. [14]. This algorithm is a special derivative of the Support Vector Machine (SVM), which calculates a linear separator (also called classiﬁer) for two data classes during training. Since the data instances of two classes are mostly not linear separable, usually a kernel function is applied. With this kernel function the classiﬁcation problem is transformed into a higher dimensional space in which the data instances can be separated in a linear way. In testing phase, data instances are classiﬁed to belong to class one or two, based on their location with respect to the separator. When using the SVM in its original form, the resulting separator has the maximum margin to the training data instances of the two classes. In case of one-class classiﬁcation problems only data instances of one class are given, i.e. normal data. To be able to calculate a separator, the second class is now representing the anomalous data instances for which no training data is available. For this scenario, the OCSVM uses again the kernel function to calculate a linear separator in the higher dimensional space. In the original space, this separator surrounds the normal training data instances as close as possible. In testing phase, all data instances, which are located outside the enclosed normal data, are considered anomalous. Details on the OCSVM algorithm can be found in [14]. The second investigated algorithm is the Replicator Neural Network, introduced for anomaly detection by Hawkins et al. [13,16]. RNNs are feed-forward neural networks, which are based on the multilayer perceptron and try to learn the identity function for the given training data. The basic assumption is that after a successful training, the RNN is able to reproduce known inputs, i.e. normal data instances, at its outputs with low error. For unknown inputs, representing anomalies, the reproduction error is considerably higher. Using a threshold mechanism, this error is transformed into the ﬁnal output normal or anomaly. The working principle of RNNs is explained in more detail in Sect. 6. Both algorithms fulﬁll the discussed prerequisites for an implementation on an ECU. The training phase is computationally complex for both. For the

50

M. Weber et al.

OCSVM, additionally all data instances have to be considered at once during the training, since the algorithm calculates a separator enclosing all training data instances. This makes online learning impossible, i.e. the adaption of the separator during runtime, if the algorithms runs on an ECU. For RNNs, there are two training modes possible. A so-called batch mode, which corresponds to the training principle of an OCSVM. In this mode, the parameters of a RNN are adapted after all training data instances have been processed. However, there is also an online mode possible, adapting the parameters after each single data instance. With this mode, online learning can be applied, although most actual ECUs are not powerful enough. However, considering the increasing computing power with every ECU generation, it is assumed that online learning becomes possible in future. In contrast to the complex training, the testing phase is much simpler for both of the discussed algorithms. In case of an OCSVM, it has to be checked, whether a data instance is located inside or outside the separator. For RNNs, one propagation through the network has to be calculated, which is a sequence of multiplications and summations. Although OCSVMs and diﬀerent derivatives are often discussed and successfully applied, even for anomaly detection in time series data [12], this paper focuses on the application of RNNs. This is mainly due to the possibility of online learning. Although not used in the ﬁrst step, it oﬀers potential for future improvements. Another related advantage of RNNs compared to OCSVMs is that the training can be performed incrementally. When new training data instances are available, the training of an OCSVM has to be restarted whereas the training of RNNs can be continued using the new data instances.

6

Replicator Neural Networks

This section gives an overview about the working principle of Replicator Neural Networks. As mentioned above, RNNs are based on the multilayer perceptron, which was ﬁrst discussed by Minsky and Papert [17]. Figure 2 shows a multilayer perceptron with two inputs x - so-called features - on the left hand side, two neurons in the hidden layer and an output layer with one neuron. Each neuron summarizes the weighted (w) inputs x and calculates its output o using the activation function ϕ, which in this case is a sigmoid function (see (1)). The superscript indices in Fig. 2 indicate the layer of the neural network. n xi wij ) oj = ϕ(

(1)

i=0

Since a RNN tries to reproduce known or similar inputs at its outputs, the number of input values has to match the number of output values and therefore the number of output neurons [13]. The number of hidden layers as well as their neuron count is ﬂexible. Hidden layers with less neurons represent the input information in a compressed form at their outputs [18]. Figure 3 shows a RNN

A Hybrid Anomaly Detection System for Electronic Control Units

51

Fig. 2. Multilayer perceptron.

with four inputs, one hidden layer with two neurons and four outputs. The bias of neurons (x0 ) and the weights (wij ) are not depicted for simplicity reasons. During the training phase, data instances represent the input and output of the RNN at the same time. By using e.g. the Backpropagation algorithm [19], the weights of the edges between neurons are adapted with the goal to minimize the error between actual and expected output. This way, the RNN learns an approximation of the identity function for the training data set. However, this approximation does not cover un-trained data instances. Therefore, after the training is ﬁnished, the reproduction error corresponds to an anomaly score. The higher the error, the more likely it is an anomaly. The error between an ndimensional input and output can be calculated in diﬀerent ways. In the proposed setup, the Outlier Factor (OF) according to (2) is used, deﬁned by Hawkins [10]. n corresponds to the number of input values. n

OF =

1 (xi − oi )2 n i=0

(2)

By applying a threshold to the OF, a data instance can be classiﬁed as normal or anomaly. Figure 4 shows the complete anomaly detection by using a RNN in a more abstract way. fRN N represents the approximation of the identity function and σ the application of the threshold OFS . Due to σ, y is a binary value (y ∈ {0, 1}), where 1 represents an anomaly and 0 indicates normal data. A RNN has diﬀerent parameters, inﬂuencing the approximation capabilities and therefore the overall performance of the anomaly detection. These are: • Number of input and output values (n) • Number of hidden layers • Number of neurons in each hidden layer

52

M. Weber et al.

Fig. 3. Replicator neural network.

Fig. 4. Anomaly detection with a replicator neural network.

• Activation function • Initial weights of the edges between neurons (w) • Threshold value (OFS ) There exist guidelines on how to choose the right values for the last three parameters. For the activation function, a bounded output value is desired, mostly [0, 1]. This way, numerically very large values, caused by single neurons in the hidden or output layer, are avoided. Additionally, it has to be derivable in order to apply the Backpropagation algorithm during the training phase. In many cases the sigmoid function is used, which fulﬁlls both requirements and in addition is derivable by itself. This is beneﬁcial for the calculation of the Backpropagation algorithm. Although in some use cases diﬀerent activation functions are used, especially for the neurons in the output layer, in case of the proposed anomaly detection system, all neurons feature a sigmoid activation function. Fernandez-Redondo et al. discuss multiple methods for the selection of initial weights in neural networks [20]. A simple but eﬀective variant is the selection of the initial weights according to (3). wij is selected randomly in the given range, considering that the bias of the neurons w0j is greater or equal wij . This method is independent from the training data set and is applied in the proposed system. wij ∈ {−0.05, 0.05}; w0j ≥ wij

(3)

The threshold value OFS selects the maximum OF, for which the corresponding data instance is considered normal. Even for a trained RNN, the OF will

A Hybrid Anomaly Detection System for Electronic Control Units

53

unlikely be zero for data instances of the training data set. Dau et al. propose to set OFS to the maximum value, occurring during the validation of the RNN [18]. Using the complete training data set for validation, this approach ensures that all corresponding data instances are considered normal. In the proposed system, the threshold value is set according to this procedure. The ﬁrst three RNN parameters are use case speciﬁc. To determine proper values for the desired anomaly detection on time series data of CAN signals, an emperical study on the inﬂuence of those parameters on the detection result is performed in Sect. 8.

7

Application of RNNs on Time Series Data

The principle idea is to use one RNN for observing the time series of one communication signal. Regarding the implementation, ﬁrst the sequences of CAN signals have to be made available as input values within the anomaly detection system. Every time a CAN message is received, the static checks extract the single signals, which are then passed to the feature extraction block, please refer to Fig. 1. This block builds up and manages the time series of those signals with a conﬁgurable count of history values. Internally, it implements a sliding window mechanism, which shifts the already available signal values as soon as a new value arrives. The oldest value is disposed. Figure 5 enhances Fig. 3 by the preceding management of the signal sequence. In the example, the conﬁgured length of the time series is four, i.e. the actual and three past values. As depicted, each time series value represents a feature and one input for the associated RNN. However, the depicted features xt ..xt−3 are not directly the signal values, present in the original CAN message, but they have been normalized before. The normalization in a value between 0 and 1 is performed with respect to the minimum and maximum value of the signal, deﬁned in the communication matrix. If minimum and maximum signal values are not deﬁned, they are derived from the training data set. The training itself is realized by a special program, which takes recorded CAN logs as input. Additionally, it features parameters, which can be used to specify the count of training-epochs as well as the amount of data instances, which shall be used for training and validation, respectively. Within one epoch, the RNN is trained with the speciﬁed amount of data instances of the training data set and validated with the remainder. As mentioned previously, the adaption of the weights between neurons is performed according to the Backpropagation algorithm. To parametrize this algorithm, the initial learning rate and the learning rate adaption are conﬁgurable as well. Internally, the program executes the anomaly detection system with every recorded CAN message. Since online mode is used, the weights of the RNNs are adapted after each execution. When the training is ﬁnished, i.e. after the given number of epochs, each RNN has learned an approximation of the identity function for the CAN signal sequence it is responsible for.

54

M. Weber et al.

Fig. 5. Replicator neural network for anomaly detection in time series data.

8

Evaluation

To evaluate the performance of the proposed system, a synthetic CAN signal is used. The signal sequence as well as the corresponding CAN messages are generated with CANoe from Vector Informatik, which is a simulation and analysis tool for automotive networks. A corresponding communication matrix is used to automatically generate the static checks, including the signal extraction from CAN messages as well as their normalization. The signal shape is deﬁned in a way to emulate a physical value of a vehicle, e.g. speed, and is shown in Fig. 6. It is used as a reference for the following considerations, i.e. it represents the normal behavior.

Fig. 6. CAN signal shape for training (reference) and evaluation.

For training and evaluation, the depicted shape is concatenated to a continuous signal. The training data set consists of 10.000 CAN messages, containing the signal according to the deﬁned reference shape. For the evaluation of the anomaly detection performance, an altered signal shape was deﬁned as well, containing ﬁve anomalies of diﬀerent types. The anomalies are inserted in the middle of the continuous time series between second 110 and 150 (counted from sequence

A Hybrid Anomaly Detection System for Electronic Control Units

55

start). The corresponding shape with the highlighted anomalies is depicted in the lower part of Fig. 6. (1) 114–116 s - Limitation of the value range: This anomaly limits the minimum and maximum value of the signal, which is diﬀerent from the pre-deﬁned one. (2) 119–122 s - Value freeze: In this case, a value is kept for an unnatural long time before it resumes to the original signal sequence. (3) 128–130 s - Alternative signal sequence: In contrast to the two types discussed before, an alternative sequence is not an anomaly. Instead it shall be a ﬁrst check of the generalization of the approximation. A RNN shall not learn the exact pattern of a signal sequence but implement a certain generalization, i.e. alternative signal sequences, which represent another form of realistic or normal behavior, shall not be detected as anomaly. Otherwise, many false positive alarms would be issued when implemented in a real ECU, since e.g. the actual sequence of speed values will never be completely identical to the sequences contained in the training data set. (4) 137–139 s - Peak: The peak anomaly refers to a spike in the signal sequence, which should not be present. (5) 145–147 s - Signal jump: Instead of having a continuous shape, the signal sequence includes an unrealistic high value step. Before discussing evaluation results in detail, a nomenclature has to be established, which easily represents the architecture of a RNN. The architecture is given by the number of layers and the number of neurons in each of those layers. Therefore, a nomenclature like 4-2-4 is used in the following to specify the architecture of a RNN. The ﬁrst 4 in the example indicates four neurons in the input layer, followed by two neurons in the hidden layer (indicated by the 2 ). The output layer holds again four neurons, represented by the second 4. Please also refer to Fig. 5. Consequently, a RNN with architecture 4-3-2-3-4 would have three hidden layers with three, two and again three neurons. In the following, the anomaly detection performance of RNNs with diﬀerent numbers of input/output neurons, hidden layers and neurons in hidden layers is evaluated. This shall help deﬁning the application speciﬁc parameters of a RNN, as discussed in Sect. 6. However, only one parameter is altered at a time for comparability reasons. Additionally, there are the initial weights of the edges between neurons, which are set randomly. Therefore, a comparison is in general diﬃcult. To counter that issue, for the evaluation, each RNN architecture is instantiated ten times. The comparison between the diﬀerent architectures is always done for the best RNN of each class. Also, for one evaluation, all networks are trained identically, i.e. with the reference sequence and with the same number of training epochs. 8.1

Diﬀerent Number of Input/Output Neurons

The ﬁrst evaluation considers diﬀerent numbers of input/output neurons, i.e. how many history values are taken into account when observing a signal sequence.

56

M. Weber et al.

Fig. 7. Anomaly detection performance for RNNs with a diﬀerent number of input/output neurons.

All used RNNs have one hidden layer, which has the same number of neurons as the input and output layers. Figure 7 shows the corresponding result. The ﬁrst row in Fig. 7 represents the reference signal, whereas the second row shows the altered signal, containing the discussed anomalies. All other rows below depict the detection performance of the best RNN of each evaluated architecture. In this case, 4-4-4 up to 64-64-64, as indicated by the assigned headlines. The grey signal is the calculated OF and the rectangular black overlay shows the detected anomalies, after the application of the OFS threshold. The best detection performance have the RNNs with 16-16-16 and 32-32-32 architectures. They detect all anomalies and do not issue a false positive for the alternative signal sequence. When considering the diﬀerence in the OF between normal and anomalous signal sequences - comparable to a signal-to-noise ratio - the 32-32-32 architecture performs best. Additionally, it seems that an increasing number of input/output neurons lead to an increased low-pass behavior within the OF. However, it has

A Hybrid Anomaly Detection System for Electronic Control Units

57

to be considered that also the computational eﬀort and the memory consumption increase with the number of neurons. Therefore, the 16-16-16 architecture is used as a reference for the following evaluations.

Fig. 8. Anomaly detection performance for RNNs with a diﬀerent number of neurons in the hidden layer.

8.2

Diﬀerent Number of Neurons in the Hidden Layer

Next step is the investigation of the detection performance, dependent on the number of neurons in the hidden layer. Starting from the 16-16-16 RNN, an architecture with one neuron less and another architecture with one additional neuron in the hidden layer is investigated. As depicted in Fig. 8, no important diﬀerence can be identiﬁed. For the 16-16-16 architecture, the results diﬀer between the discussed evaluations. This is due to the diﬀerent initial weights at the beginning of the training phase. 8.3

Diﬀerent Number of Hidden Layers and Neurons

The third evaluation setup considers the number of hidden layers and extends the previous investigation regarding the number of hidden neurons. Figure 9 shows three additional RNN architectures, each having three hidden layers instead of one. These networks seem to be more sensitive, if the results are compared for the ﬁrst and the second anomaly. This sensitivity has the cost of additional layers and neurons and therefore increased computing eﬀort as well as memory consumption.

58

M. Weber et al.

Fig. 9. Anomaly detection performance for RNNs with a diﬀerent number of hidden layers and corresponding neurons.

8.4

Considering 1st and 2nd Derivation of a Signal Sequence

Up to now, only the signal value was considered. Having a look to the results shown in Figs. 7, 8 and 9, the RNNs seem to react sensitive to a large gradient change. Therefore, in this evaluation, the RNNs do not only have the normalized value sequence as an input, but also the sequences of the normalized ﬁrst and second derivation of the signal. Since the gradients are not deﬁned statically in the communication matrix, the normalization is done by using minimum and maximum values, which are learned from the training data set. The ﬁrst investigated RNN type has a 21-21-21-21-21 architecture. Eight of the inputs represent the signal value sequence, seven the ﬁrst and six inputs the second derivation. This RNN does not detect an anomaly, which is comparable to the RNN with eight inputs in the ﬁrst evaluation setup. For the larger networks, again an increased sensitivity can be observed, as shown in Fig. 10.

A Hybrid Anomaly Detection System for Electronic Control Units

59

Fig. 10. Anomaly detection performance for RNNs, considering signal derivations.

8.5

Correlation of Communication Signals

Considering the sequence of one communication signal and potentially its derivations shows promising results in terms of anomaly detection with RNNs. Because in an automotive network many communication signals correlate to each other, e.g. speed and rpm, the last evaluation targets the anomaly detection in two correlating signals. Also in this case, RNNs are used. However, now each RNN gets the value sequence of two signals, i.e. a 32-32-32 architecture gets 16 values of the ﬁrst and 16 values of the second signal as input features. During the training phase, the ﬁrst one is the reference signal discussed above. The second signal is based on the same reference but mirrored and scaled. The scaling is done for a more realistic correlation scenario. However, diﬀerent value ranges are already addressed by the normalization and therefore do not inﬂuence the behavior of the RNNs. For the validation, the modiﬁed reference signal with anomalies is used, whereas the second signal remains untouched. Figure 11 depicts the promising results. For simplicity reasons, only the reference signal without and with anomalies is shown. The mirrored and scaled second signal is omitted. As can be seen, all anomalies are detected nearly perfectly. Now also the alternative signal sequence is detected. However, in this case it has to be detected, since it also represents a violation of the correlation between the two input signal sequences. It is also worth to note, that in this evaluation the OF during an anomaly is by far higher compared to the OF during normal data. For the synthesized signals, even RNNs with few input features work very well. The RNN with

60

M. Weber et al.

a 8-8-8 architecture also detected all anomalies while using fewest resources. Whether this holds true also for real communication signals has to be investigated in future.

Fig. 11. RNN anomaly detection performance for two correlating signals.

9

Conclusion and Future Work

This paper motivated the need for intrusion detection systems in automotive ECUs. Today, anomaly detection, which comes along with IDS, is mostly used within information technology (IT) infrastructure like servers and networking equipment. Since automotive and embedded networks are far more static than IT infrastructure, a diﬀerent approach is proposed. Instead of using a solution, purely based on machine learning or other complex algorithms, we favor a hybrid anomaly detection system. In a ﬁrst stage, the system works speciﬁcation-based. Especially in the automotive industry, there exist semi-formal network speciﬁcations, which are perfectly suited to implement an eﬃcient and eﬀective IDS.

A Hybrid Anomaly Detection System for Electronic Control Units

61

However, the speciﬁcations do not contain information about the temporal behavior of communication signals. For the advanced observation of this temporal behavior, machine learning algorithms can be used. In the second part of this paper, we investigated Replicator Neural Networks for anomaly detection in communication signal sequences. First evaluations, based on synthesized signals, show promising results. Some investigated RNN architectures are able to detect all generated anomalies whereas an alternative signal sequence did not cause a false positive alarm. Considering the length of the observed signal sequence, it is interesting to see that the best performance is achieved by a medium-sized network (and not by the largest one). Altering the number of hidden neurons did not result in a remarkable eﬀect. Instead, more hidden layers and using signal derivations as additional input features resulted in a more sensitive anomaly detection by the corresponding RNNs. Very promising results are shown for anomaly detection with respect to two correlating communication signals, also for small RNNs. Since the evaluation is up to now based on synthesized communication signals, the next step is to perform similar tests with real data. As a starting point, vehicle signals, collected via the standardized On-Board Diagnostics II port, can be used. However, in a ﬁnal step, real bus logs should be used to do a performance evaluation under the same conditions as if the anomaly detection system would run in a real ECU. A second ﬁeld of work is the extension of the proposed system by additional bus and networking systems. As Ethernet is currently becoming an important in-vehicle networking technology, the extension by static checks for Ethernet would be another next step. Since the learning checks work on communication signals, they are independent from the underlying networking or bus system. Therefore, the investigated RNNs are re-usable in an extended hybrid anomaly detection system.

References 1. Miller, C., Valasek, C.: A survey of remote automotive attack surfaces. In: Black Hat USA, vol. 2014 (2014) 2. Miller, C., Valasek, C.: Remote exploitation of an unaltered passenger vehicle. In: Black Hat USA, vol. 2015 (2015) 3. Checkoway, S., McCoy, D., Kantor, B., Anderson, D., Shacham, H., Savage, S., Koscher, K., Czeskis, A., Roesner, F., Kohno, T., et al.: Comprehensive experimental analyses of automotive attack surfaces. In: USENIX Security Symposium (2011) 4. Onishi, H.: Paradigm change of vehicle cyber security. In: Czosseck, C. (ed.) 4th International Conference on Cyber Conﬂict (CYCON). IEEE, Piscataway (2012) 5. AUTO-ISAC: Automotive Cybersecurity Best Practices: Executive Summary. In: AUTO-ISAC, 2016th edn. (2016) 6. Brown, D.A., Cooper, G., Gilvarry, I., Grawrock, D., Rajan, A., Tatourian, A., Venugopalan, R., Vishik, C., Wheeler, D., Zhao, M., Clare, D., Fry, S., Handschuh, H., Patil, H., Poulin, C., Wasicek, A., Wood, R.: Automotive security best practices: recommendations for security and privacy in the era of the next-generation car. In: White Paper, McAfee Inc. (2015)

62

M. Weber et al.

7. van Roermund, T., Birnie, A.: A multi-layer vehicle security framework. In: Whitepaper, May 2016 ed. NXP B.V. (2016) 8. Hoppe, T., Kiltz, S., Dittmann, J.: Security threats to automotive can networks– practical examples and selected short-term countermeasures. Reliab. Eng. Syst. Saf. 96(1), 11–25 (2011) 9. Studnia, I., Nicomette, V., Alata, E., Deswarte, Y., Kaˆ aniche, M., Laarouchi, Y.: Security of embedded automotive networks: state of the art and a research proposal. In: Roy, M. (ed.) SAFECOMP 2013 - Workshop CARS (2nd Workshop on Critical Automotive Applications: Robustness & Safety) of the 32nd International Conference on Computer Safety, Reliability and Security, France, Toulouse (2013) 10. Hawkins, D.M.: Identiﬁcation of Outliers, Monographs on Applied Probability and Statistics. Springer, Dordrecht (1980) 11. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009) 12. Theissler, A.: Anomaly detection in recordings from in-vehicle networks. In: Proceedings of Big Data Applications and Principles, Madrid (2014) 13. Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Arikawa, M., Winiwarter, W. (eds.) Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science. Springer, Heidelberg (2002) 14. Sch¨ olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, vol. 12, pp. 582–588. MIT Press, Cambridge (2000) 15. Theissler, A.: Detecting known and unknown faults in automotive systems using ensemble-based anomaly detection. Knowl.-Based Syst. 123, 163–173 (2017) 16. Williams, G., Baxter, R., He, H., Hawkins, S., Gu, L.: A comparative study of rnn for outlier detection in data mining. In: Kumar, V. (ed.) Proceedings/2002 IEEE International Conference on Data Mining, ICDM 2002, pp. 709–712. IEEE Computer Society, Los Alamitos (2002) 17. Minsky, M.L., Papert, S.A.: Perceptrons: An Introduction to Computational Geometry, 2nd edn. The MIT Press, Cambridge (1972) 18. Dau, H.A., Ciesielski, V., Song, A.: Anomaly detection using replicator neural networks trained on examples of one class. In: Dick, G. (ed.) Simulated Evolution and Learning, Lecture Notes in Computer Science, vol. 8886, pp. 311–322. Springer, Cham (2014) 19. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, A Bradford Book, pp. 318–362. MIT Press, Cambridge (1986) 20. Fern´ andez-Redondo, M., Hern´ andez-Espinosa, C.: Weight initialization methods for multilayer feedforward. In: Verleysen, M. (ed.) Proceedings/9th European Symposium on Artiﬁcial Neural Networks, ESANN 2001, pp. 119–124. D-Facto, Brussels (2001)

Optimizing Noise Level for Perturbing Geo-location Data Abhinav Palia(B) and Rajat Tandon Department of Computer Science, University of Southern California, Los Angeles, CA, USA {palia,rajattan}@usc.edu

Abstract. With the tremendous increase in the number of smart phones, App stores have been overwhelmed with applications requiring geo-location access in order to provide their users better services through personalization. Revealing a user’s location to these third party Apps, no matter at what frequency, is a severe privacy breach which can have unpleasant social consequences. In order to prevent inference attacks derived from geo-location data, a number of location obfuscation techniques have been proposed in the literature. However, none of them provides any objective measure of privacy guarantee. Some work has been done to deﬁne diﬀerential privacy for geo-location data in the form of geo-indistinguishability with l privacy guarantee. These techniques do not utilize any prior background information about the Points of Interest (PoI s) of a user and apply Laplacian noise to perturb all the location coordinates. Intuitively, the utility of such a mechanism can be improved if the noise distribution is derived after considering some prior information about PoI s. In this paper, we apply the standard deﬁnition of diﬀerential privacy on geo-location data. We use ﬁrst principles to model various privacy and utility constraints, prior background information available about the PoI s (distribution of PoI locations in a 1D plane) and the granularity of the input required by diﬀerent types of apps, in order to produce a more accurate and a utility maximizing differentially private algorithm for geo-location data at the OS level. We investigate this for a particular category of Apps and for some speciﬁc scenarios. This will also help us to verify whether Laplacian noise is still the optimal perturbation when we have such prior information. Keywords: Diﬀerential privacy · Utility Geo-location data · Laplacian noise

1

· Points of interest

Introduction

Over the years, a number of mobile phone services are becoming dependent on user’s location in order to provide a better experience, be it a dating app, restaurant search, nearby gas stations lookup and what not. All these services require a user to surrender her location (mostly exact coordinates) in order c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 63–73, 2019. https://doi.org/10.1007/978-3-030-03405-4_5

64

A. Palia and R. Tandon

to derive accurate results. With the increasing popularity of social networks, extracting auxiliary information about an individual has become easier than ever before. Both of these factors have increased the likelihood of inference attacks on the users which can have unpleasant social consequences. Therefore, revealing a user’s location, no matter at what frequency, is a severe privacy breach [5]. The criticality of geo-location data can be estimated by the news pieces reporting that the Egyptian government used to locate and imprison users of Grindr–a gay dating app [4]. Grindr uses geo-location of its users in order to provide them a perfect match in their vicinity. Most of the users have submitted their “stats” such as body weight, height, eye color, ethnicity, preferences, onprep (AIDS status), extra information, etc. while creating a proﬁle. Even half of these values along with their geo-location, can be used to derive inferences uniquely identifying a particular user. [6] has reported social relationship leakage of a user through applications which use GPS data. A number of inferences can be deduced by observing social relationships of an individual which he might not want to disclose. Tracking location coordinates or identifying PoI s of an individual, can characterize his mobility and can infer sensible information such as hobbies, political, religious interests or even potential diseases [7]. All these studies provide enough motivation for the research community to ﬁnd a solution to protect geo-location privacy. Although geo-indistinguishability presents various appealing aspects, it has the problem of treating space in a uniform way, imposing the addition of Laplace noise everywhere on the map [3]. This assumption is too strict and can aﬀect the utility of the service. A Laplace-based obfuscation mechanism satisfying this privacy notion works well in the case where no prior information is available. However, most of the apps which require geo-location as input, are conditioned with the prior of the destination or the PoI s, in general. In this paper, we would try to investigate that whether the choice of using Laplacian noise to perturb geo-data is optimal in the scenarios where prior information about user’s PoI s is available. Intuitively, availability of this information will improve the utility of the diﬀerential private mechanism but has to be conditioned with some more constraints. We use the basic deﬁnition of diﬀerential privacy as well as the ﬁrst principles to model the utility and privacy constraints in order to deduce a noise distribution for the scenario where we have a prior. In the next section we discuss the related work done in this direction. In Sect. 3, we clearly deﬁne the problem statement we counter in this paper. Section 4 discusses our proposal and the contribution towards the solution of this problem. In Sects. 5 and 6, we present our results and the future trajectory of our work, respectively. In Sect. 7, we conclude our ﬁndings. After listing the references, Appendix A and B provide the mathematical solution of our constraints.

2

Related Work

Most of the hand held devices provide three options of allowing location access to the installed apps, namely, Always, While using and Never. One can easily

Optimizing Noise Level for Perturbing Geo-location Data

65

predict the harm which can be caused when this permission is granted Always. On the other hand, we still want to use the service from the app, so not providing this permission by selecting Never is not a valid choice. In such a situation, While using option appears appropriate but can still be used by an attacker to track the trajectory of a user. Intuitively, it is better to trust the OS which can sanitize the geo-location data before supplying it as an input to the App. Literature proposes diﬀerent ideas to perturb geo-location data. [10] proposes the idea of spatial and temporal cloaking which uses k –anonymity, l –diversity and p–sensitivity. Other spatial obfuscation mechanisms proposed in [11,12,14] reduce the precision of the location information before supplying it to the service. Most of these techniques are not robust and are also detrimental to utility functions [1] as they are based on very simple heuristics such as i.i.d. location sampling or sampling locations from a random walk on a grid or between points of interest. The generated location traces using these techniques, fail to capture the essential semantic and even some basic geographic features. Techniques such as spatial cloaking perturb the exact location of the user but do not provide any privacy guarantee. Additionally, they are not resistant to probability based inference attacks [9]. Thus, there exists some knowledge gap between these techniques and the desired characteristics of a location perturbation mechanism. Diﬀerential privacy holds a good reputation in providing a privacy guarantee by adding carefully calibrated noise which also maintains an acceptable level of utility. Geo-indistinguishability proposed in [2], deﬁnes a formal notion of privacy for location-based systems that protect user’s exact location, while allowing approximate information–typically needed to obtain a certain desired service to be released. It formalizes the intuitive notion of protecting the user’s location within a radius r with a level of privacy l that depends on r, and corresponds to a generalized version of the well-known concept of diﬀerential privacy. The authors in [2] claim that adding Laplace noise, can perturb data eﬀectively. As pointed out in [8], the utility of a diﬀerentially private mechanism can be increased if some prior information is available about the user. Also in [13], a generic prior distribution π, derived from a large user dataset is used to construct an eﬃcient remap function for increasing the utility of the obfuscation algorithm. Clearly, if we can gather some information about PoI s of a user, it can help us to provide a more useful result. However, this information leakage (prior distribution available publicly) is useful for the adversary to design his remap function over the output of a diﬀerentially private mechanism. Therefore, the privacy bounds would require some alteration and intuitively, use of Laplace noise might not be an optimal choice.

3

Problem Statement

In this section, we elaborate the problem statement considered in this paper. Since geo indistinguishability is a ﬂavor of diﬀerential privacy for geo location data, it does not take into account various factors such as (1) π: denotes the priori probability distribution (prior), which is relative to the user and her knowledge,

66

A. Palia and R. Tandon

i.e. user’s PoI history [8]; (2) ψ: denotes the priori probability distribution relative to OS’s knowledge about the location of the PoI s, for instance, location of restaurants (= PoI ) relative to the current location. Since most of the LBS require the user to provide the “destination location”, (through which OS can determine ψ), this information can help the OS to perturb the original location in a biased way (towards the PoI ) and therefore maximize the utility of the mechanism. Clearly, with the knowledge about ψ, we can have a set of linear constraints and can use them to determine whether Laplace is still the best choice or do we need to have some new noise distribution. In the next section, we will begin by stating the basics and then will use ﬁrst principles to model various privacy and utility constraints in order to derive a noise distribution.

4

Proposed Solution

First we deﬁne the basic structure of the problem by stating the mathematical construct for prior, privacy and utility goals of the mechanism. Then we deﬁne the example problem and present the privacy and utility constraints. PoI prior (ψ): For multiple PoI s located at L1 , L2 , ... from the actual location i, the prior is deﬁned as the distribution of these PoI s, denoted by ψ = {L1 , L2 , ...} Privacy: We use [2] to deﬁne the notion of privacy, i.e., for any user located at point i, she enjoys ρ-privacy within a radius r. More precisely, by observing z, the output of the mechanism K when applied to i (as compared to the case when z is not available), does not increase the attacker’s ability to diﬀerentiate between i and j (|i − j| ≤ δ and j lies inside the circle of radius r centered at i) by more than a factor depending on ρ (ρ = .r). Utility: We propose that a diﬀerentially private mechanism for geo-location data is utility maximizing if the output of a Location Based Service (LBS) does not change if the input given to it is the perturbed location (as compared to the output when the input is the original location). The output of a LBS depends on the type of query it answers and the scope of these queries is vast. However, in this paper we restrict ourselves to 2 queries: Query 1: Get me the nearest PoI, my distance to it and provide the option to navigate to it. The output should be the nearest PoI and the approximate/eﬀective distance to it when a perturbed input location is provided to the LBS Query 2: Get me the nearest PoI and my distance to it. The output should be the nearest PoI and the approximate/eﬀective distance to it when a perturbed input location is provided to the LBS Query 3: Get me the list of PoIs starting from the nearest to the farthest. The output of this query is to provide the list of the PoIs, from the nearest to the farthest, such that the order of the output list is same when a perturbed location input is provided or when exact location was provided.

Optimizing Noise Level for Perturbing Geo-location Data

67

Fig. 1. 1-Dimensional scenario.

4.1

Example Problem

For the sake of simplicity, we begin with a 1-Dimensional example problem (Fig. 1) in which a user Alice is located at a point i and the point of interest is the restaurant located at L distance from her. Further, we shift the origin at i, so we can denote the coordinates of the destination as (L, 0) and we can write ψ = {L}. Alice wants to have ρ level of privacy within a distance r from (0, 0). We deﬁne privacy level ρ within a linear distance r on both sides of the original location instead of a circle with radius r just to suit our 1 D model. We also want to ensure ρ0 level of privacy outside this region. For diﬀerential privacy to hold, we consider a mechanism K, conditioned with ψ, which takes location i as input and produces output z from the output space S ⊆ E, (E is 1-D Euclidean plane). S in this case includes all the points lying on the x-axis. Intuitively, availability of a prior ψ, will help us to provide a better output but will not aﬀect the privacy constraints. Privacy Constraints: The privacy constraints for our mechanism (both queries) are as follows: (i) P (i, z, ψ) > 0; ∀z; The probability of outputting any points in the output space when the mechanism is applied on the input location i should be non zero (ii)

∞

z=−∞

P (i, z, ψ)=1;

Probability values must sum to 1, given that for our case z ∈ S ⊆ E. S in this case includes all the points lying on the x-axis (iii) P (i, K(i) ∈ S, ψ) ≤ eρ .P (j, K(j) ∈ S, ψ), for two points i and j with |i−j| ≤ δ; Diﬀerential Privacy constraint derived from the deﬁnition [2], where is related to user deﬁned level of privacy ρ as ρ = .r Utility Constraints (1) Query 1 and 2: For better understanding of the solution, we categorize the apps into two classes–Class A: Apps which output the nearest restaurant, the distance to it and also have the feature of providing navigation to this PoI. Class B : Apps which output the nearest restaurant and the distance

68

A. Palia and R. Tandon

to it but do not provide navigation facility (and do not show the original location of the user). The logic behind having these two classes is that the utility constraints are more relaxed in the Class A apps since the output space is less restricted and we can have the same distance to the PoI from a number of places. While in case of Class B Apps, the App should have to be supplied with a point closer to the original location because it will need a starting point for the navigation to start. Summarizing, Query 1 – Apps which will take perturbed user location as input and output the nearest PoI, the distance to it and will provide them the option to start navigation (Class B ). Query 2 – Apps which will take perturbed user location as input and output the nearest PoI and the distance to it (no navigation) (Class A). • Query 1: (i) P (i, z, ψ) > P (i, −z, ψ); where ‘−z’ denotes the points in the opposite direction of the prior ψ (ii) Minimize the distance between the output point and the original location, for maximum utility • Query 2 : (a) i ≤ z ≤ L (i) P (i, z, ψ) > P (i, z1 , ψ); z < z1 (ii) M inimize||L − i| − |L − z|| (b) L ≤ z ≤ 2L (i) P (i, z, ψ) > P (i, z1 , ψ); z > z1 (ii) M inimize||L − i| − |z − L|| (c) −∞ ≤ z ≤ i (i) P (i, z, ψ) > P (i, z1 , ψ); z > z1 (ii) M inimize||L − i| − |L − z|| (d) 2L ≤ z ≤ ∞ (i) P (i, z, ψ) > P (i, z1 , ψ); z < z1 (ii) M inimize||L − i| − |z − L|| We solve these constraints in the Appendix A and B. (2) Query 3: In these queries, the app outputs the list of the PoI s, from the nearest to the farthest, such that the order of the output list is same when a perturbed location input is provided and when exact location was provided. For the sake of discussion, we describe a more realistic scenario to handle the case with multiple PoIs, (query example: multiple Mexican restaurants in Los Angeles near me). For this situation, we need to redeﬁne utility so that the output of the mechanism K aligns itself with the actual output of the LBS (the spatial order of the restaurants should not change). If L1, L2, and L3 are the 3 PoIs | L1 < L2 < L3, relative distance from i, and K(i) = z | L1 < L2 < L3, relative to z is still valid, then the value P (K(i) = z) should be maximized while also minimizing | z − i |. Here we deﬁne tolerance limit m, which is the distance from the starting point such that if we output point z within this space, the ordering of the output remains intact. Since the PoI s are at diﬀerent distances in the two

Optimizing Noise Level for Perturbing Geo-location Data

69

directions (+x−axis, −x−axis), there will be two diﬀerent tolerance limits– for the +x (m+ ) and for −x (m− ). The probability of outputting z should be maximum within this region satisfying the M inimize|z − i| constraint and after m+ and m− , there will be a steep decline (not zero) because it destroys the utility (making it 0 will decrease privacy). Based on above, the utility constraints are as follows: (i) M inimize |z − i| (ii) P (i, m− ≤ z ≤ m+ , ψ) > P (i, z > m+ , z < m− , ψ) However, there can be multiple factors which can govern the outcome for this case. For instance, the output location should be in the direction where there are more PoI s or closeness of certain PoI s to the original location. Taking these factors into consideration, the complexity of this case increases and therefore, we skip the representation and the solution of the constraints for query 3 as our future work.

5

Results and Discussion

In this section, we draw graphs for probability distribution for the output points for query 1 and query 2 which can be used to add noise to the original location. We use ρ = ln2, which implies that a user wants ρ level of privacy within some distance r and as described earlier, ρ = .r. Based on the derivation in Appendix A and B, we have the maximum probability value p ≤ 0.48 for this case while using the approximation parameter α = 4. Using these values, we have Fig. 2 for query 1 and Fig. 3 for query 2. As predicted, the curve in Fig. 3 is symmetrical about the destination prior at point (L, 0). Analyzing the above results, the diﬀerential privacy constraints help to provide a privacy guarantee while our assumption about the knowledge of PoI distribution (to LBS) has helped us to derive a noise distribution which is more realistic and utility maximizing. Evidently, if noise is added from the curve derived above, the probability of outputting the perturbed location is near its original location and biased towards the destination. This is logically correct and is utility maximizing as a user is more interested in getting a realistic estimate of the distance/time to the destination. Further, in order to tune the result according to the user preference we can also have a parameter λ which is the tolerance a user allows in the result (in terms of time/distance with the result obtained from LBS without perturbation). Comparing to the case where Laplacian noise is used, the probability of outputting a location in the opposite direction of the destination is equally likely as it is near the destination which can destroy the utility by predicting a point which is twice as far from the destination. However, it is important to note that in this paper we have focussed on a 1 D scenario and a single PoI, in order to prevent complexity and to start a discussion about having a utility based mechanism so as to make it a win win situation for the service providers as well as users caring for their privacy. Also, it creates a single point of accountability– in this case the OS which is responsible for performing perturbation before handing over the coordinates to the LBS.

70

A. Palia and R. Tandon

Fig. 2. Probability distribution of output points for query 1.

Fig. 3. Probability distribution of output points for query 2.

6

Future Scope

Diﬀerential privacy oﬀers the mathematical guarantee that a user’s data is safe from both the classes of privacy breaches– evident as well as inferred. However, its applicability has been restricted by the complexity of deployment and the utilityprivacy tradeoﬀ. In this paper, we have proposed that an optimum solution is feasible to the data privacy problem if we design the linear program having utility and privacy constraints based on the kind of the application for which that data is potentially utilized. More speciﬁcally, as our future work, we plan to take into account multiple factors which can be used to deﬁne the output solution for multiple PoI s (query 3 ). Further, we would want to extrapolate this work for 2 dimensions and then for 3 dimensions so that it is applicable to the real location data.

7

Conclusion

In this paper, we have worked on improving the utility of a diﬀerentially private mechanism for geo-location data. We have used the notion of geoindistinguishability to provide diﬀerential privacy guarantee for geo-location data and at the same time, we have used the prior information available to the OS about the PoI s in order to improve the utility of the mechanism. Through mathematical formulation of the problem and solving the linear system of constraints, we have derived the probability distribution of the output points, which can be used to add noise to the original input location accordingly. Through our results, it is clear that Laplace is not the optimal choice for geo location queries

Optimizing Noise Level for Perturbing Geo-location Data

71

conditioned with a prior and with our mechanism we have strived for maximum utility for 2 queries. We have further discussed our future work which lays the trajectory of what we plan to do next in order to have an optimum solution for the real world geo-location data. According to the best of our knowledge, this is the ﬁrst paper which takes prior information about PoI s into consideration and maximizes the utility of the geo-location perturbation mechanism while providing ρ level of privacy to the user. Acknowledgement. We would like to thank Dr. Aleksandra Korolova for being the guiding light throughout the course of this paper.

Appendix A The domain D and range R is the x-axis discretized with step δ. Let p be the maximum value that should occur at the original location i = (0, 0). The probability values for output points z at points ∈ (δ, ∞) are smaller than p but greater than points ∈ (−δ, −∞). i\z −∞ . . . −δ 0 δ . . . +∞ -∞ .. . -δ

p

0

↓

p ↑

δ .. .

p

∞

Now using the privacy constraint– ∞

z=−∞

−δ

z=−∞

P (i, z, ψ)=1 P (i, z, ψ) + p +

∞

z=δ

P (i, z, ψ) = 1

... (1)

or A + B + C = 1 A=

−δ

z=−∞

P (i, z, ψ); B = p; C =

∞

z=δ

P (i, z, ψ)

For C, we can use diﬀerential privacy constraintP (i,K(i)=z,ψ) P (j,K(j)=z,ψ)

≤ eρ ; |i − j| ≤ δ

72

A. Palia and R. Tandon

i = (0, 0), P (0, 0, ψ) = p and j = (δ, 0) so we can write P (δ, z, ψ)– P (δ, z, ψ) ≤ p.e−ρ For P (2δ, z, ψ), we have– P (2δ, z, ψ) ≤ p.e−2ρ and in general, P (xδ, z, ψ) ≤ p.e−xρ , therefore we can rewrite C in Eq. (1) as ∞

x=δ

p.exρ

... (2)

For part A of Eq. (1), we have P (0, −δ, ψ) < P (0, δ, ψ) < p. With utility constraint of min.|z − i|, along with the constraint of having higher probability of outputting points in the direction of prior, we can say that after some point αδ it would be better to output points near the original location i either in the direction opposite to the prior, i.e., P (0, −δ, ψ) ≥ P (0, αδ, ψ) = p.e−αρ . While maintaining the diﬀerential privacy constraint for the points −δ, −2δ, ..., we can write– P (0, −δ, ψ) ≥ e−ρ .P (0, −2δ, ψ), or p.e−αρ .eρ ≥ P (0, −2δ, ψ) and in general– (x−α)ρ .p ≥ P (0, −(x − 1)δ, ψ), therefore we can write A in Eq. (1) as e (x−α)ρ .p ... (3) x=−∞−δ e

Combining (1), (2) and (3) x=−∞−δ

e(x−α)ρ .p + p +

∞

x=δ

p.exρ ≤ 1

Solving this with δ = 1 we get, (1−e−ρ ) p ≤ 1+e −(α+1)ρ

Appendix B For query 2, using the constraints we can write– ∞

z=−∞

P (i, z, ψ)=1

L 2L −δ z=−∞ P (i, z, ψ) + p + z=δ P (i, z, ψ) + z=L P (i, z, ψ) + p ∞ + z=2L P (i, z, ψ) = 1 ... (1)

Optimizing Noise Level for Perturbing Geo-location Data

73

or A + B + C + D + E + F = 1 Since we are interested in the magnitude of the probability, for the sake of simplicity, we can safely apply same approximation before i and after 2L, and using the symmetry around L, we can write– p≤

e−αρ +e−2α.L.ρ 1−e−αρ

1 2.e−ρ [1−(e−ρ.L )] +2+ −ρ 1−e

or approximately– p ≤

(1−e−ρ ) , 2(1+e−(α+1)ρ )

when δ = 1

References 1. Bindschaedler, V., Shokri, R.: Synthesizing plausible privacy preserving location traces. IEEE, August 2016 2. Andr´es, M., Bordenable, N.E., Chatzikokolakis, K., Palamidessi, C.: Geoindistinguishability: diﬀerential privacy for location-based systems. Springer, Switzerland (2015) 3. Andre´s, M., Bordenable, N.E., Chatzikokolakis, K., Palamidessi, C.: Optimal geoindistinguishable mechanisms for location privacy. In: Proceedings of the 2014 ACM SIGSAC, Conference on Computer and Communications Security 4. http://www.independent.co.uk/news/world/africa/egyptian-police-grindrdating-app-arrest-lgbt-gay-anti-gay-lesbian-homophobia-a7211881.html 5. Polakis, I., Argyros, G., Petsios, T., Sivakorn, S., Keromytis, A.D.: Where’s Wally? Precise user discovery attacks in location proximity services. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (2015) 6. Srivastava, V., Naik, V., Gupta, A.: Privacy breach of social relation from location based mobile applications. In: IEEE CS Home, pp. 324–328 (2014) 7. Liao, L., Fox, D., Kautz, H.: Extracting places and activities from GPS traces using hierarchical conditional random ﬁelds. Int. J. Robot. Res. Arch. 26(1), 119– 134 (2007) 8. Brenner, H., Nissim, K.: Impossibility of diﬀerentially private universally optimal mechanisms. In: 2010 51st Annual IEEE Symposium Foundations of Computer Science (FOCS) 9. Nunez, M., Frignal, J.: Geo–location inference attacks: from modelling to privacy risk assessment. In: EDCC 2014 Proceedings of the 2014 Tenth European Dependable Computing Conference 10. Gruteser, M., Grunwald, D.: Anonymous usage of location–based service through spatial and temporal cloaking. In: Proceeding MobiSys 2003 Proceedings of the 1st International Conference on Mobile Systems, Applications and Services 11. Kulik, L., Duckham, M.: A Formal Model of Obfuscation and Negotiation for Location Privacy. PERVASIVE Springer-Verlag, Heidelberg (2005) 12. Ardagna, C.A., Cremonini, M., Damiani, E., Samarati, P.: Location privacy protection through obfuscation–based techniques. In: IFIP Annual Conference on Data and Applications Security and Privacy DBSec 2007: Data and Applications Security 13. Chatzikokolakis, K., Elsalamouny, E., Palamidessi, C.: Practical Mechanisms for Location Privacy. Inria and LIX, cole Polytechnique 14. ElSalamouny, E., Gambs, S.: Diﬀerential privacy models for location based services. Trans. Data Priv. 9, 15–48 (2016). INRIA, France

Qualitative Analysis for Platform Independent Forensics Process Model (PIFPM) for Smartphones F. Chevonne Thomas Dancer(&) Department of Electrical Engineering, Computer Engineering, and Computer Science, Jackson State University, Jackson, MS 39217, USA [email protected]

Abstract. This paper details how forensic examiners determine the mobile device process and if the Platform Independent Forensics Process Model for Smartphones (PIFPM) helps them in achieving the goal of examining a smartphone. The researcher conducted interviews, presented the PIFPM process to the examiners, and supplied surveys that the examiners were exposed to. Using convenience sampling, the frequency and percent distribution of each examiner is given as well as strengths and weaknesses of PIFPM as it relates to the examiner. Based on the hypotheses given by the researcher, the results were either refuted or supported through sampling from the forensic examiners. The goal of this paper is to uncover interesting details that the researcher overlooked when examining a smartphone. Keywords: Platform Independent Forensics Process Model (PIFPM) Digital forensics Interviews Mobile device forensics

1 Introduction The Platform Independent Process Model (PIFPM) for Smartphones introduces a novel approach of examining mobile device forensics. The author presents a way of examining smartphones regardless of make, model, or device as seen in Fig. 1. The PIFPM is explained in more detail in [1]. Smartphone devices were used to analyze data in the Primary Stage of the Analysis Phase that averages the percent of change by category in Experiment 1. The ﬁrst experiment involves securing the ﬁles generated by XRY 6.1 and capturing the size of each at the byte level. The ﬁles were compared to ﬁles in 40 separate tests within their particular smartphone category concerning the size, carrier, OS, and device. Doing so enabled the author to compute the differences in size by test as well as by category. This affords us the knowledge of discovering which categories offer the least and most ﬁle in size change. When dealing with the changes in ﬁle size, the results can take one of three options: Either the size will increase, decrease, or have no change. Given these options, the researcher was able to provide projections of how each XRY ﬁle would be affected by each test [2–4]. To assist with folders that change in content, the researcher has designed a lookup tables with unique IDs that tells the status from test state 1 to test © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 74–89, 2019. https://doi.org/10.1007/978-3-030-03405-4_6

Qualitative Analysis for Platform Independent Forensics Process

75

Fig. 1. Platform independent process model for smartphones (PIFPM).

state 2 in Table 1 [2–4]. For example, B-OC is the unique ID for the Browser Category. The researcher opens a browser window on HTC Pro 6850 and saves the folder size of the smartphone, and closes the browser window and saves the folder size of the smartphone. Table 2 shows an example ﬁle size change of six devices in Experiment 1. In the second experiment, the XRY ﬁles from the ﬁrst experiment were exported to the hard drive as a hierarchical folder containing all the ﬁles and folders extracted from each device. The number of ﬁles within the folder structure that differed from one state to the next was compared by inputting the two folders into SourceForge DiffMerge version 3.3.2 software DiffMerge returned the number of identical, “Iden,” and different ﬁles, “Diff,” the number of ﬁles without peers, “W/P,” and the number of folders, “# Folds”. The percent difference, “% Δ,” in the number of ﬁles where the content changed was computed by adding the number of different ﬁles and ﬁles without peers and dividing by the total number of ﬁles within the folder structure. This number is then divided by 100. “Num of Diff” is the number of differences from test state 1 to state 2 and “Cat%Δ” is the number that is categorically different [2–4]. An example of the Apple iPhone 3G A1241 is below in Table 3 [2]. From these experiments, each test within each category was ranked from least to the greatest amount of change with respect to the percentage of change reported. It compared the ﬁles with their smartphone category concerning the size, carrier, and platform. Those same smartphones were used to generate the average change in ﬁle content by device while applying XRYv6.1 and DiffMerge 3.3.2 in Experiment 2. The manual analyses of smartphones were obtained by comparing Experiment 1 with Experiment 2 [1–6].

76

F. Chevonne Thomas Dancer Table 1. Unique ID lookup table Unique ID B-IO B-OG B-GC B-OC B-GD B-CD Contact C-IN C-NA C-AD MMS M-IR M-IS M-RO M-RD M-SD Pic P-IN P-ND SMS S-IR S-IS S-RO S-OD S-SD Call V-IP V-IRA V-IRU V-IDC V-PDC V-RUDM Miscellaneous A-ISA J-IJB J-JBDM L-IL L-LnS Vmail-IR Vmail-RL Vmail-LD Browser

Test state 1 to Test state 2 Initial to Open Browser Window Open Browser Window to Google Search Google Search to Close Browser Window Open Browser Window to Close Browser Window Google Search to Delete History and Bookmarks Close Browser Window to Delete History Initial to New Contact New Contact to Altered Contact Altered Contact to Deleted Contact Initial to Received MMS message Initial to Sent MMS message Received MMS message to Opened MMS message Received MMS message to Deleted MMS message Sent MMS message to Deleted MMS message Initial to New Picture New Picture to Deleted Picture Initial to Received SMS message Initial to Sent SMS message Received SMS message to Opened SMS message Received SMS message to Deleted SMS message Sent SMS message to Deleted SMS message Initial to Placed Call Initial to Received Answered Call Initial to Received Unanswered Call Initial to Deleted Call log Placed Call to Deleted Call log Received Unanswered Call to Deleted Missed Call Initial to Stop All Apps (TouchPro 6850 only) Initial to Jailbreak (iPhone only) Jailbreak to Delete SMS (iPhone only) Initial to Passcode Enabled (iPhone only) Passcode Enabled to no SIM (iPhone only) Initial to Received Voicemail (iPhone only) Received Voicemail to Listened to Voicemail (iPhone only) Listened to Voicemail to Deleted Voicemail (iPhone only)

This research involves collecting data from two case studies to determine the feasibility and usefulness of PIFPM. After the data is gathered in the qualitative design, the hypotheses (1–5) are either negated or supported depending on the answers from the forensics examiners. Those answers are shown in the Frequency/Percent Distribution by Group in Tables 4, 5, 6, 7, 8, 9, 10 and 11.

Qualitative Analysis for Platform Independent Forensics Process

77

Table 2. Projected results vs. actual results Test ID

Projected result

B-IO B-OG B-GC B-OC B-GD B-CD C-IN C-NA C-AD M-IR M-IS M-RO M-RD M-SD P-IN P-ND S-IR S-IS S-RO S-OD S-SD V-IP V-IRA V-IRU V-IDC V-PDC V-RUDM A-ISA J-IJB J-JBDM L-IL E-IE E-ELAN N-IDN W-ILAN

I I D U D D I U D I I U D D I D I I U D D I I I D D D D I D U I U D I

Apple iPhone 3G D D I I I I I I I I D I I I I I I I I I I I D I I I I NA I D D N/A N/A N/A N/A

HTC Touch Pro 6850 I I I I I D D NC D NA I NA NA NA I I NA I NA NA I I NA NA D D NA I NA NA NA N/A N/A N/A N/A

HTC Aria NC I D NC D D I D D NA NA NA NA NA NA NA NA I NA NA D I NA NA NC D NA NA NA NA NA N/A N/A N/A N/A

RIM BB 8530 I NC NC NC NC NC I I D N/A I N/A N/A D NC NC D I NC NC D I N/A N/A D D N/A N/A N/A N/A NC N/A N/A NC I

RIM BB8703

Nokia 5230

N/A N/A N/A N/A N/A N/A I I D N/A N/A N/A N/A N/A N/A N/A NC I NC NC D I N/A N/A D D N/A N/A NC NC N/A NC NC N/A NC

N/A N/A N/A N/A N/A N/A I D I N/A D N/A N/A N/A I D N/A D N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

78

F. Chevonne Thomas Dancer Table 3. Apple Iphone: % change in folder content by device and category Test ID

Iden

J-IJB J-JBDM M-IS M-SR M-RO M-OD S-IS S-SR S-RO S-OD Vmail-IR Vmail-RL Vmail-LD V-RUDM V-DMR V-IP V-PDC P-IN P-ND B-DBO B-OG B-GC B-CD C-IN C-NA C-AD L-IL L-LnS

1 54784 55041 64563 64394 54869 66276 65933 66815 66750 55774 56215 55599 55796 55745 56133 56496 56298 56257 39193 37100 53427 38285 38529 53619 1 1 39531

Num of Diff Diff W/P 9 71023 6410 14645 16484 19345 7626 17901 8101 17354 16644 19404 7096 15628 7407 15714 6679 15464 6802 15431 7520 14409 7557 13601 8125 13581 7770 13636 7630 14054 7553 13620 7168 13596 7332 13772 7539 13609 23142 16021 24862 17000 9142 16024 24222 16124 24032 16044 9008 15976 61678 18030 61942 17607 23567 15284

# Folds % Δ

Cat.% Δ

4430 4355 4743 4833 4800 4774 4731 4728 4769 4743 2590 2550 2611 2648 2648 2587 2637 2644 2614 2686 2656 2577 2679 2679 2679 2680 2637 2637

62.7%

99.999% 27.763% 39.429% 28.335% 28.331% 39.649% 25.533% 25.963% 24.892% 24.986% 28.222% 27.345% 28.078% 27.727% 28.005% 27.389% 26.875% 27.265% 27.321% 49.981% 53.015% 32.021% 51.311% 50.984% 31.785% 99.999% 99.999% 49.566%

34.0%

25.3%

27.9%

28.2%

27.3% 46.6%

61.1%

75.0%

Table 4. Question 2 frequency/percent distribution by group Q2. How difﬁcult is PIFPM to understand? Option SE a. Not difﬁcult 1/100% b. Slightly difﬁcult 0/0% c. Somewhat difﬁcult 0/0% d. Very difﬁcult 0/0% e. Extremely difﬁcult 0/0%

ME 1/50% 0/0% 1/50% 0/0% 0/0%

Qualitative Analysis for Platform Independent Forensics Process Table 5. Question 3 frequency/percent distribution by group Q3. Rate how feasible PIFPM would be in its application to the forensic processing of smartphones? Option SE ME a. Not at all feasible 0/0% 0/0% b. Slightly feasible 0/0% 0/0% c. Somewhat feasible 0/0% 1/50% d. Very feasible 0/0% 0/0% e. Extremely feasible 1/100% 1/50%

Table 6. Question 4 frequency/percent distribution by group Q4. How likely would you be to incorporate PIFPM into your forensic examination process? Option SE ME a. Not likely 0/0% 0/0% b. Slightly likely 0/0% 0/0% c. Somewhat likely 0/0% 1/50% d. Very likely 0/0% 1/50% e. Extremely likely 1/100% 0/0%

Table 7. Question 5 frequency/percent distribution by group Q5. Of the phases listed below, which one(s) do not ﬁt the logical progression of a forensic examination? Option SE ME a. Transportation 0/0% 0/0% b. Classiﬁcation 0/0% 0/0% c. Analysis 0/0% 0/0% d. Interpretation 0/0% 0/0% e. All seem logical 1/100% 2/100%

79

80

F. Chevonne Thomas Dancer Table 8. Question 6 frequency/percent distribution by group Q6. How useful is PIFPM in smartphone examination? Option SE a. Not useful at all 0/0% b. Slightly useful 0/0% c. Somewhat useful 0/0% d. Very useful 0/0% e. Extremely useful 1/100%

a ME 0/0% 0/0% 1/50% 1/50% 0/0%

Table 9. Question 8 frequency/percent distribution by group Q8. Is it logical for smartphones to use the same forensic process model as computers? Option SE ME a. Not logical 0/0% 0/0% b. Slightly logical 0/0% 0/0% c. Somewhat logical 1/100% 1/50% d. Very logical 0/0% 1/50% e. Extremely logical 0/0% 0/0%

Table 10. Question 9 frequency/percent distribution by group Q9. How often do you manipulate the process you frequently use to examine smartphones, whether intentionally or unintentionally? Option SE ME a. Not often 0/0% 1/50% b. Slightly often 0/0% 1/50% c. Somewhat often 1/100% 0/0% d. Very often 0/0% 0/0% e. Extremely often 0/0% 0/0%

Table 11. Question 14 frequency/percent distribution by group Q14. Do you believe that incorporating PIFPM into phone examinations will change the conﬁdence level of the investigator? Option SE ME a. Yes, it will lower the conﬁdence level greatly 0/0% 0/0% b. Yes, it will lower the conﬁdence level slightly 0/0% 0/0% c. No, the conﬁdence level will not change 0/0% 0/0% d. Yes, it will elevate the conﬁdence level slightly 0/0% 2/100% e. Yes, it will elevate the conﬁdence level greatly 1/100% 0/0%

Qualitative Analysis for Platform Independent Forensics Process

81

2 Qualitative Design Study The observable population consists of three professional forensic examiners with varying years of experience exploring many different devices including smartphones. Eight forensic examiners were sought after for this study, and four agreed to have interviews. Only three forensic examiners were dialogued because one examiner was in court at that time. The researcher traveled to each participant in his/her perspective locations. The participants were interviewed concerning their current process when examining mobile devices as well as the usage of any equipment. Then, the participants examined the proposed model while a presentation was given about PIFPM. After the presentation was completed, the participants were allowed to ask any questions they had about the model. A follow-up survey was given that captured qualitative data regarding usefulness and feasibility of PIFPM. Each participant was interviewed separately to maintain an unbiased environment. Each person was asked the same four questions in an attempt for uniformity, but each examiner was also asked one or more follow-up questions. The answers to the interview questions allowed the researcher to discover a theme that could be veriﬁed through interviews with a larger population set. Examiners ME-A and ME-B, from the same organization, almost follow the same process from beginning to end. They also used the same tool, almost never deviating. On the other hand, Examiner SE-A uses a more ad-hoc process where he adapts to his environment depending on the type of OS being dealt with. ME-A and SE-A were both asked the same follow-up question after the researcher inquired about their speciﬁc process which was, “What happens if [your process] does not work?” ME-A said that they return the phone to its owner without trying any other tool other than Cellebrite. When asked about XRY in particular, he said that anything XRY could read, Cellebrite could read and if Cellebrite cannot read the device, XRY cannot read the device either. On the other hand, SE-A said that they go on to the other tools in their arsenal to see if any of those can extract the data. If none of the other tools comply, the examiner returns the phone to the user. He also added that if the client still wants the information to be extracted without the use of tools, they usually return the phone to them and instruct them to look for the information manually. While mapping the interviewees with their particular responses, it was discovered that each examiner had once before manually examined a device. In every case, with each examiner, the process used in these instances was the same. They take photographs of every action taken by the examiner on the device. SE-A was then asked another follow-up question concerning whether or not he has ever examined a device manually for a reason other than to be used in a court of law. His answer was, “Sure.” The next question was purely a question that stemmed from curiosity. The researcher asked them whether or not they ever examined two phones of the same make/model and compared them to see what effect their actions had on the OS. The answers from each examiner were that they had not done so either because they had not had the opportunity or that they never had a reason to. Next, the examiners were asked whether or not there was a particular model smartphone that they feel more conﬁdent in examining over others. SE-A and ME-A

82

F. Chevonne Thomas Dancer

both said no, but ME-B said that he likes examining anything but a Samsung Galaxy or an iPhone. When inquiring why, the examiner mentioned that no tool in his organization could break into the phone if it were passcode protected. The only thing they would be able to do is extract the SIM card and get whatever information is available there or ask a federal agency for the tool that can break into the phones.

3 Qualitative Analysis Results All participants were males with two having 3 to 4 years of experience and the other having 2 to 3 years. Given this information, the researcher created two categories about experience since some of the research questions deal with that in particular. The categories are More Experience (ME) and Some Experience (SE). Using this information, eight frequency/percent tables were created outlining each question that deals with the hypotheses as well as a Rankings and Medians Table. To follow is a discussion of the responses to the questions found on the post-survey. The hypotheses are: (1) How useful is PIFPM in a smartphone examination? (2) Is it feasible to include PIFPM in the current process for examining smartphones? (3) Does PIFPM offer anything to a smartphone investigation that other models do not? (4) Is it logical to suggest that every category of a technological device should assume a unique forensic process model? (5) Do examiners, whether intentional manually manipulate current process models to suit speciﬁc model smartphones? In this study, the sampling method used was convenience sampling. In using this method, there is a possibility of bias, but this method was selected due to ease of collection and the nature of the careers of the participants. This resulted in a sample size insufﬁcient to support this work with great conﬁdence. In determining the conﬁdence interval of the survey data given here, it can be stated with 95% conﬁdence that if the same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population in approximately 56.58% of the cases [7]. Tables 4, 5, 6, 7, 8, 9, 10 and 11 are reported based on this data. Given this, the margin of error is well beyond what is acceptable by the researchers. To alleviate this, the study will have to be repeated to obtain a sample size of at least 24. Then the researcher will be able to state that the margin of error is 20% and that the answers will represent those reported 95% of the time. To absolve all doubt, as part of future work, the researchers plan to survey a total of 384 forensic examiners to obtain a conﬁdence interval of 5% [7]. As far as results, some questions are discussed at length, and some are not. Question 1 asked the forensic examiners about the years of experience they acquired; question 7 is a discussion question and will be discussed later in this paper; question 10 states, “Have you ever manually examined a device with no external equipment” and the answer to all was yes; question 11 was a follow-up question if they answered yes in question 10, and questions 12 & 13 are discussion questions and will be discussed later.

Qualitative Analysis for Platform Independent Forensics Process

83

Question 2 asked the participants how difﬁcult PIFPM was to understand. The response frequency and percents are broken down by groups and mapped to each response given on the survey as seen in Table 4. The SE Group and 50% of the ME Group feel that PIFPM is not at all difﬁcult to understand and the other half of the ME Group feel that it was somewhat difﬁcult to understand. Question 3 asked the participants to rate how feasible PIFPM would be in its application to the forensic processing of smartphones. Table 5 show that the SE group and 50% of the ME Group feel that it is extremely feasible. The remaining 50% of the ME Group feel that PIFPM is somewhat feasible. Question 4 asked each participant how likely he would be to incorporate PIFPM into his forensic examination process and Table 6 shows the frequency and percentage of the responses from each group. The SE Group reported that it would be extremely likely to incorporate PIFPM into their forensic process. The ME Group is split. Half of the group reported that they would very likely to incorporate the model whereas the other half reported that it would be somewhat likely to use PIFPM in their examination process. Question 5 asked the examiners which phases do not ﬁt the logical progression of a forensic examination out of the following: Transportation, Classiﬁcation, Analysis, and Interpretation. If they felt that all of the phases are logical, they had the opportunity to circle that choice as well. 100% of both groups feel that all of these phases seem logical as shown in Table 7. Question 6, as seen in Table 8, asked each participant how useful PIFPM would be in a smartphone examination. The SE Group feels that PIFPM would be extremely useful. The ME Group is split. 50% of the group feels that PIFPM would be very useful, whereas the other half thinks that the model would be somewhat useful. Table 9 shows the frequency and percent of the responses given for Question 8. This question asked the participants whether it is logical for smartphones to use the same forensic process model as computers. The SE Group and half of the ME Group feel that it is somewhat logical to use the same forensic process model as computers. The remainder of the group thinks that it is very logical. Question 9 asked each participant how often he manipulates the process when he examines a smartphone. Table 10 shows that the SE Group changes the process somewhat often. Half of the ME Group reported that its process does not often change when examining smartphones and the remainder of the group reported that change occurs slightly often. Table 11 reports the frequency and percent of the responses for Question 14 on the survey. Each examiner was asked whether he believes that incorporating PIFPM into smartphone examinations would change the conﬁdence level of the investigator. The SE Group feels that using PIFPM would elevate the conﬁdence level of the investigator greatly and the ME Group feels that using the model would elevate the conﬁdence level of the investigator slightly. The survey also contained two questions that asked each examiner to list any strength and weaknesses they could discern from evaluating the model during the presentation. Table 12 reports the number of weaknesses and strengths outlined by the examiners. The SE Group reported one weakness and one strength. The ME Group reported one weakness and three strengths.

84

F. Chevonne Thomas Dancer

Table 12. Number of reported PIFPM weaknesses vs. strengths group distribution frequency SE ME Strengths 1 3 Weaknesses 1 1

The ﬁrst discussion question asked the examiners what strengths PIFPM offers to the examination of a smartphone. One examiner reported that it offers good guidelines on the next step to take in most situations. Another examiner reported that it gives them an orderly process to follow and it also ensures the same process is followed each time. The last examiner reported that the model offers them diversity. The second discussion question asked the examiners what weaknesses PIFPM offers to a forensic examiner in a smartphone investigation. One examiner reported that it would need to adapt as [smartphone] OS’s change. Another examiner reported that given the amount and frequency of updates on phones, inconsistency would be an issue. The last examiner had no weaknesses to report. Table 13 shows the three discussion questions asked to the examiners as shown on the survey; two discussions questions were already discussed above. The ﬁnal discussion question asked each participant whether PIFPM offered anything to an examination that other models do not. One examiner had no response because he said that he could not answer it. Another examiner reported that he had no model for comparison, and the last examiner reported, “Not that I am aware of.” Table 13. Post survey discussion questions Q7 Q12 Q13

Does PIFPM offer anything to an examination that other models do not? What strengths does PIFPM offer to a forensic examiner in a smartphone investigation? What weaknesses does PIFPM offer to a forensic examiner in a smartphone investigation?

Table 14 contains the responses for each question that relates to our hypotheses and ranks the answers from 1 to 5 using a mapping created from the available responses labeled a to e in Tables 4, 5, 6, 7, 8, 9, 10 and 11. The median values are the values used to either support or refute our hypotheses. Table 15 shows a mapping of the research questions to the premises and the survey questions. Table 14. Post survey response rankings and medians Q2 A 3 B 1 C 1 Median 1

Q3 5 3 5 5

Q4 4 3 5 4

Q5 5 5 5 5

Q6 3 4 5 4

Q8 4 3 3 3

Q9 1 2 3 2

Q14 4 4 5 4

Qualitative Analysis for Platform Independent Forensics Process

85

Table 15. Research questions, hypotheses, and survey questions mapping Research questions

R1. How useful is PIFPM in a smartphone examination?

Hypothesis

H1a. Examiners with less experience will ﬁnd PIFPM to be at least somewhat useful H1b. Examiners with more experience will ﬁnd PIFPM to be at least slightly useful H1c. Examiners with less experience will be more likely to incorporate PIFPM into their forensic examination process than examiners with more experience H1d. Examiners with more experience will be less likely to incorporate PIFPM into their forensic examination process than examiners with less experience R2. Is it feasible to include PIFPM in H2a. Most examiners will ﬁnd PIFPM the current process for examining to be at least somewhat feasible smartphones? H2b. Most examiners will ﬁnd that all the proposed phases ﬁt the logical progression of a smartphone forensic examination H2c. Examiners, regardless of experience, will ﬁnd that PIFPM is not difﬁcult R3. Does PIFPM offer anything to a H3a. Examiners with less experience will ﬁnd that PIFPM has more smartphone investigation that other strengths than weaknesses models do not? H3b. Examiners with more experience will ﬁnd that PIFPM has more weaknesses than strengths R4. Is it logical to suggest that every H4. Examiners, regardless of experience, will not ﬁnd that it is very category of a technological device logical to use the same process model should assume a unique forensic to examine smartphones and process model? computers R5. Do examiners, whether intentional H5a. Examiners with less experience do not manipulate current process or not, manually manipulate current process models to suit speciﬁc model models often smartphones? H5b. Examiners with more experience do manipulate current process models often

Y or N Y

Post survey Q Q6

N

Y

Q4

Y

Y

Q3

Y

Q5

Y

Q2

N/A Q12, Q13, Q7 N/A

Y

Q8

N

Q9

N

86

F. Chevonne Thomas Dancer

To support or refute Research Questions 1 to 5, the researcher has to refer back to the frequency and percent tables. Research Question 1 (R1) maps to Hypothesis 1a (H1a), Hypothesis 1b (H1b), Hypothesis 1c (H1c), and Hypothesis 1d (H1d). H1a states that “Examiners with less experience will ﬁnd PIFPM to be at least somewhat useful.” Table 8 shows that the SE Group reported ﬁnding PIFPM very useful. Since the SE Group is the group with less experience than the ME Group, H1a is supported by the qualitative data. H1b states that “Examiners with more experience will ﬁnd PIFPM to be at least slightly useful.” Table 14 shows that the median response maps between “Somewhat useful” and “Very useful.” The researcher believed that a more experienced examiner might not be as open to change as a less experienced examiner, but this was not the case in this instance. As a result, H1b is not supported by the qualitative data. H1c states that “Examiners with less experience will be more likely to incorporate PIFPM into their forensic examination process.” H1d states that “Examiners with more experience will be less likely to incorporate PIFPM into their forensic examination process.” Table 4 shows that the SE Group reported ﬁnding that it is extremely likely that they would incorporate PIFPM into their examination whereas the ME Group reported that their likelihood of incorporating PIFPM into their examination would be the median of “Very likely” and “Somewhat likely.” Given our mapping scale, the data shows that the group with the least amount of experience would be more likely to incorporate the model into the daily examination than the group with the most experience. As a result, both H1c and H1d are supported by the qualitative data. Given that three of the four hypotheses derived from Research Question 1 is supported by the qualitative data, that Table 14 reports the median response of the usefulness of PIFPM as being “very useful”, and the likelihood of the examiner incorporating the model into the daily routine as being “very likely”, it is reasonable to believe that PIFPM would be at least somewhat useful in a smartphone examination. Research Question 2 (R2) maps to Hypothesis 2a (H2a), Hypothesis 2b (H2b), and Hypothesis 2c (H2c). H2a states that “Most examiners will ﬁnd PIFPM to be at least somewhat feasible.” Table 14 shows that the median answer for survey Q3 is “Extremely feasible.” As a result, the qualitative data is shown to support H2a. H2b states that “Most examiners will ﬁnd that all the proposed phases ﬁt the logical progression of a smartphone forensic examination.” Table 14 shows that the median answer for survey Q5 is “All seem logical.” As a result, the qualitative data is shown to support H2b. H2c states that “Examiners regardless of experience will ﬁnd that PIFPM is not difﬁcult.” Table 14 shows that the median answer for Q2 is “Not difﬁcult.” As a result, the qualitative data is shown to support H2c. Given that all the hypotheses derived for Research Question 2 are supported by the qualitative data, it is reasonable to believe that it is feasible to include PIFPM in the current process to examine smartphones. Research Question 3 (R3) was answered by using the frequencies reported in Table 12. R3 maps to Hypothesis 3a (H3a) and Hypothesis 3b (H3b). H3a states that “Examiners with less experience will ﬁnd that PIFPM has more strengths than weaknesses.” H3b states that “Examiners with more experience will ﬁnd that PIFPM has more weaknesses than strengths.” Table 12 shows that the SE Group reported the same amount of weaknesses and strengths, and the ME Group reported more strengths than weaknesses. Given this, the qualitative data refutes both H3a and H3b.

Qualitative Analysis for Platform Independent Forensics Process

87

This question was also asked the participants verbatim in Question 7 on the survey. As mentioned previously, the participants had no answer for this question for various reasons. Therefore, the researcher is not able to answer R3 which asks whether PIFPM offers anything to a smartphone investigation that other models do not based on the qualitative data in this study. Research Question 4 (R4) maps to Hypothesis 4 (H4). H4 states that “Examiners, regardless of experience, will not ﬁnd that it is very logical to use the same process model to examine smartphones and computers.” Table 14 shows that the median answer for survey Q8 is “Somewhat logical.” As a result, the qualitative data is shown to support H4. Given that the hypothesis derived for Research Question 4 is supported by the qualitative data, it is reasonable to suggest that every category of a technological device should assume a unique forensic process model. Research Question 5 (R5) maps to Hypothesis 5a (H5a) and Hypothesis 5b (H5b). H5a states that “Examiners with less experience do not manipulate current process models often” and H5b states that “Examiners with more experience do manipulate current process models often.” Table 10 shows that the SE Group reported that it manipulates its process somewhat often whereas the ME Group reported that its frequency of manipulation would be the median of “Not often” and “Slightly often.” As a result, both H5a and H5b are not supported by the qualitative data. In deriving these hypotheses, the researcher believed that the less experienced examiner would be less likely to change their routine and skew from the norm. It was also the belief of the researcher that the more experienced examiner would be more likely to change their process mainly due to lessons learned. Even though the hypotheses are not supported by the data, Table 14 shows that the median response for all participants is that they manipulate their process slightly often, which answers R5. Although the results given in the surveys are not statistically signiﬁcant, there were several lessons that can be taken away from the qualitative portion of the study based on whether or not they would actually apply PIFPM, instances in which they would or would not use the model, what they would change about PIFPM, and their overall opinion of the model. The author asked the participants, after they experienced the model and its uses, if and how they would incorporate PIFPM into their examinations and the response was unanimously positive. No participant reported that they would decline to incorporate it into their work. For example, Participant A reported that he would be very open to incorporating it into his normal process because the model is not difﬁcult to understand and it seems logical. He would ﬁrst test the model out by using it after using his normal process to compare procedures several times. If he felt comfortable with the process and results, he would then begin to incorporate it into his normal processes. Alternatively, Participant B also feels that the model is not difﬁcult to understand, and he would feel more comfortable incorporating PIFPM if a workshop was conducted that will assist in directing examiners on how to approach each phase and sub-phase in the model. When asked of any instance they could think of that they would not feel comfortable incorporating PIFPM, Participant C stated that because he does not feel comfortable examining Android and Apple mobile devices, he would more than likely

88

F. Chevonne Thomas Dancer

not use the model on these devices. Participant B felt that he may not feel comfortable testifying in a court of law based on this model without some experience. The participants were asked what aspects of PIFPM they would change given the fact that they are practicing examiners, Participant A would change the order of manual examination. Given that the browser of most smartphones reloads all the windows last used, he would change this category to the last category viewed on an Android device. After further thought, he also decided that this should probably be the case for every OS smartphone. Participant C also mentioned that the browser information for the Android and Apple mobile devices should be listed last. Other than that, he said that he would not change anything from this initial introduction. Participant B felt that he could not decide what he would change in theory, but after he has been able to apply the practices of the model, he could give a more accurate response to this question. The researcher inquired how the participants felt about the model overall. Participant A felt that the model was “cool” and that it would be great because there would be something out there to follow. Participant B did not have any negative feedback of the model itself. He questioned the use of the word ‘forensics’ when referring to the examination of a smartphone since smartphone examinations always change the state of the device and forensic examinations are not supposed to make changes. This is true in general, but there is no method in general that is guaranteed to preserve the state of a smartphone or any cell phone during an examination. This is accepted practice and can be explained in court. Participant C felt that the model seemed to be an overall logical one and that he would have more of an opinion after being able to apply the model.

4 Conclusions and Future Work The difﬁculty for the examiner lies in the lack of a methodology for smartphones. Neither ad-hoc methods nor methods for computer examination are well suited for the examination of a smartphone due to their distinct issues [8–10]. These methods do not take into consideration the uniqueness of smartphones and therefore could lead to a loss or non-discovery of any information with evidentiary value. In this study, forensic examiners had the PIFPM presented to them, and they answered questions based on feasibility. These questions were given based on a survey to answer the hypotheses. Frequency/Percent Distributions, Tables 4, 5, 6, 7, 8, 9, 10 and 11, were split into groups based on experience. Based on the survey response medians, the researcher could tell whether the hypotheses were negated, supported, or not applicable. For future work, the researcher will be able to state with a conﬁdence interval of 5% of 384 forensic examiners state that PIFPM is secure to examine the devices in mobile device forensics. Also, the researcher will continue the PIFPM for smartphones using ad-hoc and non-ad-hoc methods to further support or negate the hypotheses with conﬁdence. With enough participants, the researcher can state whether PIFPM is easy but reliable when it comes to examining smartphones regardless of make, model, or OS. Acknowledgments. I would like to thank Dr. David A. Dampier, Professor and Interim Chair of Information Systems and Cyber Security at the University of Texas at San Antonio, Texas, the

Qualitative Analysis for Platform Independent Forensics Process

89

forensic examiners at the National Center for Forensics at Mississippi State University, Mississippi and the forensic examiners at the Attorney General’s Ofﬁce in Jackson, Mississippi.

References 1. Chevonne Thomas Dancer, F., Dampier, D.A., Jackson, J.M., Meghanathan, N.V.: A theoretical process model for smartphones. In: Proceedings of 2012 Second International Conference on Artiﬁcial Intelligence, Soft Computing and Applications, Chennai, India, 13– 15 July 2012 2. Chevonne Thomas Dancer, F.: Analyzing and comparing android HTC Aria, Apple iPhone 3G, and Windows Mobile HTC TouchPro 6850. In: The 2016 IEEE International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, USA, 15–17 December 2016 3. Chevonne Thomas Dancer, F.: Manual analysis phase for (PIFPM): Platform Independent Forensics Process Model for smartphones. Int. J. Cyber Secur. Digit. Forensics 6(3), 101– 108 (2017) 4. Dancer, F.C.T., Skelton, G.W.: To change or not to change: that is the question. In: 2013 IEEE International Conference on Technologies for Homeland Security (HST), Waltham, MA, pp. 212–216 (2013) 5. Chevonne Thomas Dancer, F., Dampier, D.A.: Reﬁning the digital device hierarchy. J. Acad. Sci. 55(4), 8 (2010) 6. Chevonne Thomas Dancer, F., Dampier, D.A.: A platform independent process model for smartphones based on invariants. In: SADFE 2010: IEEE International Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 56–60. https://doi.org/10.1109/ SADFE.2010.15 7. Graziano, A.: Research Methods: A Process of Inquiry. Pearson, London (1993) 8. Patil, D.N., Meshram, B.B.: Digital forensic analysis of ubuntu ﬁle system. Int. J. Cyber Secur. Digit. Forensics 4(5), 175–186 (2016) 9. Jansen, W., Delaitre, A., Moenner, L.: Overcoming impediments to cell phone forensics (2008) 10. Punja, S.G., Mislan, R.P.: Mobile device analysis. Small Scale Digit. Forensics J. 2(1), 1–16 (2008)

Privacy Preserving Computation in Home Loans Using the FRESCO Framework Fook Mun Chan1(B) , Quanqing Xu1 , Hao Jian Seah2 , Sye Loong Keoh2 , Zhaohui Tang3 , and Khin Mi Mi Aung1 1

Data Storage Institute, A*STAR, Singapore, Singapore {chanfm,Xu Quanqing,Mi Mi AUNG}@dsi.a-star.edu.sg 2 University of Glasgow, Glasgow, UK [email protected], [email protected] 3 Singapore Institute of Technology, Singapore, Singapore [email protected]

Abstract. Secure Multiparty Computation (SMC) is a subﬁeld of cryptography that allows multiple parties to compute jointly on a function without revealing their inputs to others. The technology is able to solve potential privacy issues that arises when a trusted third party is involved, like a server. This paper aims to evaluate implementations of Secure Multiparty Computation and its viability for practical use. The paper also seeks to understand and state the challenges and concepts of Secure Multiparty Computation through the construction of a home loan calculation application. Encryption over Multi Party Computation (MPC) is done within 2 to 2.5 s. Up to 10 K addition operations, MPC system performs very well and most applications will be suﬃcient within 10K additions. Keywords: Privacy · Secure multiparty computation FRamework for Eﬃcient Secure COmputation (FRESCO)

1

Introduction

Traditional methods of aggregating data for computing on a function relies on a trusted third party to perform the function. Consider the example of data analytics. Data analytics can only be done when an organization collects personal data about their users. This creates a huge privacy issue as companies can gain private insights of individuals based on such data, especially if this data is aggregated from multiple sources [1]. A simple example would be shopping habits of customers; a company can derive a person’s health through the products that they buy. If a person constantly buys products that remove acne, data analytics can reveal that this person has acne, which is something that an individual might not want to reveal to a public entity. Secure Multiparty Computation is a ﬁeld of cryptography that explores joint computation of a function with inputs from diﬀerent parties while keeping each party’s inputs private. Secure Multiparty Computation can resolve these privacy c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 90–107, 2019. https://doi.org/10.1007/978-3-030-03405-4_7

Privacy Preserving Computation in Home Loans

91

issues as it generalizes the existence of a trusted third party into the security of cryptographic protocols. Research into speciﬁc ﬁelds like data mining [2] have been done with the same motivation, and shows the wide ranging use cases of the ﬁeld. Secure Multiparty Computation is a subset of cryptography that has been not been used practically due to eﬃciency. However, recent developments in Secure Multiparty protocols have made it more eﬃcient and more viable for practical implementation. The aim of this paper is to create an application using a Secure Multi Party Computation (MPC) framework to compute home loan installments. This application will then be used to evaluate the MPC framework and determine the viability of MPC in practical usage. During the process of deciding to buy a home, buyers would calculate the required costs to determine if they are eligible and able to aﬀord, which requires private inputs from diﬀerent parties. The intention of this paper is to evaluate the use of the FRESCO (a FRamework for Eﬃcient Secure COmputation) MPC framework in its current state and implementations of its Secure Multi Party Computation protocols. The focus is on usability and implementation of the framework. Since practical implementations of Secure Multiparty Computation are relatively new and undocumented, this paper seeks to implement a sample scenario to verify usability of current frameworks that implement Secure Multiparty Computation. The rest of the paper is organized as follows. Section 2 describes the concepts and developments of Secure Multiparty Computation. Section 3 explores the context of home loans. Section 4 deﬁnes the identiﬁed requirements that a Home Loan Calculation application should fulﬁll. Section 5 describes the system design. Section 6 details the phases and description of the implementation of the Home Loan Calculation Application. Section 7 discusses the experiments conducted to evaluate the implementation of FRESCO. Section 8 concludes this paper and potential future work.

2 2.1

Related Work Secure Multiparty Computation

The concept of Secure Multiparty Computation was introduced by Yao in his paper introducing the classic millionaires’ problem [3]. More speciﬁcally, Yao presents the problem as a generalized problem involving the use of multiple parties. Given a function f (xi · · · xn ) and number of parties n, can function f be computed among the n participants among themselves such that each person Pi only knows its own input xi and the output of function f ? Yao’s proposed solution for this problem in the paper is a secure two party protocol. Yao’s solution is based on party P1 giving P2 a list of possible values, with P2 inputting his values into the list of possible values, upon which is returned to P1 who is then able to securely evaluate a boolean function f (x1 , x2 ) by selecting the correct entry in the list of values to evaluate the function. While his solution is a two party secure computation, his generalization of the problem opened the idea of secure multiparty computation and contextualized it. There are two main

92

F. M. Chan et al.

secure multiparty computation approaches: circuit garbling and secret sharing schemes. Before explaining further on Secure Multiparty Computation concepts, some terms must be deﬁned. This section explains the terminologies that will be used in describing protocols for the rest of the paper. (1) Oblivious Transfer: In Oblivious Transfer, the sender sends a list of information to the receiver, while remaining unaware of what information that it has transferred. The construction below of solution is an example of Oblivious Transfer. Oblivious transfer is also used as a cryptographic primitive in many secure multiparty protocols. An example of oblivious transfer is the 1 of 2 oblivious transfer. In this protocol, there is a sender Alice and receiver Bob. Bob desires a message from Alice, but does not wish Alice to know which information that Bob has requested [4]. Such a protocol can be implemented with any public key encryption. Generally, this protocol requires a few prerequisites. Alice as the sender has msg0 and msg1 messages that could potentially be the message that Bob desires. Bob has a bit b that corresponds to the message that he desires from Alice and does not want to let Alice knows which message he wants. The protocol can be implemented using any public key encryption schemes. This protocol has been generalized to a 1 out of n oblivious transfer where there can be more than two inputs [5]. (2) Circuits: Logic Circuits are a model of computation for cryptography. A logic circuit is deﬁned by their size depth i.e the length of their longest path. Logic circuits are also circuits whose operations are in Boolean [6]. They are often referred to Boolean Circuit in the cryptography literature. 2.2

Homomorphism

Gentry proposed a fully homomorphic scheme in his paper using lattices [7]. He deﬁned fully homormophic public encryption scheme as a scheme that contains the functions: (1) fkeygen that generates the key, (2) fencrypt that encrypts a plaintext, (3) fdecrypt that decrypts a cipher text, and (4) fcompute that computes a circuit based on input ciphertext generated by fencrypt and outputs a ciphertext c that is result of the circuit. In addition to these functions, such a scheme should support any circuit. Gentry also describes diﬀerent kinds of homomorphisms based on the lattice structure. These homomorphisms are additive homomorphism and multiplicative homomorphism [7]. (1) Additive Homomorphism: Generally a scheme is additively homomorphic when plaintext values x and y satisfy the following condition: fencrypt x + fencrypt y = fencrypt x + y The property implies that any addition of cipheri , · · · , ciphern ciphertexts using the same encryption scheme gives the same result when computing the plaintext. (2) Multiplicative Homomorphism: A scheme is multiplicative homomorphic when plaintext values x and y satisfy the following condition: fencrypt x × fencrypt y = fencrypt x × y

Privacy Preserving Computation in Home Loans

93

The property implies that any multiplication of cipheri , · · · , ciphern ciphertexts using the same encryption scheme gives the same result when computing the plaintext. 2.3

Yao’s Garbling Circuit

Yao’s inﬂuence for secure multiparty computation was extended further with his proposals to solve his original millionaire’s problem. Known as Yao’s Garbling Circuit, it relies on the use of circuits as a model of computation for computing a function. Using circuits, the idea is to encrypt the circuit to be computed, creating a “garbled” version of the circuit [8]. Yao’s protocol starts with the garbling/encryption of the circuit. In this case we assume Alice and Bob, with Alice being the “garbler” and Bob being the “evaluator”. Alice provides the circuit on which to compute on, which is garbled by her. Alice will send the garbled circuit and use the oblivious transfer primitive to send her garbled inputs to Bob. Bob then decrypts the circuit to obtain the encrypted outputs. Alice and Bob then communicate to reveal the ﬁnal value of the output. Basically, the idea of the protocol is to provide a way to compute a function where values obtained on a circuit wire would not be revealed, with the exception of the output wire’s value [9]. 2.4

Shamir’s Secret Sharing Scheme

Shamir [10] introduced a problem ﬁrst formulated by Liu [11] as a background to his paper. Secret Sharing is a cryptographic primitive dealing with the problem of sharing a secret among n parties such that the secret can only be revealed upon combining t number of shares from the parties. Shamir’s scheme is a threshold scheme, or a k, n threshold scheme [10]. Given the secret S divided into n parts, the following properties apply: (1) Reconstructible: Knowledge of k parts of the secret can easily reconstruct S, (2) Secrecy: Knowledge of k − 1 parts of the secret do not allow reconstruction of S; furthermore all permutations of S is possible at k-1 parts. Since Shamir’s scheme is based on interpolation of polynomials, every share of secret value u u is a point of f x . Given also a sharing of another secret v as v and n number of parties we observe: u + vi = ui + vi u + v = F uncdec u + vi , u + vi+1 , ......u + vn Adding diﬀerent shares of diﬀerent secrets creates a new share based on the sum of the secrets. When these secrets are shared with at least k parts, then we can compute the real value of the sum of the two secret values [12].

94

F. M. Chan et al.

3

Background of Home Loans

This section details the background of home loans in Singapore. Described in this section includes the context and scenario needed to make the application work. 3.1

Overview of Home Loan Privacy

In Singapore, property agents are service people who help home buyers with the ﬁnancial paperwork when purchasing a home. These property agents also help consult potential buyers on the home that they wish to buy, which includes the ﬁnancial aspect of aﬀording the home. Typically, this presents privacy problems, as the calculation ofthe amount requires information that intrudes on the privacy of the home buyer i.e., savings . In addition, when a property agent consults a buyer based on the ﬁnancial aspects of the potential home purchase, the lender i.e., banks of any potential loan taken is not involved in the consultation. Finally, in Singapore, the social security system CPF can help pay for part of the home cost. These entities buyer, bank, CP F are required for accurate computation of a home purchase, but they are not connected together; to do so would incur privacy loss on the part of the buyer, as the full calculation would reveal private information the user has in the three entities. The rest of this section shall elaborate on the details of the overview that is presented. Section 3.2 explains the overview in Singapore’s context and Sect. 4 will show the high level overview of the application. 3.2

Context

In Singapore, up to 80% of the population stay in public housing built by the government, also known as HDB ﬂats1 . There are also diﬀerent kinds of HDB ﬂats, with diﬀerent prices for each. While there are also a sizable number of population who possess private housing, the application will explore the purchase of public housing ﬂats as it is a more general case for a higher percentage of the population in Singapore. In addition, we can also generalize the scenario into a more global context. The model for this application is using the Singapore housing context. This application uses Singapore’s public housing payment model and conditions HDB ﬂats for buying a ﬂat to compute home loan installments. The rest of the section shall explain the concepts of the scenario in more detail. (1) Central Provident Fund: In Singapore, the Central Provident Fund CPF is a social security system that helps working Singapore Citizens and Permanent Residents PR to save enough for their retirement. The scheme also provides the use of a citizen’s/PR’s funds for certain purposes like housing and health care. 1

http://www10.hdb.gov.sg/eBook/AR2016/key-statistics.html.

Privacy Preserving Computation in Home Loans

95

CPF also allows use of funds for purchase of a house. In particular, a buyer of a HDB home can pay part of the cost of the house using their funds held in CPF. The amount of which can be paid is dependent on various factors, most notably that it cannot exceed the amount that a user has in his account with CPF. (2) Total Debt Servicing Ratio And Monthly Debt Servicing Ratio: In Singapore, a condition for being able to take a loan from the bank is the Total Debt Servicing Ratio TDSR 2 . TDSR is a loan limit using a person’s monthly income. For Singapore, the TDSR cap is 60% of a person’s monthly income that a user can use to service his monthly debt repayments3 . In HDB loans, the monthly debt servicing ratio (MSR) applies instead. However, they are both similar; the diﬀerence is that the cap is diﬀerent at 30% and only applies for HDB ﬂats. We choose to generalize all debt upper bound calculation as MSR in the application. (3) Downpayment: The purchase of a HDB ﬂat can be separated into two portions: the downpayment and the loan. This section will explain the downpayment portion of the scheme. When purchasing a HDB ﬂat, a buyer can choose either a HDB housing loan or a bank loan. Since they are both loans operating on similar principles, this section shall explain the HDB loan as a example for explanation. HDB requires that if a bank loan is taken to pay for the purchase of a HDB ﬂat then the buyer has to pay 20% of the purchase price as downpayment. Of this 20%, at least 5% of the purchase price must be paid in cash, with the balance is payable using the buyer’s CPF funds under the CPF scheme for public housing. (4) Home Loan: This section will explain the loan portion of the loan scheme that we are using. Home loans are amoritizing loans. Amoritizing loans work by calculating interest on a annual basis. The interest is calculated by taking the outstanding amount owed and multiplying it by the interest. HDB loans are ﬁxed rate loans pegged to the CPF interest rate. To calculate the monthly installment, we use the Equated Monthly Installment formula. The formula reads as follows: A=P ·

1 − (1 + r)n (1 + r)n − 1

where, A is the monthly installment, P the principal/outstanding amount, r the interest rate and n the repayment period in months.

4

Requirements

This section will state the structure and purpose of the application created in this paper.

2 3

http://www.mas.gov.sg/news-and-publications/media-releases/2013/mas-introduc es-debt-servicing-framework-for-property-loans.aspx. http://housingloansg.com/hl/resources/housing-loan-guide/tdsr-and-msr.

96

4.1

F. M. Chan et al.

Problem Statement

Calculating the ﬁnancial details of buying a new HDB ﬂat is often a complicated process that requires private data of the buyer (i.e., savings, debt) from many diﬀerent sources. These private data should ideally be secured from any other parties other than the buyer himself. However, current methods of calculation still require knowledge of the private values to allow actual calculation to happen. These problems are somewhat mitigated as the parties involved are segregated from one another, only using the output (i.e., Yes or No for checking if savings are enough) of each party to carry on the calculation. While this ensures the secrecy aspect, this can only be done when the actual purchase of home happens, a buyer would not be able to calculate the estimated costs securely as he needs to reveal information to a property agent for him to get consultation on his potential purchase. 4.2

The Solution

We aim to solve individual privacy by aggregating the three entities (CPF, Bank, Buyer ) data for calculating estimated loan installment amount. This aggregation of private data will be done via the use of Secure Multiparty Computation techniques. The data we wish to protect are the buyer’s monthly salary, CPF amount in CPF, savings and debt that are recorded in banks. Secure Multiparty Computation is a relatively new ﬁeld of cryptography, and we analyze the potential uses of implementations of SMPC frameworks at its current state using this problem as a model to evaluate. 4.3

Parties

The three entities that were identiﬁed for the solution are: (1) CPF ; (2) Buyer ; and (3) Bank. Each party other than the buyer is required to provide some private details that they cannot share with any other party to calculate the monthly installment when purchasing a property selected by the buyer. In Sect. 4.4, a high level description of what calculations need to be done is detailed, from which we can infer which values each party is require to provide for calculating a home loan’s monthly installment. In our solution, a trusted third party is not desired in computing the home loan; the only parties are the parties listed in this section, and they jointly compute the calculation together. (1) CPF: CPF only needs to provide one value: Amount Usable in CPF account. This value must remain secret, as the the amount that a buyer can use from his account must not be known to the bank. The value is needed as it is required for calculating the downpayment for a HDB ﬂat. (2) Buyer: The buyer party is required to provide the following values: (1) 30% of their monthly salary; (2) Repayment period in months; (3) Minimum amount of money for downpayment for chosen HDB ﬂat; (4) Minimum amount of money required in CPF for the chosen HDB ﬂat; (5) Maximum amount loanable

Privacy Preserving Computation in Home Loans

97

for the chosen HDB ﬂat. The only value here that needs to be secret is the buyer’s 30% of monthly salary. The remaining values can be public, as they are based on the buyer’s choice of HDB ﬂat. Those values are needed to calculate the monthly installment of a loan, and to verify the buyer’s eligibility for a loan. (3) Bank: The bank is required to provide the following values: (1) Buyer’s Existing Debt; (2) Buyer’s Savings; and (3) Interest of loan. All the values from the bank except interest are required to be secret, as these are private details of the buyer. Interest is needed to calculate the loan’s monthly installment. 4.4

Calculating a HDB Home Loan’s Monthly Installment

Based on the context described in Sect. 3.2, there are several preconditions for getting a HDB home loan. They are as follows: (1) The Total Debt Servicing Ratio/Monthly Debt Servicing Ratio threshold. (2) Amount of cash the buyer has on hand to pay the downpayment. (3) The CPF funds usable to pay the downpayment. A buyer has to ensure that he does not exceed the TDSR/MSR threshold, has enough money he has on hand and also enough money that he can use in his CPF account before he can be eligible to buy a HDB ﬂat. Based on the context, we can detail the steps required to calculate a home loan. (1) Determine TDSR/MSR limit and see if loan is allowed to be acquired. (2) Determine if buyer’s CPF funds and cash on hand is enough to pay the 20% downpayment. (3) Calculate the monthly installment using the Equated Monthly Installment formula. (4) Add the calculated installment value to existing debt and recheck TDSR.

5

System Design

Two experimental versions and a prototype of the home loan calculation application were created in this paper, in accordance to the requirements discussed in Sect. 3. This section details the diﬀerent phases during the implementation and the decisions made on evaluating the technology used. 5.1

Overview

The system architecture is shown in Fig. 1. The system implementation was conducted in phases. These phases were: (1) Evaluation of Protocols in FRESCO (as discussed in Sect. 2). (2) Implementation of a simple interest calculation application prototype. (3) Implementation of a amortizing loan calculation application prototype.

98

F. M. Chan et al.

Fig. 1. System architecture.

The implementation was conducted in the order shown above. Firstly, the frameworks Sharemind [13] and FRESCO were shortlisted and evaluated for use in implementing the requirements as discussed in Sect. 4. After choosing the framework, the protocols used in the framework were evaluated for use, with attempts to create simple prototypes and functions; this protocol evaluation will be discussed in Sect. 6.1. After that, true implementation of the requirements as discussed in Sect. 4 were created using the protocol. Two implementations were created, as proof of concept implementations because of technical reasons that will be discussed in Sect. 7. 5.2

Investigation of the FRESCO Framework

FRESCO is a framework that is designed for users to easily write prototypes based on secure computation. It allows rapid and simple application and protocol suite development as well as a ﬂexible design pattern with support for large and eﬃcient computations4 . FRESCO abstracts the idea of diﬀerent protocol suites to create a plug and play framework. This is achieved by FRESCO’s Protocol Producer/Consumer Pattern. (1) Usability: FRESCO is a framework that is easily extensible and ﬂexible; users can deﬁne a protocol that they wish to use to evaluate a certain function. In that sense, protocols are decoupled from application development; developers just need to specify a function like addition that they wish to calculate without knowing about its speciﬁcs. FRESCO envisions that applications using this pattern can be run on multiple diﬀerent protocol suites, using common operations to act as a abstraction from the protocols when developing. FRESCO is a relatively new framework that has been around since 2015. Currently it is in its ﬁrst unstable version, version 0.1.0. It uses SCAPI as the underlying networking protocol for use in its application and currently has three protocol suites that are implemented. (2) Protocols: These protocol suites are: • the Dummy protocol suite, • the BGW Protocol suite, • the SPDZ protocol suite. 4

http://fresco.readthedocs.io/en/latest/intro.html.

Privacy Preserving Computation in Home Loans

99

The Dummy protocol suite has no security and is used as a measure for the basis overhead of FRESCO. The BGW and SPDZ protocols will be explored in the later sections of this section. The BGW protocol is a protocol used in the FRESCO framework. Proposed by Ben-or et al. [14]. It is a protocol which describes a way to implement secure multiparty computation for several logical operators. In particular, they deﬁned circuits for addition and multiplication, and created a secret sharing scheme that would be secure in presence of an adversary [15]. The protocol is based on Shamir’s Secret Sharing Scheme; in particular, BGW tweaks certain rules of Shamir’s schemes so that they can compute operations using shares generated by the scheme. In general, secure computation in BGW consists of three steps: (1) Input Sharing Stage: In the input sharing stage, each party Pi creates a share ui using threshold t + 1 where t < n/2 and distributes them among the parties. (2) Computation Stage: In this stage, parties jointly compute a function f using the values they hold. the function f xi , ....xn will return a output outi . Each outi is a sharing of the true value out. The function to be computed and their behavior depends on the formula to be calculated. (3) Output Reconstruction Stage: In this stage, parties collude and communicate to reconstruct the output out by using shares of outi from all parties P0 , , , , Pn parties. If only one party is required to know the output, all parties send shares to the party that is only allowed to know the output. SPDZ is a protocol that was developed in 2012. It is implemented in the FRESCO framework. The protocol diﬀers from the BGW scheme in several ways, notably in the use of Message Authentication Code (MAC) for authenti cating shares, and the use of a somewhat homomorphic scheme SHE during preparation of values to be computed in the protocol [16]. However SPDZ is also a secret sharing scheme, like BGW. Computing operations like addition and multiplication is diﬀerent as compared to BGW with some novel notable concepts. In particular SPDZ consists of a two phase protocol: (1) Preprocessing Phase; and (2) Online Phase. (3) Contrasting design philosophies: Sharemind is a commercial application design for commercial usage. In contrast, FRESCO is open source, which allows anyone who wishes to use or contribute to the framework instant access. Since the system is time bounded, FRESCO’s instant usability clearly is better suited. In addition, Sharemind’s design is as a framework that provide a full suite of functions for secure multiparty computation; this means that any application written with Sharemind must be in Sharemind’s context. FRESCO, however, envisions itself as a plug and play component in a larger application. In this case, FRESCO is better suited for the system’s purpose, as we also seek to evaluate the general use of secure multi party computation frameworks.

100

6 6.1

F. M. Chan et al.

System Implementation Evaluation of Protocols in FRESCO

FRESCO is a platform for secure multiparty computation protocol implementations. In this section, the protocols implemented in FRESCO 0.1.0 are evaluated and one protocol will be chosen for the implementation of the home loan calculation application as described in Sect. 4. There are three secure multiparty computation protocols that are implemented in FRESCO. They are: Dummy, BGW protocol and SPDZ protocol. The dummy protocol is a protocol that is used for measuring FRESCO’s overhead. It provides zero security and thus not usable for the actual home loan calculation application for the system. A major problem in FRESCO is that it does not allow decimals in the framework. This applies to both SPDZ and BGW. The SPDZ and BGW protocols were evaluated against each other to determine which of the two protocols were to be used in implementing the requirements as described in Sect. 4. The implementation for both were studied by creating a prototype application that does simple addition and multiplication. These eﬀorts are detailed in this section. (1) SPDZ: A prototype of SPDZ was attempted to evaluate the protocol for use in the system. During the attempt to build the prototype, ﬂaws in the implementation of SPDZ was discovered. These ﬂaws are: • The preprocessing phase was not implemented fully in SPDZ. • The utility class cannot parse speciﬁed SPDZ options properly. A working prototype of SPDZ was attempted but not completed, as the actual implementation of SPDZ in FRESCO is incomplete; while FRESCO has a method that lets a trusted party to generate the preprocessing requirements, using this third party would violate privacy as we do not want the values from each party to be known to any other party other then the party inputting the values itself. A prototype was created successfully for BGW that does simple addition and multiplication. However, just like SPDZ, ﬂaws in the implementation of the protocol was discovered. These ﬂaws are: • Negative values are not supported in BGW due to implementation bugs. • The framework’s utility class is unable to parse user speciﬁed BGW options due to bugs. • If a computation returns a negative value it returns the modulus − (negativevalue). These bugs and problems will be further explained later in Sect. 6.3. (2) Choice of protocol: BGW was chosen as the protocol to use in FRESCO as it is the only protocol that does not require any other party then the ones identiﬁed in Sect. 4. In addition, the SPDZ implementation in FRESCO is still a work in progress; examination of the implementation shows that even though a method of doing the preprocessing phase is implemented, it is not a proper implementation but a placeholder for a future full implementation of the preprocessing phase.

Privacy Preserving Computation in Home Loans

6.2

101

Actual Implementation of Application

There are two home loan calculation applications that were produced for this system. One is based on a simple interest scheme, and the other is based on a amortizing interest scheme. This section presents the implementation of the home loan application. Firstly, the deﬁnitions of each component required for the home loan application will be ﬁrst described as follow. This will be followed by the actual explanation of the application workﬂow in Sect. 6.3, which lists the components of the application in the order that it happens. Firstly, to simplify the explanation of the implementation of the application, we shall deﬁne some terms. We deﬁne the BGW protocol stages as: • the Input Sharing Stage as Stageinput • the Computation Stage as Stagecompute • the Output Reconstruction Stage as Stageoutput Here, we deﬁne the terms relevant to the parties involved as detailed in Sect. 4. We deﬁne the inputs from the party buyer as follows: • • • • •

30% of monthly salary as salary Repayment Period as P aymentperiod Chosen Flat’s minimum cash required as requiredcash Chosen Flat’s minimum CPF amount needed as CP F N eeded Chosen Flat’s maximum loanable as M axLoanable For the party bank, we deﬁne the inputs:

• Buyer’s Existing Debt as Debt • Buyer’s Savings as Savings • Interest rate of Loan as interest For the party CPF we deﬁne its input amount usable in cpf account as CP FU sable . 6.3

Application Workﬂow

Two versions of the home loan calculation application were created for the system. The only diﬀerence between the two is the calculation of the monthly installment, namely, simple interest and amortizing interest schemes. Apart from the formula used to calculate the interest, the two versions are the same. This section will detail the generic application work ﬂow for both versions and detail the diﬀerences where it happens. (1) Assumptions of Application: This home loan calculation application will work under a few assumptions based on the bugs that were identiﬁed in FRESCO’s implementation of BGW. They are: • savings ≥ requiredcash • if the result of any computation is more than 60,000,000,000, it is a negative value. This value is deﬁned as bound

102

F. M. Chan et al.

We assume that when a buyer wants to calculate the home loan cost using the application, he should know that he has enough money in his savings to calculate his costs required. To address the bug of BGW returning the modulus, we assume any number above a certain threshold is a negative value. We take the number 60,000,000,000 for our threshold as it is suﬃciently high enough such that it is improbable that the value returned by modulus − 60, 000, 000, 000 is a realistic number. (2) Gathering Input: The application will ﬁrst require users to identify themselves. This is done through a command line interface, requiring users to input a number that identiﬁes the party that the user is. Each party runs on a diﬀerent address port based on their party identiﬁcation ID. The network address for the three diﬀerent parties in the application is set in the code; all parties are required to know the address of all other parties so that they can share their inputs with each other for the Computation Stage. After specifying the party, the user of the application will be prompted to enter values based on the party that they have identiﬁed as. For example, a user of an application who identiﬁes himself as the CPF party has to input the CP Fusable amount. Although discussed in Sect. 4 some parties’ values need not be secure, the application will still use the networking implementation in FRESCO to simplify the implementation. At the end of this part, each party should have their values usable in this stage. (3) Calculating the home loan: After gathering all inputs required, we use FRESCO’s networking implementation to invoke Stageinput for secret sharing the inputs among all parties. Using these inputs, we begin the actual secure computation to determine the home loans, as shown in Fig. 2.

Fig. 2. FRESCO-based home loan calculator application.

The payment for a loan can be divided into the 20% downpayment and 80% loanable amount. In the 20% downpayment, at least 5% is required to be paid

Privacy Preserving Computation in Home Loans

103

by cash, with the rest being payable by a buyer’s CPF account. This 5% is a lower bound, and can be higher if the buyer wishes so. However, we wish to check if the CPF account has enough money to be able to aﬀord the house. We do so by the following formula fcheckCP F : CP Fusable + (savings − downpayment) − CP Fneeded We also wish to determine the monthly installment for a amortizing home loan. We do so by the following formula famortize : A=P ·

r(1+r)n (1+r)n −1

where r = interest/12/100 and n = P aymentperiod. This formula requires division and exponential functions, both of which are not deﬁned in the BGW protocol nor implemented in FRESCO. We solve this by doing the calculation using values that are opened and known to the public, where traditional Java has functions for division and exponentials. We thus have to reveal the values interest, paymentperiod and the result of fcheckCP F . Alternatively, in another version of the application we use a simple interest loan calculation. This formula fsimple is P ×r n

Finally, we also wish to compute the TDSR of a buyer. We do so by the formula ftdsr : salary − debt − monthlyinstallment Algorithm 1 shows how the functions that are securely computed after the parties provide their respective inputs. All parties have shares of every input (line 4). Because of operation limitations of the protocol, we have to compute famortize and ftdsr using secret values made public. This computation will happen after Algorithm 1. Algorithm 2 details how the application considers if a buyer is eligible to buy a ﬂat and the calculations required. The application algorithm starts with Algorithm 1 and then Algorithm 2. These algorithms describe the amortizing interest version of the home loan application. For the simple interest version, the diﬀerence is adding a step to Algorithm 1 to calculate loanable×interest and in Step 2 of Algorithm 2; replace the formula with fsimple . Figure 3 shows the activity diagram of the application. Computation of the loan is done not in FRESCO’s MPC framework, the reasons of which are explained in Sect. 6.3.

7

Performance Evaluation

This section discusses the evaluation of the feasibility of using the FRESCO framework to build a full application. We conduct an Eﬃciency Evaluation to decide if the computational overhead of FRESCO is suitable for use in a

104

F. M. Chan et al.

Algorithm 1. High level description of Application Secure Computation Implementation 1: 2: 3: 4: 5: 6: 7: 8:

9: 10: 11:

All parties input their values and identities into the application. All parties invoke Stageinput to secret share their inputs among all parties. Calculating using BGW - the steps in this section is done in a secure way. We construct a circuit Ccomputecpf to compute fcheckCP F by setting CP Fexcess ← CP Fusable + (savings − downpayment) − CP Fneeded . We then construct a circuit CrevealCP F to reveal the secret value CP Fexcess . We compute the amount loanable by constructing a circuit Cloanable computing loanable ← M axLoanable − CP Fexcess . We then construct a circuit Crevealloanable to reveal the secret value loanable. Finally, we wish to compute ftdsr . Since installment is not known until the full computation is computed, ﬁrst construct circuit Cloanlimit computing loanlimit ← salary − debt. Construct circuit Creveallimit that reveals the secret value loanlimit. Construct circuit Creveal that reveals secret values paymentperiod, interest Glue the circuits together in the order Ccomputecpf , CrevealCP F , Cloanable , Crevealloanable , Cloanlimit , Creveallimit and evaluate them.

Fig. 3. Activity diagram of the home loan computation application.

Algorithm 2. High level description of Application Non Secure Implementation Require: Algorithm 1 was complete prior to this algorithm. Ensure: Output that reveals the installment of the loan if buyer is eligible. 1: After Algorithm 1, we have revealed values CP Fexcess , loanable, loanlimit, paymentperiod and interest. 2: Compute famoritize using paymentperiod, loanable and interest by installment ← paymentperiod loanable interest(1+interest) . (1+interest)paymentperiod −1 3: Compute TDSR T DSR ← loanlimit − installment. 4: Check if downpayment > bound. If so this means that the downpayment is insuﬃcient. Return an error message informing user that CPF is insuﬃcient and exit. 5: Check if T DSR > bound If returns true then this means that TDSR is a negative value, which means the user cannot get another loan. Return an error message that TDSR has been exceeded. 6: If none of the conditions evaluate to true, return the installment value installment to the user.

Privacy Preserving Computation in Home Loans

105

full ﬂedge application. The experiments were conducted on a machine with the following specs: Windows 10 Home with Intel Core i7-4720HQ Processor, 16 GB RAM and JVM heap size 2 GB. An evaluation was done on the eﬃciency of the framework. This was done by two experiments, measuring the time for encrypting values for secret sharing and measuring the time taken for computing an operation. 7.1

Measuring Time Taken for Encrypting Values

In this experiment, FRESCO’s secret sharing implementation is evaluated. In particular, the time taken to encrypt a value for secret sharing is evaluated. In addition, we wish to evaluate if high values could be eﬃciently transformed to its encrypted secret shared form. The experiment was conducted by coding a custom test application that only encrypts the values that are retrieved from the various parties, with time measurement using Java’s System.currentTimeMillis() function. A variable scale was used to scale up the values, in order to determine if higher values would yield diﬀerent timings of the function.

Fig. 4. Results from measuring time taken for encrypting the values.

The results were then plotted out in a graph in Fig. 4. Figure 4 shows that the time taken to encrypt the data remains a constant regardless of how the values are scaled; This implies that the secret sharing implementation of BGW by FRESCO is a constant time implementation, which is quite eﬃcient as the time taken is also quite low. 7.2

Measuring Time Taken for Computing Addition

In this experiment, the potential overhead of operations in FRESCO’s BGW implementation was evaluated. For this experiment, the addition operation was selected for evaluation. The experiment also evaluates and veriﬁes if more addition operations would lead to cause an exponential overhead during computation.

106

F. M. Chan et al.

The experiment was conducted with a custom application that does addition according to Algorithm 3. In addition, another application was created to compute simple addition using the same idea as Step 3 of Algorithm 3 but in a non secure traditional way in Java for comparison and context to the time of the application. A scaling factor of 10 was used to increase the number of additions for the experimentation.

Algorithm 3. Addition Method for Experimentation Require: Number of times addition is to be done deﬁned as rounds. 1: Invoke Stageinput to get a value val to compute addition. 2: Start Timer. 3: Create a circuit C to based on the number of rounds speciﬁed. 4: This is done using a for loop and assigning val = val + val for each round. 5: Execute circuit C. 6: Stop timer and print out time taken.

The result of the experiment was plotted onto a graph shown in Fig. 5. The results show that having more additions seem to be a exponential in time. Compared to traditional addition in Java, the time overhead cost of addition in BGW is noticeably higher, but still eﬃcient. Eﬃciency only peaks at 1 million addition operations, which is a upper bound that is hard to reach for a conventional application.

Fig. 5. Results from measuring time taken for computing addition.

8

Conclusion and Future Work

The aim of this paper was to implement a MPC scenario and use it to evaluate the practicality of existing MPC frameworks. Encryption over MPC is done within 2 to 2.5 s. Up to 10,000 addition operations, MPC system performs very well and most applications will be suﬃcient within 10,000 additions. We believe that this aim was suﬃciently achieved, but more work can be done to evaluate more

Privacy Preserving Computation in Home Loans

107

frameworks and their implementations. Secure Multiparty Computation is a ﬁeld that has indirect links to other sections of cryptography. We believe that elements of MPC, like secret sharing, can be used in tandem with other cryptographic techniques to enhance them, like key management schemes. In addition, MPC is based on homomorphic properties. The recent advances in Fully Homomorphic Encryption (FHE) is also a potential avenue of further study.

References 1. Katal, A., Wazid, M., Goudar, R.: Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3), pp. 404–409. IEEE (2013) 2. Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: ACM Sigmod Record, vol. 29, no. 2, pp. 439–450. ACM (2000) 3. Yao, A.C.: Protocols for secure computations. In: 23rd Annual Symposium on Foundations of Computer Science SFCS 2008, pp. 160–164. IEEE (1982) 4. Even, S., Goldreich, O., Lempel, A.: A randomized protocol for signing contracts. Commun. ACM 28(6), 637–647 (1985) 5. Naor, M., Pinkas, B.: Oblivious transfer and polynomial evaluation. In: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, pp. 245– 254. ACM (1999) 6. Savage, J.E.: Models of Computation. Addison-Wesley, Reading, vol. 136 (1998) 7. Gentry, C., et al.: Fully homomorphic encryption using ideal lattices. In: STOC, vol. 9, no. 2009, pp. 169–178 (2009) 8. Yao, A.C.-C.: How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science, pp. 162–167. IEEE (1986) 9. Lindell, Y., Pinkas, B.: A proof of security of yaos protocol for two-party computation (2006) 10. Shamir, A.: How to share a secret. Commun. ACM 22(11), 612–613 (1979) 11. Liu, C.L.: Introduction to Combinatorial Mathematics (1968) 12. Beimel, A.: Secret-sharing schemes: a survey. In: International Conference on Coding and Cryptology, pp. 11–46. Springer (2011) 13. Bogdanov, D., Laur, S., Willemson, J.: Sharemind: a framework for fast privacypreserving computations. In: European Symposium on Research in Computer Security, pp. 192–206. Springer (2008) 14. Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for noncryptographic fault-tolerant distributed computation. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pp. 1–10. ACM (1988) 15. Asharov, G., Lindell, Y.: A full proof of the BGW protocol for perfectly secure multiparty computation. J. Cryptol. 30(1), 58–151 (2017) 16. Damg˚ ard, I., Pastro, V., Smart, N., Zakarias, S.: Multiparty computation from somewhat homomorphic encryption. In: Advances in Cryptology-CRYPTO 2012, pp. 643–662. Springer (2012)

Practically Realisable Anonymisation of Bitcoin Transactions with Improved Eﬃciency of the Zerocoin Protocol Jestine Paul1(B) , Quanqing Xu1 , Shao Fei2 , Bharadwaj Veeravalli2 , and Khin Mi Mi Aung1 1 Data Storage Institute, A*STAR, Singapore, Singapore {jestine-paul,Xu Quanqing,Mi Mi AUNG}@dsi.a-star.edu.sg 2 National University of Singapore, Singapore, Singapore [email protected], [email protected]

Abstract. As transaction records are public and unencrypted, Bitcoin transactions are not anonymous and the privacy of users can be compromised. This paper has explored several methods of making Bitcoin transactions anonymous, and Zerocoin, a protocol that anonymises transactions based on Non-Interactive Zero-Knowledge proofs, is identiﬁed as a promising option. Although theoretically sound, the Zerocoin research has two shortcomings: (1) Zerocoin transactions are vastly ineﬃcient compared to Bitcoin transactions in terms of veriﬁcation time and size; and (2) despite this ineﬃciency, the protocol has not been tested in an actual Bitcoin network to validate its practicality. This paper addresses these two problems by ﬁrst making performance improvements to the Accumulator Proof of Knowledge (AccPoK) and Serial Number Signature of Knowledge (SNSoK) in the Zerocoin protocol, and then integrating both the original and improved protocol into the Bitcoin client software to evaluate their performances in a Bitcoin network. Results show that the improved Zerocoin protocol reduces the veriﬁcation time and size of the SNSoK by 80 and 60 times, respectively, and reduces the size of the AccPoK by 25%. These translate to a 3.41 to 6.45 times reduction in transaction latency and a 2.5 times reduction in block latency in the Bitcoin network. Thus, with the improved Zerocoin protocol, anonymising Bitcoin transactions has become more practical. Keywords: Anonymisation

1

· Bitcoin · Zerocoin · AccPoK · SNSoK

Introduction

Privacy is an important requirement in any banking system. Users typically do not want their transaction histories to be exposed for a variety of legitimate reasons. While traditional banking services oﬀer privacy by keeping transaction records conﬁdential, the centralised records can be exposed maliciously or under the coercion of authorities. Since real-world identities are tied to transaction c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 108–130, 2019. https://doi.org/10.1007/978-3-030-03405-4_8

Practically Realisable Anonymisation of Bitcoin Transactions

109

records for accountability, privacy is completely destroyed when the records are exposed. Bitcoin is a cryptocurrency that addresses some of the privacy problems of traditional banking. It provides pseudonymity through the use of arbitrary public key addresses as identities during payments. This has partly driven the immense popularity of Bitcoin, which has a market capitalisation of USD$10 billion and a daily transaction volume of USD$50 million as of October 20161 . However, due to the decentralised nature of Bitcoin, transaction records are unencrypted and publicly available. This exposes Bitcoin transactions to scrutiny from the public and allows the arbitrary public key addresses used in transactions to be de-anonymised. Pseudonymity refers to the case of a user using an identity that is not his real identity [1]. Bitcoin achieves pseudonymity as payments are made with public key addresses instead of real-world identities. These addresses are generated randomly, and a user can generate as many public key addresses as he wants to further hide his identity. Thus, it is impossible to tell who a public key address belongs to given only the information of the public key address. However, pseudonymity alone in Bitcoin does not achieve privacy as public key addresses can still be linked to real-world identities given other information available on the blockchain. Due to the public nature of the blockchain, once a real-world identity is linked to a public key address, all the transactions in the past, present and future using that address can be linked to that real-world identity. To achieve privacy in Bitcoin, the anonymity property needs to be satisﬁed. Anonymity in the context of Bitcoin requires pseudonymity and unlinkability [1]. Unlinkability refers to the case that it is hard to: (1) link diﬀerent public key addresses of the same user (2) link diﬀerent transactions made by the same user (3) link a payer to the payee If the above properties are satisﬁed, Bitcoin can be truly anonymous and transactions can be fully private. Another term that is relevant to anonymity is the anonymity set [1]. An anonymity set in Bitcoin is the set of transactions in which an adversary cannot distinguish a particular transaction from. In other words, even if an adversary knows that a transaction is linked to a particular user, he cannot identify which transaction it is if that transaction is in its anonymity set. An anonymity set containing one transaction essentially implies no anonymity, while an anonymity set containing all the transactions on the blockchain implies full anonymity. Thus, the level of anonymity of a transaction increases with the size of the anonymity set that it is in. This paper has explored Zerocoin [2], a protocol that anonymises transactions based on Non-Interactive Zero-Knowledge proofs. Although theoretically sound, the Zerocoin protocol has two disadvantages. This paper has extended the work by the original Zerocoin authors by making two key contributions: (1) 1

https://coinmarketcap.com/.

110

J. Paul et al.

an improved Zerocoin protocol that reduces the veriﬁcation time and size of Spend transactions; and (2) an evaluation of the performances of both the original and the improved Zerocoin protocol in an actual Bitcoin network. The ﬁrst contribution is achieved by modifying the NIZK proofs (SNSoK and AccPoK) required to verify the Spend transaction. For the SNSoK, both veriﬁcation time and proof size are reduced by compressing the 80 rounds of veriﬁcations using 80 sets of proof variables into a single round of veriﬁcation using one set of proof variables. For the AccPok, proof size is reduced by replacing a large proof variable with its hash and using the hash for veriﬁcation. The second contribution is achieved by integrating the key Zerocoin operations of transaction creation and veriﬁcation into the Bitcoind client and setting up a Bitcoin network with nodes running the modiﬁed Bitcoind clients. Due to the improved Zerocoin protocol, transaction and block latencies under the improved and original Zerocoin protocol are compared to assess the performance improvements. The rest of the paper is organized as follows. Section 2 describes related work. Section 3 presents mixing through Altcoins: Zerocoin and Zerocash. Section 4 proposes the improvements for the Zerocoin protocol, explains how they can result in better performance, and justiﬁes their theoretical soundness. Section 5 discusses the experiments conducted to evaluate the implementation. Section 6 concludes this paper and potential future work.

2

Related Work

In this section, we examine the methods that have been developed to increase the anonymity of Bitcoin transactions. The focus of analysis is on evaluating the practicality of the various methods. 2.1

Mixing

In general, the key to achieving anonymity in Bitcoin is to remove the ability to link a payer to a payee. This immediately satisﬁes the third property of anonymity that it should be hard to link a payer to a payee. A user can also take advantage of this and make an anonymous payment to himself. Since the public key address that he uses to make the payment cannot be linked to the public key address that he uses to receive the payment, the user can transact with the latter public key address on a clean slate. Hence, the ﬁrst and second properties of anonymity that it should be hard to link public key addresses and transactions to the same user are also achieved. Mixing is a concept that removes the ability to link a payer to a payee. All methods to anonymise transactions adopt mixing in one way or another. To perform mixing, payers make payments via an intermediary called the mixer. The mixer collects payments from payers and pays each intended payee with a transaction whose inputs do not refer to outputs that contain the public key addresses of the actual payer. In this way, observers of the blockchain are only able to see transactions from the payers to the mixer and from the mixer to

Practically Realisable Anonymisation of Bitcoin Transactions

111

the payees, but cannot tell which payer paid to which payee. The size of the anonymity set is the number of payers participating in the mix, since an observer only knows that a payer participated in mixing, but is unable to tell who the payer paid to among those who received payments from the mixer. In order for mixing to work, the amounts that are being transacted via a mixer should be constant for all participants [1]. If diﬀerent payers pay diﬀerent amounts to the mixer, and subsequently the mixer pays the corresponding amounts to the payees, one can link payers and payees by matching the payment amounts between the payers and the mixer and those between the mixer and the payees. 2.2

Traditional Mixing

(1) Dedicated Mixing Services: Dedicated mixing services have been commercially implemented in Bitcoin. Services such as Bitcoin Fog and BitLaundry routinely handle 6-digit dollar amounts everyday [3]. These services act as centralised mixers, where participants make payments to the services and instruct them to transfer the funds to speciﬁc addresses. However, dedicated mixing services face some major drawbacks. First, dedicated mixing services need to keep the payment records of their users in order to transfer funds to the correct payees. Thus, privacy is built on the trust that mixing services will not disclose these records. This runs against principle of decentralisation of Bitcoin and cryptocurrencies in general [1]. Second, since mixers are used for payments, users typically require funds to be transferred immediately to the intended payees. Thus, mixers can only work with a limited pool of participants who happen to make payments at the same time. This limits the size of the anonymity set in the mixing protocol, and lowers the level of anonymity [1]. By exploiting this weakness and analysing the mixing patterns adopted by the dedicated mixers, transactions in dedicated mixes can be de-anonymised in various ways [3]. (2) Decentralised Mixing: Decentralised mixing adopts a peer-to-peer mixing protocol. One such protocol proposed is Coinjoin2 , which gathers payers who wish to participate in mixing to collectively construct a single transaction. Each input in the Coinjoin transaction refers to the funds from each payer, while each output in the transaction pays to the public key address of each intended payee. The inputs and outputs are ordered randomly in the transaction, and an observer cannot tell which input (payer) corresponds to which output (payee). Here, the mixer is the transaction itself, and the size of the anonymity set is the number of participants in the transaction. Although Coinjoin achieves decentralisation, it still faces the same problem as dedicated mixing services of having a limited anonymity set, as the protocol is carried out with the few payers who want to make payments at the same time. In addition, since a Coinjoin transaction is collectively constructed, the protocol faces denial-of-service (DoS) attacks where an adversary who initially agrees to 2

https://bitcointalk.org/index.php?topic=279249.0.

112

J. Paul et al.

participate in a Coinjoin transaction refuses to complete his part in the creation of the transaction [1].

3

Mixing Through Altcoins: Zerocoin and Zerocash

In this section, we argue the case for Zerocoin as having the most practical potential to be an anonymous transaction protocol and justify the eﬀorts this research makes to further improve on the protocol. As seen in Sects. 2.1 and 2.2, traditional mixing techniques in Bitcoin that involve the shuﬄing to payers and payees do not work well in both the centralised and decentralised form. There is also little room for improvement for these techniques as they are limited to working with Bitcoin transactions. On the contrary, Altcoins are in a more favourable position to provide better privacy as compared to the traditional mixing techniques. Altcoins can be seen as upgrades to the current Bitcoin protocol to cater for anonymous transactions. The two popular Altcoins that are designed to provide anonymous transactions are Zerocoin [2] and Zerocash [4]. As opposed to the traditional mixers, Zerocoin and Zerocash provide anonymity based on cryptographic guarantees. 3.1

Zerocoin

(1) Protocol: Zerocoin extends Bitcoin by incorporating another type of anonymous currency called the “Zerocoin” into the Bitcoin protocol. Payments in Bitcoin involve transferring of Bitcoins via transactions recorded on the blockchain. For each transaction in a block, payments are made using Bitcoins that have been paid to the payer by transactions in previous blocks. The high level ﬂow of Zerocoin is also similar. In addition to making payments in Bitcoins via normal Bitcoin transactions, users can use a Bitcoin to mint an anonymous “Zerocoin” (referred to as coin where unambiguous) via a Mint transaction. A coin is simply a value c, which is a cryptographic commitment of a serial number S and a secret trapdoor r. S and r are only known to the user who minted the coin. The property of c is that both S and r are needed to obtain the value of c, and it is hard to compute the S or r from c. Like a Bitcoin transaction, a Mint transaction contains inputs that specify the Bitcoins (outputs in previous transactions) that are used to mint the coin. However, its output does not contain any public key address, but simply the value c of the coin that has been minted. Once the input in a Mint transaction is veriﬁed to be valid by network, it is published on the blockchain and the value c of the minted coin is added to a public data structure called the accumulator. A minted coin is spent by redeeming the coin back to Bitcoins using a Spend transaction. Similar to a Bitcoin transaction, the Spend transaction contains an input that proves the existence and ownership of the spent coin, and an output that speciﬁes the public key address to pay the redeemed Bitcoins to. To prove the existence and ownership of the spent coin in the Spend transaction, the spender must show:

Practically Realisable Anonymisation of Bitcoin Transactions

113

(a) The value c of the coin has been added to the accumulator to show that the coin has been minted. (b) The knowledge of the serial number S and random trapdoor r that produced c to show that the spender minted the coin. Proving the above involves revealing only the serial number S that is committed in c and not c itself in the Spend transaction. Such proofs are termed Non-Interactive Zero-Knowledge (NIZK) proof. Once the proofs are veriﬁed, the Spend transaction gets published on the blockchain and the output in the transaction can be subsequently spent by its owner. To prevent the spent coin from being spent again, the S disclosed in Spend transactions are recorded on a separate public table. Any Spend transaction that spends a coin with a serial number listed on this public table will be rejected. A simpliﬁed example illustrating the relationships between normal Bitcoin transactions, Mint transactions and Spend transactions is shown in Fig. 1.

Input

Input

Coin

Output

Output

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

Normal Bitcoin Transaction

...

...

...

Input

...

Input Output

Accumulates

Redeems a coin without revealing the coin

Accumulator

Fig. 1. Relationship between Bitcoin and Zerocoin transactions

As seen in Fig. 1, Zerocoin is in fact a mixing protocol, where the accumulator acts as the intermediary. When a new Mint transaction is published on the blockchain, the value c of the minted coin speciﬁed in the Mint transaction is mixed with the rest of the previously minted coins in the accumulator. When a Spend transaction is published on the blockchain, the only information available to the public is that one of all of the previously minted coins has been spent, as the value c of the spent coin is not revealed. Even though the serial number S of the spent coin is revealed in a Spend transaction to prove the legitimacy of the spent coin, the secret trapdoor r is private to the coins minter at all times. An observer who only knows S cannot reconstruct c and determine which coin is being spent and thus cannot link the spender (payee) to the minter (payer). Thus, the anonymity set in Zerocoin is all of the minted and unspent coins in history, which is much larger than the anonymity sets in the traditional

114

J. Paul et al.

mixing protocols (Sect. 2.2). Therefore, Zerocoin achieves a much higher level of anonymity compared to these protocols. (2) Drawbacks: However, Zerocoin is far from perfect. Some of the drawbacks of Zerocoin are listed below: (a) Coins can only have ﬁxed denominations, making payments less ﬂexible. This is to prevent observers of the blockchain from linking Mint transactions and Spend transactions by matching the payment amounts in the transactions. This is largely similar to why transaction amounts in traditional mixing protocols must be uniform for all participants (Sect. 2.2). (b) There is no provision to transfer a coin to a payee directly, and actual payments still need to be done in Bitcoins. Every time a payer wants to use a coin to make an anonymous payment, he needs to redeem the coin to Bitcoins via a Spend transaction, and pay the redeemed Bitcoins to the payee. If the payee wants to make an anonymous payment again using the received Bitcoins, he must ﬁrst mint some coins via a Mint transaction and then spend them via a Spend transaction. Thus, coins need to be constantly minted and spent when making multiple anonymous transactions. This makes anonymising transactions cumbersome. (c) The payment amounts, payment addresses and the timings of Zerocoin transactions are still publicly available on the blockchain. This means that Zerocoin transactions are still exposed to de-anonymisation through side channels. (d) Spend transactions are much larger in size (about 26 KB) [2] as compared to Bitcoin transactions (less than 1 KB) due to the large size of the NIZK proofs that are used to prove the existence and ownership of coins. This is undesirable in a cryptocurrency network. First, since all nodes in the network maintain a copy of the blockchain, large transactions impose higher storage requirements for nodes. Second, and more importantly, large transactions and blocks containing large transactions also take a longer time to propagate through the network, which slows down the conﬁrmation of transactions and blocks and compromises the overall performance of the network. (e) Spend transactions take much longer to verify (about 300 ms) [2] as compared to Bitcoin transactions (a few milliseconds). Since each node needs to verify a transaction or block before relaying it to its peers, a long transaction veriﬁcation time also slows down the propagation of transactions and blocks in the network and compromises the overall performance of the network. 3.2

Zerocash

(1) Features and Capabilities: Zerocash is purported to be an improvement of Zerocoin. It is built on the same principles as Zerocoin, where a Mint transaction mints some coins and a Pour transaction (similar to Spend transaction in Zerocoin) spends a minted coin without disclosing which coin is spent. However, instead of using NIZK proofs to prove the existence and ownership of coins, Zerocash uses the zero-knowledge Succinct Non-interactive Arguments of Knowledge

Practically Realisable Anonymisation of Bitcoin Transactions

115

(zk-SNARKs) [5]. As zk-SNARKs is extremely complex and a relatively new ﬁeld, this paper is not able to describe how the proof works. However, knowing the features and capabilities of Zerocash is suﬃcient to evaluate its strengths and weaknesses. Indeed, Zerocash boasts several improvements from Zerocoin in terms of performance and features. (2) Drawbacks: Though seemingly superior, Zerocash also has the following weaknesses when compared to Zerocoin: (a) The theoretical foundation of Zerocash has not been proven to be sound. zk-SNARKs is a relatively new area of research and its theories have not been used in practice as of 2015 [1]. In contrast, the NIZK proofs used by Zerocoin, which are based on RSA cryptography, have been widely tested and proven for many years. (b) Zerocash requires a huge set of parameters (over 1 GB) for the zk-SNARKs proofs. This places a huge storage constraint for anyone who wishes to use Zerocash, and makes deployment of Zerocash on mobile devices challenging. In contrast, Zerocoin only requires parameters of about 2.5 KB for its NIZK proofs. (c) The construction of the zk-SNARKs proofs in Zerocash is computationally expensive. Speciﬁcally, it takes 2 min to construct the proofs in a Pour transaction on a high-end machine [4]. This means that a user of Zerocash needs to wait for 2 min before he can create a transaction on a high-end PC. This makes Zerocash impractical for devices with lower processing power. In contrast, only 0.7 s is required to construct the NIZK proofs in a Spend transaction on a machine that has similar processing power [2]. 3.3

Justiﬁcations for Choosing to Improve on the Zerocoin Protocol

Although Zerocash deﬁnitely has much more to oﬀer in terms of privacy and features, Zerocoin is still deemed to be a more practical protocol that warrants the eﬀorts for enhancement. Zerocash requires large storage for its parameters and an extremely long time to construct transactions even on a high-end machine. Thus, it may face challenges when running on machines that increasingly consist of mobile devices. The advantages of Zerocoin in these two areas make up for its shortcomings in transaction size and veriﬁcation time. In terms of size, the parameters of Zerocoin are 400,000 times smaller than that of Zerocash, while Spend transactions are only about 90 times larger than Pour transactions. In terms of computational time, Spend transactions take about 170 times faster to construct and only about 50 times slower to verify than Pour transactions. Thus, with signiﬁcant advantages in the size of parameters and the construction time of Spend transactions, Zerocoin can be a more practical protocol than Zerocash if moderate improvements are made to the size and veriﬁcation time of Spend transactions. In addition, the technology behind Zerocash is also not mature. Literature on zk-SNARKs is both sparse and complex, while literature on NIZK proofs

116

J. Paul et al.

are readily available and well documented3 . This makes research on Zerocoin more productive as compared to Zerocash. Moreover, the theoretical basis for Zerocash is uncertain due to the lack of implementation of zk-SNARKs in reallife applications, while the NIZK proofs used in Zerocoin have been extensively tested and proven. Most importantly, the full anonymity provided by Zerocash may be undesirable. This is because full anonymity removes a key feature of the blockchain, which is that the network can collectively audit the public transactions. If payment amounts in transactions are hidden, any bugs or attacks that cause coins to be wrongly minted will go unnoticed in the blockchain. This can cause hyperinﬂation of Zerocash and render the system dysfunctional. Thus, as promising as it seems, Zerocash has its caveats and its implications as a real-life cryptocurrency are unknown. The shortcomings of Zerocoin lie mainly in the large size and long veriﬁcation time of Spend transactions. If these weaknesses are addressed, Zerocoin can deliver a level of performance that is comparable to Zerocash. Hence, the rest of this paper is devoted to analysing Zerocoin in detail and ﬁnding ways to improve its performance, so that it can be a more practical anonymous transaction protocol.

4

Improvements to the Zerocoin Protocol

In this section, we present the design of the improvements and explains how they can lead to a more eﬃcient Zerocoin protocol. The theoretical soundness of the improvements is also justiﬁed. Finally, this section describes how the improvements have been integrated into the current implementation of the Zerocoin protocol. 4.1

A More Eﬃcient Serial Number Signature of Knowledge

(1) Problem with Original SNSoK: The most problematic NIZK proof out of the three proofs in Zerocoin is the SNSoK as it has the biggest size and takes the longest time to verify. The ineﬃciency in the SNSoK is mainly due to the proof requiring 80 iterations of the same proof process, which means that 80 sets of proof variables are generated and veriﬁed in one SNSoK. This diﬀers from the standard scheme of the Fiat-Shamir heuristic, where veriﬁcation of an NIZK proof only requires a single iteration. (2) Design of improved SNSoK: This paper proposes a more eﬃcient SNSoK that reduces both the veriﬁcation time and the size of the proof. The improved SNSoK is adapted from the ZK proof for a committed value in a Pedersen commitment outlined by Hohenberger [6], and made non-interactive by applying the standard Fiat-Shamir Heuristic. The improved proof is veriﬁed in a single iteration as opposed to the 80 iterations required by the original proof. 3

https://zcoin.io/zh/zcoin-and-zcash-similarities-and-diﬀerences/.

Practically Realisable Anonymisation of Bitcoin Transactions

117

The improved SNSoK starts oﬀ in a same way as the original SNSoK. In order to hide the value c of the coin that is being proven, the prover creates a Pedersen commitment y of c using a random trapdoor z ∈ ZqSoK , such that gS

hr

c comm comm z y = gSoK hzSoK = gSoK hSoK . Also like in the original proof, the improved SNSoK proves that y contains a c that is a commitment of the serial number S and trapdoor r without revealing r in the following form: gS

comm ZKSoK[m]{(c, r, z) : y = gSoK

hrcomm z hSoK }

However, the improved SNSoK does not require 80 sets of (t, s, s ) values to be included as proof variables in the proof. Instead, the prover ﬁrst computes S 1 gcomm hv comm v2 a single t = gSoK hSoK where v1 ∈ Zqcomm , v2 ∈ ZqSoK . Following the principle of the Fiat-Shamir heuristic, the prover then computes the challenge value C = H(ySgSoK hSoK tm) which also doubles up as the signature for the transaction contents m like in the original SNSoK. Next, the prover computes 1 − Chrcomm and s = v2 − Cz. The single set of y, t, s, s , the a single s = hvcomm serial number S of the spent coin and the transaction content m are written into the Spend transaction as proof variables to be veriﬁed. Upon receiving the Spend transaction, the verifying node recomputes C = H(ySgSoK hSoK tm) using the received y, t, m, S and the public parameters gSoK , hSoK . The proof is S gcomm s s veriﬁed if and only if t = y C gSoK hSoK . Since only a single set of (t, s, s ) is included in the improved SNSoK as opposed to the 80 sets of (s, s ) included in the original SNSoK, the size of the improved SNSoK should be signiﬁcantly smaller. Similarly, the time needed to verify the improved SNSoK should also be shorter by about 80 times since veriﬁcation is done in one iteration in the improved SNSoK as opposed to 80 iterations in the original SNSoK. (3) Size Optimisation: Veriﬁers of the original SNSoK check for ti = ti indirectly by checking C = C. This reduces the size of the proof as it includes only one 256 bit C instead of 80 1,024 bit ti . In contrast, the improved SNSoK outlined earlier includes t in the proof and requires veriﬁers to check for t = t directly S gcomm s s where t = y C gSoK hSoK . The size of the improved SNSoK can be further reduced by adopting the indirect checking scheme of the original SNSoK. As such, the improved SNSoK is modiﬁed slightly to optimise for size. Instead of (t, s, s ), the prover includes (C, s, s ) as the proof variables in the Spend transaction. Upon receiving the Spend transaction, the verifying node computes t = S gcomm s s y C gSoK hSoK using the received (C, s, s ). The verifying node further com putes C = H(ySgSoK hSoK t m). The proof is veriﬁed if and only if C = C. Since C = H(ySgSoK hSoK tm), checking for C = C achieves the same eﬀect gS

s

comm as checking for t = t = y C gSoK hsSoK . This indirect checking scheme further reduces the size of the SNSoK by including the 256 bit C instead of the 1,024 bit t in the proof.

118

J. Paul et al.

(4) Zero-knowledge Properties of Improved SNSoK: In order for the improved SNSoK to be valid, it must satisfy the three zero-knowledge proof properties. (a) Completeness The completeness property of the improved SNSoK can be shown by demonstrating that the proof produced by an honest prover who knows S, r and z can always to be veriﬁed. Hence it suﬃces to show that the veriﬁcation formula is S gcomm s s hSoK , the equation can be expanded as correct. To prove that t = y C gSoK such: S gcomm s s t = y C gSoK hSoK gS

comm = (gSoK

Cg S

S r 1 hrcomm z gcomm (hv comm −Chcomm ) v2 −Cz hSoK )C gSoK hSoK

comm = gSoK

gS

comm = gSoK

S S r 1 hrcomm Cz gcomm hv comm −Cgcomm hcomm v2 −Cz hSoK gSoK hSoK

1 hv comm v2 hSoK

=t As seen, the veriﬁcation equation is indeed correct and the improved SNSoK is complete. (b) Soundness The soundness property of improved SNSoK can be shown by demonstrating that a hypothetical knowledge extractor can obtain c from two accepting proofs that use the same t but two diﬀerent sets of (C, s, s ). To show that this is true, let the two diﬀerent sets of (C, s, s ) be (C1 , s1 , s1 ) and (C2 , s2 , s2 ). Since the t in the two accepting proofs are the same, the following equality is satisﬁed: gS

s

s

gS

s

s

comm 1 1 comm 2 2 t = y C1 gSoK hSoK = y C2 gSoK hSoK

By rearranging the equation, y can be expressed in the following form: gS

s

s

gS

s

s

comm 1 1 comm 2 2 y C1 gSoK hSoK = y C2 gSoK hSoK

gS

comm y C1 −C2 = gSoK

(s2 −s1 ) s2 −s1 hSoK

S gcomm (s2 −s1 )

y = gSoKC1 −C2

s2 −s1

C1 −C2 hSoK

c Since y = gSoK hzSoK , the knowledge extractor can successfully determine that gS

(s −s )

2 1 c = comm . Thus, the success of the knowledge extractor in extracting the C1 −C2 secret coin value c shows that the improved SNSoK is sound.

(c) Statistical Zero-Knowledge The statistical zero-knowledge property of the improved SNSoK can be shown by demonstrating that given a challenge C, a hypothetical simulator can generate s and s that are statistically indistinguishable from the s and s computed by the prover. First, it can be shown that the s and s generated by the prover are random values. Since hcomm is the generator for the subgroup of order qcomm , hxcomm

Practically Realisable Anonymisation of Bitcoin Transactions

119

produces all element in the subgroup exactly once for x from 1 to qcomm . As 1 , and since v1 ∈ Zqcomm , there is a one-to-one mapping between v1 and hvcomm 1 is also random. As a random value added with a constant v1 is random, hvcomm 1 − Chrcomm is also random. By the same still produces a random value, s = hvcomm argument, s = v2 − Cz is also random as v2 is random. As such, the simulator can simply be a random value generator that produces random values s and s . In order for the ranges of the s and s generated by the simulator and the s and s computed by the prover to match, the simulator should generate s and s such that s ∈ Zqcomm and s ∈ ZqSoK . Using this method, the simulator can generate random s and s that are statistically indistinguishable from the s and s computed by the prover. Thus, the improved SNSoK achieves statistical zero-knowledge. 4.2

A Smaller Accumulator Proof of Knowledge

(1) Problem with Original AccPoK: The AccPoK is the other NIZK proof in Zerocoin that is relatively ineﬃcient. This is mainly because the AccPoK contains 7 sub-proofs that must be all veriﬁed. While Camenisch and Lysyanskaya [7] described the steps of the proof in their work, the reasons behind the subproofs are not explained in detail. To revisit the AccPoK, the proof statement is shown: N IZKP oK{(c, β, γ, δ, ε, ζ, ϕ, ψ, η, σ, ξ) : ε hζQRN ∧ Cc Cc = g h ∧ g = (Cc /g)γ hψ ∧ Cr = gQRN c ϕ

c c = gQRN hηQRN ∧ A = Cw (1/hQRN )β ∧ 1

= Crc (1/hQRN )δ (1/gQRN )β ∧ c ∈ [A, B]} Cc = gc hϕ c Cw (1/hQRN )β

is a sub-proof that proves Cc contains the coin value c, A = is a sub-proof that proves the membership of c in the accumulator A and c ∈ [A, B] is a sub-proof that proves c is within the allowed range. However, this paper is unable to understand the other sub-proofs and cannot make any major modiﬁcation to the AccPoK to improve its performance. Nevertheless, this paper is able to reduce the size of the AccPoK without altering the proof scheme. The problem addressed by the improved AccPoK is the large number of proof variables that are included in the proof. The following proof variables are included in the original AccPoK: sα , sβ , sζ , sσ , sη , s , sδ , sξ , sϕ , sγ , sψ , t1 , t2 , t3 , t4 , t5 , t6 , t7 When verifying the original AccPoK, the veriﬁer essentially computes a set of ti using the si received from the prover and checks if ti equal to the corresponding ti received from the prover. For example, one equality that the veriﬁer checks is whether t1 = CCc gsα hsζ = t1 . The average size of each si is 288 bit while the size of each ti ranges from 550 bit to 3,070 bit4 . The large amount of proof 4

Values are obtained by running libzerocoin.

120

J. Paul et al.

variables and the relatively large size of each variable are the main reasons for the large size of the AccPoK. This problem can be tackled by applying the same size optimisation technique used in the SNSoK. The next section explains the design of the improved AccPoK. (2) Design of improved AccPoK: Instead of checking the equalities ti = ti directly, the indirect checking technique used in both the original SNSoK and the improved SNSoK (Sect. 4.2) are used to reduce the size of the AccPoK. In the improved AccPoK, the prover constructs the proof in the same way as the original AccPoK, but includes C instead of the 7 ti as a proof variable in the Spend transaction. Upon receiving the proof, the veriﬁer computes ti like in the original AccPoK, and further computes C = H(Cc Cc Cw Cr g hgQRN hQRN t1 · · · t7 ). The proof is veriﬁed if and only if C equals to the C received from the prover. Since C = H(Cc Cc Cw Cr ghgQRN hQRN t1 · · · t7 ), C = C can only be satisﬁed if each ti equals to its corresponding and ti . Thus, the improved AccPoK indirectly checks for ti = ti , which essentially achieves the same proof scheme as the original AccPoK. As the 7 ti with sizes ranging from 550 bit to 3,070 bit in the original AccPoK are replaced by a single 256 bit C in the improved AccPoK, the improved AccPoK should be able to reduce the size of the AccPoK by a few kilobytes. 4.3

Implementation of Improvements

The improved SNSoK and AccPoK are implemented by modifying the code of libzerocoin. Most of the modiﬁcations are done in the SerialNumberSign atureOfKnowledge and AccumulatorProofOfKnowledge classes that implement the SNSoK and AccPoK respectively. These classes contain four main functions. They are (1) constructor functions that construct the proof variables; (2) serialisation functions that sends the proof variables over the network; (3) deserialisation functions that converts the received data back into the proof classes containing the proof variables; and (4) veriﬁcation functions that verify the the proof using the received proof variables. SerialNumberSignatureOfKnowledge and AccumulatorProofOfKnowledge are member variables of the CoinSpend class, which represents the Spend transaction. CoinSpend provides its own constructor, serialisation, deserialisation and veriﬁcation functions that invoke the respective functions of the SerialNumberSignatureOfKnowledge and AccumulatorProofOf Knowledge classes. Two new classes, SerialNumberSignatureOfKnowledge2 and AccumulatorProof OfKnowledge2 with identical constructor, serialisation, deserialisation and veriﬁcation function signatures as their original proof classes are created to implement the improved SNSoK and AccPoK, respectively. As the improved proofs have diﬀerent proof variables and ways of veriﬁcation, the member variables, constructor and veriﬁcation functions of SerialNumberSignatureOfKnowledge2 and AccumulatorProofOfKnowledge2 are modiﬁed to implement the improved proof schemes. This is done in the same style as in the original proof classes, where modular arithmetic is done using the Bignum class provided by f libzerocoin.

Practically Realisable Anonymisation of Bitcoin Transactions

121

To control the version of SNSoK and AccPoK used by CoinSpend, a macro ZKPROOF VERSION is deﬁned in the header ﬁle of libzerocoin to instruct the compiler to compile the desired version of the proofs. In the constructor of CoinSpend, the preprocessor directives #if and #elif check the value of ZKPROOF VERSION and direct the compiler to compile the code that instantiates the desired SNSoK and AccPoK classes. When ZKPROOF VERSION is set to 1, the code that instantiates the original SNSoK and AccPoK classes (SerialNumberSignatureOfKnowledge and AccumulatorProofOfKnowledge) are compiled. When ZKPROOF VERSION is set to 2, the code that instantiates the improved SNSoK and AccPoK classes (SerialNum berSignatureOfKnowledge2 and AccumulatorProofOfKnowledge2) are compiled. As the function signatures provided by both versions of the SNSoK and AccPoK classes are identical, there is no need to change how CoinSpend invokes the functions of the proof classes when the switching between the two versions of the proofs. Thus, it is suﬃcient to control which classes are instantiated in CoinSpend using ZKPROOF VERSION to control the version Zerocoin protocol that is compile in libzerocoin.

5

Performance Evaluation

This section presents and analyses the results for the experiments described in the previous section. The objective of the analysis is to quantify the performance improvements brought by the improved Zerocoin protocol and discuss their implications on the practicality of Zerocoin as an anonymous transaction protocol. 5.1

Experimental Setup

(1) Network Setup: The test network used in the experiments consists of 20 nodes running the modiﬁed Bitcoind client. The Bitcoind clients are run in the regtest mode that avoids connection with the actual Bitcoin network. This creates a private network where connections between nodes are fully customizable. The regtest also allows nodes to mine blocks instantly as opposed to requiring them to solve the proof-of-work puzzle5 that is extremely time-consuming without specialised hardware. This allows the experiments to have full control over block creation, which ensures that enough measurements can be collected for block latencies. Due to limitation of resources, only two machines are used for the experiments with each machine running 10 instances of Bitcoind nodes. In order to avoid overloading the machines, the number of Bitcoind instances running on each machine is kept well below the number of cores available. Table 1 shows the speciﬁcations of the machines used for the network experiments. 5

A proof-of-work puzzle involves adjusting the nonce ﬁeld in a block until the hash of the block is under a certain value. Currently, this value of set to a level such that only specialized hardware can solve the puzzle in a cost eﬀective manner.

122

J. Paul et al. Table 1. Machines used for network experiments Processor

Speed

No. of Cores RAM

No. of Nodes

Intel Xeon E5-2620 2.40 GHz 24

64 GB

Intel Xeon E5-2660 2.20 GHz 32

128 GB 10

10

(2) Running of Network: Nodes in the test network perform both mining and wallet functions just like nodes in the actual Bitcoin network. Mining functions refer to creating blocks and publishing them onto the blockchain, while wallet functions refer to maintaining an “account” that can make payments by creating transactions. These functions are already built into the Bitcoind client and can be executed via the JSON RPC API using the Bitcoin-cli program. When the test network is ﬁrst run, nodes cannot make payments as they do not own any funds. To obtain funds, nodes take turns to mine blocks to receive the block rewards and mining fees through the coinbase transactions6 in the blocks. Under the Bitcoin protocol, 100 blocks must be built on the block that contains the coinbase transaction before the output in that coinbase transaction can be spent. After all of the nodes are able to spend the block rewards and mining fees that they have received from mining, the actual experiment starts. Five values of each independent variable are used in the experiments with pzerocoin taking the values of 0%, 25%, 50%, 75%, 100% and λtxrate taking the values of 2 s−1 , 4 s−1 , 8 s−1 , 16 s−1 , 32 s−1 . A pair of values for pzerocoin and λtxrate is used for each experiment. For example, an experiment with pzerocoin = 50% and λtxrate = 4s−1 means that 50% of all transactions created are Zerocoin transactions and 4 transactions are created per second. All combinations of values of pzerocoin and λtxrate are used in the experiments in order to cover all scenarios in the network. Thus, a total of 25 experiments are conducted. Each experiment consists of 10 rounds of transactions creation and block mining. During one round, randomly selected nodes broadcast either a Bitcoin or Zerocoin transaction every certain time interval based on the values of the independent variables set for that experiment. The type of transaction being sent is randomly generated based on the probability pzerocoin and the intervals between transactions is the reciprocal of the Poisson variable with mean λtxrate . After 15 transactions are created and sent in the network, there is no network activity for 25 s before a randomly selected node mines a block and broadcasts it to the network. Another 35 s of no network activity is imposed before the next round of transactions creation and block mining starts. A total of 150 transactions and 10 blocks containing 15 transactions each are processed by the 6

A coinbase transaction is a special transaction in a block that pays the block reward and mining fees to the miner. The block reward is ﬁxed for all blocks at a given point of time while the mining fees are collectively paid by all the other transactions in the block. Block reward and mining fees serve as incentives for miners to invest resources to mine blocks.

Practically Realisable Anonymisation of Bitcoin Transactions

123

network at the end of each experiment. The numbers of blocks and transactions are kept low to reduce the duration of each experiment as 25 experiments need to be conducted. 5.2

Data Collection and Results Processing

A event logging system is incorporated into the Bitcoind client to measure the transaction and block latencies in the test network. Under this system, each node creates a log entry whenever it has sent or veriﬁed a transaction or block. At the end of the experiment, the logs contain the transaction and block events that have occurred in every node in the network. As transactions and blocks go through diﬀerent nodes as they are relayed through the network, there will be multiple log entries for each transaction or block. The ﬁrst transaction sent or block sent entry occurred when the transaction or block is ﬁrst created, while the last transaction veriﬁed or block veriﬁed entry occurred when the transaction or block is veriﬁed by the last block in the network. Thus, latencies are computed by ﬁrst grouping the log entries by their hash ﬁeld to separate the event time stamps for diﬀerent transactions and blocks, and then taking the diﬀerence between the latest received time stamp and the ﬁrst sent time stamp in each group of log entries to obtain the latency of each transaction or block. 5.3

Microbenchmarks

The results for the microbenchmarks are shown in Table 2. As seen in Table 2, the improved Zerocoin protocol has shrunk the size of the Spend transaction by about 4 times from 26 KB to 6.5 KB. The main contribution to the reduction in size comes from the improved SNSoK, which is smaller than the original SNSoK by about 60 times. This is within expectations, as the improved SNSoK replaces the 80 sets of proof variables (si , si ) included in the original proof with just a single set of (s, s ) (Sect. 4.1). The improved AccPoK is also 25% smaller than the original AccPoK due to the size optimisation technique of including C instead of the 7 ti in the proof (Sect. 4.2). Table 2. Microbenchmarks results Zerocoin protocol Improved Original Size

SNSoK 293 B AccPoK 5243 B Spend transaction 6,490 B

18293 B 6972 B 26,081 B

Veriﬁcation time SNSoK 2.4 ms 193.5 ms AccPoK 122.2 ms 123.3 ms Spend transaction 128.4 ms 320.8 ms

124

J. Paul et al.

The improved Zerocoin protocol has also reduced the veriﬁcation time of the Spend transaction by 2.5 times from 320.8 ms to 128.4 ms. The reduction in veriﬁcation time comes solely from the improved SNSoK, which is faster than the original SNSoK by about 80 times. This is again within expectations as the improved SNSoK reduces the 80 iterations of veriﬁcation required by the original proof to just a single iteration (Sect. 4.1). The veriﬁcation time for the improved AccPoK is almost identical to that of the original AccPoK. This is because the veriﬁcation schemes of the original and improved AccPoK are largely the same. Instead of computing C followed by t in the original AccPoK, the veriﬁer computes t followed by C in the improved AccPoK (Sect. 4.2). As C and C are computed using the same hash operation, the original and improved AccPoK incur the same computational time during veriﬁcation. 5.4

Network Performance

The transaction latencies and block latencies measured from the experiments are shown in the form of box plots in Figs. 2 and 3 respectively. The latencies are plotted over increasing percentages of Zerocoin transactions in the network for each transaction rate. The latencies under the original and improved Zerocoin protocol are put side by side to compare the diﬀerences in performance between the two protocols. Anomaly points are discarded to show the main box plots features more clearly. (1) Overall eﬀects of Zerocoin on Latencies: We present the following experimental results: (a) Drastic Increase in Latency due to Zerocoin Figures 2 and 3 show that for both the original and improved Zerocoin protocol, transaction and block latencies increase quickly as percentage of Zerocoin transactions in the network increases. This is within expectations due to the large diﬀerence in overheads between Zerocoin transactions and normal Bitcoin transactions. As seen from the results of the microbenchmarks, Zerocoin Spend transactions take hundreds of milliseconds to verify and are at least a few kilobytes in size for both versions of the Zerocoin protocol. On the other hand, the normal Bitcoin transactions created in the experiments take less than 1 ms to verify and are less than 1 KB in size. Thus, both the veriﬁcation delay at nodes due to veriﬁcation time and propagation delay across nodes due to transaction size are much higher for Zerocoin transactions, which cause Zerocoin transactions and blocks containing Zerocoin transactions to have much higher latencies in the network. As a result, overall latency increases drastically with the addition of Zerocoin transactions in the network. (b) Variation in Latency due to Zerocoin Figures 2 and 3 also show that except for transaction latencies under the original Zerocoin protocol, variation in latencies increases then decreases as the amount of Zerocoin transactions in the network increases from 0% to 100%. This can be again explained by the large diﬀerence in latencies between Zerocoin and

Practically Realisable Anonymisation of Bitcoin Transactions

125

Transaction rate (per second)

Transaction latency (seconds)

2

4

30

30

20

20

10

10

0 0

25

0 50 75 100 0 25 50 Percentage of Zerocoin transactions in network

75

100

75

100

Transaction rate (per second)

Transaction latency (seconds)

8

16

30

30

20

20

10

10

0 0

25

0 50 75 100 0 25 50 Percentage of Zerocoin transactions in network

Transaction latency (seconds)

Transaction rate (per second) 32 30 20 10 0 0 25 50 75 100 Percentage of Zerocoin transactions in network Original Zerocoin protocol

Improved Zerocoin protocol

Fig. 2. Transaction latency against percentage of Zerocoin transactions in network for diﬀerent transaction rates.

Bitcoin transactions. As the percentage of Zerocoin transactions in the network increases, transactions in the network become more evenly divided between Zerocoin transactions and Bitcoin transactions. Thus, diﬀerences in latencies between the two type of transactions become more distinct which results in a higher spread in latencies. Since the diﬀerence in latency between Bitcoin transactions

126

J. Paul et al. Transaction rate (per second)

Block latency (seconds)

2

4

60

60

40

40

20

20

0 0

25

0 50 75 100 0 25 50 Percentage of Zerocoin transactions in network

75

100

75

100

Transaction rate (per second)

Block latency (seconds)

8

16

60

60

40

40

20

20

0 0

25

0 50 75 100 0 25 50 Percentage of Zerocoin transactions in network Transaction rate (per second) 32

Block latency (seconds)

60

40

20

0 0 25 50 75 100 Percentage of Zerocoin transactions in network Original Zerocoin protocol

Improved Zerocoin protocol

Fig. 3. Block latency against percentage of Zerocoin transactions in network for different transaction rates.

and Zerocoin transactions is very large, the spread is also large. Theoretically, latency variation should be the highest when transactions in the network are divided equally between the two types of transactions. This is largely observed in the results as the inter-quartile ranges for both transaction and block latencies are often the highest when 50% of the transactions in the network are Zerocoin

Practically Realisable Anonymisation of Bitcoin Transactions

127

transactions. The above pattern in latency variation does not apply to transaction latencies under the original Zerocoin protocol, as latency variation increases consistently with percentage of Zerocoin transactions in the network. (2) Eﬀects of Improved Zerocoin Protocol on Block Latency: They read as follows: (a) Lack of Contagion Eﬀect in Block Latency It can be seen from Fig. 3 that for both the improved and original Zerocoin protocol, block latency increases fairly linearly as percentage of Zerocoin transactions increases in the network. This indicates that contagion eﬀect is absent at the block level, which can be explained by the experimental setup. In the experiments, a block is mined 25 s after one round of transaction creation completes and another 35 s of delay is introduced before the next round of transaction creation starts (Sect. 5.1). Including the time taken for each round of transaction creation, new blocks are mined and sent out in the network at least every minute. This avoids contagion eﬀect in two ways. First, the 25 s delay before the mining of each block allows most of the transactions that are created in the previous round of transaction creation to be received and veriﬁed by the network. Thus, when nodes are verifying the newly mined block, there are little or no transactions left in the network to be veriﬁed. This means that there is no spillover delay from transaction veriﬁcation when nodes are verifying blocks. Second, as the maximum time needed to verify each block containing 15 transactions is less than 5 s7 , a block will always be veriﬁed before the next block is received. Thus, the network never reaches its full capacity in terms of block veriﬁcation and there is no contagion eﬀect at the block level. (b) Consistent Improvements in Block Latency Due to the linear relationship between block latency and percentage of Zerocoin transactions in the network for both versions of the Zerocoin protocols, the ratio of the block latencies between the two versions of Zerocoin protocol remain consistent under diﬀerent scenarios. To illustrate the consistent reduction in block latencies due to the improved Zerocoin protocol, Fig. 4 shows the median, 75th percentile and 90th percentile relative block latencies between the original and improved Zerocoin protocol for all combinations of independent variable values in the experiments. The relative block latency is computed as the block latency under the original Zerocoin protocol divided by the block latency under the improved Zerocoin protocol to clearly show the scale of improvements brought by the improved Zerocoin protocol. To assess how the shorter veriﬁcation time of Spend transactions in the improved Zerocoin protocol translates to improvements in block latency, relative block latencies at 100% Zerocoin transactions in the network in Fig. 4 are 7

Estimated for a block containing all Zerocoin Spend transactions under the original Zerocoin protocol. From the microbenchmarks results, each of such transaction takes about 300ms to verify. Since block veriﬁcation mainly involves verifying the transactions in the block, the maximum time taken to verify a block containing 15 transactions is estimated to be less than 5 s.

J. Paul et al.

Relative block latency

Relative block latency

Relative block latency

Relative block latency

Relative block latency

128

0 4 3 2 1.03 0.90 1.05 1 0 50 75 90

0 4 3 2 1.23 1.16 1.00 1 0 50 75 90

0 4 3 2 1.24 1.10 0.95 1 0 50 75 90

0 4 3 2 1 0.87 0.87 0.80 0 50 75 90

0 4 3 2 1 1.02 0.98 1.00 0 50 75 90

4 3 2 1 0

4 3 2 1 0

4 3 2 1 0

4 3 2 1 0

4 3 2 1 0

Transaction rate = 2 per second Percentage of Zerocoin transactions in network 25 50 75 4 4 3 3 2.49 2.39 2.42 2.20 1.89 2.13 1.86 2.02 1.68 2 2 1 1 0 0 50 75 90 50 75 90 50 75 90 Latency percentile Transaction rate = 4 per second Percentage of Zerocoin transactions in network 25 50 75 4 4 3.00 2.94 2.75 2.70 3 2.23 3 2.39 2.52 2.54 2.49 2 2 1 1 0 0 50 75 90 50 75 90 50 75 90 Latency percentile Transaction rate = 8 per second Percentage of Zerocoin transactions in network 25 50 75 4 4 2.96 3.07 3 2.44 2.28 2.33 3 2.11 2.34 2.31 2 2.01 2 1 1 0 0 50 75 90 50 75 90 50 75 90 Latency percentile Transaction rate = 16 per second Percentage of Zerocoin transactions in network 25 50 75 4 4 2.91 2.62 2.60 3 2.49 2.25 3 2.36 1.97 1.72 1.87 2 2 1 1 0 0 50 75 90 50 75 90 50 75 90 Latency percentile Transaction rate = 32 per second Percentage of Zerocoin transactions in network 25 50 75 4 4 2.16 2.24 2.12 3 2.12 2.04 2.10 3 2.56 2.54 2.50 2 2 1 1 0 0 50 75 90 50 75 90 50 75 90 Latency percentile

100 4 3 2.49 2.48 2.45 2 1 0 50 75 90

100 4 3 2.49 2.52 2.52 2 1 0 50 75 90

100 4 3 2.52 2.47 2.47 2 1 0 50 75 90

100 4 3 2.43 2.40 2.40 2 1 0 50 75 90

100 4 3 2.43 2.43 2.41 2 1 0 50 75 90

Fig. 4. Relative block latency against percentage of Zerocoin transactions in network for diﬀerent transaction rates.

examined. For these cases, the median, 75th percentile and 90th percentile block latencies for the improved Zerocoin protocol are all consistently about 2.5 times smaller compared to the original Zerocoin protocol. This is consistent with the microbenchmarks results, which show that the improved Spend transaction takes 2.5 times shorter to verify. The lower block latency due to the improved Zerocoin protocol improves the general performance of the network.

Practically Realisable Anonymisation of Bitcoin Transactions

129

Another observation from Fig. 4 is that except for the cases where there are no Zerocoin transactions in the network, the relative block latencies remain at similar levels even when the percentage of Zerocoin transactions in the network is less than 100%. This shows that the improved Zerocoin protocol is able to bring consistent improvements to block latency even for blocks containing a small amount of Zerocoin transactions. This observation can be attributed to the much higher overheads of processing a Zerocoin transaction as compared to a Bitcoin transaction in the experiments (Sect. 5.4). Since there are only 15 transactions per block in the experiments, the overheads of processing a block is dominated by the overheads of processing the Zerocoin transactions in the block even if the block contains only a small number of Zerocoin transactions. Thus, as long as a block contains some Zerocoin transactions, the contribution to latency from the Bitcoin transactions in the block becomes negligible and the latency for the block is essentially the latencies of the Zerocoin transactions in the block. As a result, comparing the latencies of blocks containing some Zerocoin transactions is equivalent to comparing latencies of blocks containing all Zerocoin transactions.

6

Conclusion

This paper starts oﬀ by examining the privacy problems in Bitcoin and exploring the various methods to increase the anonymity of Bitcoin transactions. The Zerocoin protocol is chosen to be analysed in detail due to its theoretical soundness and practical potential. The main weaknesses of Zerocoin are the long veriﬁcation time and the large size of its Spend transactions, which can adversely aﬀect its performance in the actual Bitcoin network. This paper addresses these two problems by ﬁrst making performance improvements to the AccPoK and SNSoK in the Zerocoin protocol, and then integrating both the original and improved protocol into the Bitcoin client software to evaluate their performances in a Bitcoin network. For the AccPok, proof size is reduced by replacing a large proof variable with its hash and using the hash for veriﬁcation. These modiﬁcations has reduced the veriﬁcation time and size of the SNSoK by 80 and 60 times respectively, and reduced the size of the AccPoK by 25%. Consequently, the veriﬁcation time and size of a Spend transaction are reduced by about 2.5 and 4 times respectively. We integrate the key Zerocoin operations of transaction creation and veriﬁcation into the Bitcoind client and setting up a Bitcoin network with nodes running the modiﬁed Bitcoind clients. Transaction and block latencies under the improved and original Zerocoin protocol are compared to assess the performance improvements due to the improved Zerocoin protocol.

130

J. Paul et al.

References 1. Narayanan, A., Bonneau, J., Felten, E., Miller, A., Goldfeder, S.: Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction. Princeton University Press, Princeton (2016) 2. Miers, I., Garman, C., Green, M., Rubin, A.D.: Zerocoin: anonymous distributed e-cash from bitcoin. In: Proceedings - IEEE Symposium on Security and Privacy, pp. 397–411 (2013) 3. Moser, M., Bohme, R., Breuker, D.: An inquiry into money laundering tools in the Bitcoin ecosystem. eCrime Researchers Summit, eCrime (2013) 4. Ben-Sasson, E., Chiesa, A., Garman, C., Green, M., Miers, I., Tromer, E., Virza, M.: Zerocash: decentralized anonymous payments from bitcoin. In: Proceedings IEEE Symposium on Security and Privacy, pp. 459–474 (2014) 5. Ben-sasson, E., Chiesa, A., Tromer, E.: Succinct Non-Interactive Arguments for a von Neumann Architecture. USENIX Security, pp. 1–35 (2013) 6. Hohenberger, I.S.: Lecture 10: more on proofs of knowledge examples of proofs of knowledge. Compute 1, 1–8 (2002) 7. Camenisch, J., Lysyanskaya, A.: Dynamic Accumulators and Application to Eﬃcient Revocation of Anonymous Credentials. In: Crypto, p. 16 (2002)

Walsh Sampling with Incomplete Noisy Signals Yi Janet Lu1,2(B) 1

2

National Research Center of Fundamental Software, Beijing, People’s Republic of China [email protected] Department of Informatics, University of Bergen, Bergen, Norway

Abstract. With the advent of massive data outputs at a regular rate, admittedly, signal processing technology plays an increasingly key role. Nowadays, signals are not merely restricted to physical sources, they have been extended to digital sources as well. Under the general assumption of discrete statistical signal sources, we propose a practical problem of sampling incomplete noisy signals for which we do not know a priori and the sampling size is bounded. We approach this sampling problem by Shannon’s channel coding theorem. Our main results demonstrate that it is the large Walsh coeﬃcient(s) that characterize(s) discrete statistical signals, regardless of the signal sources. By the connection of Shannon’s theorem, we establish the necessary and suﬃcient condition for our generic sampling problem for the ﬁrst time. Our generic sampling results ﬁnd practical and powerful applications in not only statistical cryptanalysis, but software system performance optimization. Keywords: Walsh transform · Shannon’s channel coding theorem Channel capacity · Classical distinguisher · Statistical cryptanalysis Generic sampling · Digital signal processing

1

Introduction

With the advent of massive data outputs regularly, we are confronted by the challenge of big data processing and analysis. Admittedly, signal processing has become an increasingly key technology. An open question is the sampling problem with the signals, for which we assume that we do not know a priori. Due to reasons of practical consideration, sampling is aﬀected by possibly strong noise and/or the limited measurement precision. Assuming that the signal source is not restricted to a particular application domain, we are concerned with a practical and generic problem to sample these noisy signals. Our motivation arises from the following problem in modern applied statistics. Assume the discrete statistical signals in a general setting as follows. The samples, generated by an arbitrary (possibly noise-corrupted) source F , are 2n -valued for a ﬁxed n. It is known to be a hypothesis testing problem to test c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 131–144, 2019. https://doi.org/10.1007/978-3-030-03405-4_9

132

Y. J. Lu

presence of any signals. Traditionally, F is a deterministic function with small or medium input size. It is computationally easy to collect the complete and precise distribution f of F . Based on the notion of Kullback-Leibler distance, the conventional approach (aka. the classic distinguisher) solves the sampling problem, given the distribution f a priori (see [20]). Nevertheless, in reality, F might be a function that we do not have the complete description, or it might have large input size, or it maybe a non-deterministic function. Thus, it is infeasible to collect the complete and precise distribution f . This gives rise to the new generic statistical sampling problem with discrete incomplete noisy signals, using the bounded number of samples. In this work, we show that we can solve the generic sampling problem as reliable as possible without knowing signals a priori. By novel translations, Shannon’s channel coding theorem can solve the generic sampling problem under the general assumption of statistical signal sources. Speciﬁcally, the necessary and suﬃcient condition is given for the ﬁrst time to sample the incomplete noisy signals with bounded sampling size for signal detection. It is interesting to observe that the classical signal processing tool of Walsh transform [2,9] is essential: regardless of the signal sources, it is the large Walsh coeﬃcient(s) that characterize(s) discrete statistical signals. Put other way, when sampling incomplete noisy signals of the same source multiple times, one can expect to see repeatedly those large Walsh coeﬃcient(s) of same magnitude(s) at the ﬁxed frequency position(s). Note that this is known in application domains such as images, voices. Our results show strong connection between Shannon’s theorem and Walsh transform, both of which are the key innovative technologies in digital signal processing. Our generic sampling results ﬁnd practical and useful applications in not only statistical cryptanalysis - it is expected to become a powerful universal analytical tool for the core building blocks of symmetric cryptography [12,21], but performance analysis and heterogeneous acceleration. The latter seems to be one of the main bottlenecks for large-scale IT systems in the era of the revolutionary development of memory technologies. The rest of the paper is organized as follows. In Sect. 2, we review the basics of Walsh transforms and its application of multi-variable tests in statistics. In Sect. 3, Shannon’s famous channel coding theorem, also known as Shannon’s Second Theorem, is reviewed. In Sect. 4, we present our main sampling results: we put forward two sampling problems, namely, the classical and generic versions; we also conjecture a quantitative relation between Renyi’s divergence of degree 1/2 and Shannon’s channel capacity. In Sect. 5, we give illustrative applications and experimental results. Finally, we give conclusions and future work in Sect. 6.

Walsh Sampling with Incomplete Noisy Signals

2

133

Walsh Transforms in Statistics

Given a real-valued function f : GF (2)n → R, which is deﬁned on an n-tuple binary vector of input, the Walsh transform of f , denoted by f, is another real-valued function deﬁned as def f(i) = (−1)i,j f (j), (1) j∈GF (2)n

for all i ∈ GF (2)n , where < i, j > denotes the inner product between two n-tuple binary vectors i, j. For later convenience, we give an alternative deﬁnition below. Given an input array x = (x0 , x1 , . . . , x2n −1 ) of 2n reals in the time domain, the Walsh transform y = x = (y0 , y1 , . . . , y2n −1 ) of x is deﬁned by def

yi =

(−1)i,j xj ,

(2)

j∈GF (2)n

for any n-tuple binary vector i. We call xi (resp. yi ) the time-domain component (resp. transform-domain coeﬃcient) of the signal with dimension 2n . For properties and references on Walsh transforms, we refer to [9,12,17]. Let f be a probability distribution of an n-bit random variable X = (Xn , Xn−1 , . . . , X1 ), where each Xi ∈ {0, 1}. Then, f(m) is the bias of the Boolean variable m, X for any ﬁxed n-bit vector m, which is often called the output pattern or mask (and note that the pattern m should be nonzero). Here, recall that a Boolean random variable A has bias , which is deﬁned by = E[(−1)A ] = Pr(A = 0) − Pr(A = 1). def

(3)

Hence, we always have −1 ≤ ≤ 1 and if A is uniformly distributed, A has bias 0. Walsh transforms were used in statistics to ﬁnd dependencies within a multivariable data set. In the multi-variable tests, each Xi indicates the presence or absence (represented by ‘1’ or ‘0’) of a particular feature in a pattern recognition experiment. Fast Walsh Transform (FWT) is used to obtain all coeﬃcients f(m) in one shot. By checking the Walsh coeﬃcients one by one and identifying the large 1 ones, we are able to tell the dependencies among Xi ’s. For instance, we have a histogram of the pdf (the probability density function) of the triples (X0 , X1 , X2 ) as depicted in Fig. 1. It is obtained by an experiment of trying a total of 160 times. The Walsh spectrum of the histogram, i.e., of the array (24, 18, 16, 26, 22, 14, 16, 24), as calculated by (2), is shown in Fig. 2. The largest nontrivial Walsh coeﬃcient is found to be 32 located at the index position (0, 1, 1). This implies that the correlation is observed to be strongest for the variables X1 , X2 . 1

Throughout the paper, we refer to the large transform-domain coeﬃcient d as the one with a large absolute value.

134

Y. J. Lu

Fig. 1. The PDF of the triple (X0 , X1 , X2 ).

Fig. 2. Walsh spectrum of the PDF of the triple (X0 , X1 , X2 ).

3

Review on Shannon’s Channel Coding Theorem

We brieﬂy review Shannon’s famous channel coding theorem2 [5]. First, we recall basic deﬁnitions of Shannon entropy. The entropy H(X) of a discrete random variable X with alphabet X and probability mass function p(x) is deﬁned by def

H(X) = −

p(x) log2 p(x).

x∈X

2

Sometimes it’s called Shannon’s Second Theorem.

Walsh Sampling with Incomplete Noisy Signals

135

The joint entropy H(X1 , . . . , Xn ) of a collection of discrete random variables (X1 , . . . , Xn ) with a joint distribution p(x1 , x2 , . . . , xn ) is deﬁned by def p(x1 , . . . , xn ) log2 p(x1 , . . . , xn ). H(X1 , . . . , Xn ) = − x1 ··· xn

Deﬁne the conditional entropy H(Y |X) of a random variable Y given X by def p(x)H(Y |X = x). H(Y |X) = x

The mutual information I(X; Y ) between two random variables X, Y is equal to H(Y ) − H(Y |X), which always equals H(X) − H(X|Y ). A communication channel is a system in which the output Y depends probabilistically on its input X. It is characterized by a probability transition matrix that determines the conditional distribution of the output given the input. Theorem 1 (Shannon’s Channel Coding Theorem): Given a channel, denote the input, output by X, Y respectively. We can send information at the maximum rate C bits per transmission with an arbitrarily low probability of error, where C is the channel capacity deﬁned by C = max I(X; Y ),

(4)

p(x)

and the maximum is taken over all possible input distributions p(x). For the binary symmetric channel (BSC) with crossover probability p, that is, the input symbols are complemented with probability p, so the transition matrix is of the form 1−p p . (5) p 1−p We can express C by [5]: C = 1 − H(p)bits/transmission.

(6)

We refer to the BSC with crossover probability p = (1 + d)/2 and d is small (i.e., |d| 1) as an extremal BSC. Note that such a BSC becomes useful in studying the correlation attacks (see [15]) on stream ciphers. To put simply, the observed keystream bit is found to be correlated with one bit which is a linear relation of the internal state. In this case, the probability that the equality holds is denoted by 1 − p (and so the crossover probability of this BSC is p). Using H

1 + d 2

=1−

we can show H

d2 2

1 + d 2

+

1 d4 d6 d8 + + + ··· × , 12 30 56 log 2

(7)

O(d10 )

= 1 − d2 /(2 log 2) + O(d4 ).

(8)

136

Y. J. Lu

So, we can show the following result of the channel capacity for an extremal BSC: Corollary 1 (extremal BSC): Given a BSC channel with crossover probability p = (1 + d)/2, if d is small (i.e., |d| 1), then, C ≈ c0 · d2 , where the constant c0 = 1/(2 log 2). Therefore, we can send one bit with an arbitrarily low probability of error with the minimum number of transmissions 1/C = (2 log 2)/d2 , i.e., O(1/d2 ). Interestingly, in communication theory, this extremal BSC is rare as we typically deal with |d| 0 (see [18]).

4

Sampling Theorems with Incomplete Signals

In this section, we put forward two sampling problems, namely, the classical and generic versions. Without loss of generality, we assume that the discrete statistical signals are not restricted to a particular application domain and the signals are 2n -valued for a ﬁxed n. Speciﬁcally, we give the mathematical model on the signal represented by an arbitrary (and not necessarily deterministic) function F as follows. Let X be the n-bit output sample of F , assuming that the input is random and uniformly distributed. Denote the output distribution of X by f . Note that our assumption on a general setting of discrete statistical signals is described by the assumption that F is an arbitrary yet ﬁxed function with the n-bit output. The classical sampling problem can be formally stated as follows: Theorem 2 (Classical Sampling Problem): Assume that the largest Walsh coefﬁcient of f is d = f(m0 ) for a nonzero n-bit vector m0 . We can detect signals represented by F with an arbitrarily low probability of error, using minimum number N = (8 log 2)/d2 of samples of F , i.e., O(1/d2 ). Note that it can be interpreted as the classical distinguisher which is used often in statistical cryptanalysis, though the problem statement of the classical distinguisher is slightly diﬀerent and it uses a slightly diﬀerent N [20]. The classical sampling problem assumes that F together with its characteristics (i.e., the largest Walsh coeﬃcient d) are known a priori. Next, we present our main sampling results for practical (and widely applicable) sampling. Assuming that it is infeasible to know signal represented by F a priori, we want to use the bounded number of samples to detect signals with an arbitrarily low probability of error. Note that the sampled signal is often incomplete3 and so the associated distribution is not precise. We call this problem as generic sampling with incomplete noisy signals. In analogy to the classical distinguisher, this result can be interpreted as a generalized distinguisher4 in the context of statistical cryptanalysis. We give our main result with n = 1 below. 3 4

That is, it is possible that not all the outputs are generated by sampling. With n = 1, this appears as an informal result in symmetric cryptanalysis, which is used as a black-box analysis tool in several crypto-systems.

Walsh Sampling with Incomplete Noisy Signals

137

Theorem 3 (Generic Sampling Problem with n = 1): Assume that the sampling size of F is upper-bounded by N . Regardless of the input size of F , in order to detect the signal F with an arbitrarily low probability of error, it is necessary and suﬃcient to have the following √ condition satisﬁed, i.e., f√has a nontrivial Walsh coeﬃcient d with |d| ≥ c/ N , where the constant c = 8 log 2. Assume that f satisfy the following conditions: 1) the cardinality of the support of f is a power of two (i.e., 2n ), and 2) 2n is small, and 3) f (i) ∈ (0, 3/2n ), for all i. Now, we present a generalized result for n ≥ 1, which incorporates Theorem 3 as a special case: Proposition 1 (Generic Sampling Problem with n ≥ 1): Assume that the sampling size of F is upper-bounded by N . Regardless of the input size of F , in order to detect the signal F with an arbitrarily low probability of error, it is necessary and suﬃcient to have the following condition satisﬁed, i.e., (f(i))2 ≥ (8 log 2)/N. (9) i=0

We note that the suﬃcient condition can be also proved based on results of the classic distinguisher (i.e., Squared Euclidean Imbalance), which uses the notion of Kullback-Leibler distance and states that i=0 (f(i))2 ≥ (4 log 2)/N is required for a high probability of success [20]. Secondly, from (9), the discrete statistical signals can be characterized by large Walsh coeﬃcients of the associated distribution. Thus the most signiﬁcant transform-domain signals are the largest coeﬃcients in our generalized model. 4.1

Proof of Theorem 3

We ﬁrst prove the following hypothesis testing result by Shannon’s Channel Coding Theorem: Theorem 4: Assume that the boolean random variable A has bias d and d is small. We are given a sequence of random samples, which are i.i.d. following the distribution of either A or a uniform distribution. We can tell the sample source with an arbitrarily low probability of error, using the minimum number N of samples (8 log 2)/d2 , i.e., O(1/d2 ). Proof: We propose a novel non-symmetric binary channel. Assume the channel with the following transition matrix: 1 − p e pe p(y|x) = , (10) 1/2 1/2 where pe = (1 − d)/2 and d is small. The matrix entry in the xth row and the yth column denotes the conditional probability that y is received when x is sent. So, the input bit 0 is transmitted by this channel with error probability pe (i.e., the received sequence has bias d if input symbols are 0) and the input bit 1 is

138

Y. J. Lu

transmitted with error probability 1/2 (i.e., the received sequence has bias 0 if input symbols are 1). By Shannon’s channel coding theorem, with a minimum number of N = 1/C transmissions, we can reliably (i.e., with an arbitrarily low probability of error) detect the signal source (i.e., determine whether the input is ‘0’ or ‘1’). To compute the channel capacity C, i.e., ﬁnd the maximum deﬁned in (4), no closed-form solution exists in general. Nonlinear optimization algorithms (see [1,3]) are known to ﬁnd a numerical solution. Below, we propose a simple method to give a closed-form estimate C for our extremal binary channel. As I(X; Y ) = H(Y ) − H(Y |X), we ﬁrst compute H(Y ) by 1 , H(Y ) = H p0 (1 − pe ) + (1 − p0 ) × 2

(11)

where p0 denotes p(x = 0) for short. Next, we compute p(x)H(Y |X = x) H(Y |X) = x

= p0 H(pe ) − 1 + 1.

(12)

Combining (11) and (12), we have 1 1 I(X; Y ) = H p0 × − p0 pe + − p0 H(pe ) + p0 − 1. 2 2

(13)

As pe = (1 − d)/2, we have I(X; Y ) = H(

1−d 1 + p0 d ) − p0 H( ) − 1 − 1. 2 2

(14)

For small d, we apply (8) in Appendix I(X; Y ) = −

1−d p20 d2 − p0 H( ) − 1 + O(p40 d4 ). 2 log 2 2

(15)

Note that the last term O(p40 d4 ) on the right side of (15) is ignorable. Thus, I(X; Y ) is estimated to approach the maximum when p0 = −

H( 1−d d2 /(2 log 2) 1 2 )−1 ≈ = . 2 2 d /(log 2) d /(log 2) 2

Consequently, we estimate the channel capacity (15) by 1−d 1 1 1 − H( ) C ≈ − d2 /(2 log 2) + 4 2 2 2 2 ≈ − d /(8 log 2) + d /(4 log 2), which is d2 /(8 log 2).

(16)

Walsh Sampling with Incomplete Noisy Signals

139

We now proceed to prove Theorem 3. The only nontrivial Walsh coeﬃcient d for n = 1 is f(1), which is the bias of F . First, we will show by contradiction that this is a necessary condition. That is, if we can identify √ √ F with an arbitrarily low probability of error, then, we must have |d| ≥ c/ N . Suppose |d| < c/ N otherwise. Following the proof of Theorem 4, we know that the error probability is bounded away from zero as the consequence of Shannon’s Channel Coding Theorem. This is contradictory. Thus, we have shown that the condition on d is a necessary condition. Next, we will show that it is also a suﬃcient condition. That √ is, if |d| ≥ c/ N , then, we can identify F with an arbitrarily low probability of error. This follows directly from Theorem 2 with n = 1. We complete our proof. 4.2

Proof of Proposition 1

Assume that the channel have transition matrix p(y|x). Let p(y|x = 0) denote the distribution f , and let p(y|x = 1) be a uniform distribution. Denote the channel capacity by C. For convenience, we let f (i) = ui + 1/2n for i = 0, . . . , 2n − 1. Note that we have i ui = 0 and −1/2n < ui < (2n − 1)/2n for all i. By Taylor series, for all i we have log(ui +

u ui 1 1 1 i 3 ( ) = log + 2 + ) + · · · 2 2n 2n 3 22n + ui 2n + u i 2ui 1 , ≈ log n + 2 2 n 2 + ui

(17)

as we know ui /( 22n + ui ) ∈ (−1, 1). And we deduce that with small 2n , we can calculate − log 2 · H(f ) by

1 1 ) log(ui + n ) 2n 2 2ui +2 ui − 2 + 2n · ui i i 1 1 − n−1 (1 − ). n−1 u 2 1 + 2 i i

(ui +

i

≈ log

1 2n

≈ log

1 2n

(18)

Assuming that |2n−1 ui | < 1 for all i (and small 2n ), we have i

1 n−1 n−1 2 1 − 2 . ≈ u + (2 u ) i i 1 + 2n−1 ui i

(19)

We continue (18) by − log 2 · H(f ) ≈ log 21n + 2n−1 i u2i . Meanwhile, by property of Walsh transform ([12, Sect. 2]), we know i

(f(i))2 = 2n

i

2 f (i) = 2n u2i + 1. i

(20)

140

Y. J. Lu

So, we have shown an important result as follows 2n 1 2 f (i) − n 2 log 2 i 2 2 i=0 (f (i)) , =n− 2 log 2

H(f ) ≈ n −

(21)

Assuming that (1) the cardinality of the support of f is a power of two (i.e., 2n ), and (2) 2n is small, and (3) f (i) ∈ (0, 3/2n ), for all i. Next, in order to calculate C, by (21) we ﬁrst compute H(Y |X) = p0 H(f ) + (1 − p0 )n ≈ n −

p0

2 i=0 (f (i))

2 log 2

,

(22)

where p0 denote p(x = 0) for short. Denote the distribution of Y by DY . We have DY (i) = p0 f (i) + (1 − p0 )/2n = p0 ui + 1/2n for all i. Again we can apply (21) and get H(Y ) ≈ n −

(i))2

i=0 (DY

2 log 2

=n−

p20

2 i=0 (f (i))

2 log 2

.

(23)

So, we have I(X; Y ) = H(Y ) − H(Y |X) ≈

(p0 − p20 )

2 i=0 (f (i))

2 log 2

.

(24)

When p0 = 1/2, we have the maximum I(X; Y ), which equals

C≈

i=0

2 f(i)

8 log 2

2n =

i

f (i) −

8 log 2

1 2n

2 .

(25)

Consequently, we have N ≥ 1/C, i.e., i=0 (f(i))2 ≥ (8 log 2)/N . And this is a necessary and suﬃcient condition, following Shannon’s theorem. 4.3

More Generalized Results

Above we consider the case that f (i) is not so far from the uniform distribution. Based on the weaker assumption that 2n is small and f (i) = 1/2n + ui > 0 (i.e., ui > −1/2n ) for all i, we now show a more general result. We have ui 1 H(f ) ≈ n + (26) log 2 i 1 + 2n−1 ui by (18).

Walsh Sampling with Incomplete Noisy Signals

Following similar computations we have ui p0 , H(Y |X) ≈ n + log 2 i 1 + 2n−1 ui p0 u i 1 H(Y ) ≈ n + . log 2 i 1 + 2n−1 p0 ui So, we obtain the following general result, p0 u i 1 p0 u i . C ≈ max − p0 log 2 1 + 2n−1 p0 ui 1 + 2n−1 ui i

141

(27) (28)

(29)

Recall that if |u( i)| < 2/2n for all i, (29) can be approximated by (25), which is achieved with p0 = 1/2. Speciﬁcally, if |u( i)| < 2/2n , the approximation for the addend in (29) can be expressed as follows: (

p0 u i p0 u i − ) ≈ 2n−1 u2i (p0 − p20 ), 1 + 2n−1 p0 ui 1 + 2n−1 ui

(30)

v 1 = 1 − 1+v ≈ 1 − (1 − v + v 2 ) = v − v 2 (for |v| < 1). where we use 1+v v Note that for |v| > 1, we have 1+v ≈ 1 − 1/v + 1/v 2 . We can show that with 2n−1 ui = k > 1 for some i, the addend in (29) can achieve the maximum when p0 = 1/k, that is,

max( p0

1 p0 u i p0 u i 1 1 1 − ) ≈ n−1 (1 − + 2 − 3 ). n−1 n−1 1+2 p0 u i 1+2 ui 2 k k k

(31)

On the other hand, the right-hand side of (30) equals (p0 −p20 )k 2 /2n−1 , which is much larger than (31). Meanwhile, with 2n−1 ui = k = 1 for some i, we have p0 u i p0 u i − ) 1 + 2n−1 p0 ui 1 + 2n−1 ui 1 p0 1 1 ≈ max n−1 ( − p20 ) = · n−1 , p0 2 2 16 2 max( p0

(32)

when p0 = 1/4. 4.4

Further Discussions

Based on Renyi’s information measures [6], we make a conjecture5 below for an even more general form of our channel capacity C. Recall that Renyi’s information divergence (see [6]) of order α = 1/2 of distribution P from another distribution Q on a ﬁnite set X is deﬁned as def

Dα (P Q) =

1 log P α x Q1−α x . α−1 x∈X

5

See [14] for most recent results on the conjecture.

(33)

142

Y. J. Lu

So, with α = 1/2, def

D1/2 (P Q) = (−2) log

P (x)Q(x).

(34)

x∈X

Conjecture 1: Let Q, U be a non-uniform distribution and a uniform distribution over the support of cardinality 2n . Let the matrix of T consist of two rows Q, U and 2n columns. We have the following relation between Renyi’s divergence of degree 1/2 and the generalized channel capacity of degree 1/2 (i.e., standard Shannon’s channel capacity), D1/2 (Q U ) = 2 · C1/2 (T ).

(35)

Finally, assume that 2n -valued F has potentially large input space 2L . The collected normalized distribution of N 2n output samples in the time-domain approximately ﬁts in the noisy model of [4,11]. That is, the additive Gaussian noise is i.i.d. and has zero mean and variance σ 2 = 1/(N 2n ). In this case, noisy sparse WHT [4] can be used for our generic sampling problem in recovering the nonzero Walsh coeﬃcients f(i) and index positions i. And we obtain evaluation time N queries of F , processing time roughly on the order of n, provided that (f(i))2 ≥ (8 log 2)/N, (36) i=0

that is, SNR =

i=0

5

2 2 1 f(i) /(2n · σ 2 ) = f(i) /( ) ≥ 8 log 2. N

(37)

i=0

Applications and Experimental Results

We ﬁrst demonstrate a cryptographic sampling application. The famous block cipher GOST 28147-89 is a balanced Feistel network of 32 rounds. It has a 64-bit block size and a key length of 256 bits. Let 32-bit Li and Ri denote the left and right half at Round i ≥ 1 and L0 , R0 denote the plaintext. The subkey for Round i is denoted by ki . For the purpose of multi-round analysis, our target function is F (Ri−1 , ki ) = Ri−1 ⊕ ki ⊕ f (Ri−1 , ki ), where f (Ri−1 , ki ) is the round function. We choose N = 240 . It turns out that the largest three Walsh coeﬃcients are 2−6 , 2−6.2 , 2−6.3 , respectively. This new weakness leads to various severe attacks [13] on GOST. Similarly, our cryptographic sampling technique is applicable and it further threatens the security of the SNOW 2.0 cipher [22]. For general SNR ≥ 8 log 2, regardless of 2L , when N ∼ 2n , we choose appropriate b such that N = b · 2 . So, roughly b · processing time is needed. With L = 64, it would become a powerful universal analytical tool6 for security evaluation of the core building blocks of symmetric cryptography [12,21]. 6

Note that current computing technology [16] can aﬀord exascale WHT (i.e., on the order of 260 ) within 2 years, which uses 215 modern PCs.

Walsh Sampling with Incomplete Noisy Signals

143

Another notable practical application is software performance optimization. Modern large-scale IT systems are of hybrid nature. Usually, only partial information about the architecture as well as some of its component units is known. Further, the revolutionary change of the physical components (e.g., main memory, storage) inevitably demand that the system take full advantages of the new hardware characteristics. We expect that our sampling techniques (and the transform-domain analysis of the running time) would help with performance analysis and optimization for the whole heterogeneous system.

6

Conclusion

In this paper, we model general discrete statistical signals as the output samples of an unknown arbitrary yet ﬁxed function (which is the signal source). We translate Shannon’s channel coding theorem to solve a hypothesis testing problem. The translated result allows to solve a generic sampling problem, for which we know nothing about the signal source a priori and we can only aﬀord bounded sampling measurements. Our main results demonstrate that the classical signal processing tool of Walsh transform is essential: it is the large Walsh coeﬃcient(s) that characterize(s) discrete statistical signals, regardless of the signal sources. By Shannon’s theorem, we establish the necessary and suﬃcient condition for the generic sampling problem under the general assumption of statistical signal sources. As for future direction of this work, it is interesting to investigate noisy sparse WHT for our generic sampling problem in a more general setting N 2n and 2n is large, in which case the noise cannot be modelled as Gaussian.

References 1. Arimoto, S.: An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inform. Theory IT-18, 14-20 (1972) 2. Blackledge, J.M.: Digital Signal Processing - Mathematical and Computational Methods. Software Development and Applications, 2nd edn. Horwood Publishing, England (2006) 3. Blahut, R.: Computation of channel capacity and rate distortion functions. IEEE Trans. Inform. Theory IT-18, 460–473 (1972) 4. Chen, X., Guo, D.: Robust sublinear complexity walsh-hadamard transform with arbitrary sparse support. In: IEEE International Symposium Information Theory, pp. 2573–2577 (2015) 5. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. John Wiley & Sons, Hoboken (2006) 6. Csisz´ ar, I.: Generalized cutoﬀ rates and r´enyi’s information measures. IEEE Trans. Inform. Theory 41(1), January 1995 7. Dinur, I., Dunkelman, O., Keller, N., Shamir, A.: Memory-eﬃcient algorithms for ﬁnding needles in haystacks. In: CRYPTO 2016. Part II, LNCS, vol. 9815, pp. 185–206 (2016) 8. Gray, R.M., Davisson, L.D.: An Introduction to Statistical Signal Processing. Cambridge University Press (2004). http://www-ee.stanford.edu/∼gray/sp.pdf

144

Y. J. Lu

9. Horadam, K.J.: Hadamard Matrices and Their Applications. Princeton University Press, Princeton (2007) 10. Joux, A.: Algorithmic Cryptanalysis. Cryptography and Network Security. Chapman & Hall/CRC, Boca Raton (2009) 11. Li, X., Bradley, J.K., Pawar, S., Ramchandran, K.: The SPRIGHT algorithm for robust sparse Hadamard transforms. In: IEEE International Symposium Information Theory, pp. 1857–1861 (2014) 12. Lu, Y., Desmedt, Y.: Walsh transforms and cryptographic applications in bias computing. Cryptogr. Commun. 8(3), 435–453 (2016). Springer 13. Lu, Y.: New Linear Attacks on Block Cipher GOST, IACR eprint (2017). http:// eprint.iacr.org/2017/487 14. Lu, Y.: New Results on the DMC Capacity and Renyi’s Divergence (2017). arXiv:1708.00979 15. Meier, W., Staﬀelbach, O.: Fast correlation attacks on certain stream ciphers. J. Cryptol. 1(3), 159–176 (1989). Springer 16. Reed, D.A., Dongarra, J.: Exascale computing and big data: the next frontier. Commun. ACM 58(7), 56–68 (2015) 17. Scheibler, R., Haghighatshoar, S., Vetterli, M.: A fast hadamard transform for signals with sublinear sparsity in the transform domain. IEEE Trans. Inf. Theory 61(4), 2115–2132 (2015) 18. Shokrollahi, M.A.: Personal Communication (2006) 19. Vaudenay, S.: An experiment on DES - statistical cryptanalysis. In: Third ACM Conference on Computer Security, pp. 139–147 (1996) 20. Vaudenay, S.: A Classical Introduction to Modern Cryptography. Applications for Communications Security. Springer, New York (2006) 21. S. Vaudenay, A Direct Product Theorem, submitted 22. Zhang, B., Xu, C., Meier, W.: Fast correlation attacks over extension ﬁelds, largeunit linear approximation and cryptanalysis of SNOW 2.0. In: CRYPTO 2015, LNCS Vol. 9215, pp. 643–662, Springer (2015) 23. Zhang, B., Xu, C., Feng, D.: Practical cryptanalysis of bluetooth encryption with condition masking. J. Cryptol. (2017). Springer, https://doi.org/10.1007/s00145017-9260-1

Investigating the Effective Use of Machine Learning Algorithms in Network Intruder Detection Systems Intisar S. Al-Mandhari(&), L. Guan, and E. A. Edirisinghe Department of Computer Science, Loughborough University, Loughborough, UK [email protected] [email protected]

Abstract. Research into the use of machine learning techniques for network intrusion detection, especially carried out with respect to the popular public dataset, KDD cup 99, have become commonplace during the past decade. The recent popularity of cloud-based computing and the realization of the associated risks are the main reasons for this research thrust. The proposed research demonstrates that machine learning algorithms can be effectively used to enhance the performance of existing intrusion detection systems despite the high misclassiﬁcation rates reported in the literature. This paper reports on an empirical investigation to determine the underlying causes of the poor performance of some of the well-known machine learning classiﬁers. Especially when learning from minor classes/attacks. The main factor is that the KDD cup 99 dataset, which is popularly used in most of the existing research, is an imbalanced dataset due to the nature of the speciﬁc intrusion detection domain, i.e. some attacks being rare and some being very frequent. Therefore, there is a signiﬁcant imbalance amongst the classes in the dataset. Based on the number of the classes in the dataset, the imbalance dataset issue can be considered a binary problem or a multi-class problem. Most of the researchers focus on conducting a binary class classiﬁcation as conducting a multi-class classiﬁcation is complex. In the research proposed in this paper, we consider the problem as a multi-class classiﬁcation task. The paper investigates the use of different machine learning algorithms in order to overcome the common misclassiﬁcation problems that have been faced by researchers who used the imbalance KDD cup 99 dataset for their investigations. Recommendations are made as for which classiﬁer is best for the classiﬁcation of imbalanced data. Keywords: Cloud computing Imbalanced dataset

Data mining Intrusion detection

1 Introduction Due to cloud services and operational models, and the technologies that are used to enable these services, organisations will face new risks and threats. Moreover, a cloud’s distributed nature increases the possibilities of being exploited by intruders. Since cloud services are available through the internet, the key issue to be looked upon is © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 145–161, 2019. https://doi.org/10.1007/978-3-030-03405-4_10

146

I. S. Al-Mandhari et al.

cloud security [1]. The conﬁdentiality, integrity, and availability of cloud services and resources are affected by attacks. The authors of [2] argued that the computing costs needed for making use of cryptographic techniques can not justify their use in preventing attacks on the cloud. A ﬁrewall is another solution that can be used to prevent attacks. However, it is argued that while ﬁrewalls work well for outsider attacks, they cannot prevent the occurrence of insider attacks. Because of this, it is recommended to incorporate Intrusion Detection Systems (IDSs) where the internal and external attacks can be detected. Intrusion detection is the process of detecting the signs of intrusion based on event monitoring and analysing what occurs within ICT systems during an intrusion. Based on the protected objectivities and information sources, IDSs can be classiﬁed into HostIDS or Network-IDS. Host-Based intrusion detection systems collect and monitor suspicious events occurring within the host. Network-based intrusion detection systems capture and analyse network packets. Based on detection techniques, there are two main distinct approaches to IDS: knowledge-based intrusion detection (KID) and behaviour-based intrusion detection (BID). The former is known as misuse intrusion detection where evidence of attacks is searched for, based on the accumulated knowledge that corresponds to known attacks. The later is known as signature-based intrusion detection, where the attack is detected based on observation of what is deﬁned as a normal activity in order to identify signiﬁcant deviations. A wide range of machine learning techniques have been adopted in IDSs and has been in focus of much research in the past decade. These techniques are employed to learn from the training set that includes samples of a network’s behaviour under known attacks, to help build IDS systems that perform intrusion detection automatically. Actually, the machine learning classiﬁers have the ability to learn the nature of attacks and their variations from the known attacks that are in the dataset. Therefore, there is no need for a human expert to extract the knowledge and formulate it into rules that can represent attacks. The rest of this paper is organized as follows. Section 2 is about machine learning in intrusion detection which discusses the motivation of using machine learning in intrusion detection, data mining benchmark, machine learning detection methodology and the classiﬁcation approaches of the imbalanced dataset. Section 3 presents the KDD cup 99 imbalanced dataset classiﬁcation issue. Section 4 is about experimental setup in terms of dataset pre-processing and validation methods. Experimental results and comparative analysis are discussed in Sect. 5. Conclusions and summary are discussed in Sect. 6.

2 Machine Learning in Intrusion Detection 2.1

Application and Motivation of Machine Learning to ID

In the case of Intrusion Detection, learning involves analysing sample data of such activities, called training set, to discover patterns of normal behaviour or intrusive behaviour. However, it is argued that the training set should contain the most samples of the studied environment to be able to discover the whole patterns. Therefore, new

Investigating the Effective Use of Machine Learning Algorithms

147

data instance will be classiﬁed by the learned model based on its similarity to normal behaviour (anomaly detection) or known attack signatures (misuse detection) [3]. Machine Learning (ML) has a different type of techniques that enhance the detection capability as well as save the time and the cost. So, instead of constructing signatures of attacks or specifying the normal behaviour of a sensor node manually, ML did it automatically using the proper methodology and classiﬁer. Therefore, the time will be saved and the need for human labour will be less. From the literature review, researchers encourage the deployment of machine learning and data mining to enhance the IDS performance. In [4] it is reported that the ML play a great role in enhancing the IDS ability to focus on malicious activities through extract the normal activities from the alarm data. Moreover, researchers argued that the irrelevant alert is reduced by 99.9% after the deployment of ML classiﬁer. Also, Al-Memory in [5] concludes that using ML to enhance the IDS performance reduces the alarm load by 28%. 2.2

Knowledge Discovery and Data Mining Benchmark

The adequate dataset is one of the major issues that should consider in terms of intrusion detection system. In fact, there are few public datasets that used to serve the IDS in machine learning. From the literature review, there are two main datasets that are widely used for network intrusion detection system; DARPA (Defense Advanced Research Projects Agency) dataset and KDD Cup ’99 (Knowledge Discovery and Data) Dataset. DARPA is the ﬁrst standard corpora for evaluation of computer network intrusion detection systems is collected and distributed by the MIT (Massachusetts Institute of Technology) Lincoln Laboratory under DARPA and AFRL sponsorship. This dataset has been widely used by the researcher as it used for training and testing intrusion detectors where proper technical state-of-the-art can be achieved. DARPA dataset it not used widely after the appearance of the KDD Cup ’99 Dataset as it overcomes the limitation that it has. The main criticism against the DARPA is that it is not possible to determine how accurate the background trafﬁc inserted into the evaluation as the test bed trafﬁc generation software is not publicly available. Other popular critiques are the procedures used in building the data set and in performing the evaluation have been criticized by McHugh in [6]. Where the background trafﬁc is generated using simple models and in case that the life trafﬁc was used, the false positive rate would be much higher. Moreover, the background noise like the packet storms, strange packets, etc. is not included in the background data. Other critiques regards the irregularities in the data as it is commented by the authors of [7] where an appreciable detection rate is shown by the trivial detector as the attacks TTL value is obviously different as well as the normal packets. However, with all the criticisms, the DARPA dataset is slightly used by the researcher for IDS evaluation [7–9]. Actually, it is argued that since 1999, the widely-used dataset for the evaluation of detection methods is KDD Cup ’99 Dataset. Where it is built based on the data captured in DARPA 1998 TCP/IP data. For the KDD subset, the training data consist of ﬁve million records while the test set consist of about 4 million records each of which contains 41 features. Where the training dataset contains 24 types of attacks and

148

I. S. Al-Mandhari et al.

14 types are added to the test data only. Each record of the training data is labelled as either normal or a speciﬁc kind of attack. The attacks fall into one of four categories: Denial of Service (DoS), User to Root (U2R), Remote to local (R2L) and Probe. The detail description of the training attack types are listed in [15] the features of the KDD Cup ’99 Dataset can be classiﬁed into three groups [8]: • Basic features: All the attributes that can be extracted from a TCP/IP connection is encapsulated in it. Within this features, there will be delay in the detection. • Trafﬁc features: Features that are computed with window interval consideration is included in this category and it is divided into two groups; same host features and same service features. Where the connections in the past 2 s that have the same destination host as the current connection are examined within the same host features. While same service features examine the connections in the past 2 s that have the same service as the current connection. • Content features: Are the features that are used to detect suspicious behaviour in the data portion. In others word, this features can be used to detect the R2L and U2R attacks as these attacks embedded in the data portions of the packets, and normally involves only a single connection unlike the DoS and Probing attacks involve many connections to some host(s) in a very short period. 2.3

Machine Learning Detection Methodology

Based on the availability of the label in the dataset, two operating modes can be deﬁned for detection techniques [8]: • Supervised Detection: Also known as classiﬁcation methods, where the predicted model can be built based on labelled training set containing both normal and anomalous samples. It is argued that with this method the detection rate will be better due to the ability to access a lot of information. However, these methods suffer from some technical issues that affect its accuracy. One of them is the lack of the training set that includes all the legitimate area and samples. Furthermore, the noise that is included in the training sets leads to increase the false alarm rate. • Unsupervised Detection: No training data is required with unsupervised techniques as this approach is based on two basic assumptions. First, it assumes that the normal trafﬁc is represented by the majority of the network connections while the malicious trafﬁc represented by the small percentage of the connections. Second, statistically, there is different between the normal and malicious trafﬁc. Therefore, the instances that appear infrequently and are signiﬁcantly different from most the instances are considered as attacks while the normal trafﬁc is reflected by the data instances that build groups of similar instances and appear very frequently are supposed. Based on the way to report anomalies, typically, there are three types of output [8]: • Scores: In this technique, a numeric score is assigned to each tested instance to determine how likely to be an attack. Therefore, the analyst can use ranking to classify the most signiﬁcant samples by setting the threshold value. A good example of this kind of methods is Naive Bayes which is kind of Bayesian networks.

Investigating the Effective Use of Machine Learning Algorithms

149

• Binary Labels: Some of the detection techniques can’t provide scores for the instances; instead of that, the labelling technique is used. Where the tested instances are label as either anomalous or normal. • Multi-Labels: With this technique, the speciﬁc label is assigned to each tested instance. In this approach, one label is for normal trafﬁc and for each kind of attacks there is label contain it is name or category. For example, some methods apply the labels normal, DoS, Probe, U2R, and R2L to show the general category of the detected attacks. This type is used by the detection techniques that can’t score the instance like Decision Tree. As this paper focuses on multi-class classiﬁcation problem that will discuss later, we will briefly describe, here, the machine learning classiﬁcation algorithms. Classiﬁcation algorithms in intrusion detection are used to classify the network trafﬁc as normal or attacks. Basically, From the time that Denning [10] formalize the anomaly detection, different methods have been proposed and in the following, we briefly explain some of the methods applied for the multi-class problem [8, 11–14]. • Bayesian Networks: Bayesian Network is a probabilistic graphical model where the idea is based on the representation of a set of variables and their probabilistic independencies. It is claimed that this approach is acyclic graphs consist of nodes and edges nodes; represent variables, and the edges encode conditional dependencies between the variables. They have been applied in detection and in classiﬁcation in different ways. The most two types of this technique that are deployed are Naïve Bayes and Bayes Network. • Neural Networks: The Neural network is a network of computational units where complex mapping function is implemented between these units. Initially, label dataset is used to train the network; therefore, the tested instances are then classiﬁed as either normal or attacks after they are faded into the network implemented between these units. Initially, label dataset is used to train the network, therefore; the tested instances are then classiﬁed as either normal or attacks after they are faded into the network. Support Vector Machines (SVM) and Multi-Layer Perception (MLP) are examples of Neural Network technique which is widely used in anomaly detection. • Trees (inductive rule generation algorithms): Flow-chart like tree structure is produced where the feature is represented by the node, the testing result is denoted by the branch and the predicted classes are presented by the leaves. Different types of techniques are under this category, the most deployed for classiﬁcation problem are J48 and Random Forest. • Clustering: The idea of these techniques is usually based on the idea of the unsupervised detection that discussed above (two assumptions). Therefore, if these two assumptions hold, anomalies can be detected based on the cluster size, i.e., large clusters correspond to normal data, and the rest correspond to attacks.

150

I. S. Al-Mandhari et al.

3 An Insight into Classiﬁcation Imbalanced Dataset 3.1

KDD Cup ’99 Dataset Classiﬁcation Issues

Duplicates and class imbalance is the critical issues that can be considered in KDD Cup ’99 dataset. There are many duplicate instances in the KDD Cup ’99 dataset due to the lack of temporal information in it. Therefore, the quality of data is affected by the duplicate records which negatively affect the training process of machine learning techniques. The importance of high-quality training data is discussed in [15, 16] and it is concluded that the training samples should be diverse. Moreover, [16] investigate empirically the effects of duplicates on naïve Bayes and a Perceptron with Margins, in an application to spam detection. They recommend removing the duplicates as they observe that the amount of duplication has a negative effect on the accuracy of the classiﬁers. Table 1 provides an overview of the number of instances of each class before and after removing duplicates, which illustrates that the DoS and Probing classes exhibit the most duplicates, due to the nature of the intrusions. Moreover, it is clearly seen that KDD Cup ’99 Dataset suffers from the class imbalance as what can see in the table below the DoS and Prob attack enhance the most number of samples [17, 18]. Table 1. Description of class distribution in KDD Cup 99 Dataset Dataset/Number of instances KDD Cup 99 dataset (with duplicates) KDD Cup 99 dataset (no duplicates)

3.2

DoS

R2L

U2R

Probe

Normal

Total

3,883,370

1,126

52

41,102

9,72,780

4898430

247,267

999

52

13,860

812, 814

1,074,992

Problem of Imbalanced Dataset

The scenario of imbalanced datasets appears frequently in the classiﬁcation problem ﬁeld. This problem rise when one class signiﬁcantly out-number the examples of the other one where there will be a higher misclassiﬁcation rate for the minority class instances as the standard classiﬁcation learning algorithms are often biased towards the majority class. So, it is important to focus on this classiﬁcation problem as it is argued that the minority class usually represents the most important concept to be learned. However, since the data acquisition of the minor classes is costly or it is associated with exceptional and signiﬁcant cases [19] it seems that it is difﬁcult to get the represented samples. In most cases, binary classiﬁcation is referred to the imbalanced classiﬁcation but there are other problems occur under imbalance dataset called multi-class classiﬁcation in which it is argued by Engen in [17] that it is more difﬁcult to be solved since there are several minor classes need to be balanced [20]. The hitch with imbalanced datasets is that some machine learning algorithms (like ANN and DT) are biased towards the major class(es) [19], and, therefore, the minor

Investigating the Effective Use of Machine Learning Algorithms

151

class(es) score poor classiﬁcation rate. Where in sometimes the minor class is ignored by some classiﬁers as discussed in [21, 22] whilst still achieving a high overall accuracy [17, 18]. 3.3

Addressing Classiﬁcation with Imbalanced Data

Several approaches have been proposed to deal with imbalance classiﬁcation where these approaches can be categorized into two groups: the data level and algorithm level. The data level is also called external approaches where the data is pre-processed to diminish the effect of the class imbalanced. Under this category, some sampling form is applied to the training data. The algorithm level approaches are also called internal approaches as a new algorithm is created or the existing one is modiﬁed to deal with imbalanced dataset issue [23–29]. However, some class imbalance approaches are incorporating both the external (data level) and internal (algorithm level) approaches seeking to minimize the higher cost of minor class misclassiﬁcation samples [30–33]. Other approaches are based on ensemble learning algorithms that employed either by embedding a cost-sensitive framework in the ensemble learning process [33] or by preprocessing the data before the learning stage of each classiﬁer [34–36]. Regarding this, in this section, we ﬁrst introduce the sampling techniques. Next, we describe the cost-sensitive learning approach. Finally, we present some relevant ensemble techniques in the framework of imbalanced datasets [17, 18]. 3.3.1 Data Level: Resampling Techniques Appling some form of sampling is one of the most common methods of dealing with class imbalance to the training data. In specialized literature [37–39], it is proved that applying a pre-processing step to balance the class distribution is usually a useful solution. This resamples techniques can be in three groups. The ﬁrst form aims at removing instances from the major class(es) to obtain a balanced dataset and it is called under-sampling. The second form is aimed at producing more instances of the minor class(es) to balance the dataset. The third form is the hybrid methods which combine both oversampling and undersampling. However, Weiss in [19] recommends using both resamplings (hybrid) for imbalance dataset [40]. Actually, the resample can be conducted randomly, directed or based on the methods. Sometimes the random sampling is recommended as it produces a satisfactory result. But some authors argued that random sampling may generate some problems as the data distribution is not considered. In the case of random under-sampling, it is claimed that the signiﬁcant data may be discarded potentially which may affect the learning process. Moreover, in random oversampling the biased problem may increase, since the exact copy of the existing instance is made, therefore, the over-ﬁtting issue will rise. Nevertheless, both sampling approaches have been shown to help [41, 42]. However, class distribution in resampling been considered by some methods like the generation of synthetic training samples by [43]. The most sophisticated method is “Synthetic Minority Oversampling Technique” (SMOTE) [43] which based on the idea of interpolating several minority class instances that lie together to create new minor classes for oversampling the training set. However, the true distribution of a ‘real’

152

I. S. Al-Mandhari et al.

problem is generally unknown [19] Where authors in [22, 44] found that this is based on the method and the problem [17, 18, 45]. 3.3.2 Algorithms Level: Cost-Sensitive Learning Recently, researchers investigate the reasons behind the poor learning of many machine learning techniques from imbalance dataset. Some of them argued that the evaluation criteria they adopt are one of the reasons, as the overall accuracy is the most used metrics to evaluate the performance of the classiﬁer. Where the overall accuracy, mostly, not present the biased towards the major class. In which the minor class(es), sometimes, is ignored by the classiﬁer especially in the case that the extreme imbalance dataset is used [19, 22]. Authors study the biased in the classiﬁers and conclude that ANNs and DTs are good examples of these classiﬁers [34]. Several alternative evaluation metrics, including AUC1, F-measure, and weighted metrics (cost-sensitive learning) are discussed by [19, 21]. The cost sensitive learning is the popular approach, in which the error or classiﬁcation rate for each objective is multiplied by a cost/weight. In other words, it is based on weight matrix idea as it corresponds to the confusion matrix obtained from the performance of the classiﬁer where there is no penalty for the correct classiﬁcation [21]. It is recommended to incorporate the domain knowledge into such weight matrix. The weight matrix used in the evaluation of the KDD Cup ’99 competition, for example, is presented in Table 2, where it does not address the class imbalance as it used only to evaluate the classiﬁcation result and here the particular emphasis on penalising misclassiﬁcations of U2R and R2L instances is recommended [17, 40, 45, 46]. Table 2. Weight matrix for evaluating result of KDD Cup 99 competition [40] Actual/Predict Normal Probing DoS U2R R2L

Normal 0 1 2 3 4

Probing 1 0 1 2 2

DoS 2 2 0 2 2

U2R 2 2 2 0 2

R2L 2 2 2 2 0

3.3.3 Classiﬁer Combination: Ensemble Learning Ensemble-based classiﬁers, based on the idea of constructing two or more classiﬁers from the original data and when unknown instances are presented the classiﬁer aggregate their predictions. It is constructed by combining several classiﬁers to obtain a new classiﬁer that outperforms their ability to classify the new instances. Actually, it is a combination of data level or algorithms level and ensemble learning approaches. In the case of data, the level is combined with it, the new approach makes data preprocessing and then train each classiﬁer. While with cost-sensitive ensembles, the ensemble learning algorithm is used to guide the cost minimization procedure [20, 47].

Investigating the Effective Use of Machine Learning Algorithms

153

4 Experimental Setup 4.1

Data Mining Tool

Weka (Waikato Environment for Knowledge Analysis) was originally developed at the University of Waikato in New Zealand. It is a collection of machine learning algorithms for data mining tasks. It is java software that contains GUI interface. It contains tools for data pre-processing, classiﬁcation, regression, clustering and visualization. This software consists of Explorer, Experimenter, Knowledge flow, Simple Command Line Interface, Java interface. Several advantages encourage the use of Weka one of them is that free available as it is under the GUN public license. Moreover, since it is fully implemented in Java programming language and run on most of the architecture, it has portable feature. It mostly serves the needs of data mining issues due to the data processing and modeling techniques [48]. 4.2

Dataset and Pre-processing

The 10% version of KDD training set is used here, which avoids some of the methodological complexities in terms of time and usage memory that are related to the use of full dataset. Table 3 presents the number of records in the full dataset compared to 10% training set. Table 3. Number of records in full KDD Dataset vs. 10% training Dataset description Number of instances Full KDD 99 cup 4,898,431 10% training KDD cup 99 494,020

The duplicate records of the dataset were removed to reduce the memory requirement, time-consuming as well as the biased towards speciﬁc class. Table 4 presents the 10% training set with and without duplication. It is clear that 348,436 duplicated instances are removed. Table 4. Number of records in 10% training dataset with and without duplicates Dataset description Number of instances 10% training (duplicates) 494,020 10% training (no duplicates) 145,584

4.3

Validation Methods

There are two different approaches to validate the test; cross-validation and percentage split (holdout validation) in case that the studies do not adopt the original training and test sets. Based on the validation the result may differ as it is sensitive to the selection

154

I. S. Al-Mandhari et al.

of training and test data. However, an experiment has been conducted in this study show that holdout validation is differ based on the value of the percentage split that deployed. As if the value of the percentage split is 80% it means that the 80% of the dataset will be used as training while the rest is for testing. Therefore, the tested instance may differ based on the value that used which leads to a different rate of detection. Figure 1 presents the result when the value of holdout validation is 90%, 80% and 60% with a different classiﬁer, J48, random forest (RF), bays net (BN) and naïve Bayes (NB).

Fig. 1. Classiﬁers’ performance with different holdout validation value.

The variety in the detection rate of each attack is clearly seen which affected by the percentage that is used due to the fact that might new attack appear in the test set that is not studied in the training set. One sample of this variety is that the J48 detects 96.4% of U2Rwhen the holdout value is 90% while it increases to 50% and 56.6% when the percentage split is 80% and 60% respectively. Another noticeable result that RF detection ability for a U2R increase when the holdout value is decreased as it detects 45.5% of U2R attacks with holdout 90%, compared with 80% percentage score 71.4%. Although some classiﬁer detection changes slightly it, it affects the overall accuracy as with NB the overall accuracy decrease from 98 to 89.9% when the training set changed from 90% to 80%. More detailed and speciﬁed result provided in Fig. 1. So, it can be concluded that cross-validation is more accurate as the dataset will be passed to an Nfold cross-validation. Tenfold is considered to be the common value by Depren et al. [49] and chosen here since it is studied and supported by an empirical investigation with DTs and NB by Kohavi [50].

5 Results Based on Learning from Imbalance Dataset 5.1

Classiﬁcation with Resample Method

It is argued that sample is the most used method to enhance the performance of the classiﬁer that suffers from misclassiﬁcation and biased towards the major class. The sampling value that used in this study is 0.1 as the other value affecting the major class instances but with this value, the minor class increase slightly and the major class decrease

Investigating the Effective Use of Machine Learning Algorithms

155

slightly which not affect the original distribution of the dataset (Table 5). Figure 2 presents the sample value along with the change in the class distribution. Table 5. Classes distribution with different sampling value 0.0

0.1

0.2

0.3

0.4

0.5

Normal 87831 81959 76088 70216 64354 58473 U2R 52 2958 5864 8771 11677 14584 DoS 54572 52026 49480 46935 44389 41844 R2L 999 3810 6622 9434 12246 15057 Probe 2130 4828 7527 10226 12924 15623

0.6

0.7

0.8

0.9

1

52602 17490 39298 17869 18322

46731 20397 36753 20681 21020

40859 23303 34207 32493 23719

34988 26210 31662 26305 26418

29116 29116 29116 29116 29116

From Table 6, it is clearly seen that the overall accuracy is high for all of the studied classiﬁers. Where, in terms of attack classiﬁcation, it is observed that the detection rate of the major classes (Normal, Dos and probe) is high. But the detection classiﬁcation is low to the minor classes (U2R and R2L) due to the imbalance distribution of the instances. In another word, the detection rate of the studied classiﬁcation is affected by the imbalanced dataset, where there is some biased towards the major classes. For example, RF classiﬁer performs well with 99.9% accuracy but it detects only 67.3% of U2R and 67.9% of R2L. Moreover, the BN accuracy is 97.2% and the detection rate for all classes is high except for U2R due to the low samples of U2R to learn from.

Table 6. Classiﬁers performance before resampling Accuracy NB 89.8 BN 97.2 J48 99.8 RF 99.94 MLP 99.9

TPR 89.8% 97.2% 99.0% 99.9% 99.7%

FPR 1.8% 0.7% 0.1% 0.1% 0.3%

Precision 96.9% 98.6% 99.9% 99.9% 99.7%

Normal 86.6% 98.1% 99.9% 99.9% 99.9%

U2R DoS 84.6% 96.1% 82.7% 95.8% 59.6% 100.0% 67.3% 100.0% 50.0% 99.7%

R2L 38.0% 96.8% 96.0% 67.9% 88.9%

Probe 81.3% 98.1% 98.2% 98.8% 98.4%

As resample is one method for the multi-class imbalanced dataset, Table 7 presents the behaviour of the studied approaches after resample is deployed to the dataset. It is clearly seen that the detection rate of the minor classes is increased and there is no biased towards the major classes. For example, the detection rate for U2R increases from 59.9 to 100 using j48 classiﬁer. However, the detection rate for NB still low (38%) that will be solved and discussed later. Figure 3 compares the performance of the classiﬁers with imbalanced dataset before and after the sampling.

156

I. S. Al-Mandhari et al. Table 7. Classiﬁers performance after resampling

Accuracy TPR NB 87.9 87.9% BN 97.4 97.4% J48 0.999 100.0% RF 99.98 100.0% MLP 99.34 99.3%

FPR Precision Normal U2R DoS R2L 1.9% 94.1% 85.0% 97.3% 96.3% 38.2% 1.2% 97.8% 98.2% 98.0% 96.0% 99.4% 0.1% 99.9% 99.9% 100.0% 100.0% 99.7% 0.0% 100.0% 100.0% 100.0% 100.0% 100.0% 0.4% 99.3% 99.7% 87.1% 99.9% 93.4%

Probe 81.3% 98.4% 99.7% 99.0% 98.4%

Fig. 2. Classes distribution with different sampling value.

Fig. 3. Classiﬁers performance with and without resample method.

5.2

Random Forest Behaviour with Imbalance Learning Methods

This experiment studies the behaviour of the RF classier under the imbalance learning methods. It aims to conclude the best imbalance learning methods that enhance the RF classiﬁcation of the minor classes. The learning methods that deployed here is resampled, cost sensitive and ensemble (bagging and AdaBoostM). Figure 4 illustrates the result of the imbalance learning methods with RF. In terms of data level, it is clearly seen that RF classify the minor attacks (R2L and U2R) correctly without any misclassiﬁcation with 99.98 overall accuracy and 100% classiﬁcation of each attack. In terms of algorithm levels, the used cost sensitive is the standard one that is recommended [40] provided in Table 1. It is observed that the cost-

Investigating the Effective Use of Machine Learning Algorithms

157

sensitive classier has greatly influenced the behaviour of the RF in terms of the detection rate of R2L since it increased by 30% (from 67% to 97.8%). While the detection rate for U2R slightly affected since it leads to only 69% of it being correctly classiﬁed. It is argued that RF algorithm is working the same as bagging and it a type of ensemble learning. But from the result below it is clearly seen that RF with bagging classiﬁer works better than RF as tree classiﬁcation. Since the detection rate of R2Lis increased from 67.9% to 97%.

Fig. 4. Random forest performance with imbalance learning method.

However, we tried in this experiment to study the performance of RF with other ensemble learning called AdaBoostM as it one of the recommended classiﬁers for classiﬁcation. It is observed that the performance of the RF with AdaBoostM is the same as with Bagging, the detection rate of R2L obviously increased while the U2R classiﬁcation slightly decreased. Furthermore, this experiment uses hybrid imbalance learning methods like bagging with resampling and AdaBoostM with resampling. From the result, the resample positively affect the ensemble methods as all of the minor classes along with the major classes is correctly classiﬁed. 5.3

Naive Bayes Classiﬁer and the Minor Attacks

From resample experiments, Fig. 3, it is clearly seen that the resample method affect the behaviour of NB as the detection ability increase for the one of the minor class U2R while the classiﬁcation ability of R2L, approximately, remain the same. So, some experiments have been conducted to determine he methods that can enhance the NB detection ability for R2L. These experiments are based on different approaches of imbalance learning methods which are cost sensitive, bagging and stacking. In terms of cost sensitive, Fig. 5, the detection ability of R2 l attacks with NB has not affected by cost-sensitive method even if the weight of R2L misclassiﬁcation increased to more than 2. Moreover, bagging ensemble learning, as it is one of recommended solution for the multi-classiﬁcation problem, is used here and the minor classes’ classiﬁcation ability of NB increased slightly for U2R class (from 84.6% to 86.6) and remain the same for R2L. Therefore, to avoid the biased towards the major class bagging is used along with resampling where an increase is observed for U2R detection while the R2L not affected (approximately not change, 38.5%).

158

I. S. Al-Mandhari et al.

Fig. 5. NB detection for minor attacks.

Another ensemble method is used called stacking, which combines different classiﬁer together, where NB is chosen to be the base classiﬁer. In the ﬁrst experiment, NB is used with bagging where RF is chosen to be the base of bagging classiﬁer. The ﬁrst run is done without resampling where there is a noticeable decrease in U2R detection (decrease by 50%) and a high increase in R2L detection (increased by 52%). However, when the second experiment is conducted with resampling the result is positively changed. As the correct classiﬁcation of the minor classes is increase scoring high percentage, 100% for U2R and 98.5 for R2L. Two other approaches are studied where one of them apply MLP classiﬁer as a base classiﬁer for bagging and NB is based for stacking with resample. where, NB performance for R2L detection not change and the U2R being ignored as scores 0%. In the other approach, the NB is the base of stacking and J48 is the base for bagging and the NB performance with this approach, in terms of U2R and R2 l classiﬁcation, is decreased. It can be concluded that NB classiﬁcation for U2R can be enhanced through a hybrid approach that combines data-level method and ensemble method and in this study, the approach of NB and RF bagging is recommended.

6 Conclusion This study has focused on the multi-class classiﬁcation problem of network intrusions on KDD Cup ’99 imbalance dataset. An investigation based on this imbalance dataset was conducted along with investigating a number of reﬁnements that can be used to enhance the performance of machine learning classiﬁers. After a detailed investigation, it was found that some classiﬁers are biased towards the major classes which lead to misclassiﬁcation of minor classes. The conclusions of the research conducted can be summarized as follows: • Cross-validation is recommended to avoid the diversity of the result when holdout validation is used. • Biased towards the major classes and the poor performance of some classiﬁers are due to the imbalance dataset and this can be solved through resampling techniques.

Investigating the Effective Use of Machine Learning Algorithms

159

• The performance of Random Forest classiﬁer as an ensemble is better than as tree classiﬁcation. That due to the fact that RF use subset of features to split the node in a tree while bagging (ensemble) consider all the features for splitting a node. Therefore, it can conclude that the features in the dataset affect the performance of RF. Where resample with ensemble methods (bagging or AdaBoosM) is the recommended approaches for RF classiﬁer. • NB classiﬁcation for U2R can be enhanced through a hybrid approach that combines data-level method and ensemble method and in this study, the approach of NB and RF bagging is recommended.

References 1. Modi, C., Patel, D., Borisaniya, B., Patel, A., et al.: A survey on security issues and solutions at different layers of Cloud computing. J. Supercomput. 63(2), 561–592 (2013) 2. Chen, Y., Sion, R.: On securing untrusted clouds with cryptography. Science 109–114 (2010) 3. Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection, pp. 305–316 (2010). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp= &arnumber=5504793&contentType=Conference+Publications&queryText=R.+Sommer +and+V.+Paxson,+Outside+the+Closed+World:+On+Using+Machine++Learning+For +Network+Intrusion+Detection 4. Naiping, S.N.S., Genyuan, Z.G.Z.: A study on intrusion detection based on data mining. In: International Conference of Information Science and Management Engineering, ISME, vol. 1, pp. 8–15 (2010) 5. Almutairi, A.: Intrusion detection using data mining techniques 6. McHugh, J.: Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Inf. Syst. Secur. 3(4), 262–294 (2000) 7. Tavallaee, M., et al.: A detailed analysis of the KDD CUP 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, (Cisda), pp. 1–6 (2009) 8. Tavallaee, M.: An Adaptive Intrusion Detection System. Sdstate.Edu. (2011) 9. Thomas, C., Balakrishnan, N.: Performance enhancement of intrusion detection systems using advances in sensor fusion. In: 11th International Conference on Information Fusion, pp. 1–7 (2008) 10. Tran, T., et al.: Network intrusion detection using machine learning and voting techniques. In: Machine Learning, pp. 7–10 (2011). http://cdn.intechweb.org/pdfs/10441.pdf 11. Tsai, C.H., Chang, L.C., Chiang, H.C.: Forecasting of ozone episode days by cost-sensitive neural network methods. Sci. Total Environ. 407(6), 2124–2135 (2009). https://doi.org/10. 1016/j.scitotenv.2008.12.007 12. Troesch, M., Walsh, I.: Machine learning for network intrusion detection, pp. 1–5 (2014) 13. Juma, S., et al.: Machine learning techniques for intrusion detection system: a review. J. Theor. Appl. Inf. Theor. 72(3), 422–429 (2015). http://research.ijcaonline.org/volume119/ number3/pxc3903678.pdf 14. Panda, M., et al.: Network intrusion detection system: a machine learning approach. Intell. Decis. Technol. 5(4), 347–356 (2011). http://dx.doi.org/10.3233/IDT-20110117%5Cnhttp:// iospress.metapress.com/content/911371h6266k5h4p/

160

I. S. Al-Mandhari et al.

15. Kubat, M.:. Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994. The Knowledge Engineering Review 13(4), pp. 409–412 (1999). ISBN 0-02-352781-7 16. LeCun, Y.A., et al.: Efﬁcient backprop. Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), vol. 7700 (2012) 17. Engen, V.: Machine learning for network based intrusion detection. Int. J. (2010) 18. López, V., et al.: An insight into classiﬁcation with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). https:// doi.org/10.1016/j.ins.2013.07.007 19. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. 6(1), 7–19 (2004) 20. Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Data Mining and Knowledge Discovery Handbook, pp. 853–867 (2005). http://link.springer.com/10.1007/0387-25465-X_40 21. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. Science 30(1), 25–36 (2006). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96. 9248&rep=rep1&type=pdf 22. Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003) 23. Barandela, R., et al.: Strategies for learning in class imbalance problems.pdf. Pattern Recog. 36, 849–851 (2003) 24. Barandela, R., Sánchez, J.S., Valdovinos, R.M.: New applications of ensembles of classiﬁers. Pattern Anal. Appl. 6(3), 245–256 (2003) 25. Ducange, P., Lazzerini, B., Marcelloni, F.: Multi-objective genetic fuzzy classiﬁers for imbalanced and cost-sensitive datasets. Soft. Comput. 14(7), 713–728 (2010) 26. Lin, W.J., Chen, J.J.: Class-imbalanced classiﬁers for high-dimensional data. Brief. Bioinform. 14(1), 13–26 (2013) 27. Wang, J.: Advanced attack tree based intrusion detection (2012) 28. Wang, J., et al.: Extract minimum positive and maximum negative features for imbalanced binary classiﬁcation. Pattern Recogn. 45(3), 1136–1145 (2012). https://doi.org/10.1016/j. patcog.2011.09.004 29. Batuwita, R., Palade, V.: Class imbalance learning methods for support vector. imbalanced learning: foundations, algorithms, applications, pp. 83–100 (2013) 30. García-Pedrajas, N., et al.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl. Based Syst. 25(1), 22–34 (2012) 31. Domingos, P.: MetaCost: a general method for making classiﬁers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 55, pp. 155–164 (1999). http://portal.acm.org/citation.cfm?id= 312129.312220&type=series 32. Zhou, Z., Member, S., Liu, X.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006) 33. Błaszczyński, J., et al.: Integrating selective pre-processing of imbalanced data with Ivotes ensemble. Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), LNAI, vol. 6086, pp. 148–157 (2010) 34. Chawla, N.V., et al.: SMOTEBoost: improving prediction. In: Lecture Notes in Computer Science, vol. 2838, pp.107–119 (2003) 35. Chawla, N.V.: C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the International Conference on Machine Learning, Workshop Learning from Imbalanced Data Set II (2003). http://www.site.uottawa.ca:4321/*nat/Workshop2003/chawla.pdf

Investigating the Effective Use of Machine Learning Algorithms

161

36. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., et al.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185– 197 (2010) 37. Batuwita, R., Palade, V.: Efﬁcient resampling methods for training support vector machines with imbalanced datasets. In: Proceedings of the International Joint Conference on Neural Networks (2010) 38. Fernandez, A., et al.: A study of the behaviour of linguistic fuzzy rule based classiﬁcation systems in the framework of imbalanced data-sets. Fuzzy Sets Syst. 159(18), 2378–2398 (2008) 39. Fernández, A., del Jesus, M.J., Herrera, F.: On the 2-tuples based genetic tuning performance for fuzzy rule based classiﬁcation systems in imbalanced data-sets. Inf. Sci. 180(8), 1268– 1291 (2010). https://doi.org/10.1016/j.ins.2009.12.014 40. Weiss, G., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? Dmin, pp. 1–7 (2007). http://storm. cis.fordham.edu/*gweiss/papers/dmin07-weiss.pdf 41. Japkowicz, N.: The class imbalance problem: signiﬁcance and strategies. In: Proceedings of the 2000 International Conference on Artiﬁcial Intelligence, pp. 111–117 (2000) 42. Van Hulse, J.: An empirical comparison of repetitive undersampling techniques, pp. 29–34 (2009) 43. Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 44. Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36(3), 4626–4636 (2009). https://doi.org/10.1016/j.eswa.2008.05.027 45. Adamu Teshome, D., Rao, V.S.: A cost sensitive machine learning approach for intrusion detection. Glob. J. Comput. Sci. Technol. 14(6) (2014) 46. Choudhury, S., Bhowal, A.: Comparative analysis of machine learning algorithms along with classiﬁers for network intrusion detection. In: International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), (May), pp. 89–95 (2015). http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=7225395 47. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 48. Mohammad, M.N., Sulaiman, N., Muhsin, O.A.: A novel Intrusion Detection System by using intelligent data mining in WEKA environment. Procedia Comput. Sci. 3, 1237–1242 (2011). https://doi.org/10.1016/j.procs.2010.12.198 49. Depren, O., Topallar, M., Anarim E., Ciliz, M.K.: An intelligent intrusion detection system foranomaly and misuse detection in computer networks. Expert Syst. Appl., 29, 713–722 (2005) 50. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artiﬁcial Intelligence (IJCAI), (1995)

Anonymization of System Logs for Preserving Privacy and Reducing Storage Siavash Ghiasvand1(B) and Florina M. Ciorba2 1

Technical University of Dresden, Dresden, Germany [email protected] 2 University of Basel, Basel, Switzerland [email protected]

Abstract. System logs constitute valuable information for analysis and diagnosis of systems behavior. The analysis is highly time-consuming for large log volumes. For many parallel computing centers, outsourcing the analysis of system logs (syslogs) to third parties is the only option. Therefore, a general analysis and diagnosis solution is needed. Such a solution is possible only through the syslog analysis from multiple computing systems. The data within syslogs can be sensitive, thus obstructing the sharing of syslogs across institutions, third party entities, or in the public domain. This work proposes a new method for the anonymization of syslogs that employs de-identiﬁcation and encoding to provide fully shareable system logs. In addition to eliminating the sensitive data within the test logs, the proposed anonymization method provides 25% performance improvement in post-processing of the anonymized syslogs, and more than 80% reduction in their required storage space. Keywords: Privacy · Anonymization · Encoding · System logs Data quality · Size reduction · Performance improvement

1

Introduction

System logs are valuable sources of information for the analysis and diagnosis of system behavior. The size of computing systems and the number of their components, continually increase. The volume of generated system logs (hereafter, syslogs) is in proportion to this increase. The storage of the syslogs produced by large parallel computing systems in view of their analysis requires high storage capacity. Moreover, the existence of sensitive data within the syslogs raises serious concerns about their storage, analysis, dissemination, and publication. The anonymization of syslogs is a mean to address the second challenge. During the process of anonymization, the sensitive information will be eliminated while the remaining data is considered as cleansed data. To the best of our knowledge, no existing automatic anonymization method guarantees full user privacy. This is c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 162–179, 2019. https://doi.org/10.1007/978-3-030-03405-4_11

Anonymization of System Logs for Preserving Privacy and Reducing Storage

163

due to the fact that there is always a small probability that sensitive data leaks into the cleansed data. Applying anonymization methods to syslogs to cleanse the sensitive data before storage, analysis, sharing, or publication, reduces the usability of the anonymized syslogs for further analysis. After a certain degree of anonymization, the cleansed syslog entries lose their signiﬁcance and only remain useful for statistical analysis, such as time series and distributions. At this stage, it is possible to encode long syslog entries into shorter strings. Encoding signiﬁcantly reduces the required storage capacity of syslogs and addresses the storage challenge mentioned earlier. Shortening the log entries’ length reduces their processing complexity and, therefore, improves the performance of further analysis on syslogs. Encoding also guarantees full user privacy via masking any potential sensitive data leakage. In this work, we also address the trade-oﬀ between the sensitivity and the usefulness of the information in anonymized syslogs [1]. It is important to note that the sensitivity and the signiﬁcance of syslog entries are relative terms. Each data item (or term) in a syslog entry, depending on policies of the computing system it originates from, may or may not be considered sensitive data. The same degree of relativity applies to the signiﬁcance of syslog entry data items. Depending on the chosen data analysis method, the signiﬁcance of syslogs can be assessed as rich or poor. Even though the classiﬁcation of each term as sensitive or as signiﬁcant is related to the policies of computing centers, the ﬁnal assessment of sensitivity and signiﬁcance has a binary value of true (1) or false (0). Therefore, every single term in a syslog entry can only be sensitive or nonsensitive, e.g., a username. Figure 1 illustrates the relation between the sensitivity, the signiﬁcance and the length of syslog terms.

Fig. 1. (a) The sensitivity, signiﬁcance, and length of terms in syslog entries and their relation. (b) Trade-oﬀ scenarios between the signiﬁcance, sensitivity, and length of a system log entry. Each of the i, ii, and iii illustrations depicts the four possible states of a syslog entry based on its sensitivity, signiﬁcance, and length. The trade-oﬀ triangle in illustration iv shows the trade-oﬀ between the three parameters (sensitivity, signiﬁcance, length) in a single uniﬁed view.

164

S. Ghiasvand and F. M. Ciorba

A triple trade-oﬀ exists between sensitivity, signiﬁcance, and length of a syslog entry. Figure 1 schematically illustrates this trade-oﬀ, regardless of the system policies and syslog analysis methods in use. The illustration shows that a syslog entry can be in four distinct states. Green color states denote best conditions while red color states denote undesirable conditions. White color states represent neutral conditions. Under undesirable conditions, the approach taken in this work is to transition from the red state to one of the white states. The yellow arrows indicate this in Fig. 1. Increasing the signiﬁcance of a syslog entry is not possible. Therefore, the remaining possibilities are either decreasing the sensitivity or reducing the length of the syslog entry. Data, in general, has a high quality when it is “ﬁt for [its] intended uses in operations, decision making, and planning” [2]. The syslog entries represent the data in this work and several parameters aﬀect their quality. To measure and maximize this quality, a utility function called quality (QE ) is deﬁned as the relation between sensitivity, signiﬁcance, length, and usefulness of syslog entry E. The goal of this work is to maintain the quality (QE ) of all syslog entries, by pushing the parameters mentioned above toward their best possible values, when the computing system policies degrade this quality (as exempliﬁed later in Table 5). The main contribution of this work is introducing a new approach for anonymization that employs de-identiﬁcation and encoding to provide shareable system logs with guaranteed user privacy preservation, the highest possible data quality and of reduced size. An example of the proposed approach is shown in Table 4. The remainder of this work is organized as follows. In Sect. 2 the background and current state of the art are discussed. The proposed approach is described in Sect. 3, and the methodology and technical details are provided in Sect. 4. After explaining the results of the current work in Sect. 5, the conclusion and future work directions are discussed in Sect. 7.

2

Related Work

In July 2000, the European Commission adopted a decision recognizing the “Safe Harbor Privacy Principles” [3]. Based on the “Safe Harbor” agreement, 18 personal identiﬁers should be eliminated from the data before its transmission and sharing.“Safe Harbor” was originally designed to address the privacy of healthcare-related information. However, its principles are also taken into account for other types of information. Later, in March 2014, European Parliament approved the new privacy legislation. According to this regulations, personal data is deﬁned as “any information relating to an identiﬁed or identiﬁable natural person (‘data subject’)” [4]. This information must remain private to ensure a person’s privacy. Based on this deﬁnition, syslog entries contain numerous terms which represent personal data and must, therefore, be protected. Protection of personal data in syslog entries can be attained via various approaches; the most common ones are encryption and de-identiﬁcation. Encryption reduces the risk of unauthorized access to personal data. However, the

Anonymization of System Logs for Preserving Privacy and Reducing Storage

165

encrypted syslog entries cannot be freely used or shared in the public domain. The risk of disclosure of the encryption-key also remains an important concern. In contrast, de-identiﬁcation eliminates the sensitive data and only preserves the nonsensitive (cleansed) data. As such, de-identiﬁcation provides the possibility of sharing de-identiﬁed information in the public domain. The de-identiﬁed data may turn out to no longer be of real use. Pseudonymization and anonymization are two diﬀerent forms of deidentiﬁcation. In pseudonymization, the sensitive terms are replaced by dummy values to minimize the risk of disclosure of the data subject identity. Nevertheless, with pseudonymization the data subject can potentially be re-identiﬁed by some additional information [5]. Anonymization, in contrast, refers to protecting the user privacy via irreversible de-identiﬁcation of personal data. Several tools have been developed to address the privacy concerns of using syslog information. Most of these tools provide log encryption as the main feature, while certain such tools also provide de-identiﬁcation as an additional feature. Syslog-ng and Rsyslog are two open-source centralized logging infrastructures that provide out of the box encryption and message secrecy for syslogs, as well as de-identiﬁcation of syslog entries [6,7]. Both tools provide a pattern database feature, which can identify and rewrite personal data based on predeﬁned text patterns. Logstash [8] is another open-source and reliable tool to parse, unify, and interpret syslog entries. Logstash provides a text ﬁltering engine which can search for the text patterns in live streams of syslog entries and replace them with predeﬁned strings [9]. In addition to the oﬀ-line tools, such as Syslogng and Logstash, there is a growing number of on-line tools, e.g., Loggy [10], Logsign [11], and Scalyr [12], that oﬀer a comprehensive package of syslog analysis services. The existence of sensitive data in the syslogs, barricades the usage of such services. Alongside these industrial-oriented tools, several research groups have developed scientiﬁc-oriented toolkits to address the syslog anonymization challenge. eCPC toolkit [13], sdcMicro [14], TIAMAT [15], ANON [16], UTD Anonymization Toolbox [17], and Cornell Anonymization Toolkit [18] are selected examples of such toolkits. These tools apply various forms of k-anonymity [5] and l-diversity [19] to ensure data anonymization. Achieving an optimal k-anonymity is an NP-hard problem [20]. Heuristic methods, such as k-Optimize, can provide eﬀective results [21]. The main challenges of using existing anonymization approaches, in general, are: (1) The quality of the anonymized data dramatically degrades, and (2) The size of the anonymized syslogs remains almost unchanged. The industrialoriented approaches are unable to attain full anonymization at micro-data [5] level. Even though scientiﬁc-oriented approaches can guarantee a high level of anonymization, they are mainly not capable of applying eﬀective anonymization in an online manner. Certain scientiﬁc-oriented methods, such as [22], which can eﬀectively anonymize online streams of syslogs, need to manipulate log entries at their origin [23].

166

S. Ghiasvand and F. M. Ciorba

The anonymization approach proposed in this work is distinguished from existing work through the following features: (1) Ability to work with streams of syslogs without modiﬁcation of the syslog origin; (2) Preservation of the highest possible quality of log entries; and (3) Reduction of the syslogs storage requirements, whenever possible.

3

Proposed Approach

Computing systems can generate system logs in various formats. RFC5424 proposes a standard for the syslog protocol which is widely accepted and used on computing systems [24]. According to this protocol, all syslog entries consist of two main parts: a timestamp and a message. In addition to these two main parts, there are optional parts, such as system tags. Let us consider the following sample syslog entry E1 : “1462053899 Accepted publickey for Siavash from 4.3.2.1”. In this entry, “1462053899” is the timestamp and the rest of the line“Accepted publickey for Siavash from 4.3.2.1” is the message. In the message part, the terms Accepted, publickey, for, and from are constant terms, while Siavash, and 4.3.2.1 are variable terms, in the sense that for the above variable terms, the user name and IP can vary among users and machines. The goal of this work, described earlier in Sect. 1, is to preserve the quality of syslog entries throughout the anonymization process and preserve the user privacy. To achieve this goal, (1) The variable terms in the syslog entries are divided into three groups: sensitive, significant, and nonsignificant terms. (2) The sensitive terms are eliminated to comply with the privacy policies. (3) The nonsignificant terms are replaced with predeﬁned constants to reduce the required storage. (4) Following the anonymization Steps (2) and (3) above, every syslog entry that does not have any remaining variable terms, is mapped to a hash-key, via a collision-resistant hash function. This step is called encoding. (5) The quality of the remaining syslog entries is measured with a utility function. (6) When it is revealed that removing a significant term from the syslog entry improves the quality of syslog, that particular term is replaced with a predeﬁned constant. (7) The remaining processed syslog entries that do not contain additional variable terms, are mapped into hash-keys (similar to Step (4) above). (8) Upon completion of Steps (4) and (7), the hash-key codes can be optimized based on their frequency of appearance. The preliminary results of analyzing the syslogs of a production HPC system called Taurus1 using the proposed approach shows up to 95% reduction in storage capacity [25]. An interactive demonstration of the use of this anonymization approach on a sample syslog is provided online [26]. In the proposed approach, regular expressions are used for the automatic detection of variable terms within syslog entries. Categorization of automatically detected terms into sensitive and/or significant is performed based on the information in Table 2. This information is manually inferred from the policies and conditions of the host high-performance computing system. Automatically 1

https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus.

Anonymization of System Logs for Preserving Privacy and Reducing Storage

167

detected variable terms which do not belong to any of the sensitive and significant categories are considered as nonsignificant. A variable length hash algorithm is used to encode the syslog entries. The encoding step is described in greater details at the end of this section. Table 1 contains 15 main regular expressions (out of 38) which are used to detect variable terms in syslog entries of Taurus. The order of their application is important since certain patterns are subsets of other patterns. Even though most variables can be detected with these basic regular expressions, in an unlikely case of similarity between variables and constants, the regular expression may not be able to diﬀerentiate between constants and variables correctly. For example the username panic may be misinterpreted as a constant value like kernel panic. Various possibilities for such misinterpretations are imaginable, yet unlikely. Therefore, the overhead imposed by employing sophisticated methods to detect misinterpretations is not justiﬁable. In contrast, encoding is a robust and lightweight approach to address all forms of misinterpretations. In such scenarios, the undetected variables are considered as constants and will appear as a new event pattern. Encoding event patterns in the ﬁnal step of the proposed approach masks any potential data-leakage and guarantees the highest attainable level of anonymization. Table 1. Regular expressions used to detect certain terms within taurus syslogs

As the ﬁrst step, the quality of syslog entries needs to be quantiﬁed. The product of four characteristics of syslog entries deﬁnes the syslog entry quality: (1) sensitivity, (2) signiﬁcance, (3) length, and (4) usefulness. To render uniform the impact of all characteristics, their importance is normalized in the range of 0 to 1, and the negative parameters are replaced with their reverse positive counterparts. Therefore, the eﬀective parameters are nonsensitivity, significance, reduction, and usefulness. The nonsensitivity parameter of a syslog entry can take any

168

S. Ghiasvand and F. M. Ciorba

value in the range of 0 (most sensitive) to 1 (most nonsensitive) denoting the best value. The parameter significance can also take any value in the range of 0 (nonsigniﬁcant) to 1 (highest signiﬁcance), with 1 representing a highly relevant syslog entry. The length of syslog entries can be interpreted as the size of syslog entry. The reduction of syslog entries size may also take any value in the range of 0 (no reduction) to 1 (most reduction). Size reduction can be achieved via any general lossy or lossless compression algorithm. When the applied compression method does not change the signiﬁcance and sensitivity of syslog entries, it is considered as lossless (from the perspective of this work). If the chosen compression method modiﬁes the signiﬁcance or sensitivity of syslog entries, in the context of this work, it is taken as an additional level of anonymization rather than compression. A careful consideration of various eﬀective compression algorithms, including Brotli, Deﬂate, Zopﬂi, LZMA, LZHAM, and Bzip2, revealed that in aﬀordable time, compression could reduce the data size to 25% of its original size. Therefore the reduction of syslogs ranges between 0.75 to 1, where 1 indicates 100% compression and is practically impossible to reach. Every time that a compressed syslog entry is processed, the decompression process imposes an additional performance penalty on the host system. Therefore, the proposed approach in this work uses an encoding algorithm instead of compression algorithms which demand a decompression before accessing the compressed data. The encoded data can be accessed and used without pre-processing (decompression). Unlike the previous three parameters, the fourth parameter, usefulness, is boolean and takes 0 or 1 as values. The value of 0 or 1 for usefulness denotes that a syslog entry in its current form cannot or can be used for a speciﬁc type of analysis, respectively. QE = UE ∗ (n ∗ NE ) ∗ (s ∗ SE ) ∗ (r ∗ RE )

(1)

Equation (1) quantiﬁes the quality (QE ) of a syslog entry E as a product of its nonsensitivity (NE ), signiﬁcance (SE ), reduction (RE ), and usability (UE ). The coeﬃcients, n, s, and r indicate the importance of nonsensitivity, signiﬁcance, and reduction for a speciﬁc computing system. The default value for n, s, and r is 1. The value of 0 for usability results in a 0-quality syslog entry and disqualiﬁes the current syslog entry from further analysis. As explained earlier, regardless of system conditions and policies, a reduction rate of 75% is always achievable [27–30]. Therefore, the quality of a raw syslog entry is calculated using (2). QE = 1 ∗ (1 ∗ NE ) ∗ (1 ∗ SE ) ∗ (1 ∗ 0.75)

(2)

The sensitivity of each syslog entry term is deﬁned based on the policies set up by the computing system administrators. Table 2 indicates the sensitivity and the signiﬁcance of syslog entry terms of a computing system. The severity degree of each term’s sensitivity varies from 0 to 10. This degree is only used to give priority to the individual anonymization steps. Each syslog entry term can only be either sensitive (Y) or nonsensitive (N). Therefore, in this section, only the boolean sensitivity indicator (Y/N) is considered to denote the sensitivity of each syslog entry term. The same assumptions hold for the signiﬁcance of each

Anonymization of System Logs for Preserving Privacy and Reducing Storage

169

syslog entry term. Accordingly, each term can be significant or nonsignificant. The signiﬁcance of each term can be judged from three sources: (1) Every sensitive term is also signiﬁcant. (2) Every signiﬁcant term is marked with “Y” in the signiﬁcance table (Table 2), (3) All terms not included in the signiﬁcance table nor marked with “Y” therein simply have length, are nonsensitive, and nonsigniﬁcant. Table 3 indicates the sensitivity and signiﬁcance of each term from the message part of the sample syslog entry based on information from Table 2. The nonsensitivity (NE ), and the signiﬁcance (SE ) of a syslog entry E are obtained using (3). Calculating these properties for the sample syslog entry E1 from Table 3, with the information from Table 2, results in: NE 1 = 46 and SE 1 = 4 6 , respectively. The quality of the sample syslog (QE 1 ) is then obtained with (2) to be QE 1 = 1∗ 46 ∗ 46 ∗0.75 ≈ 0.33. The steps for performing a full anonymization with the proposed approach on the sample syslog entry E1 from Table 3 are shown in Table 4. Number of nonsensitive terms in entry E Total number of terms Number of significant terms in entry E = Total number of terms

NE = SE

(3)

The encoding algorithm used in this work is the variable length hash algorithm SHAKE-128 [31,32] with 32-bit output length adjustable based on the system requirements. All syslog entries that follow the pattern of the sample syslog entry E1 : “Accepted publickey for Siavash from 4.3.2.1”, Table 2. Classiﬁcation of syslog entry terms into sensitive and/or signiﬁcant. Severity denotes the importance of the characteristics for the respective terms. Term

Sensitivity Severity Term

Sensitivity Severity

User Name

Y

10

accept*

Y

07

IP Address

Y

08

reject*

Y

10

Port Number Y

01

close*

Y

08

Node Name

03

*connect* Y

09

Y

Node ID

Y

03

start*

Y

02

Public Key

Y

10

*key*

Y

01

App Name

N

00

session

Y

07

Path / URL

N

00

user*

Y

05

Table 3. A Sample Syslog Entry. Sensitive and Signiﬁcant Terms are Marked with “Y” in the Respective Rows. Message

Accepted Publickey for Siavash from 4.3.2.1

Sensitive

-

Signiﬁcant Y

-

-

Y

-

Y

Y

-

Y

-

Y

170

S. Ghiasvand and F. M. Ciorba

Table 4. Anonymization and encoding of the sample syslog entry from Table 3.

regardless of the values which they carry, after ‘constantiﬁcation’ are identical to: “Accepted publickey for #USR# from #IP4#”. This string is an event pattern. Event patterns are constant strings with a certain signiﬁcance. Replacing them with a shorter identiﬁer does not change their meaning, as long as the identiﬁer replacing a particular event pattern is known. Therefore, in this work, a hashing function is used to transform event patterns from syslogs into shorter single-term identiﬁers. Using a hashing function guarantees that an event pattern is always converted to an identical identiﬁer (hash-key). The identiﬁer carries the same signiﬁcance as the event pattern, in an 8-character string. The identiﬁer “caa5002d” in comparison with the original string of“Accepted publickey for Siavash from 4.3.2.1” with 43 characters, represents an 81% decrease in the string length. Apart from shortening the syslog entries, using identiﬁers also reduces the number of terms in each syslog entry as well as uniforming the length of entries, which in turn, results in signiﬁcant performance improvement of further processing of

Anonymization of System Logs for Preserving Privacy and Reducing Storage

171

syslog entries. Further more, the ﬁnal hashing which maps entries to identiﬁers, provides a fast and eﬀective way to identify natural groupings of syslog entries. In Sect. 7, the grouping capability is explained with more details.

4

Methodology

We selected a 13-month collection of syslogs between February 01, 2016 and February 28, 2017, from the Taurus production HPC cluster as the source of information in the present study. Taurus is a Linux-based parallel cluster with 2014 computing nodes. It employs Slurm [33] as its batch system. Taurus’ 2014 computing nodes are divided into six islands, mainly based on their processing units type: CPUs (Intel’s Haswell, Sandy Bridge, Triton, Westmere), and GPUs. The 13-month collection includes syslog entries from all 2014 Taurus computing nodes. The syslog daemons on the computing nodes are conﬁgured to submit syslog entries to a central node. The central node, in turn, sends the entries to a syslog storage node. On this storage node, daily syslog entries are accumulated according to their origin into diﬀerent log files. Therefore, the 13-month collection includes 2014 system log ﬁles per day (one log ﬁle for each computing node). The number of syslog entries generated by a computing node per day depends on various factors, including system updates and node failures. For the 13-month period of this study, approximately 984.26GiB of syslogs were collected, which comprise 8.6 billion syslog entries. Various causes, such as scheduled maintenance or node failures, are responsible for a certain percentage of errors during the collection of syslog entries. The completeness of the syslog collection process can be measured by considering the

Fig. 2. Illustration of the syslog entries collection gaps. Approximately 3% of the syslog entries collected over 13 months were not correctly recorded. The data loss occurred at three distinct points in time, identiﬁed as three vertical lines. The causes of this data loss are (a) scheduled maintenance, (b) reaction of automatic overheating protection mechanism, and (c) failure of the central syslog collection node.

172

S. Ghiasvand and F. M. Ciorba

presence of a log file as the indicator of the gathering of syslog entries from a particular node on a given day. Based on this deﬁnition, the syslog collection completeness for the speciﬁc time interval in this work is 97%. The red lines in Fig. 2 indicate the 3% of missed (uncollected) syslogs. Most of the missing 3% syslog entries have been lost over the course of three days, marked at the top of Fig. 2 with letters a, b, and c. The reasons for their occurrence was (a) scheduled maintenance, (b) reaction of automatic overheating protection mechanism, and (c) failure of the central syslog collection node.

5

Results

The proposed anonymization approach has been applied to a 13-month collection of Taurus syslog entries. During this process, the sensitivity and signiﬁcance of each term needed to be identiﬁed based upon the policies of Taurus HPC cluster. According to the user privacy and data protection act of the Center for Information Services and High Performance Computing, at the Technical University of Dresden (TUD), Germany, HPC system usage information may be anonymously collected from the users and shared with research partners. This information includes, yet is not limited to, various metrics about processors, networks, storage systems, and power supplies [34]. Other types of information are processed according to the IT [35] and identity management [36] regulations of TUD. Based on these regulations, certain data are considered sensitive and must, therefore, remain conﬁdential [37]. To the best of our knowledge, the information in Table 5 captures the data sensitivity according to the TUD privacy regulation in force. From Table 5, one can note that certain syslog entry terms, such as node names or port numbers, may remain unchanged. The use of the proposed approach on 8.6 billion syslog entries from Taurus of an uncompressed size of 985 GiB, according to the TUD privacy regulations revealed seven facts. (1) Only approximately 35% of the syslog terms are sensitive and need to be anonymized. Therefore approximately 65% of terms remain untouched (Fig. V). (2) The anonymization of sensitive terms has less than 0.5% impact on syslog size reduction. (3) The quality of most entries degraded post-anonymization. (4) Approximately 3, 000 unique event patterns were discovered. (5) More than 90% of syslog entries are based on 40 event patterns (hereafter frequent patterns ). All other non-frequent event patterns together are responsible for less than 10% of syslog entries. For instance, more than 15% of the syslog entries have the (#USER#) cmd (#PATH#) pattern (Fig. 3). (6) A small percentage of syslog entries (approximately 5%) among the nonfrequent event patterns do not contain any variable terms in their original form (e.g., disabling lock debugging due to kernel taint). (7) According to the current TUD privacy regulations, almost all syslog entries lose their signiﬁcance after anonymization (e.g., “failed password for #USER# from #IPv4# port 32134 ssh2”).

Anonymization of System Logs for Preserving Privacy and Reducing Storage

173

Fig. 3. (a) Syslog term sensitivity according to the TU Dresden privacy regulations. (b) Percentage of sensitive and nonsensitive terms within the 13-month-long collection of syslogs. The red bars indicate the percentage of sensitive terms while the green bars indicate the percentage of nonsensitive terms. The sensitive terms sum up to approximately 35% of all terms in the collection.

The only remaining useful information in these cases is the meaning (signiﬁcance) of the event pattern itself. For the above example, authentication via ssh failed is the meaning. Based on the above seven observations about syslog entries on Taurus, we can state that: (1) The anonymized syslogs consist of approximately 90% nonsigniﬁcant entries (after mandatory anonymization), (2) Approximately 5% of the entries are constant (without any variable terms), (3) approximately 5% are entries with signiﬁcance (retained their useful properties even after anonymization). Following the necessary anonymization, (90 + 5)% of syslog entries no longer have signiﬁcance and can be converted to hash-keys. The 5% of syslog entries which carry a certain degree of signiﬁcance even after anonymization,

174

S. Ghiasvand and F. M. Ciorba

may remain untouched. The information within these 5% syslog entries may become useful in revealing the root-cause of occurring failures. Statistical analysis that focus on anomaly detection does not require such information. Therefore, according to the chosen analysis method the remaining 5% of the syslog entries can also be encoded to hash-keys. Table 5 illustrates the application of proposed approach on sample syslog entries in three steps. Table 5 is a reference to the meaning of each of the hash keys. Together with the anonymized syslogs and according to the privacy regulations, the information in Table 5 may also be fully/partially published. Table 5 compares certain characteristics of a sample syslog, before and after application of the proposed approach. The data in part (ii) of Table 5 follow the main anonymization guidelines. This fact enables their inclusion in the present work. However, since syslog entries lengths have been reduced, the data in part (iii) of Table 5 delivers the very same signiﬁcance as part (ii), at a much smaller length. The usefulness of the anonymized and hashed information from the 13-month syslog collection remains identical. Reprocessing the results from an earlier Table 5. Demonstration of the Proposed Approach

Anonymization of System Logs for Preserving Privacy and Reducing Storage

175

work [38], in which the correlation of failures in Taurus was analyzed, via the new anonymization and encoding approach led to identical outcomes. Moreover, due to single-term syslog entries, the processing time was approximately 25% shorter than before.

6

Discussion

The results reported in Sect. 5 describes a real-world use-case, in which we applied statistical behavior analysis methods on system logs to detect symptoms of potential upcoming failures. The quality of data is deﬁned based on its usefulness. Quality in the provided use-case is deﬁned as the compatibility of output-data and the intended statistical behavior analysis methods [25]. The conditions and required anonymization were deﬁned by university policies, and therefore our ﬂexibility were limited. Table 6. Demonstration of grouping via encoding

176

S. Ghiasvand and F. M. Ciorba

Even though that anonymization of syslogs were the main focus of this paper, the system logs are used only to exemplify the anonymization and data compression solutions. The proposed approach is a generic approach and independent of the type and structure of the input data. In general, any timestamped data which is dividable into entries and terms, is suitable to be analyzed via the proposed approach. Each anonymization step (Table 4), masks a certain term in data entries. Applying diﬀerent anonymization steps to a single entry followed by encoding (hashing) step, result in various identiﬁers for that unique entry. The anonymization steps should be adjusted according to the analyzing methods. Based on the applied anonymization steps, the resulting identiﬁers form various groups. Data mining algorithms can eﬀectively use this grouping functionality [39]. An example of data grouping in diﬀerent anonymization steps shown in Table 6. Although in Steps (i), (ii), (iii), and (iv) of Table 6 some sensitive terms in each entry remained untouched, the ﬁnal encoding step ensures the full anonymization of entries, via hashing them into irreversible identiﬁers. The encoding step, because of its irreversible functionality eliminates any potential leakage of sensitive information into cleansed data, as well as quasi-identiﬁers and micro-data. As shown in Table 6, the identiﬁers preserve their utility for various statistical analysis methods e.g., PreﬁxSpan, Spade, SPAM, GSP, CMSPADE, CM-SPAM, FCloSM, FGenSM, PFP-Tree, MKTPP, ITL-Tree, PF-tree, and MaxCPF. Therefore, we consider that such data (identiﬁers) has a high quality for the intended purpose. Since our analyzing methods required low number of more generic classes of syslog entries, in the example provided in Sect. 5, a complete set of anonymization steps has been applied.

7

Conclusion

System logs have widely been used in various domains, from system monitoring and performance analysis to failure prediction of diﬀerent system components. Even though system logs are mainly system dependent, having knowledge about various computing systems improves the general understanding of computing systems behavior. However, due to the vast amount of personal data among the system log entries, users privacy concern impedes the free circulation and publication of system logs. In this work, we examined the trade-oﬀ between sensitivity and signiﬁcance of the information within system logs. Keeping the nonsigniﬁcant data is not the best practice knowing that upon a certain level of anonymization, the signiﬁcance of given system logs may be lost. The risk of sensitive data leakage into the cleansed data also impedes the required dissemination and circulation of data. This work introduced quality, the system logs utility function, as a measurable parameter calculated based on nonsensitivity, significance, [size] reduction, and usefulness of system logs. The goal is to maintain the quality of system log, by pushing all eﬀective parameters to their possible limit, while preserving the user privacy protection. This proposed approach has

Anonymization of System Logs for Preserving Privacy and Reducing Storage

177

been applied on a thirteen-month collection of Taurus HPC cluster system logs, between February 01, 2016 and February 28, 2017. The proposed anonymization approach can guarantee full anonymization of syslog entries via the ﬁnal encoding step. Apart from the highest degree of anonymization, a total reduction of more than 80% in system log size as well as 25% performance improvement in system log analysis was achievable. Adjusting the anonymization steps according to the intended analyzing methods, provides a natural grouping functionality for the output data (identiﬁers), which can be eﬀectively used by various data mining algorithms. The current hashing function produces larger hash-keys than required to avoid hash-key collisions. Fine tuning of the hashing function according to the computing system requirements, together with improving the variable term detection, are planned as future work. Using a ﬁne tuned hash function, further size reduction is achievable. The motivation behind this work stems from the usefulness of analyzing the system behavior. System logs are used in this work only to exemplify the anonymization and data compression solutions. The proposed approach beneﬁts privacy protection in any type of repetitive data stream, regardless of the format and type of data. Acknowledgement. This work is in part supported by the German Research Foundation (DFG) in the Cluster of Excellence “Center for Advancing Electronics Dresden” (cfaed). The authors also thank Holger Mickler and the administration team of Technical University of Dresden, Germany for their support in collecting the monitoring information on the Taurus high performance computing cluster. Disclaimer. References to legal excerpts and regulations in this work are provided only to clarify the proposed approach and to enhance explanation. In no event will authors of this work be liable for any incidental, indirect, consequential, or special damages of any kind, based on the information in these references.

References 1. Cranor, L., Rabin, T., Shmatikov, V., Vadhan, S., Weitzner, D.: Towards a privacy research roadmap for the computing community. ArXiv e-prints (2016) 2. Redman, T.C.: Data Driven: Proﬁting from Your Most Important Business Asset. Harvard Business Press (2008) 3. European Commission Decision. http://eur-lex.europa.eu/legal-content/en/ ALL/?uri=CELEX:32000D0520. Accessed 06 June 2017 4. General data protection regulation. http://gdpr-info.eu/art-4-gdpr/. Accessed 06 June 2017 5. Sweeney, L.: Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy (2000, working paper) 6. Dahlberg, R., Pulls, T.: Standardized Syslog Processing : Revisiting Secure Reliable Data Transfer and Message Compression, Karlstad, Sweden (2016) 7. New rsyslog 7.4.0. http://www.rsyslog.com/7-4-0-the-new-stable/. Accessed 06 June 2017 8. Logstash, centralize, transform and stash your data. http://www.elastic.co/ products/logstash. Accessed 06 June 2017

178

S. Ghiasvand and F. M. Ciorba

9. Sanjappa, S., Ahmed, M.: Analysis of logs by using logstash. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, pp. 579–585. Springer, Singapore (2017) 10. Loggy, log management. http://www.loggly.com/. Accessed 06 June 2017 11. Siem, log management, compliance. http://www.logsign.com/. Accessed 06 June 2017 12. Blazing-fast log management and server monitoring. http://www.scalyr.com. Accessed 06 June 2017 13. Gholami, A., Laure, E., Somogyi, P., Spjuth, O., Niazi, S., Dowling, J.: Privacypreservation for publishing sample availability data with personal identiﬁers. J. Med. Bioeng. 4(2) (2015) 14. Templ, M., Kowarik, A., Meindl, B.: Statistical disclosure control methods for anonymization of microdata and risk estimation. http://cran.r-project.org/web/ packages/sdcMicro/index.html. Accessed 06 June 2017 15. Dai, C., Ghinita, G., Bertino, E., Byun, J.-W., Li, N.: TIAMAT: a tool for interactive analysis of microdata anonymization techniques. Proc. VLDB Endow. 2(2), 1618–1621 (2009) 16. Ciglic, M., Eder, J., Koncilia, C.: Anonymization of data sets with null values. Trans. Large-Scale Data-Knowl.-Cent.Ed Syst. XXIV: Spec. Issue DatabaseExpert.-Syst. Appl., 193–220 (2016) 17. UTD anonymization toolbox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox. Accessed 06 June 2017 18. Xiao, X., Wang, G., Gehrke, J.: Interactive anonymization of sensitive data. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 1051–1054. ACM, New York (2009) 19. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), pp. 24–24, April 2006 20. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2004, pp. 223–228. ACM, New York (2004) 21. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE 2005), pp. 217–228, April 2005 22. Rath, C.: Usable privacy-aware logging for unstructured log entries. In: 11th International Conference on Availability, Reliability and Security (ARES), pp. 272–277, August 2016 23. Privacy-aware logging made easy. http://github.com/nobecutan/privacy-awarelogging. Accessed 06 June 2017 24. The syslog protocol. http://tools.ietf.org/html/rfc5424. Accessed 06 June 2017 25. Ghiasvand, S., Ciorba, F.M.: Toward resilience in HPC: a prototype to analyze and predict system behavior. In: Poster at International Supercomputing Conference (ISC), June 2017 26. Demonstration of annonymization and event pattern detection. https://www. ghiasvand.net/u/paloodeh. Accessed 06 June 2017 27. Alakuijala, J., Kliuchnikov, E., Szabadka, Z., Vandevenne, L.: Comparison of brotli, deﬂate, zopﬂi, lzma, lzham and bzip2 compression algorithms. http://cran. r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf. Accessed 06 June 2017 28. Collin, L.: A quick benchmark: Gzip vs. Bzip2 vs. LZMA. http://tukaani.org/ lzma/benchmarks.html. Accessed 06 June 2017

Anonymization of System Logs for Preserving Privacy and Reducing Storage

179

29. Quick benchmark: Gzip vs Bzip2 vs LZMA vs XZ vs LZ4 vs LZO. http://www. ghiasvand.net/u/compression. Accessed 06 June 2017 30. Mahoney, M.: 10 gb compression benchmark. http://mattmahoney.net/dc/10gb. html. Accessed 06 June 2017 31. Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: The KECCAK SHA-3 submission. http://keccak.noekeon.org/Keccak-submission-3.pdf. Accessed 06 June 2017 32. Fluhrer, S.: Comments on FIPS-202. http://csrc.nist.gov/groups/ST/hash/sha-3/ documents/ﬁps202 comments/Fluhrer Comments Draft FIPS 202.pdf. Accessed 06 June 2017 33. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Proceedings of 9th International Workshop on Job Scheduling Strategies for Parallel Processing, pp. 44–60. Springer, Heidelberg (2003) 34. Terms of use of the HPC systems at the ZIH, Technical University Dresden, Germany. http://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/ TermsOfUse/HPC-Nutzungsbedingungen 20160901.pdf. Accessed 06 June 2017 35. Order for the Information Technology Facilities and Services and for the Information Security of the Technical University of Dresden (IT-Regulations), Germany. http://www.verw.tu-dresden.de/amtbek/PDF-Dateien/2016-12/sonstO05. 01.2016.pdf. Accessed 06 June 2017 36. Order for the Establishment and Operation of an Identity Management System at the Technical University of Dresden, Germany. http://www.verw.tu-dresden.de/ AmtBek/PDF-Dateien/2011-05/sonstO26.07.2011.pdf. Accessed 06 June 2017 37. Information leaﬂet on IT resources, Technical University Dresden, Germany. http://tu-dresden.de/zih/dienste/service-katalog/zugangsvoraussetzung/ merkblatt?set language=en. Accessed 06 June 2017 38. Ghiasvand, S., Ciorba, F.M., Tsch¨ uter, R., Nagel, W.E.: Analysis of node failures in high performance computers based on system logs. In: Poster at International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) (2015) 39. Fournier-Viger, P., Lin, J. C., Vo, B., Truong, T.C., Zhang, J., Le, H.B.: A survey of itemset mining. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 7(4) (2017). https://doi.org/10.1002/widm.1207

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis and Supervised Learning Mostafa Rezazad1(B) , Matthias R. Brust2 , Mohammad Akbari3 , Pascal Bouvry2 , and Ngai-Man Cheung4 1

4

Institute for Research in Fundamental Sciences (IPM), Tehran, Iran [email protected] 2 SnT, University of Luxembourg, Luxembourg City, Luxembourg {matthias.brust,pascal.bouvry}@uni.lu 3 SAP Innovation Center Singapore, Singapore, Singapore [email protected] Singapore University of Technology and Design, Singapore, Singapore ngaiman [email protected]

Abstract. A novel class of extreme link-ﬂooding DDoS (Distributed Denial of Service) attacks is designed to cut oﬀ entire geographical areas such as cities and even countries from the Internet by simultaneously targeting a selected set of network links. The Crossﬁre attack is a targetarea link-ﬂooding attack, which is orchestrated in three complex phases. The attack uses a massively distributed large-scale botnet to generate low-rate benign traﬃc aiming to congest selected network links, so-called target links. The adoption of benign traﬃc, while simultaneously targeting multiple network links, makes detecting the Crossﬁre attack a serious challenge. In this paper, we present analytical and emulated results showing hitherto unidentiﬁed vulnerabilities in the execution of the attack, such as a correlation between coordination of the botnet traﬃc and the quality of the attack, and a correlation between the attack distribution and detectability of the attack. Additionally, we identiﬁed a warm-up period due to the bot synchronization. For attack detection, we report results of using two supervised machine learning approaches: Support Vector Machine (SVM) and Random Forest (RF) for classiﬁcation of network traﬃc to normal and abnormal traﬃc, i.e, attack traﬃc. These machine learning models have been trained in various scenarios using the link volume as the main feature set. Keywords: Distributed Denial of Service (DDoS) Link-ﬂooding attacks · Traﬃc analysis · Supervised learning Detection mechanisms

1

Introduction: The Crossﬁre Attack

A novel class of extreme link-ﬂooding DDoS (Distributed Denial of Service) attacks [1] is the Crossﬁre attack, which is designed to cut oﬀ entire geographical c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 180–202, 2019. https://doi.org/10.1007/978-3-030-03405-4_12

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

181

areas such as cities and even countries from the Internet by simultaneously targeting a selected set of network links [2,3]. The most intriguing property of this target-area link-ﬂooding attack is the usage of legitimate traﬃc ﬂows to achieve its devastating impact by making the attack particularly diﬃcult to detect and, consequently, to mitigate [4]. The Crossﬁre attack uses a complex and massively large-scale botnet for attack execution [4]. A botnet is a network of computers infected with malware (bots) that can be controlled remotely. A command-and-control unit updates the bots by sending them the commands of the botmaster, which is orchestrating the attack by executing the attack procedure. The bots direct their low-intensity ﬂows to a large number of servers in such a devastating manner that the targeted geographical region is essentially cut oﬀ from the Internet. The success of the attack depends highly on the network structure and how the attacker plans and initiates the attack sequence [5,6]. The attacker aims to ﬁnd a set of target links which connects to the decoy servers such that if the target links are ﬂooded, traﬃc destined to the target area is prevented from reaching its destination. Reciprocally, access from the target area to Internet services outside the target area will be cut oﬀ. For the adversary to achieve its goal, it chooses public servers either inside of the target area or nearby the target area, which can be easily found due to their availability. The quality of the attack depends on the speciﬁc selection of servers and the resulting links to be targeted, but also on the overall network topology [7]. The Crossﬁre attack: The Crossﬁre attack consists of three phases: (a) the construction of the link map, (b) the selection of target links, and (c) the coordination of the botnet. While phases (a) and (b) are sequentially executed only initially, once triggered phase (c) is executed periodically. Figure 1 illustrates the dynamics of the Crossﬁre attack. Link Map Construction: The initial step of the Crossﬁre attack is the construction of the link map. The attacker crates a map of the network along the ways from the attacker’s bots to the servers using traceroute. The result of traceroute inevitably consist of a record of diﬀerent routes between the same pairs of nodes, because of network-inherent elements inﬂuencing the eﬀective route chosen (e.g., ISP and load-balancing). Subsequently, a link map is gradually constructed, which exposes the network structure and the traﬃc ﬂow behavior around the target area. (1) Target Link Selection: After the construction of the link map, the adversary evaluates the data for more stable and reliable routes to decide on its selection of the target links. The adversary prefers disjoint routes with mostly independent target links for the attack to create the biggest impact. Bot coordination: In the ﬁnal phase of the attack, the adversary coordinates the bots to generate low-intensity traﬃc and to send it to the corresponding decoy

182

M. Rezazad et al.

Fig. 1. The Crossﬁre attack traﬃc ﬂows congest a small set of selected network links using benign low-rate ﬂows from bots to publicly accessible servers, while degrading connectivity to the target area.

server. The targeted aggregation of multiple low-intensity traﬃc ﬂows on the target link ideally exhausts its capacity, hence, congesting the link. Because the Crossﬁre attack aims to congest the target links with low-rate benign traﬃc, neither signature based Intrusion Detection Systems (IDS) nor alternative traﬃc anomaly detection schemes are capable of detecting malicious behavior on individual ﬂows. The Crossﬁre attack’s detectibility can even be further reduced by integrating any of the following features into the attack: the attacker (a) gradually increases bot traﬃc intensity, (b) estimates the decoy servers’ bandwidth to avoid exceeding their bandwidth, (c) evenly distributes the traﬃc over the decoy servers, (d) alternates the set of bots ﬂooding a target link, and (e) alternates the set of target links [4]. Although these techniques further sophisticate the attack, research described in this paper focuses on the eﬀort the adversary has to invest for successful attack preparation and execution. As it turns out, the inherent complexities of the attack create also substantial execution obstacles, which exposes the attack to detection vulnerabilities. In this paper, we describe how the Crossﬁre attack has been replicated in a realistic test bed emulation. The traﬃc has been measured during the topology construction phase and attack phase and analyzed for patterns and vulnerabilities of the Crossﬁre attack. The results indicate that characteristic traﬃc anomalies emerge in the attack region. Particularly, we found a correlation between coordination of the botnet traﬃc and the quality of the attack and a correlation between the attack distribution and detectability of the attack. Additionally, we show that due to the bot synchronization there is a warm-up period after the

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

183

attack is launched and before the target links are overwhelmed. Because of this warm-up period and the distinguishing patterns in the topology construction phase, the obtained results pave the way for novel detection methods in the early stage of the attack, when the attack traﬃc is formed. As a consequence, based on intrinsic property of attack traﬃc distribution, we propose a new approach to monitor the traﬃc volume (or intensity) on speciﬁc network regions for any sudden subtle changes on some of the links. Depending on the resolution of the monitoring scheme, this leads to an early detection of the attack, which we illustrate in this paper. This paper also provides a functional analysis on how to assess the impact of the Crossﬁre attack on the eﬀected area more realistically instead of overestimating resources needed for attack detection and mitigation. We analyze these challenges in attack preparation and execution of the Crossﬁre attack and exploit them for attack detection. Hence, we describe a prototypical Crossﬁre attack detector, which exploits these vulnerabilities. For this, we utilize two supervised machine learning approaches: Support Vector Machine (SVM) and Random Forest (RF) for classiﬁcation of network traﬃc to normal and abnormal traﬃc, i.e, attack traﬃc. To show the feasibility of detection, we report on the trained scenarios using the link volume as the main feature set. Finally, results of the attack detector are reported along with some future directions to improve the detector.

2 2.1

Monitoring and Detection Approach Monitoring Points

Considering the described Crossﬁre attack execution sequence, it turns out that there are potentially four ways to detect the attack: (a) detection at the traﬃc ﬂows origin, i.e., bot sides, (b) detection at the target area, (c) detection at the target link, and (d) detection at the decoy servers. Following, we address the advantages and disadvantages each of the four ways to ﬁnally justify our choice for traﬃc monitoring. • Detecting at origin can be the fastest way to stop an attack before even it is initiated. However, versatility and spatial distribution of bots (source of the attack traﬃc) makes it the most challenging option. • Detection at target area is the most reasonable approach as any target areas should be equipped for self defense. However, assuming not all decoy servers are inside the target area, early detection is impossible [8]. • Detection at target link might be the simplest form of detection as simple a threshold based detection system that could detect the trend of the incoming traﬃc. • Detection at decoy servers can be the best approach to detect Crossﬁre attack. Assuming the target area is not far from the decoy servers (3 to 4 hops [4]) detecting at the decoy servers might reduce the impact of the attack.

184

M. Rezazad et al.

Our approach is based on detection at the decoy servers, because it is the exclusive area that the defender can detect the attack while actively respond to it. To emphasize the eﬀectiveness of our detection approach at the decoy servers, we address the question of where is the best location to probe the network. In a high resolution, this probing can be placed either at the target link, before target link or after the target link. Monitoring a single link as a target link is not considered as a solution because of two reasons: • Any links can be targeted for an attack. Therefore, there should be one-to-one detector for every link in the network. While, in our proposal there is only one detector but many probing points. • Monitoring and detecting based on a single link will fail in distinguishing between link attack and ﬂash-crowd. The main goal is to detect the Crossﬁre attack without necessity of having the target link info. To ﬁnd out the best monitoring domain, we assume to know the location of the target link for now. The question is which side of the target link provides more information for detection? Assuming the number of ports of a switch/router is limited, considering only the immediate links before or after the target link might not help to choose a side. However, getting farther away from the target link the distribution of the intensity of the traﬃc on the links might be a function of the distribution of the end points. We will show in Sect. 4.2 that more distributed attack traﬃc is more diﬃcult to detect. Depending on the budget of the adversary, the number of bots purchased for an attack can be in range of thousands to millions. If the source of the attack traﬃc, i.e., bots, is geographically spread out, the variation of the traﬃc volume on most of the links is very small (for many routes there might be only one or few attack ﬂows before they are aggregated at the target link). That leaves only few link closer to a target link worth to examine. However, the chosen decoy servers should not be very far away from the target area (if they are not inside the target area). Since there are smaller number of destinations for the attack traﬃc than the number of sources of generating them, it can be assumed that the variation of the volume of the traﬃc caused by the attack traﬃc on the links after the target link is higher than the links before the target link. Therefore, we suggest monitoring links around servers or data centers results in better detection than around clients. The approach of evenly distributing the traﬃc for decoy servers [4], might even support the above reasoning and rather make it simpler to detect some variation on the traﬃc volume on several links. The important element in this method is to be able to monitor the traﬃc at several links and send the information to a detector for decision making.

3

Emulation Setup

In order to substantiate our discussion from Sect. 4, we emulate the Crossﬁre attack in a realistic test bed environment. The test bed is implemented in Mininet and the following setup has been chosen for the emulation environment:

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

185

Fig. 2. A four sub-tree topology of 10 bots, 10 normal clients, and 40 decoy servers. Each 10 decoy servers connected to a switch is called a sub-tree.

• • • • • • • • •

SDN network created in Mininet. SDN switches with POX controller. D-ITG traﬃc generator [9]. Tree topology (cf. Fig. 2). Link bandwidth is set to 2 Mbps with 10 ms delay. POX controller gets link status every 5 s from switches. Bots generate both normal and bot traﬃc. Some background traﬃc from clients. Some background traﬃc at the leaf switches (to decoy servers) to level up the traﬃc at the edge links.

One focus in this paper is the correlation of the traﬃc distribution on the detectability of the Crossﬁre attack. We hence used a tree structure as the topology of the network. This permits us to intuitively expand the topology of the network, i.e., the tree structure, in order to widen the traﬃc distribution. Figure 2 illustrates a base network for our emulation in which there exists several sub-trees, each of them includes 10 decoy servers. To investigate diﬀerent trafﬁc distributions on the network, we design three variations of this topology as shown in Table 1. Figure 2 depicts the network topology of 4ST which includes 4 sub-trees, 11 switches, and 40 decoy servers. Table 1. Diﬀerent variations of the network topology # of sub-trees # of switches # of decoy servers 2ST 2

9

20

4ST 4

11

40

8ST 8

15

80

186

M. Rezazad et al.

From practical aspect, Mininet with D-ITG traﬃc generator have limitation on the size of the network in the emulation. This is attributes to the fact that we need to reduce CPU utilization in SDN networks. Hence, the bandwidth of all links in our emulations are set to 2 Mbps to be able to saturate the target link with fewer bots and less number of traﬃc generators. Moreover, the number of clients and bots are set to a small number of 10 each, to compromise for a larger number of decoy server. Nevertheless, bots and clients can generate traﬃc in higher rate to rectify the problem. All switches are SDN switches connected to a POX controller. We modiﬁed the POX module ﬂow stats.py provided in Github [10], which gives the controller its ability to collect some port- and ﬂow-based statistics from switches. By using this code, the controller sends a stat request to all of the switches connected to the controller every ﬁve seconds. The respond from switches is the number of packets in the buﬀer of each port and the number of ﬂows at each link. There are 20 clients in this network including 10 bots (connected to switches 1 and 2) and 10 normal clients (connected to switches 3 and 4). Clients can be considered as super clients which can generate traﬃc with higher rate than a normal client (or bot) can do. Bots generate two types of traﬃc: normal traﬃc from beginning to the end of the experiments, and bot traﬃc which starts after d seconds and for duration of another d seconds. For experiments in Sect. 7, d is set to 5 and to 30 min to have enough samples for the detector. There is a limited number of traﬃc types for both normal and bot traﬃc. Table 2 presents all traﬃc types used in the experiments. Background traﬃc (normal traﬃc) consists of ﬁve application traﬃc including: Telnet, DNS, CSa (Counter Strike active player), VoIP and Quakes3, that D-ITG allows us to generate [9]. Both normal clients and bots are using these traﬃc to generate background traﬃc. However, since we could not specify any inter-departure time nor packet size using these applications, we use simple TCP requests to generate attack traﬃc. To make the two type of traﬃc indistinguishable, we add the same TCP traﬃc to the set of background traﬃc as well. This is the requirement of the Crossﬁre attack in which background traﬃc is indistinguishable from the attack traﬃc. Although the type of the normal and abnormal traﬃc should be the same, the rate of traﬃc for the two type of traﬃc can be diﬀerent. In reality, the rate of bots’ traﬃc must be engineered by the attacker. Here we set the rate base on the remaining bandwidth of the targeted link after receiving the normal traﬃc. The details of the traﬃc types and their parameters for the setup in Fig. 2 is given in Table 2. In addition to the traﬃc generated by the clients and bots, there are extra traﬃc generators attached to some of the switches (mostly leaf switches) to increase the level of the background traﬃc at the links. This can be considered as the traﬃc coming from another part of the network which is not in Fig. 2. Since, the number of clients is much less than the number of servers, the extra traﬃc generators help to boost the level of traﬃc at the edge links connected to the servers.

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

187

Table 2. Parameters of the traﬃc

4

Type

pkt size (Bytes) Rate (pkt/sec) Min Max Min Max Protocol

Duration

Telnet

-

-

-

-

TCP

60 min

DNS

-

-

-

-

TCP/UDP 60 min

CSa

-

-

-

-

UDP

60 min

VoIP

-

-

-

-

UDP

60 min

Quake3

-

-

-

-

UDP

60 min

Bot

100 2000

15

200

TCP

5 min

Background1

50 1000

1

80

TCP

60 min

Background2

10 2500

1

80

TCP

60 min

Emulation Results

As noted before, the contribution of this paper is to expose hardships of the Crossﬁre attack and use them for an early detection method. We speciﬁcally focus on the eﬀect of bot traﬃc synchronization on the quality of the Crossﬁre attack, and the eﬀect of the distribution of the attack on detectability of the attack, which we describe in the following two subsections. Since our focus is on the detection of the attack, we ignore the ﬁrst few steps of the Crossﬁre attack such link map construction, ﬁnding link persistence, or target link selection. We assume that all attack preparations have been made and the attacker is ready to attack. 4.1

Bot Traﬃc Synchronization

The topology we use is presented in Fig. 2. At this stage, to bring down the target link, the adversary only needs to start the bot traﬃc and direct it to the decoy servers. Thus, the bot-master initiates the attack by sending the attack order to the Command and Control (C&C) server or some selected peers depending on the structure of the botnet. Bots usually update each other in a polling or pushing mechanism. However, the question which is of interest is what happens if bots receive the attack order in diﬀerent time order? When designing Crossﬁre detection mechanisms, an often ignored part of the Crossﬁre attack is the phase from the attack initiation and the successful impact of the attack [4]. This often ignored part of the Crossﬁre attack, which we call it warm-up period, is the time diﬀerence between the time of the ﬁrst bot-ﬂow of the attack reaches the target link and the moment the target link is down. By deﬁnition, the attack actually happens at the end of the warm-up period when the target links are down. Since, reaching a zero time warm-up period is hard, this period can be used for early detection and before the attack successfully takes place.

188

M. Rezazad et al.

In fact, for several reasons reaching a zero warm-up time is hard. One reason could be the dynamic delay of packet arrival at the target link. That could be because of variations of hop distances from bots to target link, or the delay in receiving attack order from the adversary. Any sudden signiﬁcant change on traﬃc volume can be detected by ﬁrewalls and IDSs. Therefore, adversaries gradually increase the attack traﬃc volume to prevent being detected. To have a perfect link failure, the volume of the traﬃc arriving at the target link should be slightly higher than the bandwidth of the target link itself. However, this might not happen immediately. There are three main reasons for gradual traﬃc growth: (1) Bot traﬃc can be originated from any geographical location in the world and they might arrive at the target link with diﬀerent delays (dynamic delay). (2) Since the source of the attack is a botnet, it is reasonable to assume that there is some time slack between each bot to start sending the bot traﬃc. This time slack can be caused by how bots receive updates from their C&C center or from other peers in an advanced P2P botnet, but also from the malware itself [11,12]. (3) Bots might gradually increase their traﬃc intensity to prevent detection [4]. This can be considered as the main reason of gradual increasing the attack traﬃc volume. To illustrate the eﬀect of the bot synchronization on the traﬃc volume of the target link, the result of an emulated attack is presented in Fig. 4. A two subtree version of Fig. 2 is used to generate above results. At this stage, to bring down the target link, the adversary only needs to start the bot traﬃc to the decoy servers. That means, at this stage, we are only running the last part of the Crossﬁre attack. Figure 4 illustrates the utilization of the target link before and after the attack. Diﬀerent curves in diﬀerent colors represent diﬀerent BS time for bots to generate the attack ﬂow. In Fig. 4 the red curve is the baseline to show that perfect attack happens when all the bot traﬃc simultaneously arrive at the target link with their maximum intensity. The time interval which is used in BS in above experiments is in range of 1 to 5 min. The reason is that in a p2p platform (the most recent platform to synchronize Botnets) peers usually contact each other in range of few minutes [11,12]. For instance, Skype peers update only closer peers every 60 s [11]. In other studies like [13], the time synchronization between bots is reported in range of few milliseconds. However, there are few steps (three state machine) before they can reach to that accuracy and those states take suﬃciently long (i.e., few minutes). Therefore we still can assume that there is enough time in range of few minutes before the real attack takes place. We are looking here at the time diﬀerence between arrival of the ﬁrst packet of each bot to the target link. The time diﬀerence between arrival of each packet of any bot traﬃc could be in range of milliseconds which is not our concern here. Since we are now aware of this possible early detection, we discuss how to detect an attack which is formed by some low-intensity non-malicious traﬃc.

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

189

The main idea is analogous to detecting the variation of the volume of the traﬃc at several links. Since one cannot gain any information from per ﬂow traﬃc monitoring (the attack traﬃc is benign traﬃc), and attack type is a ﬂooding attack, probing the volume of the traﬃc at several links might be eﬀective. Although, a single bot-ﬂow is very small and can be detected neither at IDS nor at the server, the aggregation of the ﬂows are not small anymore. All these small traﬃc ﬂows must be aggregated at the certain time and place to be able to overwhelm the target link(s).

Fig. 3. Bot traﬃc with various starting points and traﬃc ﬂow duration.

Another important parameter to generate the bot traﬃc is the duration of the attack. Usually, bot-masters (adversaries) tend to reduce the duration of the attack to prevent being detected. The combination of the dynamic delay and the attack duration is diﬃcult to ﬁgure out. The attack duration parameter which is the time diﬀerence between the end of the warm-up period to the end of the attack, is named Dur in our experiments. In the case of the Crossﬁre attack, a rolling mechanism is introduced to keep the attack at the data plane (evade activating control plane which redirects the traﬃc) [4]. In the rolling scheme, a set of target links only be used for a speciﬁc period of time before it switches to another set of target links. The duration they used in the rolling scheme is 3 min. The 3 min is the keep-alive messages time interval for the BGP algorithm. This duration might be insuﬃcient when the attacking traﬃc is gradual because of any of the above mentioning reasons. This limitation of small duration of the attack forces the adversary to introduce another delay in forming the attack which causes a larger warm-up period. The eﬀect of the length of the attack duration (including warm-up period) with various BS and Dur parameters are depicted in Fig. 3. For comparison purpose, the baseline is the case where bots simultaneously generate traﬃc (warm-up period is zero) for an unlimited duration of time. For the other curves, there are warm-up periods for BS length and duration of length Dur.

190

M. Rezazad et al.

Figure 3 shows that, with less synchronized attack traﬃc, to have a successful attack, either the duration of the attack should be prolonged enough to pass the warm-up period or, the adversary should delay the attack and let the warm-up period passes before initiating the attack. Although, the parameters in our experiments are set to small numbers (few minutes of warm-up periods)1 , the result can generally be extended for longer periods. The main reason of keeping the simulation time short is limitation of resources in our setup. Increasing the size and the time of the experiments reduces the accuracy of the traﬃc generator [9].

Fig. 4. The eﬀect of bot-traﬃc synchronization on the warm-up period. Two Sub Trees with diﬀerent warm-up periods.

4.2

Distribution

We discussed the traﬃc synchronization problem in forming the attack. The introduction of warm-up period can be used for early detection of the Crossﬁre attack which is the topic of the next section. In this section, we introduce a hypothesis about the link traﬃc intensity variation caused by the Crossﬁre attack and suggest to use it for detection. We hypothesize that even if the Crossﬁre attack is successfully formed by generation of very low intensity attack traﬃc, unavoidably there will be a sudden jump in the traﬃc on (backbone) links, whereby this jump will be characteristic for a Crossﬁre attack. The main objective of the Crossﬁre attack is to bring down a set of target links to eﬀect the connectivity of a target area. Depending on the power of the attacker, the target area could be cut oﬀ completely from the Internet or the 1

There are some technical settings that can be used to support the selection of the small parameters, such as, in a p2p platform (the most recent platform to synchronize botnets) peers usually contact each other in range of few minutes [11, 12], or Skype peers update only closer peers every 60 s [11].

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

191

Fig. 5. Two Sub-Trees where jump can be seen at all levels.

Fig. 6. Four Sub-Trees and jump still visible.

quality of the connection to the Internet could be degraded. Either way, to bring down the target link the link utilization should be increased to its maximum capacity. The extra unwanted traﬃc at the target link must go through downstream links and aﬀects their utilization. For instance, the attack traﬃc at the target link between switch 5 and switch 7 in Fig. 2 must pass through the four subsequent links between switch 7 and downstream switches 8, 9, 10 or 11. This sudden extra change on traﬃc has a huge impact on these downstream links and perhaps the eﬀect goes down further to other links as well. We will examine this impact through emulation and report all the results for most of the links below the target link. Then by expanding the size of the network, we try to hide this impact by distributing the jump on the target link through more links. In these experiments, we do not consider the gradual traﬃc intensity increase at the bots. All bots send traﬃc at the maximum predeﬁned level. The warm-up period is 3 min.

192

M. Rezazad et al.

The ﬁrst scenario is a two sub-tree network of Fig. 2. In this scenario, the traﬃc at the target link can only go through two other paths. Figure 5 shows that during the attack time, the target link is completely utilized and the other two links underneath of the target link are under inﬂuence of the sudden traﬃc change. The green line is for the traﬃc at the edge of the network where the switch 8 is connected to the Decoy Server 1. The result of running the similar experiments with some larger networks is illustrated in Figs. 6 and 7. Comparing the link utilization in Figs. 5, 6 and 7 shows that increasing the number of branches in the network, reduces the obvious jump on the downstream links. In particular in Fig. 7, it is very diﬃcult to distinguish the attack period only by looking at the edge link (the green line) utilization.

Fig. 7. Eight Sub-trees where jump at the edge links are not visible.

The main point of these experiments is to show that sensing a similar variation on traﬃc intensity on multiple links could be a good indication of Crossﬁre attack for detection. Since expanding the network reduces the jump in the link utilization, the detector must be accurate enough to detect very small variations of the traﬃc intensity where cannot be detected by unarmed human sight.

5 5.1

Early Detection Method Warm-Up Phase

As mentioned, warm-up period is the phase from the attack initiation and the successful impact of the attack [4]. Early detection means detecting the attack during this phase when the attack traﬃc has reached to the decoy servers but the network is still operational. We show that by the time the attack starts the correlation among links to decoy servers gradually increases during the warm-up period potentially providing suﬃcient time and data to detect the attack.

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

193

For an eﬀective and early detection, we propose to monitor the traﬃc volume and intensity on several links of the network for simultaneously occurring sudden characteristic change on some of these links. Based on the awareness of this possible early detection, we discuss how to detect an attack which is formed by low-intensity non-malicious traﬃc. The main idea is analogous to detecting the variation of the volume of the traﬃc at several links. Since one cannot gain any information from per ﬂow traﬃc monitoring (the attack traﬃc is benign traﬃc), and attack type is a ﬂooding attack, probing the volume of the traﬃc at several links might turn out to be eﬀective. Although a single bot-ﬂow is very small and can be detected neither at IDS (Intrusion Detection System) nor the server, the aggregation of the ﬂows is not small anymore. All of these small traﬃc ﬂows must be aggregated at the certain time and place to be able to overwhelm the target link(s). This variation on the traﬃc volume at several links correlates them more and this is where the attack can be detected. 5.2

Experimental Results

In this section, we present experimental results to support the hypothesis of early detection of Crossﬁre attack based on the correlation among the links to decoy servers. The experiments in this section are diﬀerent in a way that they are designed to study the correlation among the links with and without the attack. Thus, the attack traﬃc is not designed to overwhelm any target links. The objective of the attack traﬃc is to add extra scheduled traﬃc at all decoy servers.

6

Topology

The same tree with 8-subtrees (80 decoy servers) and the same traﬃc types are used. In both experiments, there are a warm-up period of length 30 samples. The attack intensity during this period gradually increases at every time sample. This extra attack traﬃc for the ﬁrst experiment (experiment-1 ) increases from 300 bps to 600 bps and for the second experiment increases from 60 bps to 150 bps. The normalized (l1-norm)2 data of the link utilization of one link for both experiments is illustrated in Fig. 8. The ﬁgure shows that the attack intensity for the ﬁrst experiment (green curve) is higher than the second experiment (blue curve). Higher link utilization of the second experiment might hide the small variation of the attack traﬃc. The warm-up period for each experiment is highlighted with two parallel line. Pearson-R is used to measure the correlation among the links. Correlation is computed for every possible combination 79of two links. Since there are 80 decoy servers in our experiments, there are n=1 n = 3160 combination of two links. 2

l1-norm is used only for illustration purpose to preserve the level of the link utilization at each experiment.

194

M. Rezazad et al.

Fig. 8. Link utilization of one link with diﬀerent attack intensity.

Fig. 9. Correlation among links when there are attacks.

Pearson-R returns a single value for two sets of data, representing how tightly (or loosely) the two sets are correlated together. However, we are interested in observing how correlation of two links for a duration of the warm-up period evolves. Therefore, Pearson-R is calculated for a window size of 30 (the same size of warm-up period) points. To calculate the ﬁrst value of the Pearson-R, there are 29 sample points before the attack and one sample of the attack in the set. Then, the window is moved one sample to calculate the second value with 2 attack samples and 28 samples before the attack. Finally when the window reaches to the end of the warm-up period, all 30 samples in calculating Pearson-R include the attack traﬃc. The result of experiment-1 is reported in Fig. 9 This ﬁgure shows that the correlation constantly increases even for links that they are not correlated before the attack (sample-2, the green curve). The average of correlation among all links (the average over all 3160 pair of links) are presented and proves the positive eﬀect of the attack traﬃc in increasing the correlation among the all links. Figure 10 illustrates the result of experiment-2, When the attack intensity is

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

195

Fig. 10. Correlation among links when the attack intensity is reduced.

Fig. 11. Correlation among links when there is no attack.

reduced. This attack traﬃc is not strong enough to aﬀect correlating on all links. For instance, the Pearson-R value of the two links in sample-3 curve of the Fig. 10 are changing based on the background traﬃc on the links (they are not inﬂuenced by the attack traﬃc). However, there are some combination of links that are under inﬂuence of the attack traﬃc. Sample-1 curve in the same ﬁgure is one such example. We observed that the eﬀectiveness of the attack traﬃc on the correlation of links is a function of the intensity of the background traﬃc. Smaller volume of attack traﬃc does not eﬀect the correlation when there is a large amount of background traﬃc passing the link. The results of the link correlation when there is no attack traﬃc involved, is reported in Fig. 11. The ﬁgure shows that in average the correlation among the links are zero. Although, there might be some positive correlation among some links (like sample-2), this is not a general trend in the network.

196

7

M. Rezazad et al.

Crossﬁre Detection with Machine Learning

The Crossﬁre attack poses great challenges for security researchers and analysts both in detection and mitigation as the packets streaming from bots in the network are seemingly legitimate. While the objective of the Crossﬁre attack is to deplete the bandwidth of speciﬁc network links, a distinct traﬃc ﬂow between each bot to server, i.e., “bot-to-server” is usually very less intensive ﬂow, and consumes a limited bandwidth at each link. Thus detecting a single ﬂow (or very few number of them) at a link is hard to detect and ﬁlter. On the defender’s side, Traﬃc Engineering (TE) is the network process that reacts to link-ﬂooding events, regardless of their cause [14]. As an attacker, we like to hide the variation of traﬃc bandwidth as much as possible from the TE module.

Fig. 12. Classiﬁcation result for SVM with diﬀerent distribution.

The study in this section is to show that if the attacker distributes the benign traﬃc eﬀectively enough, the defender face much trouble to distinguish the attack traﬃc from normal traﬃc when detecting the attack far away from the target link. Indeed, the analysis, in previous sections, have already attested the imminent importance of traﬃc distribution. Following this direction, we leverage state-ofthe-art approaches in machine learning to investigate the eﬀect of traﬃc distribution in concealing and detecting the Crossﬁre attack from available traﬃc data. To do so, we utilize supervised learning for classiﬁcation of network traﬃc to normal and abnormal traﬃc, i.e, attack traﬃc. 7.1

Learning Models

In this paper we attempt to construct a model from big data collected from network. We utilize two supervised learning approaches: Support Vector Machine (SVM) and Random Forest (RF) as they are commonly used machine learning

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

197

Fig. 13. Classiﬁcation result for RF with diﬀerent distribution.

approaches which have demonstrated eﬀective performance on diﬀerent datasets and problems. SVM: It is known as one of the most powerful and non-probabilistic binary classiﬁers which attempts to separate the two classes of data with a hyperplane in a multidimensional space of features. We utilize linear-SVM due to its scalability. RF: Inspired by ensemble learning and bootstrapping, RF leverages multiple instances of decision trees, where each tree is built based on a randomly selected portion of training set. After computing the output of distinct trees, the ﬁnal decision is made by aggregation of the outputs via a majority voting scheme. The Random Forest Algorithm was chosen because the problem of Crossﬁre detection has the requirements of high accuracy of prediction, ability to handle diverse bots, ability to handle data characterized by a very large number and diverse types of descriptors. 7.2

Dataset and Feature Extraction

We utilized an emulated dataset collected based on the experiments discussed in Sect. 3. To generate the attack we used the topology designed and collected the data from distinct switches. As aforementioned, the objective of this section is to study the subtle variations in traﬃc data of the network to design eﬀective detection approach for the Crossﬁre traﬃc. Therefore we employ the volume of traﬃc in diﬀerent links of the network to construct feature vectors. We evaluate the performance of the learning approaches via the area under the receiver operating characteristic curve (AUC) [15], which illustrates the true positive, i.e., sensitivity, as a function of false positive, i.e., fall-out.

198

7.3

M. Rezazad et al.

Experimental Results

In this section, we design and analyze experiments to answer the following questions: (1) What is the impact of distribution of bot-to-server traﬃc in the performance of classiﬁcation algorithms? (2) What is the impact of extracted features in the performance of classiﬁcation algorithm? (3) What is the impact of levels of the links (in a tree structure) used for feature extraction?

Fig. 14. The eﬀect of number of features on attack detection for 4-sub-tree.

(1) The Eﬀect of Traﬃc Distribution To examine the impact of traﬃc distribution on the detection of the attack, we conducted experiments in three diﬀerent topologies designed in Sect. 3: 2ST, 4ST and 8ST, where the distribution of traﬃc increases as the number of sub-trees increases in the topology of the network. Figure 12 shows the classiﬁcation results of SVM in diﬀerent settings. As can be seen, the eﬀectiveness of classiﬁcation in 8ST is signiﬁcantly lower than 2ST, and 4ST, which is attributed to the fact that the former setting utilizes lower ﬂow than the alternative settings during the attack scenarios. This low amount of traﬃc as compared with normal states of the network conceals the attack from the eyes of the detection approach. Further, the AUC for 2ST and 4ST is neck to neck with a small improvement in 4ST. This is attributed to the fact that it beneﬁts from more features as compared to 2ST, i.e., 40 features against 20 features. Figure 13 depicts the classiﬁcation results of RF model for diﬀerent traﬃc distributions. As can be seen from the ﬁgure, RF demonstrated similar behavior results as that for SVM.

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

199

(2) The Eﬀect of Features Prior studies in data mining have demonstrated that the performance of classiﬁcation models highly depends on the selected features with regards to the classes. Further, a huge amount of data is required to be continuously processed as the network is a streaming and dynamic environment per se, which signify the importance of feature selection to reduce computational complexity. We hence vary the number of extracted features and evaluate the performance of classiﬁcation algorithm in terms of AUC. Figures 14 and 15 demonstrates the performance of classiﬁcation for 4ST and 8ST, respectively. From Fig. 14, we can see that the classiﬁcation performance ﬁrst indicates a positive correlation with the number of features and then saturates after an optimal value, i.e., 30 numbers of features. This is an interesting results verifying that with a too small feature dimension we would fail to achieve the optimal performance. However, by only a limited number of features, we can achieve reasonable performance. This is important as in network environment we may access to a limited number of links for feature extraction. Alternatively, Fig. 15 depicts the classiﬁcation performance for the 8ST setting. In contrast to 4St, classiﬁcation of 8St setting is much lower. This indicates the importance of distribution of the Crossﬁre attack where with a enough distribution of attack standard machine learning approaches would fail to distinguish Crossﬁre attack traﬃc from background traﬃc.

Fig. 15. The eﬀect of number of features on attack detection for 8-sub-tree.

(3) The Eﬀect of Network Visibility Looking from the network aspect, an important factor for attack detection is the level of information we can gather about the traﬃc data of the network. To examine how features from diﬀerent levels of the network aﬀects the performance of traﬃc classiﬁcation, we added the volume of one link from the upper level to the feature vector. More speciﬁcally, in 8ST setting, we have the volume of 80 decoy servers as a feature for the baseline. We also add the volume of a random link from one level upper to construct a 81-dimension and 21 -dimension feature vectors. Figures 16 and 17 demonstrate

200

M. Rezazad et al.

the performance of classiﬁcation of traﬃc data in 8ST setting for SVM and RF. Only adding one feature from the upper level, even if there are less features from the lower level, improves the performance signiﬁcantly, which highlights the importance of extracting features from diﬀerent part of the network.

Fig. 16. The eﬀect of higher level features for SVM.

Fig. 17. The eﬀect of higher level features for RF.

8

Conclusion

The Crossﬁre attack is considered to be one of the most diﬃcult target-area link-ﬂooding attacks to be detected. The attack uses a massively distributed large-scale botnet to generate multiple low-rate benign traﬃc ﬂows aiming to congest selected network link with the ultimate goal to disconnect the target area from the Internet. Although the Crossﬁre attack is a tremendous threat to

Detecting Target-Area Link-Flooding DDoS Attacks Using Traﬃc Analysis

201

any network, by analyzing the obtained data we show that the adversary has also substantial obstacles in the successful attack execution. As a result, this paper exposes detection vulnerabilities of the Crossﬁre attack by showing a correlation between coordination of the botnet traﬃc and the quality of the attack, and a correlation between the attack distribution and detectability of the attack. We also show that due to the bot synchronization there is a warm-up period after the attack is launched and before the target links are overwhelmed. Our results show that this period can be used for an early attack detection. In this paper a prototypical Crossﬁre attack detector is described, which exploits these vulnerabilities. For this, we utilize two supervised machine learning approaches: Support Vector Machine (SVM) and Random Forest (RF) for classiﬁcation of network traﬃc to normal and abnormal traﬃc, i.e, attack traﬃc. In particular, to show the feasibility of detection, we report on the trained scenarios using the link volume as the main feature set. Finally, results of the attack detector are reported along with some future directions to improve the detector. Acknowledgment. This work is partially funded by the joint research programme UL/SnT-ILNAS on Digital Trust for Smart-ICT.

References 1. Xue, L., Luo, X., Chan, E.W., Zhan, X.: Towards detecting target link ﬂooding attack. In: LISA, pp. 81–96 (2014) 2. Gkounis, D., Kotronis, V., Liaskos, C., Dimitropoulos, X.A.: On the interplay of link-ﬂooding attacks and traﬃc engineering. Comput. Commun. Rev. 46, 5–11 (2016) 3. Gkounis, D., Kotronis, V., Dimitropoulos, X.: Towards defeating the crossﬁre attack using SDN. arXiv preprint arXiv:1412.2013 (2014) 4. Kang, M.S., Lee, S.B., Gligor, V.D.: The crossﬁre attack. In: 2013 IEEE Symposium on Security and Privacy (SP), pp. 127–141, May 2013 5. Zargar, S.T., Joshi, J., Tipper, D.: A survey of defense mechanisms against distributed denial of service (DDoS) ﬂooding attacks. IEEE Commun. Surv. Tutor. 15(4), 2046–2069 (2013) 6. Ramazani, S., Kanno, J., Selmic, R.R., Brust, M.R.: Topological and combinatorial coverage hole detection in coordinate-free wireless sensor networks. Int. J. Sens. Netw. 21(1) (2016) 7. Brust, M.R., Turgut, D., Ribeiro, C.H., Kaiser, M.: Is the clustering coeﬃcient a measure for fault tolerance in wireless sensor networks? In: IEEE International Conference on Communications (ICC) (2012) 8. Xue, L., Luo, X., Chan, E.W.W., Zhan, X.: Towards detecting target link ﬂooding attack. In: 28th Large Installation System Administration Conference (LISA14), Seattle, WA, pp. 90–105 (2014) 9. Botta, A., Dainotti, A., Pescap`e, A.: A tool for the generation of realistic network workload for emerging networking scenarios. Comput. Netw. 56(15), 3531–3547 (2012) 10. Yu, W.: Pox ﬂow statistics (2012). https://github.com/hip2b2/poxstuﬀ

202

M. Rezazad et al.

11. Wu, C.-C., Chen, K.-T., Chang, Y.-C., Lei, C.-L.: Peer-to-peer application recognition based on signaling activity. In: Proceedings of the 2009 IEEE International Conference on Communications, ICC 2009, pp. 2174–2178. IEEE Press, Piscataway (2009). http://dl.acm.org/citation.cfm?id=1817271.1817676 12. Wu, C.-c., Chen, K.-t., Chang, Y.-c., Lei, C.-l.: Detecting peer-to-peer activity by signaling packet counting (2008) 13. Ke, Y.-M., Chen, C.-W., Hsiao, H.-C., Perrig, A., Sekar, V.: CICADAS: congesting the internet with coordinated and decentralized pulsating attacks. In: Proceedings of the ACM Asia Conference on Computer and Communications Security, pp. 699– 710. ACM, New York (2016) 14. Liaskos, C., Kotronis, V., Dimitropoulos, X.: A novel framework for modeling and mitigating distributed link ﬂooding attacks. In: IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9. IEEE (2016) 15. Powers, D.M.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation (2011)

Intrusion Detection System Based on a Deterministic Finite Automaton for Smart Grid Systems Nadia Boumkheld(B) and Mohammed El Koutbi University Mohamed V Rabat ENSIAS, Rabat, Morocco [email protected], [email protected]

Abstract. Smart grid system is a target to many types of network attacks, such as Denial Of Serive (DOS), Man In The Middle (MITM) ... that could compromise users’ privacy and network’s integrity and availability. For this reason we developed a network based intrusion detection system (IDS) that relies on a deterministic ﬁnite automaton (DFA) to recognize an attack’s language that represents some of the attacks that can target the smart grid network. The attacks’ language is deﬁned using elements or symbols that help identifying each type of attack. Results of our simulations show the eﬃciency of our IDS and its ability to detect dangerous cyber attacks. Keywords: Intrusion detection system Deterministic ﬁnite automaton

1

· Smart grid security

Introduction and Related Work

Information and communication technologies (ICT) are a main component of the smart grid system. They allow to build an intelligent, eﬃcient and reliable electrical grid through information exchange, distributed generation, active participation of customers... However these technologies introduce vulnerabilities inside the grid, such as: blackouts, disruption of power supplies, malicious attacks that can compromise users’ privacy and steal information about energy consumption. Moreover smart grid is a system of heterogeneous interconnected systems (transmission, distribution...), each one having its own set of communication equipment, intelligent devices, processing applications [1]... which makes it a target to many cyber attacks and makes its protection and security a hard task to achieve. Intrusion detection systems are applications that monitor a network or a system for suspicious activities and send alerts or reports to the administrator in case a malicious activity is detected [2]. There are generally two approaches to intrusion detection: anomaly based intrusion detection and misuse/signature based intrusion detection. The ﬁrst one (anomaly detection) deﬁnes the normal behavior inside the system, and any deviation from this normal behavior is c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 203–211, 2019. https://doi.org/10.1007/978-3-030-03405-4_13

204

N. Boumkheld and M. El Koutbi

considered an anomaly and indicates in most cases an intrusion. The second one or misuse detection models the abnormal behavior, and any occurrence of this behavior is identiﬁed as an intrusion. Many methods have been used for intrusion detection, for example, for anomaly detection we have [3] • Statistical models, such as: (1) Threshold measures which apply set or heuristic limits to event occurrences over an interval for example disabling user accounts after a set number of failed login attempts. (2) Multivariate model that calculates the correlation between multiple event measures relative to proﬁle expectations. (3) Clustering analysis relies on representing event streams in a vector representation, examples of which are then grouped into classes of behaviors (normal or anomalous) using clustering algorithms. • Protocol veriﬁcation relies on the use of unusual or malformed protocol ﬁelds to detect attacks. And for misuse detection there are many techniques as well, here are some of them: • Expression matching: searches for occurrences of speciﬁc patterns in an event stream (log entries, network traﬃc...) [3]. • State transition analysis: models attacks as a network of states and transitions (matching events). Every observed event is applied to ﬁnite state machine instances (each representing an attack scenario), possibly causing transitions. Any machine that reaches its ﬁnal (acceptance) state indicates an attack [4]. Finite state machines or automata have been used generally for intrusion detection such as in [5] where an approach is proposed for the real-time detection of denial of service computer attacks via the use of time-dependent DFA. Time-dependent means that the automaton considers the time intervals between inputs in recognizing members of a language, which improves the accuracy of detecting speciﬁc denial of service attacks. Another intrusion detection system in [6] is based on constructing ﬁnite automata from sequences of system calls using genetic algorithms. System calls are generated during the execution of each privileged program and the ﬁnite state automaton accepts the sequences generated by the normal execution of the program and rejects the sequences generated by the execution of the program while the system is being intruded. This approach is shown to be very eﬀective in detecting intrusions. To our knowledge DFA has not been used for the detection of cyber attacks in a smart grid network, so in this paper we are going to use DFA for the construction of a network intrusion detection system dedicated for the detection of some known cyber attacks inside a smart grid system. The remainder of this paper is organized as follows. In Sect. 2 we give a deﬁnition of Deterministic ﬁnite automata, then in Sect. 3 we explain our intrusion detection system mechanism, Sect. 4 is dedicated to present results of the intrusion detection and ﬁnally we end the paper with a conclusion.

Intrusion Detection System Based on a Deterministic Finite Automaton

2

205

Deterministic Finite Automata

A deterministic ﬁnite automaton represents a ﬁnite state machine that recognizes a regular expression or a regular language. DFA consists of the following elements: • A ﬁnite set of states Q • A ﬁnite set of symbols Σ • A transition function δ that takes a state and a symbol as argument, and returns a state (Q ×Σ to Q) • An initial state q0 • The set of accepting states F (F⊆Q) which is used to distinguish sequences of inputs given to the automaton. If the ﬁnite automaton is in an accepting state when the input ceases to come, the sequence of input symbols is accepted, otherwise it’s not accepted. A DFA is a ﬁve tuple (Q, δ, Σ, q0 ,F). For example the following DFA in Fig. 1 recognizes only symbol a over the alphabet a,b:

Fig. 1. Example of a DFA that recognizes symbol a.

3 3.1

Intrusion Detection Process Network Architecture

Our smart grid network consists of smart meters connected to the utility head end through an IP based network. Security inside the grid is of paramount importance, attackers could intercept the communications and forge incorrect information about customers’ energy consumption, or they can sniﬀ meter readings and follow users’ consumption patterns and determine if they are in their houses or not which threatens consumers’ privacy inside the grid. A hacker can also disable the server and block or delay the communications between smart grid entities which makes the smart grid unreliable and unavailable. To address these problems we propose an IDS that we place at the utility level in order to analyze the traﬃc inside the network and detect malicious activities and parties responsible for them. Figure 2 represents our network architecture.

206

N. Boumkheld and M. El Koutbi

Fig. 2. The smart grid network architecture.

3.2

Network Attacks

Our IDS’s function is to detect diﬀerent attacks that can target the smart grid’s network devices. These attacks are: (1) The Address resolution protocol (ARP) spooﬁng attack: is a type of network attack where an attacker sends falsiﬁed ARP messages with an aim to deviate and intercept network traﬃc. In normal ARP operation, when a network device broadcasts an ARP request to ﬁnd the MAC address corresponding to an IP address, the legitimate device with that IP address sends an ARP reply which is then cached by the requesting device in its ARP table [7]. An attacker can reply to the ARP request by linking his MAC address to the IP address of the legitimate device, it then takes the role of man in the middle and starts receiving any data intended for that legitimate IP address. ARP spooﬁng allows malicious attackers to intercept, modify or even stop data which is in-transit. Figure 3 illustrates the ARP spooﬁng attack. (2) Ping ﬂood attack is a denial of service attack (DOS) in which the attacker attempts to overwhelm a targeted device by sending large and continuous Internet control message protocol (ICMP) request packets causing the target to become inaccessible to normal traﬃc, because an ICMP request requires some server resources to process each request and send a response. If many devices target the same device with ICMP requests, the attack trafﬁc is increased and can potentially result in a disruption of normal network activity [8]. (3) TCP SYN ﬂooding attack is a type of distributed denial of service attack (DDOS) that exploits a part of the normal TCP three way handshake to consume resources on the targeted server. In normal TCP three way handshake between a client and a server, we have [9]: – The client requests a connection by sending Synchronize (SYN) message to the server. – The server acknowledges by sending synchronize-acknowledge (SYNACK) message back to the client.

Intrusion Detection System Based on a Deterministic Finite Automaton

207

– The client responds with an Acknowledgment (ACK) message and the connection is established. In SYN ﬂooding attack the attacker ﬂoods multiple TCP ports on the targeted server with SYN messages, often using a fake IP address. The server responds with an SYN-ACK to every SYN message, and temporarily opens a communications port for each attempted connection, while it waits for a ﬁnal ACK message from the source. The attacker doesn’t send the ﬁnal ACK and the connection remains incomplete, and before it times out other SYN packets will arrive which leaves a high number of connections half open and eventually, as a result the service to legitimate clients will be denied, and the server may even malfunction or crash. This attack is depicted in Fig. 4.

Fig. 3. The ARP spooﬁng attack.

Fig. 4. The TCP SYN ﬂooding attack.

3.3

IDS System

Our IDS is placed at the utility level, and is capable of sniﬃng the traﬃc that takes place between the smart meters and utility data center. Each meter uses a gateway (to send and receive data) that is identiﬁed with a MAC and IP address as we rely on IP based communications. So the IDS visualizes the exchange

208

N. Boumkheld and M. El Koutbi

of packets between network entities and can ﬁlter the packets based on the diﬀerent TCP/IP protocols such as ARP, ICMP, TCP... and based on those ﬁlters or protocol communications the IDS is capable of extracting signs or symbols that help identifying each of the aforementioned attacks, an example of these symbols is presented is Table 1. The IDS then deﬁnes tokens associated with those symbols and using these tokens the IDS is capable of forming a language that we call Attacks’ language that is going to be recognized by the deterministic ﬁnite automaton or in other words if an expression corresponding to an attack belongs to the attacks’ language, then the expression is accepted by the automaton indicating the occurrence of a speciﬁc attack in the network. The IDS system is illustrated in Fig. 5. Table 1. Example of symbols used by the IDS to identify the attacks Attacks

Symbols

SYN ﬂooding attack [SYN], [SYN,ACK]... Arp spoof attack

arp reply, MAC address...

Ping ﬂood attack

Data length...

Fig. 5. The intrusion detection system.

4

Intrusion Detection Results and Interpretation

In order to develop our IDS we worked with several tools, ﬁrst of all to visualize the network traﬃc we used Wireshark which is an open source packet analyzer that is used for network troubleshooting, analysis software and communications protocol development, and education [10]. Wireshark formerly known as Ethereal is Multi-platform, it runs on Windows, Linux, OS X, Solaris, FreeBSD, NetBSD, and many others. It can be used to examine the details of traﬃc at a variety of levels ranging from connection-level information to the bits that make up a single packet. Packet capture with Wireshark can provide a network administrator with information about individual packets such as transmition time, source,

Intrusion Detection System Based on a Deterministic Finite Automaton

209

destination, protocol type and header data and this information can be useful for evaluating security events and troubleshooting network security device issues [11]. Second to build our DFA and recognize the diﬀerent attacks targeting the smart grid network we use Fast LEXical analyzer generator (Flex) which is a tool for generating lexical analyzers(scanners) that perform character parsing and tokenizing via the use of a deterministic ﬁnite automaton. To test and evaluate the IDS we used DARPA intrusion detection evaluation data sets that we have downloaded from the Massachusetts Institute of Technology (MIT) Lincoln Laboratory. The testing data contains network based attacks in the midst of normal background data. We should note that positive (P) refers to an attacker, and negative (N) is a legitimate device; True positive (TP) is the number of attackers that were correctly identiﬁed; False negative (FN) is the number of attackers that were identiﬁed as legitimate; False positive (FP)is the number of legitimate users detected as attackers; and True negative (TN) is the number of legitimate users correctly identiﬁed as legitimate and we have: Detection rate = F alse alarm = Accuracy = P recision = Specif icity = F alse negative rate =

TP T P +F N FP F P +T N T P +T N P +N TP T P +F P TN F P +T N FN F N +T P

(1) (2) (3) (4) (5) (6)

(1) Arp spooﬁng attack: The results of this attack detection are presented in Table 2. Table 2. The IDS performance for the detection of ARP spooﬁng attack Performance indicator

Corresponding value

Detection rate or true positive rate (TPR) 1 False alarm

0.285

Accuracy

0.724

Precision

0.11

Speciﬁcity or true negative rate (TNR)

0.714

False negative rate

0

(2) Syn ﬂooding attack: Table 3 presents the detection results for this attack.

210

N. Boumkheld and M. El Koutbi Table 3. The IDS performance for the detection of SYN ﬂooding attack Performance indicator

Corresponding value

Detection rate or true positive rate (TPR) 1 False alarm

0

Accuracy

1

Precision

1

Speciﬁcity or true negative rate (TNR)

1

False negative rate

0

(3) Ping ﬂooding attack: For this attack we only relied on local machines to do the test because it wasn’t included in DARPA test sets or any other. Detection results are given in Table 4. Table 4. The IDS performance for the detection of ping ﬂooding attack Performance indicator

Corresponding value

Detection rate or true positive rate (TPR) 1 False alarm

0

Accuracy

1

Precision

1

Speciﬁcity or true negative rate (TNR)

1

False negative rate

0

The results indicate a good performance of our IDS, in fact we don’t have any false positives or false negatives for SYN ﬂooding attack and Ping ﬂooding attack. And it only shows a small number of false positives for the ARP spooﬁng attack and therefore few false alarms.

5

Conclusion

Our IDS system uses DFA for the detection of cyber attacks inside the smart grid network. Our tests prove the eﬃciency of our IDS and its ability of detecting all the attacks. Our work could be extended by generating a large number of attacks in the network and constructing an automaton that is capable of detecting all of them.

References 1. European Network and Information Security Agency (ENISA), Smart grid security (2012) 2. https://en.wikipedia.org/wiki/Intrusion detection system

Intrusion Detection System Based on a Deterministic Finite Automaton

211

3. Verwoerd, T., Hunt, R.: Intrusion detection techniques and approaches. Comput. Commun. J. 25(15), 1356–1365 (2002) 4. Karthikeyan, K.R., Indra, A.: Intrusion detection tools and techniques - a survey. Int. J. Comput. Theory Eng. 2(6), 901 (2010) 5. Branch, J., Bivens, A., Chan, C.Y., Lee, T.K., Szymanski, B.K.: Denial of service intrusion detection using time dependent deterministic ﬁnite automata. In: Proceedings of the Research Conference, Troy, NY, October 2002 (2002) 6. Wee, K., Kim, S.: Construction of ﬁnite automata for intrusion detection from system call sequences by genetic algorithms. In: Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, SINGAPOUR 2006 (2006) 7. http://www.omnisecu.com/ccna-security/arp-spooﬁng-attack.php 8. https://www.cloudﬂare.com/learning/ddos/ping-icmp-ﬂood-ddos-attack/ 9. https://www.incapsula.com/ddos/attack-glossary/syn-ﬂood.html 10. https://en.wikipedia.org/wiki/Wireshark 11. http://whatis.techtarget.com/deﬁnition/Wireshark

The Detection of Fraud Activities on the Stock Market Through Forward Analysis Methodology of Financial Discussion Boards Pei Shyuan Lee(&), Majdi Owda, and Keeley Crockett School of Computing, Mathematics and Digital Technology, The Manchester Metropolitan University, Chester Street, Manchester M1 5GD, UK {Pei-Shyuan.Lee,M.Owda,K.Crockett}@mmu.ac.uk Abstract. Financial discussion boards (FDBs) or ﬁnancial forums on the Internet allow investors and traders to interact with each other in the form of posted comments. The purpose of such FDBs allows investors and traders to exchange ﬁnancial knowledge. Unfortunately, not all posted content on FDBs is truthful. While there are genuine investors and traders on FDBs, deceivers make use of such publicly accessible share price based FDBs to carry out ﬁnancial crimes by tricking novice investors into buying the fraudulently promoted stocks. Generally, Internet forums rely on default spam ﬁltering tools like Akismet. However, Akismet does not moderate the meaning of a posted content. Such moderation relies on continuous manual tasks performed by human moderators, but it is expensive and time consuming to perform. Furthermore, no relevant authorities are actively monitoring and handling potential ﬁnancial crimes on FDBs due to the lack of moderation tools. This paper introduces a novel methodology, namely, forward analysis, employed in an Information Extraction (IE) system, namely, FDBs Miner (FDBM). This methodology aims to highlight potentially irregular activities on FDBs by taking both comments and share prices into account. The IE prototype system will ﬁrst extract the public comments and per minute share prices from FDBs for the selected listed companies on London Stock Exchange (LSE). Then, in the forward analysis process, the comments are flagged using a predeﬁned Pump and Dump ﬁnancial crime related keyword template. By only flagging the comments against the keyword template, results indicate that 9.82% of the comments are highlighted as potentially irregular. Realistically, it is difﬁcult to continuously read and moderate the massive amount of daily posted comments on FDBs. The incorporation of the share price movements can help to categorise the flagged comments into different price hike thresholds. This allows related investigators to investigate the flagged comments based on priorities depending on the risk levels as it can possibly reveal real Pump and Dump crimes on FDBs. Keywords: Financial Discussion Boards Fraud detection Crime prevention Information Extraction Financial crimes Pump and Dump

© Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 212–220, 2019. https://doi.org/10.1007/978-3-030-03405-4_14

The Detection of Fraud Activities on the Stock Market

213

1 Introduction The Internet has become a standard place for information and communication over the years. Topical online forums are part of the Internet and it allows people who are of like mind to communicate with each other through posted content. Share price based Financial Discussion Boards (FDBs) give investors and traders the opportunity to exchange ﬁnancial knowledge and opinions. In the UK, such share price based FDBs include the London South East [1], Interactive Investor [2] and ADVFN [3]. Although topical online forums are meant to be used for information and knowledge exchange, not all the information is truthful or accurate. Spam ﬁltering forum tools such as Akismet [4] integrated on many online forums. However, such tool ﬁlters only the spam registrations and spam messages and it does not moderate the meaning of a posted content. Likewise, human moderators usually moderate only the contents such as spam, advertisements, foul language and so on. There are little to no actions were taken by human moderators and external authorities to keep an eye on the FDBs for activities revealing of Pump and Dump (P&D) crime. Such manual moderation of FDBs content requires massive efforts and time, which turns out to be impracticable in the long run. P&D happens when deceivers on the share price based FDBs fraudulently spread false information of a stock in a positive manner after accumulating the stocks at a much lower price. This usually tricks novice investors and traders into buying the fraudulently promoted stocks. The deceivers then sell their stocks at the highest point of price, leaving the P&D victims losing money. Textual comments such as “Told you, this is a hot stock, so buy now!” can expose a hidden potential of irregular P&D activity on share price based FDBs. Existing research found that FDB comments were manipulative and positively related to the stocks’ trading volumes, market returns and volatility [5–9]. However, there is only very little attempt [10, 11] to create moderation tools to monitor posted content and detect potentially irregular activities on FDBs. Share price based FDBs contain semantically understandable artefacts (i.e. FDBs’ artefacts that can be processed by computers) such as stock ticker names, date and time, price ﬁgures, comments and usernames of comment authors. In this research, Information Extraction (IE) techniques are used for the extraction of these artefact data. IE is the process of extracting information automatically from an unstructured or semistructured data source into a structured data format [12]. IE has also been used in other areas such as search engine [13] and accounting [14]. However, IE techniques in relation to the FDBs’ ﬁnancial crimes are scarcely researched except the initial work presented in [11, 15]. In [15], share prices were not taken into account during the detection of potentially irregular activities on share price based FDBs. The novel methodology, namely forward analysis, presented in this paper will detect and flag the potentially illegal posted content with the incorporation of share prices during the detection process. This forward analysis methodology is applied in an IE prototype system, namely FDBs Miner (FDBM). FDBM analyses all the comments against a predeﬁned P&D IE keyword template. Once the potentially illegal comments are flagged, price ﬁgures which share the same or closest date and time based on same ticker symbols are then matched and

214

P. S. Lee et al.

appended to the flagged comments. Next, the forward analysis methodology takes each flagged comment’s price as a “primary price”, and calculate ±2 days’ worth of prices to determine whether there is any price hike of 5%, 10% and 15% when compared to the “primary price”. Lastly, it appends the price hike threshold labels to these flagged comments. The main contribution of this paper is to introduce a novel methodology that will flag potentially illegal comments as well as categorise these comments based on the level of risks. This can greatly beneﬁt the related investigators to investigate into the potentially illegal comments according to risk levels. Section 2 reviews the examples of past ﬁnancial crimes on share price based FDBs. Section 3 introduces Information Extraction (IE) and its usage in FDBM. Section 4 presents an architecture overview of FDBM. This followed by Sect. 5 which describes the novel forward analysis methodology and the experimental results in Sect. 6. Lastly, Sect. 7 concludes the research ﬁndings.

2 Pump and Dump (P&D) Crimes on FDBs P&D crimes are normally committed through different methods such as online discussion boards, word of mouth, social media, emails and so on. The following are a few examples of the popular share price based P&D ﬁnancial crimes: • In 2000, Jonathan Lebed, who was only 15 years old at the time, became the very ﬁrst minor to perform a P&D scam [16]. Lebed made a proﬁt of US$800,000 through ramping up the share prices using Yahoo! Finance Message Board for six months. Lebed was charged by US Securities & Exchange Commission (SEC) after that [16–18]. • In 2000, two fraudsters ramped up a share price by as high as 10,000% through posted comments on Raging Bull FDB. They made at least US$5 million after dumping millions of shares [17]. • Eight fraudsters carried out the P&D crime throughout the year of 2006 and 2007 through a popular penny stock FDB, namely, InvestorsHub (now owned by ADVFN [3]). They were then charged by the SEC in 2009 for being involved in penny stock manipulation [19]. The FDB P&D crime examples above demonstrate that there is a necessity to create FDBs moderation methods and tools to detect potentially irregular contents on share price based FDBs in real time.

3 Information Extraction (IE) IE is the process of extracting information such as text from unstructured or semistructured data sources into a structured data format [12]. Soderland [20] suggested that there is a need for systems that extract information automatically from text data. IE systems are knowledge-intensive [20] as these systems extract only snippets of information that will ﬁt predeﬁned templates (ﬁxed format) which represent useful and relevant information about the domain then display to end users of a system [21].

The Detection of Fraud Activities on the Stock Market

215

IE is used in this research to automatically extract information from an unstructured or semi-structured data source (such as FDB comments and share prices) into a structured data format (i.e. FDBs dataset). The IE prototype system in this research can display a summary of information from several interlinked sources (i.e. FDB comments and share prices) allowing ﬁltering of potentially illegal comments to take place.

4 Architecture Overview of FDBs Miner (FDBM) This section introduces the architecture of FDBM prototype system. The architecture consists of ﬁve key modules. The modules are the data crawler, data transformer, FDB dataset (FDB-DS), IE templates and the forward analyser. In general, FDBM will ﬁrst collect data, then transform unstructured and semi-structured data into fully structured data which kept in the FDB-DS. The IE templates is for the use with the forward analyser. The novel methodology introduced in this paper is made functional in the forward analyser module. Figure 1 shows a diagram of the architecture overview for the FDBM system. Each module in the architecture is explained in the following sections.

Fig. 1. Architecture overview diagram

4.1

Data Crawler

The data crawler module crawls the unstructured data from all three FDBs (i.e. LSE [1], III [2] and ADVFN [3]) during different times for 12 weeks (23rd September 2014 to 22nd December 2014). A total of 941 ticker symbols (i.e. unique abbreviations of

216

P. S. Lee et al.

companies listed on the stock market), 507,970 FDB comments and 28,980,465 price ﬁgures were collected. 4.2

Data Transformer

Data transformer module extracts and transforms the unstructured and semi-structured data collected by data crawler module in numerous formats such as HTML, HTM, JSP, ASP, CSV and XML into structured form. 4.3

FDB Dataset (FDB-DS)

Once the unstructured and semi-structured data are extracted, the structured data are stored in the FDB dataset (FDB-DS) accordingly. FDB-DS is used to store extra data produced throughout the research experiments. 4.4

IE Keyword Template

The Pump and Dump (P&D) IE template is formed and saved locally in the FDBM system. The keyword template can be modiﬁed whenever required. The IE templates contain of a group of keywords and short sentences that were built based on a thorough research [22–25] and have been validated by experts. The P&D IE keyword template is used to match against the FDB comments in the forward analysis process. 4.5

Forward Analyser

The forward analyser matches the P&D IE templates with the FDB comments to highlight potentially illegal FDB comments. Then the share prices are integrated into the forward analysis by matching the prices to the flagged comments, conducting a series of calculation and label of price thresholds. This novel forward analysis methodology is further described in Sect. 5.

5 Forward Analysis Methodology This section presents a novel forward analysis approach that flags and ﬁlters the potentially illegal P&D FDB comments. Share prices are considered in the algorithm methodology in order to group the flagged comments into different risk levels. This permits interested parties to look into the flagged comments in a timely manner with less wasted efforts. As depicted in Fig. 1 (architecture overview diagram) above, the forward analyser module comprises several functions (i.e. comments flagging, price matching and price hike thresholds labelling). These functions run as part of the forward analysis methodology and is further described in the following sections.

The Detection of Fraud Activities on the Stock Market

5.1

217

Comments Flagging

In the initial step, a series of keywords and short sentences located in the Pump and Dump (P&D) IE keyword template are matched against all the 507,970 comments that are stored in FDB dataset (FDB-DS). The list of potentially illegal FDB comments flagged through the forward analysis process is imported into FDB-DS as a new database table named ‘flaggedcomment’. 5.2

Prices and Comments Flagging

After the database table ‘flaggedcomment’ is populated in the initial step, the forward analyser locates and attaches the price to each flagged comment by associating the same ticker symbol and the exact or nearest date and time. This step is performed to ensure each flagged FDB comment has an “primary price”. This “primary price” also represents the price at the time of a comment was posted. The “primary price” will be used for threshold labelling in the following step. 5.3

Price Hike Thresholds Labelling

Once all the “primary prices” are set for each flagged comment in the previous step, the forward analyser now labels each flagged comment with price hike thresholds. The forward analyser calculates all the ± two days per-minute prices against the “primary price” of each flagged comment to determine whether the “primary price” of each flagged comment exceeds any price hike thresholds. The price hike threshold labelling rules are listed as follows: • Label a flagged comment as “R” (Red) if any of the ± two days price ﬁgures calculated against the “primary price” shows a price hike of 15%. • Label a flagged comment as “A” (Amber) if any of the ± two days price ﬁgures calculated against the “primary price” shows a price hike of 10%. • Label a flagged comment as “Y” (Yellow) if any of the ± two days price ﬁgures calculated against the “primary price” shows a price hike of 5%. • Label a flagged comment as “C” if it does not trigger any price increase thresholds. • Label a flagged comment as “N” (Null) if it has no price ﬁgure (due to missing price ﬁgures from ADVFN).

6 Forward Analysis Results The results show that 49,858 comments were flagged as potentially illegal comments (highlighted in Table 1) when performing the initial step (i.e. comments flagging step) in the forward analysis. These comments represent up 9.82% of the total comments. When performing the price hike threshold labelling step, of all the flagged comments, 3,613 (7.25%) of the flagged comments triggered the “R” price hike threshold of 15%, 2,555 (5.12%) of the flagged comments triggered the “A” price hike threshold of

218

P. S. Lee et al. Table 1. The number of flagged comments Comments Non-flagged Flagged Total comments

Total 458,112 49,858 507,970

Percentage 90.18% 9.82% 100%

10% and 5,197 (10.42%) of the flagged comments triggered the “Y” price hike threshold of 5%. 37,895 (76.01%) flagged comments are labelled as “C” as they did not trigger any price hike thresholds but still worth investigating at a lower priority than those flagged with price hike thresholds on 5%, 10% and 15%. The overall number of flagged comments that triggered the price hike thresholds are summarised in Table 2 and shown in Fig. 2. Table 2. The number of flagged comments for each price hike threshold Price hike threshold R (15%) A (10%) Y (5%) C (less than 5%) Null Grand total

Total comments 3,613 2,555 5,197 37,895 598 49,858

Percentage 7.25% 5.12% 10.42% 76.01% 1.2% 100%

Flagged comments for the price hike thresholds 3613, 7.25% 2555, 5.12%

598, 1.20%

5197, 10.42% 37895, 76.01%

C ( instead of >=). Additionally, existing tools such as GenProg do not guarantee to ﬁnd a ﬁx for simple operator faults, and when it generates a repair, it frequently breaks main functionalities and adds extra complexity to the code. The PMOs in this study change each operator into sets of alternatives and ﬁx faulty binary operators including relational operators, arithmetic operators, bitwise operators, and shift operators in diﬀerent program constructs (return statements, assignments, if bodies, and loop bodies). Evaluating MUT-APR is not within this paper’s focus; however, prior work evaluated the quality of generated repair when diﬀerent coverage criteria were used [2] and the impact of diﬀerent fault localization techniques on the eﬀectiveness and performance of MUT-APR [25]. Combination of simple mutation operators, Jaccard fault localization heuristic, and a random search algorithm to ﬁx binary operator faults improved MUT-APR performance without negatively impacting the eﬀectiveness or quality of the generated repairs. Automated program repair is described in Sect. 2, and the MUT-APR approach is described in Sect. 2.1. Section 3 described the MUT-APR framework, and Sect. 4 explained how MUT-APR is used to ﬁx faults. The limitations of existing tool is described in Sect. 5.

2

Automated Program Repair (APR)

Figure 1 is the overall organization of APR techniques. First, APR applies fault localization (FL) techniques to locate potentially faulty statements, which is Step 1 in Fig. 1. After applying FL, a list of potentially faulty statements (LPFS) that orders statements based on their likelihood of containing faults is created. APR selects statements sequentially from the list and modiﬁes them by using a set of PMOs generating a modiﬁed copy of the faulty program called a variant (Sept 2 in Fig. 1). To select a PMO, APR applies a search algorithm that determines the method of selecting PMOs from the pool of possible operators. Each generated variant

258

F. Y. Assiri and J. M. Bieman

Fig. 1. Overall Automated Program Repair (APR) process.

is executed against the repair tests (Step 3 in Fig. 1) to decided whether the generated variant is a repair or not. If the generated variant passes all repair tests is considered a potential repair and the process stops. Otherwise, the algorithm is repeated for many generations until the number of generations reaches its limit. 2.1

MUT-APR Algorithm

MUT-APR ﬁxes faulty binary operators by constructing new operators applying a genetic algorithm. The MUT-APR framework allows diﬀerent search algorithms to be applied; however, the use of genetic algorithm with MUT-APR is described in this paper. First, potentially faulty statements are identiﬁed, then the algorithm is executed to generate new copies of the faulty program variant that consists of one change in a binary operator. PMOs are picked randomly from the pool of operators; however, potentially faulty statements are selected sequentially based on their given order in the LPFS. Each generated variants is complied and executed against the repair tests. Variant that passes all repair tests considered a possible repair. If no possible repair was found, the algorithm runs for more iterations until the number of iterations reached its limit with no repairs. Table 1. Mutation operators supported by our approach Mutation operator Description ROR

Relational Operator Replacement

AOR

Arithmetic Operator Replacement

BWOR

BitWise Operator Replacement

SOR

Shift Operator Replacement

MUT-APR: MUTation-Based Automated Program Repair Research Tool

259

MUT-APR mutation operators change each faulty operator into its possible alternatives. MUT-APR targets relational operators, arithmetic operators, bitwise operators, and shift operators (Table 1). The mutation operators are designed to (1) change relational operators in if statements, return statements, assignments, and loops; and (2) change arithmetic operators, bitwise operators and shift operators in return statements, assignments, if bodies, and loop bodies. Equal probability was assigned to all mutation operators. Algorithm 1 explains the mutation operators implementation. MUT-APR selects a potentially faulty statement from the LPFS; statements are selected based on their order. Then, a mutation operator is selected randomly. The statement stmti is checked (line 5). If it includes an operator, the operator is checked against the selected mutation operator (line 6). If the selected mutation operator is one of the operator’s alternatives (line 7), the statement type is checked (line 8), and a new statement stmtj is created (line 9). Then, stmti is substituted by stmtj creating a new variant (line 13).

Algorithm 1. Mutation Operator Pseudocode 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

3

Inputs: Program P and LPFS Output: mutated program for all statements stmti in the weighted path do if stmti contains an operator then let stmtOp = checkOperator (stmti) if stmtOp = Op1 then let mOp = choosePMO(ChangeOp1ToOp2), let stmtType = checkStmtType(stmti) let stmtj = apply(stmti,mOp) end if end if end for return P with stmti substituted by stmtj

MUT-APR Framework

MUT-APR is an evaluation framework creating a conﬁgurable mutation-based tool allowing the user to vary APR mechanisms and components. This framework was built by adapting the GenProg Version 1 framework. GenProg was used due to its availability and because it is a state-of-the-art automated program repair tool that can ﬁx faults in large C programs; it can easily extend to support diﬀerent program modiﬁcation operators, fault localization techniques, and search algorithms. Table 2 summarizes the list of methods and classes that are modiﬁed in GenProg code to create MUT-APR tool. In order to support diﬀerent FL techniques, MUT-APR implemented the changes introduced by Qi et al. [14]. A separate function that takes faulty program and set of repair tests

260

F. Y. Assiri and J. M. Bieman Table 2. The GenProg methods and classes that are modiﬁed in MUT-APR

Method/Class

The changes

Fault localization − MUT-APR applies the changes by Qi et al. [14] to support the use of diﬀerent FL techniques. − MUT-APR implements a separate function to generate the LPFS using diﬀerent FL heuristics. PMOs

− MUT-APR turned oﬀ the GenProg’s PMOs classes that change code at the statement level, and implements new PMO classes that construct new operators − MUT-APR sends the potentially faulty statement ID as a parameter to the PMO classes to guarantee changing the operator of the selected statement.

Search algorithm Diﬀerent version of MUT-APR was implemented to apply diﬀerent search algorithms. GenProg’s genetic algorithm code was changed to produce new algorithms − To implement the genetic algorithm without a crossover operator, the crossover operator method was turned oﬀ − To implement the random search, the selection algorithm and the crossover operator methods were turned oﬀ − To implement the guided search algorithms and exhaustive search algorithms, code was inserted to check the operator of the selected potentially faulty statement and send the operator as a parameter to the mutation method

was implemented to generate LPFS using diﬀerent FL heuristics such as Jaccard, Ochiai, and Tarantula [25]. In addition, the crossover operator was turned oﬀ to evaluate its impact on MUT-APR, and a set of guided algorithms, that check the faulty operators and apply only a small set of alternative operators, are implemented. As described in Fig. 1, APR consists of fault localization, variant creation, and variant validation. In this section, the implementation of MUT-APR components is described. Figure 2 shows the implemented components for each step in the APR technique. Variant creation diﬀers between search algorithms: Fig. 2 includes the selection algorithm and the crossover operator which are part of the genetic algorithm (GA) but not other algorithms. Therefore, to implement genetic algorithm without a crossover operator (GAWoCross), the crossover operator is removed, and to implement a random search algorithm, the selection algorithm and crossover operator are removed.

MUT-APR: MUTation-Based Automated Program Repair Research Tool

261

Fig. 2. MUT-APR implemented components.

3.1

Fault Localization

To locate faults, MUT-APR creates an instrumented version of faulty programs and then applies a fault localization technique to create the LPFS (Step 1 in Fig. 2). Code Coverage: MUT-APR ﬁrst collects code coverage information based on test paths executed by the repair tests. MUT-APR adapted the coverage code from the GenProg tool. The coverage code uses CIL [26], an intermediate language for C programs, which generates simpliﬁed versions of source code. Then, an instrumented version of the simpliﬁed faulty version is created by assigning a unique ID for each statement. To collect code coverage, an executable version of the instrumented faulty code is created. Then the code is executed against repair tests one by one. For each test case, MUT-APR generates a text ﬁle that contains a unique list of statements’ IDs that were executed. Coverage information is collected for each test case, creating many statement ID list ﬁles (the number of statement ID list ﬁles is equal to the number of repair tests) to be used by the fault localization technique when identifying potentially faulty statements. Fault Localization Heuristic: To identify potentially faulty statements, MUTAPR employs many fault localization (FL) techniques. All of the employed FL techniques analyze the statement ID list generated in the previous step. Fault localization code reads the statement ID lists and counts the number of times

262

F. Y. Assiri and J. M. Bieman

each statement is executed by passing and failing tests. Then, FL computes a suspiciousness score for each statement ID using the number of statement executions and creates the list of potentially faulty statements (LPFS), which is an ordered unique list of statement IDs and their scores, as shown in Table 3. Table 3. List of potentially faulty statements (LPFS) in the format used by the APR tool Statement ID Suspiciousness score 1

1.00

5

0.57

7

0.57

8

0.57

4

0.50

MUT-APR used the implementation from the GenProg framework. Initially, it used GenProg’s Weighting Scheme to compute suspiciousness scores. In order to use diﬀerent fault localization techniques, changes developed by Qi et al. [14] were applied. Then, ten fault localization techniques were implemented using about 649 LOC of OCaml code. 3.2

Variant Creation

MUT-APR creates new variants by applying a set of PMOs to change faulty operators. To select a PMO from the pool of operators, a search algorithm is applied. When MUT-APR implements GA and GAWoCross algorithms, it applies a selection algorithm to select the best variant for use by the next generation. When GA is implemented, a crossover operator applied to combine changes from two variants into one variant (Step 2 in Fig. 2). Search Algorithm: A search algorithm selects a PMO from the set of PMOs. Many versions of MUT-APR were implemented; each applies a search algorithm that determines how the PMO will be selected. A genetic algorithm was adapted from the GenProg framework. To implement a genetic algorithm without a crossover operator, the crossover operator code used by GenProg was disabled. To convert the GenProg genetic algorithm into a random search, the GenProg selection algorithm and crossover operator were both disabled. This produces a random search algorithm that only uses the GenProg code responsible for creating the initial population for the genetic algorithm and the program modiﬁcation operator code to generate a variant. To implement the guided stochastic and exhaustive search algorithms, we added code to determine the operator in the selected potentially faulty statement. Then, the algorithm passes the operator as a parameter to the program

MUT-APR: MUTation-Based Automated Program Repair Research Tool

263

modiﬁcation operator code (mutation code). Thus, the program modiﬁcation code selects one of the faulty operator alternatives randomly to be applied as described in the next section. Program Modification Operators: To implement our program modiﬁcation operators, the original GenProg operators were substituted with ﬁfty-eight new program modiﬁcation operators that represent all binary operator alternatives. For each operator (e.g., >), an OCaml visitor class for each alternative was implemented. Each class takes the faulty program CIL ﬁle and the ID of the potentially faulty statement as inputs and returns a new statement by changing the operator into one of its alternatives (e.g., >=, operator alternatives. Selection Algorithm: For genetic algorithm, a sampling algorithm is used to select the best variants for use by the next generation. First, variants with ﬁtness equal to zero and those that do not compile are discarded, and then a tournament sampling algorithm selects the best variants from the remaining ones [27]. Tournament sampling takes a list of pairs (variant, ﬁtness value) and the number of variants to be returned, and produces a list of variants. To select the best variants, the algorithm selects two variants randomly. The variant with the highest ﬁtness value is included in the next population for use by the crossover operator. The selection process is repeated for all remaining variants. Crossover Operator: The one-point crossover operator from GenProg was used. This operator takes two parent variants and returns four child variants: the two newly created variants and the parent variants. To create two new child variants, the crossover operator code takes the LPFS for each variant and randomly determines a cutoﬀ point. Then, the statements after the cut-oﬀ point are swapped, creating two new LPFS lists which are sent to a visitor class to create the new variants. Figure 3 shows how a one-point crossover operator creates new variants. It selects a cut-oﬀ point randomly, and then swap the statements after the cut-oﬀ point between two parent variants modifying their Abstract Syntax Tree (AST) to create two new child variants.

264

F. Y. Assiri and J. M. Bieman

Fig. 3. One-point crossover operator.

3.3

Variant Validation

Each generated variant is validated by executing the variant against repair tests and computing its ﬁtness value using a ﬁtness function. Fitness function: A ﬁtness value is computed for each variant to determine if the generated variant is a potential repair or not. Created variants are executed against all repair tests, and the ﬁtness value is computed and saved. The ﬁtness method takes a variant and returns a ﬁtness value. To compute a ﬁtness value for each variant, each variant is compiled. If a variant fails to compile, the variant is assigned a ﬁtness value equal to zero. If the variant compiles successfully, passing and failing test are executed, and a ﬁtness value is computed by counting the number of passing and failing tests, which are multiplied by a ﬁxed weight. Variants and their ﬁtness values are cached. If a variant is similar to a previously created one, the ﬁtness value will not be returned, which improved the tool performance by reducing the number of ﬁtness computations. Each computed ﬁtness value is compared to a maximum value, which is given as an input to the repair algorithm. If a variant has a ﬁtness value equal to the maximum value, it is identiﬁed as a potential repair and the ﬁtness class returns the potential repair and the process stops. If no variant maximizes the ﬁtness values, the process is repeated for each created variant until a potential repair is found or the parameters reach the limit.

4

Repairing Faults Using MUT-APR

Fixing faults using MUT-APR requires access to the faulty C program source code and a set of repair tests. A user provides the passing and failing tests as well

MUT-APR: MUTation-Based Automated Program Repair Research Tool

265

as a maximum ﬁtness value that determines if the generated variant is a potential repair or not. First, faults must be identiﬁed. To identify potentially faulty locations, the user should run the coverage code on the faulty program (Step 1 in Fig. 4), creating an instrumented version of the faulty program (FaultyProgramcoverage.c). The instrumented faulty version is compiled (Step 2 in Fig. 4), creating executable instrumented code (FaultyProgram-coverage). Then, each test case is executed (Step 3 in Fig. 4), creating statement ID list ﬁles used to create LPFS. These steps can be automatically executed. Shell bash scripts was created to automate them.

Fig. 4. Linux shell script to collect coverage information on the faulty program.

To create an LPFS, fault localization (FL) code is executed. FL code takes as command-line arguments the number of passing tests, the number of failing tests, the FL technique, and the statement ID list ﬁles that are generated in the previous step, and it returns the LPFS. Figure 5 is the command line to create an LPFS, where #Pass is the number of passing tests, #Fail is the number of failing tests, fl is the name of the fault localization technique, and TextFile1, TextFile2, ..., TextFilen are the statement ID list ﬁles generated by executing each test case on the instrumented code.

Fig. 5. Linux shell script to run fault localization technique.

To repair faults, a user runs the repair tool through a command-line interface as shown in Fig. 6, where #Gen is the number of generations, #Pop is the population size in each generation, Max is the maximum ﬁtness value, and FaultyProgram is the faulty program name.

Fig. 6. Linux shell script to run repair code.

266

F. Y. Assiri and J. M. Bieman

The repair tool produces a text ﬁle containing summary information for the repair algorithm execution and the new version of the faulty program if a potential repair is found.

5

MUT-APR Limitations

MUT-APR targets simple binary operator faults; however, there are a few limitations related to the framework implementation. Because GenProg and MUT-APR depends on the CIL framework, MUTAPR change logical operators into if-then and if-else blocks. Thus, MUT-APR cannot ﬁx some types of faults such as logical operators and the equality operator. To repair logical operators, the blocks generated by CIL will need to be mutated, or the MUT-APR must be redesigned so it does not depend on CIL. In this study, program modiﬁcation operators change arithmetic operators in return statements, assignments, if bodies, and loop bodies, but they cannot access arithmetic operators in if statements (e.g., if( x + 1 > 0)). This is also a limitation due to the use of CIL. Additionally, MUT-APR targets faults that required one line modiﬁcation; therefore, the current implementation does not ﬁx faults that require multiple line changes.

6

Related Work

Recent research interest has been directed toward automated program repair techniques and mechanisms [1,7,9,16,18,20,28–31]. This research is based on GenProg tool that was developed by Weimer et al. [10–13]. It applies genetic programming to repair diﬀerent type of faults in C programs. In further work, Le Goues et al. [32,33] improved GenProg to scale to larger programs. Repairs were presented as patches. They also introduced fix localization which a list of statements that used as source for the repair. Additionally, diﬀerent weighting schemes and crossover operators were applied. Faults in Python software were automatically repaired by pyEDB tool [5]. The tool modiﬁes a program as patches by selecting change from look-up tables that are created in a pre-processed step using rewrite rules that map each value to all possible modiﬁcations. SemFix [8] is a tool applied semantic analysis to ﬁx faults. Faulty statements are identiﬁed and ordered using Tarantula [34]; statements are selected sequentially. For each statement, constraints are derived using symbolic execution. Then repair is generated through program synthesis. Debroy and Wong [1] ﬁxed faults through a brute-force search method. Tarantula [34] was used to compute the suspiciousness score for each statement. Statements are ranked and ﬁrst-order mutation operators are applied one by one creating a unique mutant. String matching is used to check each mutant against the original program. If they match, the mutant is considered a “potential ﬁx”, then it is executed against all tests.

MUT-APR: MUTation-Based Automated Program Repair Research Tool

267

Qi et al. [35] developed an APR tool called RSRepair which applied a simple random search algorithm with GenProg to study the impact of the random search algorithm on APR compared to that of genetic programming. To further improve the eﬃciency of RSRepair, a test prioritization technique is applied to decrease the number of test-case executions required until a fault is ﬁxed [15]. Qi et al. [36] also developed the GenProg-FL tool, which is a modiﬁed version of GenProg, to evaluate APR eﬀectiveness and performance of diﬀerent fault localization techniques. They found that the Jaccard was better at identifying actual faulty locations than other fault localization techniques. In a prior study by Assiri and Bieman [2], it was found that using simple operators to ﬁx faulty operators were more eﬀective than using existing code as done by GenProg. The impact of diﬀerent repair test suites was also evaluated, and it was found that using repair tests that satisfy branch coverage reduces the number of new introduced faults compared to the use of repair tests that satisfy statement coverage and randomly generated tests. In a later study [25], ten fault localization techniques using MUT-APR were evaluated, and it was found that Jaccard improved repair correctness. Also, Jaccard never decreased the performance compared to the alternatives. In order to study the impact of diﬀerent search algorithms with MUT-APR, three search algorithms were compared, and random search improves APR eﬀectiveness and performance, but genetic algorithm and genetic algorithm without a crossover operator improve the quality of potential repairs compared to random search [3]. There are other APRs that use behavioral models, contracts, and bug reports to repair faults. Dallmeier et al. [28] developed PACHIKA to generate ﬁxes by comparing object behavior models, and Wei et al. [29] developed a tool called AutoFix-E to automate fault ﬁxing in Eiﬀel programs equipped with contracts. In recent research, Pei [37] integrated AutoFix into the EiﬀelStudio development environment to ease its use by users who are not experts in APR techniques. Lie et al. [38] developed an APR tool called R2Fix which ﬁxes software faults using bug reports. Kaleeswaran et al. [39] proposed a semi-automated approach to generate repair hints that are used by developers to ﬁx faults. Demsky et al. [18] developed a tool to repair faults in data structures. Elkarablieh and Khurshid [19] developed a tool, Juzi, which inserts code into a data structure and predicates methods to ﬁx data structure violations in Java programs. Perkins et al. [16] developed clearView, which automates ﬁxing of faults in deployed software. Kern and Esparza [4] presented a technique to ﬁx faults in Java programs. Developers must give syntactic constructs of faulty expressions called hotspots, a set of alternative expressions to ﬁx the fault, and a set of tests. A tool scans the code for expressions that match the hotspots. Then a changeset is created to collect information about the hotspots and their alternatives. A template is created from the original program for each hostspot. A new variant is created by replacing hotspots in each template with one of the alternatives in the relevant changeset. Carzainga et al. [17] automated the repair of faults in Java software at runtime. Their method utilizes redundancy represented by a code segment that oﬀers an alternative implementation of the same functionality. Hofer and Wotawa [40]

268

F. Y. Assiri and J. M. Bieman

used genetic programming to repair spread sheet faults. Jin et al. [20] and Lui and Zhang [21] developed AFix which ﬁxes a single-variable atomicity violation. Sumi et al. [22] developed a dataset of source code to be used to repair faults automatically. They also proposed to use structures of faulty code to ﬁx faults. Jiang et al. [23] developed MT-GenProg to extend APR techniques to ﬁx applications’ fault without oracles through the use of metamorphic testing. They found that MT-GenProg eﬀectively ﬁxed faults and its performance (in terms of repair time) is similar to GenProg.

7

Conclusion

MUT-APR is a prototype tool that ﬁxes binary operator faults in the C source codes. It was developed to apply diﬀerent components and mechanisms such as applying diﬀerent program modiﬁcation operators, faults localization techniques, and search algorithms in order to ﬁnd the best combination that optimizes APR eﬀectiveness, performance, and the quality of generated repairs. The prior studies [2,3,25] summarize experimental ﬁndings, and in this paper, the architecture of MUT-APR and how it is used to ﬁx faults are described. The next step is extending MUT-APR tool to repair additional faults such as unary operators, multiple faults, and Java faults. Acknowledgment. The authors would like to thank Westley Weimer and his research group for sharing GenProg, and Yuhua Qi et al. [14] for sharing the GenProg-FL tool.

References 1. Debroy, V., Wong, W.E.: Using mutation to automatically suggest ﬁxes for faulty programs. In: 2010 Third International Conference on Software Testing, Veriﬁcation and Validation (ICST), April 2010, pp. 65 –74 (2010) 2. Assiri, F.Y., Bieman, J.M.: An assessment of the quality of automated program operator repair. In: Proceedings of the 2014 ICST Conference, ICST 2014 (2014) 3. Assiri, F.Y., Bieman, J.M.: The impact of search algorithms in automated program repair. Procedia Comput. Sci. 62, 65–72 (2015) 4. Kern, C., Esparza, J.: Automatic error correction of Java programs. In: Proceedings of the 15th International Conference on Formal methods for industrial critical systems, FMICS 2010, pp. 67–81. Springer, Heidelberg (2010) 5. Ackling, T., Alexander, B., Grunert, I.: Evolving patches for software repair. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO 2011, pp. 1427–1434. ACM, New York (2011) 6. Arcuri, A.: On the automation of ﬁxing software bugs. In: Companion of the 30th International Conference on Software Engineering, ICSE Companion 2008, pp. 1003–1006. ACM, New York (2008) 7. Arcuri, A., Yao, X.: A novel co-evolutionary approach to automatic software bug ﬁxing. In: IEEE Congress on Evolutionary Computation, CEC 2008. IEEE World Congress on Computational Intelligence, June 2008, pp. 162 –168 (2008) 8. Nguyen, H.D.T., Qi, D., Roychoudhury, A., Chandra, S.: Semﬁx: program repair via semantic analysis. In: Proceedings of the 2013 International Conference on Software Engineering, pp. 772–781. IEEE Press (2013)

MUT-APR: MUTation-Based Automated Program Repair Research Tool

269

9. Konighofer, R., Bloem, R.: Automated error localization and correction for imperative programs. In: Formal Methods in Computer-Aided Design (FMCAD), 30 November 2011, vol. 2, pp. 91–100 (2011) 10. Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and evolutionary computation, GECCO 2009, pp. 947–954. ACM, New York (2009) 11. Weimer, W., Nguyen, T., Le Goues, C., Forrest, S.: Automatically ﬁnding patches using genetic programming. In: Proceedings of the 31st International Conference on Software Engineering, ICSE 2009, pp. 364–374. IEEE Computer Society, Washington, DC, USA (2009) 12. Weimer, W., Forrest, S., Le Goues, C., Nguyen, T.: Automatic program repair with evolutionary computation. Commun. ACM 53(5), 109–116 (2010) 13. Le Goues, C., Nguyen, T., Forrest, S., Weimer, W.: GenProg: a generic method for automatic software repair. IEEE Trans. Softw. Eng. 38(1), 54–72 (2012) 14. Qi, Y., Mao, X., Lei, Y., Wang, C.: Using automated program repair for evaluating the eﬀectiveness of fault localization techniques. In: Proceedings of the 2013 International Symposium on Software Testing and Analysis, ISSTA 2013, pp. 191–201. ACM, New York (2013) 15. Qi, Y., Mao, X., Lei, Y.: Eﬃcient automated program repair through fault-recorded testing prioritization. In: 2013 29th IEEE International Conference on Software Maintenance (ICSM), September 2013, pp. 180–189 (2013) 16. Perkins, J.H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M.D., Rinard, M.: Automatically patching errors in deployed software. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP 2009, pp. 87–102. ACM, New York (2009) 17. Carzaniga, A., Gorla, A., Mattavelli, A., Perino, N., Pezze, M.: Automatic recovery from runtime failures. In: Proceedings of the 2013 International Conference on Software Engineering, pp. 782–791. IEEE Press (2013) 18. Demsky, B., Ernst, M.D., Guo, P.J., McCamant, S., Perkins, J.H., Rinard, M.: Inference and enforcement of data structure consistency speciﬁcations. In: Proceedings of the 2006 International Symposium on Software Testing and Analysis, ISSTA 2006, pp. 233–244. ACM, New York (2006) 19. Elkarablieh, B., Khurshid, S.: Juzi. In: ACM/IEEE 30th International Conference on Software Engineering, ICSE 2008, pp. 855–858. IEEE (2008) 20. Jin, G., Song, L., Zhang, W., Lu, S., Liblit, B.: Automated atomicity-violation ﬁxing. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, pp. 389–400. ACM, New York (2011) 21. Liu, P., Zhang, C.: Axis: automatically ﬁxing atomicity violations through solving control constraints. In: Proceedings of the 2012 International Conference on Software Engineering, pp. 299–309. IEEE Press (2012) 22. Sumi, S., Higo, Y., Hotta, K., Kusumoto, S.: Toward improving graftability on automated program repair. In: 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 511–515. IEEE (2015) 23. Jiang, M., Chen, T.Y., Kuo, F.-C., Towey, D., Ding, Z.: A metamorphic testing approach for supporting program repair without the need for a test oracle. J. Syst. Softw. 126, 127–140 (2016) 24. Microsoft Zune aﬀected by ‘bug’, December 2008. http://news.bbc.co.uk/2/hi/ technology/7806683.stm

270

F. Y. Assiri and J. M. Bieman

25. Assiri, F.Y., Bieman, J.M.: Fault localization for automated program repair: eﬀectiveness, performance, repair correctness. Softw. Qual. J. 25, 171–199 (2015) 26. CIL Intermediate Language. http://kerneis.github.io/cil/ 27. Miller, B.L., Goldberg, D.E.: Genetic algorithms, tournament selection, and the eﬀects of noise. Complex Syst. 9(3), 193–212 (1995) 28. Dallmeier, V., Zeller, A., Meyer, B.: Generating ﬁxes from object behavior anomalies. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, ASE 2009, pp. 550–554. IEEE Computer Society, Washington, DC, USA (2009) 29. Wei, Y., Pei, Y., Furia, C.A., Silva, L.S., Buchholz, S., Meyer, B., Zeller, A.: Automated ﬁxing of programs with contracts. In: Proceedings of the 19th International Symposium on Software Testing and Analysis, ISSTA 2010, pp. 61–72. ACM, New York (2010) 30. Wilkerson, J.L., Tauritz, D.: Coevolutionary automated software correction. In: Genetic and Evolutionary Computation Conference, pp. 1391–1392 (2010) 31. Kim, D., Nam, J., Song, J., Kim, S.: Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering, pp. 802–811. IEEE Press (2013) 32. Le Goues, C., Dewey-Vogt, M., Forrest, S., Weimer, W.: A systematic study of automated program repair: ﬁxing 55 out of 105 bugs for $8 each. In: Proceedings of the 2012 International Conference on Software Engineering, ICSE 2012, pp. 3–13. IEEE Press, Piscataway (2012) 33. Le Goues, C., Weimer, W., Forrest, S.: Representations and operators for improving evolutionary software repair. In: Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference, GECCO 2012, pp. 959–966. ACM, New York (2012) 34. Jones, J.A., Harrold, M.J.: Empirical evaluation of the tarantula automatic faultlocalization technique. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE 2005, pp. 273–282. ACM, New York (2005) 35. Qi, Y., Mao, X., Lei, Y., Dai, Z., Wang, C.: Does genetic programming work well on automated program repair? In: 2013 Fifth International Conference on Computational and Information Sciences (ICCIS), pp. 1875–1878. IEEE (2013) 36. Qi, Y., Mao, X., Lei, Y., Dai, Z., Qi, Y., Wang, C.: Empirical eﬀectiveness evaluation of spectra-based fault localization on automated program repair. In: 2013 IEEE 37th Annual Computer Software and Applications Conference (COMPSAC), July 2013, pp. 828–829 (2013) 37. Pei, Y., Furia, C.A., Nordio, M., Meyer, B: Automated program repair in an integrated development environment. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), vol. 2, pp. 681–684. IEEE (2015) 38. Liu, C., Yang, J., Tan, L., Haﬁz, M.: R2Fix: automatically generating bug ﬁxes from bug reports. In: 2013 IEEE Sixth International Conference on Software Testing, Veriﬁcation and Validation (ICST), pp. 282–291. IEEE (2013) 39. Kaleeswaran, S., Tulsian, V., Kanade, A., Orso, A.: Minthint: automated synthesis of repair hints. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp. 266–276. ACM, New York (2014) 40. Hofer, B., Wotawa, F.: Mutation-based spreadsheet debugging. In: 2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), November 2013, pp. 132–137 (2013)

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method for Matrix Factorization Mohamed A. Nassar(&), Layla A. A. El-Sayed, and Yousry Taha Department of Computer and Systems Engineering, Alexandria University, Alexandria, Egypt [email protected], [email protected], [email protected]

Abstract. Recommender systems are used in most of nowadays applications. Providing real-time suggestions with high accuracy is considered as one of the most crucial challenges that face them. Matrix factorization (MF) is an effective technique for recommender systems as it improves the accuracy. Stochastic Gradient Descent (SGD) for MF is the most popular approach used to speed up MF. SGD is a sequential algorithm, which is not trivial to be parallelized, especially for large-scale problems. Recently, many researches have proposed parallel methods for parallelizing SGD. In this research, we propose GPU_MF_SGD, a novel GPU-based method for large-scale recommender systems. GPU_MF_SGD utilizes Graphics Processing Unit (GPU) resources by ensuring load balancing and linear scalability, and achieving coalesced access of global memory without preprocessing phase. Our method demonstrates 3.1X– 5.4X speedup over the most state-of-the-art GPU method, CuMF_SGD. Keywords: Collaborative ﬁltering (CF) Matrix factorization (MF) GPU implementation Stochastic Gradient Descent (SGD)

1 Introduction Recently, recommender systems have become a popular tool used in various applications including Facebook, YouTube, Twitter, Email services, News services and Hotel reservation applications [1–5]. In recommender systems, users get a list of suggested items (i.e. movies, friends, news, advertisements, products, etc.). One of the most important challenges that face recommender systems is suggesting accurate recommendations in real-time [6–8]. Recommender systems can be categorized into non-personalized ﬁltering, contentbased ﬁltering (CBF), collaborative ﬁltering (CF) and matrix factorization techniques [1, 3, 9–15]. Non-personalized recommender systems suggest items to a user based on average of ratings given to the items by other users. This category of recommender systems is trivial in terms of implementation, but it lacks personalization where recommended items are the same for all users regardless of their proﬁle [1, 4, 13]. In CBF, items are suggested to a user based on items rated previously by the user. CBF represents items and users’ behaviors as features to ﬁnd similarity between users’ © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 271–287, 2019. https://doi.org/10.1007/978-3-030-03405-4_18

272

M. A. Nassar et al.

preferences and items. Deﬁning features that represent items and users’ behaviors is a major problem in CBF [1, 14, 16]. CF is a process of suggesting items based on users’ collaboration or the similarity between items [1, 3, 9, 10]. CF overcomes the issue of CBF features representation. However, CF cannot suggest items when there are no similarities between users or items. Moreover, CF performs slowly on huge datasets [17–20]. MF, a dimensionality reduction technique, is an advanced technique for recommender systems where the representation of users and items uses the same latent features. Predicting ratings is simply performed by the inner product of user-item feature vector pairs [1, 11, 14]. MF has many advantages over CF for the following reasons: (1) Computations required for predictions are so simple and have negligible time complexity; (2) high accuracy is guaranteed even if there is no similarity between users or items; and (3) MF is scalable for large-scale recommender systems. In MF, rating matrix R of m n is factorized into two low-rank feature matrices P (m k) and Q (k n), such that R ’ P Q where k is the number of latent features, m and n are numbers of users and items respectively. Figure 1 shows an example of matrix factorization where the following optimization rule has to be applied. min P;Q

X ðu;vÞ2R

ru;v pTu qv

2

þ kP kpu k2 þ kQ kqu k2 ;

ð1Þ

where k : k is the Euclidean norm, ðu; vÞ 2 R are the indices for users’ ratings, kP and kQ are the regularization parameters for avoiding over-ﬁtting. (1) is a difﬁcult optimization problem [11, 16, 17]. To ﬁnd P and Q, i.e. to build the model, it is required to perform expensive computations [21, 22].

Fig. 1. An example of matrix factorization where m = 4, n = 4, k = 2 [7].

Building/rebuilding the model of users’ ratings is complex in terms of computations. Therefore, many researches are directed to design fast and scalable techniques to solve (1) [7, 8, 23–32]. Three main algorithms Coordinate Descent (CD), Alternate Least Square (ALS), and SGD are proposed to solve matrix factorization problem efﬁciently [7, 33, 40, 41]. CD is shown to be vulnerable to stuck into local optima [22]. Authors in [7, 33] show that SGD is constantly converged faster than ALS.

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

273

Moreover, SGD is also more practical in systems where new ratings are progressively entered into the system [7]. Therefore, we focus on improving SGD in this paper. The basic idea of SGD is to randomly select a rating ru,v from R where u and v are the indices of R. Then, pu and qv variables are updated by the following rules: pu

pu þ c eu;v qv kP pu

ð2Þ

qv

qv þ c eu;v pu kQ qv

ð3Þ

Where, eu;v is the difference between the actual rating ru,v and predicted ratings pTu qv and c is the learning rate. Then another random instance ru,v is selected, and pu and qv are updated by applying rules (2) and (3), respectively. After ﬁnishing all ratings, the previous steps are repeated till reaching accepted Root Mean Square Error (RMSE) [11]. The time complexity per iteration of SGD is O(|µ|k), where |µ| is number of ratings. The overall SGD procedure takes hours. It is worth to mention that there are two main streams to improve the performance of SGD. The ﬁrst stream focuses on improving statistical properties to reduce the required iterations to converge [34–37]. The second stream works on improving computations per iteration by proposing efﬁcient parallel SGD methods [6–8, 30, 38–42]. In this research, we focus on the second stream where state-of-the-art parallel SGD for MF researches mainly categorized in terms of system type into (1) shared-memory systems; and (2) distributed systems. Shared-memory systems are more efﬁcient than distributed systems as distributed systems depend on network connection bandwidth and SGD requires aggregation of parameters/data at the end of each iteration [7, 43– 45]. Nowadays, all shared-memory systems are heterogeneous which includes GPUs and/or Field-programmable Gate Arrays (FPGAs) to accelerate computationally intensive applications [46–50]. Generally, GPU has better performance that FPGA for floating-point-based applications like SGD [46, 48], as GPU comes with native floating-point processors. Our objective is to propose an efﬁcient parallel SGD method based on GPU, GPU_MF_SGD, which: • Provides a high scalable SGD implementation i.e. achieves linear scalability when increasing the level of parallelism. • Utilizes GPU resources, and ensures coalesced access of GPU global memory. • Achieves load balancing across processing elements. • Overcomes any preprocessing phase. • Accesses ratings randomly in parallel. • Reduces the probability of overwriting occurrence by processing the datasets through a predeﬁned number of steps per iteration. The remainder of the paper is organized as follows. In Sect. 2, we discuss existing parallelized SGD methods for MF. In Sect. 3, we present GPU_MF_SGD, our proposed efﬁcient parallel method. Experiments and results are discussed in Sect. 4. Finally, conclusion and future work are discussed in Sect. 5.

274

M. A. Nassar et al.

2 Background Although SGD is a sequential algorithm, many researches have proposed parallel efﬁcient methods for it. Throughout this section, we discuss the state-of-the-art parallel SGD methods for MF and provide main issues associated with each one. 2.1

Hogwild

It was observed that for randomly selected ratings, the updates of feature matrices P and Q are independent, not sharing same row or column, under the condition of high sparse rating matrix R. Figure 2 shows examples of dependent and independent updates.

Fig. 2. Examples of independent updates (ratings B and C) and dependent updates (ratings A and C) [38].

In Hogwild [41], concurrent threads randomly select ratings from R and update P and Q. To avoid updating dependent ratings in the same time, synchronization (atomic operation) is required. Figure 3 shows updating sequence of two threads where red dot indicates that two dependent random ratings are accessed and processed simultaneously. Hogwild showed that synchronization is not required when R is very sparse and the number of concurrent threads is small compared to the number of ratings. Hogwild was proposed for shared memory systems; however, it has many issues as follows: • Memory discontinuity where random access of shared memory degrades system performance. • Inapplicability on GPU where random accesses to global memory is so expensive on GPUs.

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

275

Fig. 3. An example shows updating sequences of two threads in Hogwild [40].

2.2

FSGD

FSGD [22] aimed to overcome overwriting issue and memory discontinuity by introducing the following techniques: • Partitioning rating matrix R. FSGD divides R into blocks and assigns independent blocks to threads. • Lock-free scheduling. Once a thread ﬁnishes processing a block, the scheduler assigns a new block, which satisﬁes the following two criteria. First, it is a free block. Second, the number of past updates is the smallest among all free blocks. • Partial random method. To overcome the issue of memory discontinuity, FSGD simply accesses rating within blocks sequentially, but blocks selection is performed randomly. • Random shuffling of R and sorting blocks. FSGD overcomes the issue of imbalanced distribution of ratings across blocks by random shuffling ratings and sorting partitioned blocks. Despite the popularity of FSGD, it has a complex preprocessing phase, which includes a complex scheduler, random shuffling of ratings and sorting each block by user identities. 2.3

GPUSGD

GPUSGD [6] proposed SGD method based on matrix blocking using GPU. It divides a rating matrix R into blocks, which are mutually independent, and their corresponding variables are updated in parallel. Independent blocks run simultaneously using thread blocks of GPU. Authors prove that all independent ratings inside each block can be tagged with the same tag number. Therefore, a preprocessing phase which includes tagging and sorting ratings is required to provide coalesced access and independent updates. The experimental results show that GPUSGD performs much better in accelerating the matrix factorization compared with the existing state-of-the-art parallel methods. However, GPUSGD suffers from intensive prepossessing phase (tagging, sorting and partitioning) and load imbalance through GPU blocks and threads.

276

2.4

M. A. Nassar et al.

CuMF_SGD

CuMF_SGD [7] is the most recent parallelized SGD method based on GPU. Two equivalent schemes in terms of accuracy and performance (Batch-Hogwild and Wavefront-update) were proposed. CuMF_SGD overcomes complexity and consumed time to schedule blocks when the number of thread blocks becomes large. CuMF_SGD utilizes GPU resources using half-precision (2 bytes), which does not affect the accuracy and improves memory bandwidth. In addition, it accesses global memory in coalesced manner. CuMF_SGD exploits the spatial data locality using L1 cache. Preprocessing phase is necessary to shuffle ratings and partitioning data into batches. Wavefront-update reduces the existing complex scheduling schemes [21, 22], which maintains twodimensional lookup table to ﬁnd the coordinate (row and column) to update. Wavefront-update uses only one-dimensional lookup which only maintains columns. Figure 4 shows an example of four concurrent thread blocks (workers) working on R, which is partitioned, into 4 8 blocks. At ﬁrst iteration (waive), workers are assigned to independent blocks and update the status of columns in the lockup. After processing the independent blocks, workers need to check the status of the columns before processing other blocks. Wavefront-update requires preprocessing phase of partitioning and maintains a scheduler. In addition, the scheduler cannot maintain the same number of updates per block if ratings are not uniformly distributed across blocks [22].

Fig. 4. Wavefront-update example where each parallel worker is assigned to a row and a randomized column update sequence [7].

Throughout this section, we introduced the recent existing parallel methods to enhance SGD for MF. In addition, we highlighted the main issues associated with each method. In the following section, we discuss our novel GPU-based method, which aims to overcome the main issues.

3 A Novel GPU-Based SGD Method for MF According to the most recent researches and existing recommender systems, we can summarize the following observations.

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

277

• Observation 1. SGD for MF is memory bound i.e. the number of floating point operations is lower than the number of memory accesses by 37% in SGD [7]. Consequently, memory access utilization directly affects the performance of the SGD. In addition, proposing efﬁcient GPU method can improve the performance of SGD as GPU has higher memory bandwidth and inter-device connection speed compared with CPU. • Observation 2. All proposed methods neglect a considerable execution time of the preprocessing phase. Recent proposed methods [6–8, 30] consist of preprocessing phase and processing phase. Preprocessing phase includes one or more of the following procedures: shuffling dataset, partitioning the dataset into blocks, sorting partitioned dataset and/or scheduling execution of dataset blocks using a complex scheduler. For the most recent method CuMF_SGD [7], we found that around 40% of the overall execution time (preprocessing time + processing time) is spent in preprocessing phase for 20 M MovieLens dataset when k = 32 and running on NVIDIA Tesla K80 [51]. Therefore, overcome preprocessing phase enhances the overall performance of recommender systems. • Observation 3. Rating matrices for recommender systems are highly sparse. For 20 M Movielens, Netflix, Yahoo and Hugewiki datasets, matrix densities are 0.11%, 1.17%, 0.29% and 0.15% respectively. • Observation 4. Random parallel processing of the rating matrix R does not affect the accuracy when R is very sparse and the number of threads is lower than the number of ratings. Hogwild [40] proved that overwriting issue, which may occur because of random parallel processing of the rating matrix R, does not require atomic operations and does not affect the accuracy. • Observation 5. Existing methods suffer scalability issues [7]. Due to the complex scheduler and/or required synchronization between processing elements, existing methods scale only to a limited number of processing elements/threads. • Observation 6. Load Imbalance is the reason of imbalance in ratings distribution across dataset blocks. Recommender systems are highly dynamic systems [52] where the number of ratings, number of users and number of items are changing over time. Partitioning rating matrix into blocks and assigning them uniformly across processing elements lead to load imbalance and therefore nonutilized resources. Based on the mentioned observations, we introduce an efﬁcient SGD method for MF based on GPU, GPU_MF_SGD. Before discussing the proposed method, it is worth to mention that the representation of the rating matrix R is Coordinate list (COO) [53] i.e. R is represented as one-dimensional array of length l where l is number of ratings in R. Each entry of R has structure r_entry (u, i, r), where u is user identity, i is item identity and r is the rating of user u to item i. Figure 5 shows the code of the GPU_MF_SGD kernel. We describe the algorithm throughout the following main optimization techniques. • Shared memory utilization. Instead of accessing rating matrix R randomly from global memory (memory discontinuity) [40], we utilize shared memory which is two orders of magnitude faster than global memory access to shuffle R and improves system performance as follows [54–56]. We conﬁgure each thread block

278

M. A. Nassar et al.

Fig. 5. The exemplify code of GPU_MF_SGD Kernel with highlighted optimization techniques where K = 64.

to have th_size threads and shared memory array (sh_rating) with predeﬁned length (no_r_b) where no_r_b/th_size = st (Step 4). st is the number of iterations required for each thread block to process sh_rating. There are two main reasons behind the idea of processing sh_rating in st steps as follows: (1) to utilize resources as for each thread block, the available shared memory size is multiple of the available threads; and (2) to reduce matrix density by (100/st)%, thus reducing the probability of dependent updates. Shuffling R is guaranteed by coalesced access to global memory and random accesses to sh_rating using offline calculated array of random numbers (rand) (Steps 5, 7, 8, 9). Figure 6 shows an example of loading R to the shared memory of two thread blocks where st = 2, l = 8 and th_size = 2. It can be shown that two levels of shuffling are performed with negligible time complexity, as (1) each thread loads two ratings into shared memory; ﬁrst rating is from the ﬁrst half of R and the second rating is from the second half of R; and (2) each thread accesses shared memory randomly. • Coalesced Access of P and Q. Although threads access ratings from shared memory in a coalesced manner (Steps 19, 20, 21), readings and writings for rows of P and Q are performed randomly which degrade performance drastically. To overcome this issue, we conﬁgure each consecutive 32 threads to complete

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

279

Fig. 6. An example of the process of loading rating array R into shared memory of thread blocks.

Fig. 7. Examples of non-coalesced and coalesced access for P.

computations of a rating (Steps 23, 24, 26, 27). Figure 7 shows two examples of non-coalesced and coalesced access of P rows where th_size = 2 and k = 2. • Warp shuffle [6, 57]. Instead of using shared memory to synchronize and broadcast dot product results of p and q, we use warp shuffle to broadcast the result of the dot product results (Steps 29 to 34). The warp shuffle has better performance than shared memory as (1) it uses extra hardware support; (2) register operations are faster than shared memory operations; and (3) there is no need to synchronize between threads [7]. • Instruction level parallelism (ILP) [7]. For K > 32, each thread is responsible for K/32. Instructions order is considered to maximize the ILP (Steps from 23 to 40).

280

M. A. Nassar et al.

• Half-precision. As SGD for MF is a memory-bounded algorithm, any enhancement for memory access will improve the performance. New GPU architecture offers storage of half-precision (2 bytes) which is fast to be transformed to float and does not affect the accuracy [7] (Steps 26, 27). • Warp divergence avoidance [58]. GPU has performance penalties with conditional statements as different paths of executions are generated. GPU_MF_SGD does not contain conditional branches, which improves overall performance. • No preprocessing phases. All existing methods have a computational intensive preprocessing phase, which includes sorting, partitioning, random shuffling, scheduling, etc. to ensure independent updates, random access, and load balance. GPU_MF_SGD does not include any preprocessing phase as we guarantee a high probability of independent updates for P and Q by random access of ratings and processing R in predeﬁned steps. • Linear Scalability. If we increase the number of threads, GPU_MF_SGD theoretically achieve linear scalability. Unlike existing methods, they lack scalability due to required synchronization and/or scheduling [6, 7, 22, 40]. Figure 8 shows a code of GPU_MF_SGD overall procedure. First, we grid the GPU into one-dimensional thread blocks with a size of l=no r b, and organize each thread block into 1D threads of size th_size (Steps 4, 5). Then, calling for kernel execution is performed (Step 7). Finally, calculation of RMSE is performed (Step 9). Steps 7, 9 are repeated until reaching accepted RMSE.

Fig. 8. Exampify code of GPU_MF_SGD overall procedure.

4 Experiments and Results We implemented GPU_MF_SGD using Compute Uniﬁed Device Architecture (CUDA). Different types of public datasets are used to evaluate performance and accuracy. It is worth to mention that we compared our results with state-of-the-art GPU method, CuMF_SGD [7] for the following reasons: • CuMF_SGD outperforms all existing shared memory methods by 3.1X – 28.2X. • CuMF_SGD source code is publicly available, unlike other existing GPU methods. • CuMF_SGD has less computational complexity for preprocessing phase compared with other implementation. • CuMF_SGD has consistent results and graphs, unlike other existing GPU methods.

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

4.1

281

Experimental Setup

We executed both implementations (CuMF_SGD and GPU_MF_SGD) on highperformance computing service (HPC) provided by the Bibliotheca Alexandria [66]. Table 1 shows speciﬁcations of the used platform. Table 1. Speciﬁcations of the used platform RAM size 128 GB Operating system CentOS 6.8 Number of CPUs 2 Scheduler Slurm [67] GPU used NVIDIA Tesla K80 Number of GPU devices 2

We used common public datasets: MovieLens [59], Netflix [60, 61] and Yahoo! Music [62, 63]. Table 2 shows details of datasets used in experiments. We extracted 10% test random sample from different datasets using GraphLab [64, 65].

Table 2. Details about the datasets used in experiments Dataset MovieLens Netflix Yahoo!Music M 138493 480189 1823178 N 27278 17770 136735 K 32 64 128 # Training set 18000236 90432454 646084797 # Test set 2000027 10048051 71787199

The setup parameters for GPU_MF_SGD are as follows. We set th_size to be 1024, which is the maximum number of threads available per thread block for GPU. In addition, we chose st to be 2 to shuffle dataset and achieve a high level of sparsity with reasonable complexity performance. Regarding SGD parameters for both CuMF_SGD and GPU_MF_SGD, we used common parameters used by previous work. For learning rate, we used the same learning rate scheduling technique used by [30], where st, the learning rate at iteration t, is reduced using the following formula: st ¼

a 1 þ b : t1:5

ð5Þ

a is the initial learning rate and b is a constant parameter. The parameters are shown in Table 3.

282

M. A. Nassar et al. Table 3. The parameters used per dataset Dataset MovieLens Netflix Yahoo!Music

4.2

k 0.05 0.05 0.05

a 0.08 0.08 0.08

ß 0.3 0.3 0.2

Scalability Study

To study the scalability of both methods, we used the number of updates per second as the performance metric [7]: # update=s ¼

#Iterations #Samples Elapsed Time

ð6Þ

where # Iterations, # Samples and Elapsed Time indicate the number of iterations, number of ratings in R, and execution time in seconds, respectively. Scalability study of CuMF_SGD and GPU_MF_SGD for the MovieLens dataset is shown in Fig. 9. We have two curves for CuMF_SGD (CuMF_SGD_Pro and CuMF_SGD_Pre) where in CuMF_SGD_Pro, the elapsed time is only GPU execution time, and in CuMF_SGD_Pre, the elapsed time is execution time plus preprocessing time. GPU_MF_SGD implementation shows linear scalability while CuMF_SGD has limitations in terms of scalability.

Fig. 9. # updates/s for different methods versus #threads for MovieLens.

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

283

Table 4. Training times and GPU_MF_SGD speedup Dataset

CuMF_SGD_Pro CuMF_SGD_Pre GPU_MF_SGD GPU_MF_SGD speedup CuMF_SGD_Pro CuMF_SGD_Pre

MovieLens 0.56 s Netflix 4.34 s Yahoo! 49.08 s Music

4.3

0.98 s 6.70 s 66.1 s

0.18 s 1.26 s 15 s

3.1X 3.4X 3.3X

5.4X 5.3X 4.4X

Training Time Speedup

We measured training time until convergence to an accepted RMSE [68, 69] (0.93, 0.92, 1.23) for MovieLens, Netflix, and Yahoo!Music respectively). Table 4 shows training times and GPU_MF_SGD speedup over both CuMF_SGD_Pro and CuMF_SGD_Pre. Results show that GPU_MF_SGD is 3.1X – 5.4X, 3.4X – 5.3X and 3.3X – 4.4X faster than CuMF_SGD for MovieLens, Netflix, and Yahoo!Music respectively. Generally, GPU_MF_SGD outperforms CuMF_SGD by 3.1X to 5.4X for all datasets.

Fig. 10. Convergence speed for different datasets.

284

4.4

M. A. Nassar et al.

Convergence Analysis

Figure 10 shows the RMSE on test set with respect to the training time. It is obvious that our method converges faster than CuMF_SGD and achieves better RMSE for all datasets. Therefore, GPU_MF_SGD is considered as the fastest SGD method for MF because it can do more updates per second, as shown in Fig. 9. Unlike all previous methods including CuMF_SGD, GPU_MF_SGD utilizes shared memory and does not require any preprocessing phase.

5 Conclusions and Future Work In this research, we proposed GPU_MF_SGD, which is GPU-based innovative parallel SGD method for MF. Unlike previous methods, GPU_MF_SGD does not require any preprocessing phase like sorting and/or random shuffling of the dataset. In addition, GPU_MF_SGD does not require any complex scheduler for load balancing of datasets across computational resources. In GPU_MF_SGD, utilization of computational resources, high scalability, and load balance are achieved. Our empirical study shows that GPU_MF_SGD provides the highest number of updates per sec and considered as the fastest method. Evaluations on common public datasets show that GPU_MF_SGD runs 3.1X – 5.4X faster than CuMF_SGD. In GPU_MF_SGD method, we did not spend much effort in parameters tuning. We suggest studying different optimization techniques for parameter selection as a future work. Furthermore, it would be extremely interesting to study the possibilities to overcome the limitation of GPU global memory i.e. the size of ratings is bigger than global memory size. Therefore, for future work, scaling up our proposed method to run on multiple GPUs [6, 7, 70–72] is an interesting research point.

References 1. Ricci, F., et al.: Recommender Systems Handbook. Springer, New York (2011) 2. Ekstrand, M.D., et al.: Collaborative ﬁltering recommender systems. Found. Trends Hum. Comput. Interact. 4(2), 81–173 (2011) 3. Poriya, A., et al.: Non-personalized recommender systems and user-based collaborative recommender systems. Int. J. Appl. Inf. Syst. 6(9), 22–27 (2014) 4. Aamir, M., Bhusry, M.: Recommendation system: state of the art approach. Int. J. Comput. Appl. 120, 25–32 (2015) 5. Recommender System. https://en.wikipedia.org/wiki/Recommender_system. Accessed 11 July 2017 6. Jin, J., et al.: GPUSGD: a GPU-accelerated stochastic gradient descent algorithm for matrix factorization. Concurr. Comput. Pract. Exp. 28, 3844–3865 (2016) 7. Xie, X., et al.: CuMF_SGD: parallelized stochastic gradient descent for matrix factorization on GPUs. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM (2017)

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

285

8. Li, H., et al.: MSGD: a novel matrix factorization approach for large-scale collaborative ﬁltering recommender systems on GPUs. IEEE Trans. Parallel Distrib. Syst. 29(7), 1530– 1544 (2018) 9. Nassar, M.A., El-Sayed, L.A.A., Taha, Y.: Efﬁcient parallel stochastic gradient descent for matrix factorization using GPU. In: 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST). IEEE (2016) 10. Wen, Z.: Recommendation system based on collaborative ﬁltering. In: CS229 Lecture Notes, Stanford University, December 2008 11. Leskovec, J., et al.: Mining of Massive Datasets, Chap. 9, pp. 307–340. Cambridge University Press, Cambridge (2014) 12. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009) 13. Kaleem, R., et al.: Stochastic gradient descent on GPUs. In: Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, pp. 81–89 (2015) 14. Konstan, J.A., Riedl, J.: Recommender systems: from algorithms to user experience. User Model. User Adap. Inter. 22(1), 101–123 (2012) 15. Anastasiu, D.C., et al.: Big Data and Recommender Systems (2016) 16. Melville, P., Sindhwani, V.: Recommender systems. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 829–838. Springer, New York (2011) 17. Kant, V., Bharadwaj, K.K.: Enhancing recommendation quality of content-based ﬁltering through collaborative predictions and fuzzy similarity measures. J. Proc. Eng. 38, 939–944 (2012) 18. Ma, A., et al.: A FPGA-based accelerator for neighborhood-based collaborative ﬁltering recommendation algorithms. In: Proceedings of IEEE International Conference on Cluster Computing, pp. 494–495, September 2015 19. Anthony, V., Ayala, A., et al.: Speeding up collaborative ﬁltering with parametrized preprocessing. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015 20. Gates, M., et al.: Accelerating collaborative ﬁltering using concepts from high performance computing. In: IEEE International Conference in Big Data (Big Data) (2015) 21. Wang, Z., et al.: A CUDA-enabled parallel implementation of collaborative ﬁltering. Proc. Comput. Sci. 30, 66–74 (2014) 22. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2011) 23. Chin, W.-S., et al.: A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Trans. Intell. Syst. Technol. 6(1), 2 (2015) 24. Zastrau, D., Edelkamp, S.: Stochastic gradient descent with GPGPU. In: Proceedings of the 35th Annual German Conference on Advances in Artiﬁcial Intelligence (KI’12), pp. 193– 204 (2012) 25. Shah, A., Majumdar, A.: Accelerating low-rank matrix completion on GPUs. In: Proceedings of International Conference on Advances in Computing, Communications and Informatics, December 2014 26. Kato, K., Hosino, T.: Singular value decomposition for collaborative ﬁltering on a GPU. IOP Conf. Ser. Mater. Sci. Eng. 10(1), 012017 (2010) 27. Foster, B., et al.: A GPU-based approximate SVD algorithm. In: Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics, vol. 1, pp. 569– 578. Springer, Berlin (2012) 28. Yu, H.-F., et al.: Parallel matrix factorization for recommender systems. Knowl. Inf. Syst. 41 (3), 793–819 (2014)

286

M. A. Nassar et al.

29. Yu, H.F., Hsieh, C.J., et al.: Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE 12th International Conference on Data Mining, pp. 765–774 (2012) 30. Yun, H., Yu, H.-F., Hsieh, C.-J., Vishwanathan, S.V.N., Dhillon, I.: NOMAD: non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Proc. VLDB Endow. 7(11), 975–986 (2014) 31. Yang, X., et al.: High performance coordinate descent matrix factorization for recommender systems. In: Proceedings of the Computing Frontiers Conference. ACM (2017) 32. Zadeh, R., et al.: Matrix completion via alternating least square (ALS). In: CME 323 Lecture Notes, Stanford University, Spring (2016) 33. Tan, W., Cao, L., Fong, L.: Faster and cheaper: parallelizing large-scale matrix factorization on GPUs. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016 (2016) 34. Aberger, C.R.: Recommender: An Analysis of Collaborative Filtering Techniques (2016) 35. Papamakarios, G.: Comparison of Modern Stochastic Optimization Algorithms (2014) 36. Toulis, P., Airoldi, E., Rennie, J.: Statistical analysis of stochastic gradient methods for generalized linear models. In: International Conference on Machine Learning, pp. 667–675 (2014) 37. Toulis, P., Tran, D., Airoldi, E.: Towards stability and optimality in stochastic gradient descent. In: Artiﬁcial Intelligence and Statistics, pp. 1290–1298 (2016) 38. Zhou, Y., Wilkinson, D., et al.: Large-scale parallel collaborative ﬁltering for the Netflix prize. In: Proceedings of International Conference on Algorithmic Aspects in Information and Management (2008) 39. Xie, X., Tan, W., Fong, L.L., Liang, Y.: Cumf_sgd: fast and scalable matrix factorization (2016). arXiv preprint arXiv:1610.05838. https://github.com/cuMF/cumf_sgd 40. Tang, K.: Collaborative ﬁltering with batch stochastic gradient descent, July 2015. http:// www.its.caltech.edu/*ktang/CS179/index.html 41. Niu, F., et al.: HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701, June 2011 42. Gemulla, R., et al.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–77 (2011) 43. Zhang, H., Hsieh, C.-J., Akella, V.: Hogwild++: a new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 629–638. IEEE (2016) 44. Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014) 45. Udell, M., et al.: Generalized low rank models. Found. Trends Mach. Learn. 9(1), 1–118 (2016) 46. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/ #axzz4FH9nydq8. Accessed 5 Sept 2016 47. Nunna, K.C., et al.: A survey on big data processing infrastructure: evolving role of FPGA. Int. J. Big Data Intell. 2(3), 145–156 (2015) 48. Nassar, M.A., El-Sayed, L.A.A.: Radix-4 modiﬁed interleaved modular multiplier based on sign detection. In: International Conference on Computer Science and Information Technology, pp. 413–423. Springer, Berlin (2012) 49. Nassar, M.A., El-Sayed, L.A.A.: Efﬁcient interleaved modular multiplication based on sign detection. In: IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), November 2015

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method

287

50. Karydi, E., et al.: Parallel and distributed collaborative ﬁltering: a survey. J. ACM Comput. Surv. 49(2), 37 (2016) 51. Ma, X., Wang, C., Yu, Q., Li, X., Zhou, X.: A FPGA-based accelerator for neighborhoodbased collaborative ﬁltering recommendation algorithms. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER), pp. 494–495. IEEE (2015) 52. http://www.nvidia.com/object/tesla-k80.html. Accessed 22 July 2017 53. Lathia, N.: Evaluating collaborative ﬁltering over time. Ph.D. thesis (2010) 54. Sparse Matrix. https://en.wikipedia.org/wiki/Sparse_matrix#Storing_a_sparse_matrix. Accessed 12 Feb 2017 55. http://supercomputingblog.com/cuda/cudamemoryandcachearchitecture/. Accessed 26 June 2017 56. GPU memory types – performance comparison. https://www.microway.com/hpc-tech-tips/ gpu-memory-types. Accessed 5 Sept 2015 57. Pankratius, V., et al.: Fundamentals of Multicore Software Development. CRC Press, Boca Raton (2011) 58. del Mundo, C., Feng, W.: Enabling efﬁcient intra-warp communication for fourier transforms in a many-core architecture. In: Proceedings of the 2013 ACM/IEEE International Conference on Supercomputing (2013) 59. Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 3. ACM (2011) 60. Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19 (2016) 61. Gower, S.: Netflix prize and SVD, pp. 1–10. http://buzzard.ups.edu/courses/2014spring/ 420projects/math420-UPS-spring-2014-gower-netﬂix-SVD.pdf (2014) 62. Bennett, J., Lanning, S.: The Netflix prize. In: Proceedings of KDD Cup and Workshop, p. 35 (2007) 63. Dror, G., Koenigstein, N., Koren, Y., Weimer, M.: The Yahoo! music dataset and KDDCup’11. In: Proceedings of KDD Cup 2011, pp. 3–18 (2012) 64. Zheng, L.: Performance evaluation of latent factor models for rating prediction. Ph.D. dissertation, University of Victoria (2015) 65. Low, Y., et al.: GraphLab: a new parallel framework for machine learning. In: Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artiﬁcial Intelligence, UAI-10, pp. 340–349, July 2010 66. Chin, W.-S., et al.: A learning-rate schedule for stochastic gradient methods to matrix factorization. In: PAKDD, pp. 442–455 (2015) 67. https://hpc.bibalex.org/. Accessed July 2017 68. https://slurm.schedmd.com/. Accessed July 2017 69. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook, pp. 257–297. Springer, Boston (2011) 70. Ginger, T., Bochkov, Y.: Predicting business ratings on yelp report (2015). http://cs229. stanford.edu/proj2015/013_report.pdf 71. Hwu, W.: Efﬁcient host-device data transfer. In: Lecture Notes, University of Illinois at Urbana-Champaign, December 2014 72. Bhatnagar, A.: Accelerating a movie recommender system using VirtualCL on a heterogeneous GPU cluster. Master thesis, July 2015

Effective Local Reconstruction Codes Based on Regeneration for Large-Scale Storage Systems Quanqing Xu3(&), Hong Wai Ng2, Weiya Xi1, and Chao Jin1 1

2

Data Storage Institute, A*STAR, Singapore 138632, Singapore {XI_Weiya,Jin_Chao}@dsi.a-star.edu.sg Nanyang Technological University, Singapore 639798, Singapore [email protected] 3 Institute of High Performance Computing, A*STAR, Singapore 138632, Singapore [email protected]

Abstract. We introduce Regenerating-Local Reconstruction Codes (R-LRC) and describe their encoding and decoding techniques in this paper. After that their repair bandwidths of different failure patterns are investigated. We also explore an alternative of R-LRC, which gives R-LRC lower repair bandwidth. Since R-LRC is an extended version of Pyramid codes, optimization of repair bandwidth of a single failure will also apply to R-LRC. Compared with Pyramid Codes, Regenerating-Local Reconstruction Codes have two beneﬁts: (1) In an average, they use around 2.833 blocks in repairing 2 failures while the Pyramid codes use about 3.667 blocks. Hence, they have lower IOs than Pyramid Codes. (2) When there are 2 failures occurring at common block group and special block group, they require only around M/2, which is lower compared with M in Pyramid codes when k 2. In addition, we present an efﬁcient interference alignment mechanism in R-LRC, which performs algebraic alignment so that the useless and unwanted dimension is decreased. Therefore, the network bandwidth consumption is reduced. Keywords: Local reconstruction codes Regeneration code Interference alignment Maximum distance separable

1 Introduction Large-scale distributed storage systems become increasingly popular with the rapid growth of storage capacity, computational resources and network bandwidth. They provide data reliability over long periods of time using data segmentation. However, each storage node storing the data may be individually unreliable. To ensure data availability, redundancy is required. In large-scale data centers, there are distributed storage systems over erasure coding, e.g., Microsoft’s Windows Azure Storage [1], Facebook’s HDFS-Xorbas [2] and Hitchhiker [3], and Google [4]. Suppose that there are n storage nodes in the storage system. The simplest form of data redundancy is replication, which replicates data ﬁle into n replicas and each node stores a replica, incurring n times storage overhead. In addition, how to keep © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 288–303, 2019. https://doi.org/10.1007/978-3-030-03405-4_19

Effective Local Reconstruction Codes

289

consistence of replicas is also a difﬁcult issue [5]. Another form of redundancy is erasure coding. Data ﬁle of size M to be stored is stripped into k pieces A1, A2, …, Ak of equal size mk. Then the k pieces are encoded into n fragments A1, A2, …, An using an (n, k) Maximum Distance Separable (MDS) code, and store them at n nodes. The original ﬁle can be recovered from any k subset of n encoded fragments. Erasure coding incurs lower storage overhead nk k times compared to replication. In this paper, we propose Regenerating-Local Reconstruction Codes (R-LRC) and explain their encoding and decoding techniques. We investigate their repair bandwidths of different failure patterns, and explore an alternative of R-LRC that gives R-LRC lower repair bandwidth. We optimize repair bandwidth of a single failure to apply to RLRC, since R-LRC is an extended version of Pyramid codes. Compared with Pyramid Codes, R-LRC has two beneﬁts. Firstly, in average, they use around 2.8 blocks in repairing 2 failures while the Pyramid codes use about 3.6 blocks. Therefore, they have lower IOs than Pyramid Codes. Secondly, when there are 2 failures occurring at common block group and special block group, they require only Mk M2 þ 1 , which is lower compared with M in Pyramid codes when k 2. This paper presents an efﬁcient interference alignment mechanism in R-LRC. The interference alignment mechanism performs algebraic alignment so that the effective dimension of unwanted information is reduced, so the network bandwidth consumption is decreased. The rest of the paper is organized as follows. Section 2 describes related work. Section 3 introduces regenerating-local reconstruction codes. We discuss repair bandwidth in R-LRC in Sect. 4. Section 5 describes alternative of repair bandwidth of R-LRC. We give interference alignment in R-LRC in Sect. 6. In Sect. 7 we conclude this paper.

2 Related Work Two alternative coding approaches were advocated to enable more efﬁcient node repair, i.e., regenerating codes [6] and local reconstruction codes [1]. 2.1

Regenerating Codes

One vital idea behind regenerating codes is the sub-chunk mechanism, and each chunk is composed of a few sub-chunks. When a node storing a chunk fails, other nodes send in some of their sub-chunks for recovery. Efﬁciency of the recovery procedure is measured in terms of the overall bandwidth consumption. In many cases, regenerating codes can achieve a rather signiﬁcant reduction in bandwidth, compared with codes that do not employ the mechanism of sub-chunk. In practice, a considerable overhead is caused because of accessing extra storage nodes. In particular, coding solutions that do not rely on the mechanism of sub-chunk and thus access less nodes are sometimes more attractive. Dimakis et al. [6] proposed regenerating codes that use smaller storage cost and repair bandwidth, and investigated the relationship between a and c. They term the codes that satisfy former as Minimum Storage Regenerating (MSR) Codes and later as Minimum bandwidth (MBR) codes. CORE [7] (COllaborative REgeneration) explores the feasibility of applying network coding techniques for repairing lost data, but it does not address the issue of degraded reads. Previous practical work on regenerating codes focused either only on random codes [8] or on a speciﬁc system with a speciﬁc code [9].

290

2.2

Q. Xu et al.

Local Construction Codes

A family of Local Reconstruction Codes (LRC) encodes a ﬁle into n data chunks in such a way that one can recover any chunk by accessing only r chunks even after some of chunks are erased. LRC (k, l, r) is a class of maximally recoverable codes, and it tolerates up to r + 1 arbitrary failures. LRC provides low storage overhead. Among all the codes that can decode single failure from k/l chunks and tolerate r + 1 failure, it requires the minimum number of parity chunks. Therefore, the local reconstruction codes are in fact very similar to the CRL codes. Gopalan et al. [10] pioneered the theoretical study of locality by discovering a trade-off between code distance and information-symbol locality. Huang et al. [11] introduced (k + l + g, k) Pyramid codes, which explore the trade-offs between storage space and access efﬁciency. The codes derived from any existing MDS codes. All data blocks are partitioned into disjoint local groups, and all parity blocks are categorized into either local parity blocks or global parity blocks. Local parity blocks contain only data block in that local group. They are obtained through projection from original parity block. Advantage of Pyramid codes is the optimization of repair bandwidth for a single failure. In distributed storage systems, there are studies on implementation and performance evaluation of codes with locality in [1, 2]. A class of code with locality (called local reconstruction code) and related to the pyramid code has been employed in Windows Azure Storage in [1]. Sathiamoorthy et al. discuss implementation of a class of codes with locality termed as locally repairable codes in HDFS and compare the performance with Reed Solomon codes in [2]. 2.3

Erasure Codes with Local Regeneration

The CRL codes are a novel class of Concurrent Regeneration codes with Local reconstruction, which enjoy three advantages [12]. Firstly, they minimize the network bandwidth for node repair. Secondly, they minimize the number of accessed nodes. Lastly, they have faster reconstruction than traditional erasure codes in distributed storage systems. It shows how the CRL codes overcome the limitation of RS codes. Also, it demonstrates analytically that they are optimal on a trade-off between minimum distance and locality. Some works extend the designs and distance bounds to the case, in which repair bandwidth and locality are jointly optimized, under multiple local failures [13, 14], and under security constraints [14].

3 Regenerating-Local Reconstruction Codes In this section, we present how encoding and decoding work in (k, g, l) R-LRC codes. 3.1

Encoding

Suppose Gi is the k (k + g + l) generator matrix of data blocks in level i for 1 i a. The generator matrix G of R-LRC is the concatenation of all generator matrices of all level, that is G = (G1, G2, …, Ga)T. Note that G has dimension ka

Effective Local Reconstruction Codes

291

(k + g + l). The following is an example of generator matrix of (6, 2, 2) R-LRC codes with a = 2.

G1 ¼

I33 033

033 I33

131 031

031 I31

A1 B1

C1 D1

, G2 ¼

044 I44

I44 044

141 041

041 I41

A2 B2

C2 D2

,

where Ai, Bi, Ci, and Di are 3 1 matrices for i = 1, 2. Therefore, the generator matrix of (6, 2, 2) R-LRC is G = (G1, G2)T. Suppose each local group has only a local parity block. Then in general, for 1 i a, Gi can be written as G1 ¼

Ik=2k=2 0k=2k=2

0k=2k=2 Ik=2k=2

1k=21 0k=21

0k=21 Ik=21

Ai Bi

Ci Di

Suppose data symbols o = (o1, o2, …, oa)T where oi = (oi1, oi2, …, oik) for 1 i a. In other words, oi is the data symbols stored at level i of each node. We apply Hadamard product of block matrices in encoding. 0

1 0 1 0 1 G1 o1 G 1 o1 B o2 C B G 2 C B o2 G 2 C C B C B C ooG ¼ B @ .. Ao@ .. A¼@ .. A . . . oa Ga oa G a One can use the same generator matrix for all levels, that is G1 = G2 = = Ga so that the product above becomes standard matrix multiplication. 0

1 0 o1 G 1 o1 B o2 G 2 C B o2 C B ooG ¼ B @ .. A ¼ @ .. . . oa oa G a

1 C CG1 A

Example: Suppose (4, 2, 2) R-LRC and assume that a = 2, 0

1 B0 B G1 ¼ G2 ¼ B @0 o¼

0 o11 o21

o12 o22

0 1

0 0

0 1 0 1

0 0

a b

0

1

0 0

1

c

1 h gC C C; rA

0 0 1 0 o13 o14 o23 o24

1

l

x

Then the encoded matrix is

o11 o21

o12 o22

o13 o23

o14 o24

o11 þ o12 o21 þ o22

o13 þ o14 o23 þ o24

A B

C D

Note that each row of encoded matrix corresponds to one level in each node. For example, o11 is stored at the ﬁrst level of the ﬁrst node.

292

3.2

Q. Xu et al.

Decoding

We choose arbitrary k columns in generator matrix G and corresponding columns in encoded matrix. We compute the inverse of the chosen columns in generator matrix and post-multiply it with the corresponding columns in encoded matrix (same as decoding in Reed-Solomon codes). Inverse exists due to the construction of coefﬁcients. 3.3

Example: R-LRC (6, 2, 2)

In R-LRC, we assume that there are only two local groups. Each local group contains 2l local parities and 2k data blocks. Figure 1 is an example of a (6, 2, 2) R-LRC.

Fig. 1. (6, 2, 2) R-LRC.

Suppose a ﬁle of size M is stored using the (6, 2, 2) R-LRC code. The ﬁle is fragmented into 12 data blocks A1, A2, B1, B2, C1, C2, D1, D2, F1, F2, G1, G2, H1, H2 where each data block is an element of ﬁnite ﬁeld Fq . Note that Fq is a ﬁeld containing q elements. Each data block has equal size Mk and each node stores two data blocks. Coefﬁcients in parity blocks are taken from the same ﬁnite ﬁeld Fq . Addition and multiplication operations are XOR and ﬁnite ﬁeld multiplication respectively. Each local group contains 6 data blocks and 2 parity blocks, where each parity block contains local data blocks. Two global parity blocks do not belong to any local group. Deﬁne level a as the ath block stored in each node. We assume that all the nodes have the same level. For example, level 1 is the ﬁrst block stored in each node, level 2 is the second block stored in each node, so on and so forth. In Fig. 1 a = 2.

4 Repair Bandwidth in R-LRC In this section, we proceed to discuss repair bandwidth in R-LRC. 4.1

Generalization of Repair Bandwidth at One Level

Our aim is to generalize the repair bandwidth of all recoverable failures at one level. We assume that there are only 2 local groups and each group contains 2k data blocks and

Effective Local Reconstruction Codes

293

l 2

local parity blocks. Therefore, the generator matrix of each level in (k, g, l) R-LRC has the form

Ik=2k=2 0k=2k=2

0k=2k=2 Ik=2k=2

Ck=2l=2 0k=2l=2

0k=21=2 Dk=21=2

A B

Note that both A and B are 2k g matrices. Hence, the generator matrix has dimension k (k + g + l). Denote f as total number of failures at one level. Clearly f (g + l) is true due to the Singleton Bound. Assume that 2l 2k in each local group and (g + l) k. When failures occur, we only need to observe which level they fall into. Then from the generator matrix at that level, we perform recovery. In the following, we consider only data blocks failures. Denote f1 as the number of failed data blocks in ﬁrst local group and f2 as the number of failed data blocks in second local group. Note that f = f1 + f2. We consider two cases: Case ðiÞ : 0\fi

l 2

Recovery can be performed locally. Without loss of generality, we assume i = 1. Choose k/2 − f1 columns from Ik/2k/2 corresponding to surviving data blocks and f1 columns from Ck/2l/2 corresponding to local parity blocks. Note that both Ik/2k/2 and Ck/2l/2 are on the same row in the generator matrix. Thus, each column chosen has 2k rows. Form a 2k 2k matrix with all the chosen columns. Then, the matrix has full rank due to the conditions given in the following section, and thus invertible. Hence all M 2k . failed data blocks can be recovered. Note that the repair bandwidth is ka Case ðiiÞ :

l k \fi 2 2

Recovery cannot be done locally. Assume that g 2. After replacing faulty blocks with 2l local parity blocks, we are still short of fi 2l blocks for recovery. These blocks can be substituted by global parity blocks after using interference alignment [15] with local parity block in another group (assume that there is at least one surviving local parity block in another group). Therefore, the repair bandwidth is M ka

k l l M k fi þ þ 1 þ fi þ1 ¼ 2 2 2 ka 2

Note that either fi 2l or fi [ 2l . If both 0\f1 2l and 0\f2 2l , then the repair M M M bandwidth is ka 2k þ ka 2k ¼ Ma ¼ ka k. If either f1 2l and f2 [ 2l or f1 [ 2l and M M M f2 2l , then the repair bandwidth is at most ka 2k þ ka 2k þ 1 ¼ ka ðk þ 1Þ:

294

Q. Xu et al.

M The minimum repair bandwidth is ka k as the same local parity block may be used to l preform recovery. If both f1 [ 2 and f2 [ 2l , then the repair bandwidth is at most k M k M M M ka 2 þ 1 þ ka 2 þ 1 ¼ ka ðk þ 2Þ. The minimum repair bandwidth is ka k. The reason is the same as above. The followings are the assumptions made when calculating repair bandwidth in Table 1.

f (g + l) k. g 2. l k 2 2. fi 2l þ g for i = 1, 2. There are only two local groups. Both k and l are even so that each local group contains equal number of data nodes and local parity nodes. (7) f denotes only the number of data blocks failed. (8) Exact repair. (9) Systematic codes.

(1) (2) (3) (4) (5) (6)

From Tables 1 and 2, clearly repair bandwidth is lower when data striping is applied in the codes. In Table 2, the repair bandwidth is less than M only if one of the fi is zero. Therefore, the codes need to be modiﬁed so that repair bandwidth can be reduced. In Sect. 3, several methods to reduce repair bandwidth were explored.

Table 1. Fi is the number of data block’s failures Ith local group, I = 1, 2 Number of failures, f = f1 + f2 Repair bandwidth in one level l 2 l 2

M ka M ka

k

l 2

M ka

k

l 2 l 2

f1 ¼ 0 and f2 [ or f1 [ 2l and f2 ¼ 0

M ka

f1 [

l 2

M ka

k

f1

l 2

and f2

f1 ¼ 0 and f2 or f1 2l and f2 ¼ 0 f1 2l and f2 [ or f1 [ 2l and f2

l 2

and f2 [

2k

k

2

þ1

Effective Local Reconstruction Codes

295

Table 2. Repair bandwidth of L-LRC when A = 1 Number of failures, f = f1 + f2 Repair bandwidth in one level M f1 2l and f2 2l

4.2

f1 ¼ 0 and f2 2l or f1 2l and f2 ¼ 0

M 2

f1 2l and f2 [ or f1 [ 2l and f2

l 2

M k

k

l 2 l 2

f1 ¼ 0 and f2 [ or f1 [ 2l and f2 ¼ 0

M k

f1 [

l 2

M k

k

l 2

and f2 [

k

2

þ1

Existence of R-LRC

In this section, we discuss the existence of a set of coefﬁcients that gives all repair bandwidth in Table 1. Note that R-LRC requires only one encoding matrix instead of an encoding matrix. In the discussion above, when fi [ 2l for some i = 1, 2 interference alignment technique is applied. This restricts the encoding matrix of the R-LRC to be the following: ðIkk LHJKÞ 1k=21 Ek=2ðl=21Þ 0k=21 0k=2ðl=21Þ

!

0k=21 0k=2ðl=21Þ 1k=21 Fk=2ðl=21Þ

!

1k=2ðf2l=2Þ Dk=2ðf1l=2Þ Ck=2ðf2l=2Þ 1k=2ðf1l=2Þ

!

where L ¼ , H¼ , J¼ , A K¼ , A and B are both 2k ðg f1 f2 þ lÞ matrices. B From assumptions (1), we have g − f1 − f2 + l > 0. Therefore, the matrix dimension of A and B is well-deﬁned. If g − f1 − f2 + l = 0, then both A and B will vanish. If fI 2l for i = 1, 2, then 1k/2(f1−l2), 1k/2(f2−l2), Ck/2(f2−l/2), Dk/2(f1−l/2) will also vanish. Note that 1k/21, Ek/2(l/2−1) and 1k/21, Fk/2(l/2−1) contain coefﬁcients of local parity blocks in the ﬁrst and second groups respectively. The matrices 1k/2(f2−l/2), Dk/2(f1−l/2), A, Ck/2(f2−l/2), 1k/2(f1−l/2), B contains coefﬁcients of global parity blocks. The encoding matrix satisﬁes the following properties: l rank 1k=21 Ek=2ðl=21Þ ¼ 2 l rank 1k=21 Fk=2ðl=21Þ ¼ 2 rank 1k=21 Ek=2ðl=21Þ Dk=2ðf1l=2Þ Þ ¼ f1

296

Q. Xu et al.

rank 1k=21 Fk=2ðl=21Þ Ck=2ðf2l=2Þ Þ ¼ f2 One can verify that the encoding matrix above gives all the repair bandwidth in Table 1.

5 Interference Alignment in R-LRC 5.1

MDS Property

When fi [ 2l , interference alignment scheme is applied to recover data blocks failures. In this section, we show that there does not exist a code that has MDS property and interference alignment property. Suppose a ﬁle size M to be stored is segmented into k = 6 segments of equal size M6 . The data blocks are encoded into 10 encoded blocks using (6, 2, 2) Pyramid codes. The encoding matrix of the pyramid codes above is 0

1 B0 B B0 G¼B B0 B @0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 a1 0 b1 0 c1 0 0 0 0 1 0

0 0 0 h1 r1 l1

a2 b2 c2 h2 r2 l1

1 a3 b3 C C c3 C C h3 C C r3 A l3

where ai ; bi ; ci ; hi ; ri ; li 2 Fq for 1 i 2. Suppose data blocks A1, B1, C1 and D1 fail, as shown in Fig. 2. One can repair failures above using interference alignment technique, which requires the generator matrix to satisfy following conditions: 0

a1 rank ¼ @ b1 c1

a2 b2 c2

1 a3 b3 A ¼ 3; c3

0

h1 rank ¼ @ r1 l1

h2 r2 l2

1 h3 r3 A ¼ 1 l3

However, the same set of coefﬁcients cannot be used to repair the symmetric failures pattern occurs (that is, blocks D1, E1, F1, A1 fail), as the recovery requires the generator matrix to satisfy 0

a1 rank ¼ @ b1 c1

a2 b2 c2

1 a3 b3 A ¼ 1; c3

0

h1 rank ¼ @ r1 l1

h2 r2 l2

1 h3 r3 A ¼ 3 l3

Interference alignment is known to reduce repair bandwidth by aligning interferences among downloaded blocks. However, when the technique is applied on MDS codes, the codes lost their MDS property. The required conditions of both MDS codes and interference alignment technique are complement to each other. Hence they cannot be satisﬁed simultaneously.

Effective Local Reconstruction Codes

297

Fig. 2. Data blocks A1, B1, C1, D1, E1, F1 and their coefﬁcients are chosen from ﬁnite ﬁeld Fq. Addition and multiplication are XOR and ﬁnite ﬁeld multiplication respectively.

5.2

Maximization of Local Groups

Suppose a ﬁle of size M to be stored is segmented into k = 4 segments of equal size M4 . Each parity block stores only 2 data blocks and there are only 2 parity blocks storing the same data blocks but with different coefﬁcients. Figure 3 illustrates the setting. The encoding matrix G1 of the code above is G1 ¼

022 I22

I22 022

121 021

021 121

P

021 Q

021

c g Where, Q ¼ ,P¼ and c; l; g; n 2 Fq . They satisfy the properties l n

1 rank 1

c l

1 g ¼ rank 1 n

¼2

The construction above can tolerate arbitrary 2 failures and maximum 4 data block’s failures. One can observe that the construction above is actually concatenation of two (4, 2) MDS codes. One may also notice that the number of (4, 2) MDS codes used corresponds to the number of local groups. 5.3

Generalization

Denote the matrices 0

0 1 l 0 Bc 0C B C B0 0C B C C 0 ; P¼B B0 .. C B .. C B. .C B A @0 0 ... 0 1 0 0 0 ... 0 1

1 B1 B B0 B 0 Q¼B B. B .. B @0

0 0 1 1 .. .

... ... ... ... .. .

0 0 0 0 .. .

1 0 0C C 0C C 0C C .. C .C C 0 ... 0 lA 0 ... 0 c

0 0 l c .. .

... ... ... ... .. .

0 0 0 0 .. .

298

Q. Xu et al.

Where, k 2k matrix Q and the k 2k matrix P satisﬁes

c rank l

1 1

¼2

Deﬁne the k 2k encoding matrix G1 as follows G1 ¼ ð Ikk

Q

PÞ

Suppose a ﬁle of size M to be stored is segmented into k segments of equal size Mk , where the segments are denoted as S ¼ ða1 ; a2 ; . . .; ak Þ 2 Fkq . The encoded symbols are obtained through the matrix multiplication C = SG1. We call codes with above construction as M-codes, where M stands for modiﬁed. We also call data blocks that are stored in the same parity node as consecutive blocks. In Fig. 3, both A and B are consecutive blocks.

Fig. 3. Data blocks A, B, C, D and their coefﬁcients are chosen from ﬁnite Fq. Addition and multiplication are XOR and ﬁnite ﬁeld multiplication respectively.

5.4

Comparison Between M-Codes and MDS Codes

Denote f as the number of data block’s failures. Clearly it is f k. For any tolerable failures in MDS codes, they require M repair bandwidth, which is the size of the original ﬁle. For the M-codes, the repair bandwidth depends on the failure pattern. The repair bandwidth of recovering 1 or 2 consecutive blocks is Mk 2 ¼ 2M k . Clearly the repair bandwidth is less than M. Note that the repair bandwidth of the M-codes is proportional to f. However, M-codes does not have MDS property due to its concatenation of (4, 2) MDS codes. If at least 3 failures occur in the same local group, the failure is not recoverable.

6 Theoretical Optimization in R-LRC In this section, we present several methods to reduce repair bandwidth when such failures occur. We assume that in R-LRC, there are only 2 local groups, each group containing 2k data nodes and 2l local parity nodes. In previous section, if there are 2

Effective Local Reconstruction Codes

299

failures that occur at different local groups, the repair bandwidth is at least the size of original ﬁle M. 6.1

Optimization of Common Blocks

Downloaded blocks can be used for more than one time in repairing without increasing repair bandwidth. If these blocks can be maximised, the repair bandwidth can be reduced. An example is given below to illustrate the idea before a general approach is given. Deﬁnition 1 (Common block). A data block is called common block if it is contained in all parity blocks. Example 1. Suppose there are k = 4 data blocks, namely, A, B, C, D and 4 parity blocks. Note that each data block is an element from ﬁnite ﬁeld F24 generated by the polynomial x4 + x + 1 and has size of M2 . Then each of them is assigned to parity blocks as follow: C1 ¼ A þ B C2 ¼ 2A þ B C3 ¼ A þ B þ C C4 ¼ A þ B þ D Note that the coefﬁcients of A, B, C and D are chosen from F24 . The encoding matrix has the form 0

1 B0 B @0 0

0 1 0 0

0 0 1 0

0 0 0 1

1 1 0 0

2 1 0 0

1 1 1 0

1 1 1C C 0A 1

Suppose data blocks A and B fail. Then they can be repaired using 2 blocks (C1, C2). Table 3 summarises the result of 2 failures and their repair bandwidth.

Table 3. (A, B) denotes the failures of data blocks A and B Failed blocks (A, B) (A, C), (A, D), (B, C), (B, D), (C, D)

6.2

Repair bandwidth c M 4 M 4

2 3

Generalization

Deﬁnition 2 (Special block). Special block is a data block contained in some, but not all parity blocks.

300

Q. Xu et al.

In the example above, A and B are common blocks while C and D are special blocks. We assume that each parity block can contain at most 1 special block and there are 2k common blocks and 2k special blocks. We ﬁx the number of parity blocks to be k where the ﬁrst 2k parity blocks, each of them contains 2k common blocks, 0 special block while the remaining 2k parity blocks, each of them contains 2k common blocks with coefﬁcient 1 and 1 special block. Note that the coefﬁcients of common block in the ﬁrst 2k parity blocks are linearly independent and the special blocks contained in each parity block are distinct. We can always arrange common blocks and special blocks in ascending order by their sequence numbers, which is all common blocks come ﬁrst, then followed by special blocks. Hence, the k 2k encoding matrix has the form

Ik=2k=2 0k=2k=2

0k=2k=2 Ik=2k=2

Ak=2k=2 0k=2k=2

1k=2k=2 Ik=2k=2

Where, A has non-zero determinant and Ak/2k/2 = (1k/21|Bk/2(k/2−1)). Note that all entries of the matrix A are chosen from ﬁnite ﬁeld Fq , where Fq contains q elements. In short, we partition data blocks into 2 groups, one group contains all common blocks and ﬁrst 2k parity blocks while another group contains all special blocks and the remaining 2k parity blocks. 6.3

Repair Bandwidth of the Special Codes

We suppose a ﬁle to be stored with a size M. It is fragmented into k fragments with equal size Mk . We suppose that there are f 2 N failures, consisting of f1 2 N common blocks and f2 2 N special blocks. Clearly f1 + f2 = f and f1 ; f2 2k . Each of the f2 special blocks can be repaired by using Mk 2 ¼ 2M k repair bandwidth. Hence, the total repair bandwidth for repairing f2 special blocks is Mk ½2f2 ðf2 1Þ ¼ Mk ðf2 þ 1Þ as 1 parity block is used repeatedly in f2 − 1 recovery processes. To repair f1 common blocks, one may note that the group containing common blocks is an (k, 2k ) MDS codes. Therefore, the repair bandwidth of f1 is M2 , and the total repair bandwidth for repairing f = f1 + f2 failures is Mk 2k þ f2 þ 1 . We call codes that have the above parity blocks as the Special codes.

7 Numerical Results and Analysis In this section, we proceed to compare Special codes with Pyramid codes. Firstly, we show repair bandwidth of two failures in the Special codes. Secondly, we demonstrate repair bandwidth comparison between Special codes and Pyramid codes.

Effective Local Reconstruction Codes

7.1

301

Repair Bandwidth of Two Failures with Different K

We compare Special codes with Pyramid codes (R-LRC with a = 1) under the same conﬁguration, where we consider only two failures. For Pyramid codes, we assume that they have k = 4 data blocks and two local groups, each group containing one local parity block, and they have two global parity blocks. Figure 4 shows repair bandwidth of two failures with different k in Special codes, where M is repair bandwidth in Pyramid codes when k>2. Compared to Pyramid codes, R-LRC can save 25% * 62.5% of repair bandwidth as shown in Fig. 4.

Fig. 4. Repair bandwidth of two failures in Special codes.

7.2

Repair Bandwidth Comparison with Different File Sizes

Figure 5 compares Special codes with Pyramid codes in repair bandwidth with different ﬁle sizes for two failures. When the 2 failures occur at different local groups, Pyramid codes require M4 4 ¼ M repair bandwidth. When the 2 failures occur within a local group, they require M4 3 ¼ 3M 4 repair bandwidth from Table 1. Furthermore, in Pyramid codes, there are 2 2 = 4 possibilities that give M repair bandwidth while 2 2 ¼ 2 possibilities for 3M 4 repair bandwidth. Hence, in Pyramid codes, the 2 repair bandwidth M has probability 2/3 to occur compared with 3M 4 that has probability 1/3. While for special codes, the repair bandwidth 3M has probability 5/6 to occur 4

Fig. 5. Repair bandwidth comparison with different ﬁle sizes for two failures.

302

Q. Xu et al.

compared with M2 that has probability 1/6. In average, the Special codes use 2 1/6 + 3 5/6 2.833 blocks in repairing 2 failures while the Pyramid codes use 4 4/6 + 3 2/6 3.667 blocks. Hence, the special codes have lower average repair bandwidth.

8 Conclusions In this paper, we present a new class of codes called Regenerating-Local Reconstruction Codes (R-LRC), which use smaller granularity of data blocks. Hence, they give lower repair bandwidth compared to Pyramid Codes. The generator matrix of RLRC is given. However, the matrix depends on number of data block’s failures. This requires users to estimate the average number of failures occur in the storage system before applying R-LRC. After that, we discuss few alternatives that give lower repair bandwidth in R-LRC. In addition, we propose an efﬁcient interference alignment mechanism that performs algebraic alignment so that the useless and unwanted dimension is decreased, so the network bandwidth consumption is reduced. Our numerical results show that R-LRC can reduce 25% * 62.5% of repair bandwidth compared to Pyramid codes.

References 1. Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S.: Erasure coding in windows azure storage. In: 2012 USENIX Annual Technical Conference, pp. 15–26 2. Sathiamoorthy, M., Asteris, M., Papailiopoulos, D.S., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. PVLDB 6(5), 325–336 (2013) 3. Rashmi, K.V., Shah, N.B., Gu, D., Kuang, H., Borthakur, D., Ramchandran, K.: A “hitchhiker’s” guide to fast and efﬁcient data reconstruction in erasure-coded data centers. In: ACM SIGCOMM 2014 Conference, pp. 331–342 (2014) 4. Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: OSDI 2010, pp. 61–74 (2010) 5. Xu, Q., Arumugam, R.V., Yong, K.L., Mahadevan, S.: Efﬁcient and scalable metadata management in EB-scale ﬁle systems. IEEE Trans. Parallel Distrib. Syst. 25(11), 2840–2850 (2014) 6. Dimakis, A.G., Godfrey, B., Wu, Y., Wainwright, M.J., Ramchandran, K.: Network coding for distributed storage systems. IEEE Trans. Inf. Theory 56(9), 4539–4551 (2010) 7. Li, R., Lin, J., Lee, P.P.C.: CORE: augmenting regenerating coding-based recovery for single and concurrent failures in distributed storage systems. In: Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST 2013) (2013) 8. Duminuco, A., Biersack, E.: A practical study of regenerating codes for peer-to-peer backup systems. In: 29th IEEE International Conference on Distributed Computing Systems (ICDCS 2009), pp. 376–384 (2009) 9. Hu, Y., Chen, H.C.H., Lee, P.P.C., Tang, Y.: Nccloud: applying network coding for the storage repair in a cloud-of-clouds. In: FAST 2012, p. 21 (2012)

Effective Local Reconstruction Codes

303

10. Gopalan, P., Huang, C., Simitci, H., Yekhanin, S.: On the locality of codeword symbols. IEEE Trans. Inf. Theory 58(11), 6925–6934 (2011) 11. Huang, C., Chen, M., Li, J.: Pyramid codes: flexible schemes to trade space for access efﬁciency in reliable data storage systems. TOS 9(1), 3 (2013) 12. Xu, Q., Xi, W., Yong, K.L., Jin, C.: Concurrent regeneration code with local reconstruction in distributed storage systems. In: The 9th International Conference on Multimedia and Ubiquitous Engineering, pp. 415–422 (2015) 13. Kamath, G.M., Prakash, N., Lalitha,V., Kumar, P.V.: Codes with local regeneration. In: Proceedings of the IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, pp. 1606–1610, July 2013 14. Rawat, A.S., Koyluoglu, O.O., Silberstein, N., Vishwanath, S.: Secure locally repairable codes for distributed storage systems. In: Proceedings of the IEEE International Symposium on Information Theory (ISIT), pp. 2224–2228 (2013) 15. Wu, Y., Dimakis, A.G.: Reducing repair trafﬁc for erasure coding-based storage via interference alignment. In: IEEE International Symposium on Information Theory, ISIT 2009, Seoul, Korea, Proceedings, 28 June–3 July 2009, pp. 2276–2280. IEEE (2009)

Blockchain-Based Distributed Compliance in Multinational Corporations’ Cross-Border Intercompany Transactions A New Model for Distributed Compliance Across Subsidiaries in Different Jurisdictions Wenbin Zhang1(&), Yuan Yuan1, Yanyan Hu1, Karthik Nandakumar1, Anuj Chopra1, Sam Sim2, and Angelo De Caro3 1

IBM Center for Blockchain Innovation, IBM Research, Singapore, Singapore {wenbin,idayuan,yanyanhu,nkarthik,achopra}@sg.ibm.com 2 IBM Tax & Intercompany Singapore, Singapore, Singapore [email protected] 3 Industry Platforms and Blockchain, IBM Research Zurich, Zurich, Switzerland [email protected] Abstract. Multinational Corporations (MNCs) have been facing increasing challenges of compliance and audit in their complex crisscrossing intercompany transactions which are subject to various regulations of across various jurisdictions in its global network. In this paper, we investigate these challenges and apply Blockchain-based distributed ledger technology to address them. We propose a Blockchain-based compliance model and then build a Blockchainbased distributed compliance solution to enable automatic distributed compliance enforcement for an MNC’s cross-border intercompany transactions across all Jurisdictions of the MNC’s network. This solution can help an MNC reduce the risk of compliance and audit, and enhance the capability of responding to various audits efﬁciently even after many years, thus improve the trust relationship and reputation with various auditors. Keywords: Multinational Corporation Transfer pricing Intercompany transaction Documentation Compliance Audit Blockchain Distributed ledger technology

1 Introduction 1.1

Background

The golden era of globalization during which Thomas Friedman famously declared that the “World is Flat”1 has seen Multinational Corporations (MNCs) building interconnected global supply chains with subsidiaries across many jurisdictions, transacting 1

The World Is Flat: A Brief History of the Twenty-ﬁrst Century, Farrar, Straus and Giroux; 1st edition (April 5, 2005).

© Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 304–320, 2019. https://doi.org/10.1007/978-3-030-03405-4_20

Blockchain-Based Distributed Compliance

305

goods, services and ﬁnancing with one another. These transactions among the subsidiaries of an MNC are called intercompany transactions, and the size of intercompany transactional flows can be very large for an MNC. For example, General Motor Company (GM) has its global headquarters in Michigan, United States, manufactures cars and trucks in 35 countries, distributes millions of vehicles and ﬁnancial services globally, and has 19,000 dealers in over 125 countries2; and the value of intercompany transactions between GE and its associated party GE Capital was approximately $105 billion for 2013, $76 billion for 2014, and $143 billion for 2015, respectively, as disclosed in GE’s 2015 FORM 10-K3. Yet, the new millennia is seeing an increasingly multipolar world which comes associated with renewed assertions of national sovereignty, including legislation, government agencies and rules that seek to regulate cross-border transactions. As a result, these complex crisscrossing networks of intercompany transactions are subject to increasingly levels of accounting, ﬁnance, transfer pricing, exchange controls and other regulations and scrutiny by each local authority in the MNC’s global network. The external authorities in each jurisdiction demand that each subsidiary of the MNC: (1) has proper documentation on a contemporaneous basis; (2) exercises proper controls in the whole process; (3) can account for the source and nature of each transaction; (4) can match payments for each transaction to its respective underlying delivery flow; (5) can reconcile the numerous accounting accruals and payments; and (6) can trace transactions and supporting documents end-to-end. All these are subject to external audits by the statutory account auditors, the revenue authorities and banking/capital control regulators and so on. When being audited in a jurisdiction, the MNC must provide sufﬁcient evidences to prove to the auditors that it has complied with the relevant rules. This gives rise to two fundamental problems: (1) Trust problem between MNCs and auditors. (2) Provenance problem in establishing the veracity of intercompany transactions. 1.2

Challenge of Distributed Compliance with Different Jurisdictions and Rules Across a Mutlinational Corporation’s Geography

As Booz & Co. writes in “Managing the Global Enterprise in Today’s Multipolar World”4, today’s global MNCs face issues such as how much freedom should local entities have to manage themselves, how to balance local autonomy with global scale and standardization and how much central coordination is optimal. Global success in this multipolar world hinges on three management enablers: a rebalanced organizational structure and footprint; dispersed decision rights and tighter controls; and a fresh approach to leadership and talent management. This need for dispersed decision rights so that each subsidiary or local team can best deal with and comply with local regulations run up again the tendency of MNCs to 2 3 4

GM’s ofﬁcial website https://www.gm.com. Available at https://www.ge.com/ar2015/assets/pdf/GE_2015_Form_10K.pdf. https://www.strategyand.pwc.com/media/uploads/Strategyand-Managing-Global-Enterprise.pdf.

306

W. Zhang et al.

centralize controls and standardize processes. The problem is particularly pronounced in intercompany transactions where there are at least two and frequently multiple entities and jurisdictions to manage. In an MNC’s supply chain for example, a procurement center in Country A may source raw materials for a factory in Country B to make certain components that are in turn sent to Country C which then combines other parts to assemble a product that is shipped out to Country D where the distribution arm of the MNC is situated. The compliance effort is dispersed amongst each of these entities subject to the various countries’ regulatory authorities and rules. Yet, the MNC should maintain central oversight and control. A central CFO and ﬁnancial controller for instance, has the overall responsibility that, notwithstanding all these intercompany ﬁnancial flows, that the ﬁnancial accounts in each country and on a consolidated basis, the MNC as a whole, is true and fair. This challenge of distributed compliance in a fundamentally centralized management set-up in the MNC is further complicated by the trend towards outsourcing and offshoring work to shared service centers to save costs. Shared service centers are commonly implemented in MNCs to further integrate speciﬁc operational services, such as accounting, HR, IT, legal, logistics, etc., and to supply these services to other subsidiaries/business units of the MNC. This manual handling of intercompany transactions through shared service centers involving hundreds of entities across the world has the result of (1) separating the ﬁnancial billing and payment flows from the underlying flow of goods or services, i.e. Subsidiary A may render a service to Subsidiary B but the costs of such a service is billed via Subsidiary C to Subsidiary B. Such costs are being aggregated and charged out by centralized teams at one or more global or regional hubs. Another issue for a typical MNC is the presence of many non-integrated ﬁnancial, accounting, billing, invoicing, human resource systems, etc. This phenomenon is often a consequence of separate systems implemented over time or Mergers and Acquisitions (M&A) that cobble different companies with different systems and processes together. The local country and Shared Services teams draw information from, read and write data, store documents and bills using different accounting and ﬁnance systems and databases across the MNC. Furthermore, the audits of these various local subsidiaries may occur in different periods including many years later when staff members may have moved on, systems may have changed and the information may have been lost. Due to the above reasons, information, resolution and linkage of data may be lost in the transactional process, especially for transactions processed through the shared service centers. In addition, turnover of Hub personnel, system changes and deletion of data means that centralized processes are prone to a single point of failure. Even if the information or document can be traced, the manual gathering and analysis process is inefﬁcient and costly. The loss in information and linkages increases with the passage of time. These issues result in (1) wasted labor hours in retrieving information and documentation for audit compliance; (2) low process efﬁciencies; and (3) heightened risk in audits. A different perspective of the distributed compliance problem is where MNCs are global service providers for clients who are obliged to comply with certain regulations and require the MNC and its subsidiaries to verify that they, their subsidiaries or even

Blockchain-Based Distributed Compliance

307

suppliers comply with these regulations in their execution of their contracts. An example of this is where there is a list of sanctioned countries or companies. The MNC who contracts with the client in Country A should ensure that its subsidiaries, suppliers and sub-contractors outside Country A comply with the rules by refraining from dealing with the sanctioned countries or companies. In summary, the challenge of distributed compliance at each node of a complex web of cross-border intercompany transactions is a key challenge facing most MNCs. Because these intercompany transactions go through an MNC’s internal systems and processes, it encounters: (1) Loss of Data Lineage: Different systems across various jurisdictions that are not integrated, centralized processing and time-lag between transaction and time of audit cause loss of links or resolution. (2) Single Point of Failure: Dependency on individuals and silo systems that are prone to staff leave/turnover, data erasure and system changes/crashes. (3) Lack of Trust from the Authorities: MNCs must keep records for years and prove provenance to establish that it has complied with the rules in each country to external auditor/authorities. (4) Cross-border compliance that goes beyond the remit of a single jurisdiction and single contract: A regulator in any country cannot enforce its law in another country. Therefore, it often imposes its rule/requirements on an entity or its client in its jurisdiction, to comply with its rules across the network of the MNC or the client. This gives rise to the need for distributed compliance to demonstrate that the nodes (offshore subsidiaries of the MNC or the client) have complied with the rules of that country. 1.3

Contributions of This Paper

In this paper, we investigate these challenges in MNCs’ distributed compliance arising from cross-border intercompany transactions, and propose a Blockchain-based solution to address them. Our solution consists of two parts: (1) the ﬁrst part is a Blockchainbased compliance model which converts regulations in a jurisdiction into smart contracts to be deployed across the MNC’s intercompany network; (2) the second part is integration of the compliance model of the ﬁrst part and a Blockchain-based intercompany transaction platform to form a Blockchain-based distributed compliance solution for an MNC. This solution can not only provide an immutable and contemporaneous record of transactions and evidences, but also enable an MNC to enforce compliance in the process of intercompany transactions across its whole intercompany supply chain, and to update compliance rules whenever there is regulation change. When a transaction is being processed, it will be veriﬁed against compliance rules in each related jurisdiction even if the jurisdiction is not directly involved in the transaction, and its supporting documents will be recorded and veriﬁed as well. Besides beneﬁting compliance, this also enables MNCs to trace transactions and supporting documents end-to-end, transaction-by-transaction, and reconcile accounting entries even years after they have occurred. These capabilities of our solution can reduce the compliance and audit risk an MNC is facing, and enhance MNCs’ responses to various

308

W. Zhang et al.

auditors thus improving the trust relationship and reputation with auditors. They may have further potential for MNCs to employ analytics on the end-to-end ﬁnancial flows recorded on Blockchain, and provide insights to CFOs and treasurers through a dashboard of the transactions in real time - enhancing overall ﬁnance controls.

2 Related Works 2.1

Existing Solutions

Numerous attempts have been made in the past to tackle the challenges described above and business process management (BPM) framework to deal with operational business processes has emerged. The most basic requirement is to have a common language between the main stakeholders (i.e. the users, the analysts and the developers) so that the business processes can be designed with a common understanding. A state based approach to capture business processes had been proposed in [1]. BPM has the potential to signiﬁcantly increase productivity and save costs, and research on BPMs resulted in a plethora of methods, techniques, and tools to support the design, enactment, management, and analysis of operational business processes [2]. With the evolution of Enterprise system architectures to automate business processes, a modelling based approach to managing business processes was proposed. As businesses exist in society, they are subject to rules and regulations and thus need to comply with such rules i.e., semantic constraints that stem from guidelines, regulations, standards, and laws. Thus, organizations need to have proper controls in place to comply with these rules and regulations [3]. This further increases the complexity of business processes as evidence of compliance in such business processes need to be captured and retained for varying periods of time and compliance checks become an integral part of BPMs [4]. To compound the challenges, these rules and regulations are subject to change with changes in the global environment, variations in government policies and regulations [5]. Most of the existing business process models are not robust enough to account for such changes even though several approaches have been presented for supporting automation of compliance management in such business processes. In [6], some of these approaches have been studied to provide insights into their generalizability and evaluation and concluded that most of the approaches focused on a modelling technique supported by limited set of compliance rules and most of these may not be applicable for practical use. A framework was proposed in [7] for auditing business process models for compliance with legislative/regulatory requirements. In enterprise resource planning (ERP) environments the audit of business process compliance is a complex task as audit relevant context information about the ERP system like application controls (ACs) need to be considered to derive comprehensive audit results [8, 9]. Current compliance checking approaches neglect such information as it is not readily available in process models. Even if ACs are automatically analyzed with audit software, the results still need to be linked to related processes. Thus far, this linking is not methodically supported [5, 10]. Given these challenges, many MNCs may not be able to comply with the processes set to meet compliance guidelines which lead to lack of trust in the data shared by the

Blockchain-Based Distributed Compliance

309

companies with auditors. Since the authorities generally audit for the possibility of fraud or hidden information/practices, they demand a lot of documents as evidence for authenticity of a transaction. A solution to address the trust issues was proposed in [10] which rely on collaborative process execution using Blockchain and smart contracts. However, the work in [10] does not discuss intercompany transactions across an MNC’s global footprint, the compliance related challenges in each jurisdiction, and the need to achieve distributed governance. With the fast pace and frequent introduction of or revision to country rules governing ﬁnancial transactions after the global ﬁnancial crisis of 2008, transfer pricing rules after G20-OECD Base Erosion and Proﬁt Shifting project of 2013–15 and evolving accounting standards of both IFRS and US GAAP, the need to exercise robust governance over intercompany processes, maintain clear documentation and an agile response quickly to changing compliance requirements will grow increasingly important for MNCs [11]. 2.2

Blockchain-Based Solutions

Blockchain is a peer-to-peer distributed/shared ledger that immutably records a sequence of transactions without the need for any centralized or trusted entities [12, 13]. The basic idea is to group transactions into blocks and append the blocks to the distributed ledger one at a time based on a consensus mechanism [12, 13]. Each block also includes a cryptographic hash of the previous block, leading all the way to the genesis block (see Fig. 1). Given the assumption that the honest peers in the Blockchain network have a greater computational power than the malicious ones, it is practically not feasible to tamper with transactions/blocks that are already recorded on the Blockchain without being detected. Smart contracts can be broadly deﬁned as self-executing agreements written in code. By executing smart contracts on Blockchain nodes/peers, it is possible to ensure that the transactions that are recorded on the Blockchain satisfy all the terms agreed by the parties a priori. Thus, Blockchain technology powered with smart contracts can establish trust, accountability, and transparency, while streamlining business processes. It has the potential to vastly reduce the cost and complexity of getting things done, especially in multi-party transactions with minimal trust assumptions. It enables global business to transact with less friction and more trust, and brings enormous potential to transform global business across many industries. While the popularity of Blockchain is largely due to its successful application in cryptocurrencies such as Bitcoin [14], the underlying technology is generic and can be used create a tamperproof audit trail of transactions in many different applications. Public Blockchain/DLT platforms like Bitcoin or Ethereum [15] enable all participating parties to manage the same share ledger without a trusted party and has been reshaping the public domains. However, public Blockchain is not suitable in the enterprise environment as much higher efﬁciency, stronger control and data privacy is of paramount concern within an enterprise network. In this work, we employ a permissioned blockchain network, where the blockchain nodes are operated by known whitelisted entities. The identities for these entities (often deﬁned by public and private cryptographic key pairs) are granted by issuing authorities recognized by the network. One example of such a permissioned blockchain network is the open-source Hyperledger

310

W. Zhang et al.

Fig. 1. A Blockchain is typically a simple linked list of blocks, which is built using hash pointers. Since each block includes a cryptographic hash H() of the previous block, the probability of tampering a transaction without detection is extremely low.

Fabric [16]. Fabric has a modular architecture that allows network administrators to deﬁne their own constraints and then setup the protocols accordingly. Fabric also provides the following special features, some of which are used in our proposed solution to support our key requirements. • Chaincode: Chaincode is the name used in Fabric to denote smart contracts. They provide a mechanism to deﬁne assets and instructions (business logic) to modify the assets. In addition, a Chaincode is also updatable, may retain state, and inherits conﬁdentiality/privacy. • Variable Conﬁdentiality: Networks can limit who can view or interact at various levels of the environment. Applications built on top of Fabric can even impose their own conﬁdentiality rules. • Veriﬁable Identiﬁcation: In cases where unlinkability is required, Fabric allows to plug identify obfuscation mechanisms. If the users of a network grant permission, an auditor will be able to de-anonymize users and their transactions. This is particularly useful for external regulatory inspection, veriﬁcation and analysis. • Private Transactions: The details of a transaction, including but not limited to Chaincode, peers, assets, and volumes can be encrypted. This limits the information availability to non-authorized actors on the network. Only speciﬁed actors can decrypt, view and interact/execute (with chaincode). • Customizable Consensus Protocols: Fabric supports a pluggable consensus infrastructure that allows choosing the consensus mechanism that best ﬁts the particular use-case.

3 Blockchain-Based Distributed Compliance in Cross-Border Intercompany Transactions Blockchain-based distributed ledger technology can bring capabilities to an MNC on top of traditional central hubs and intercompany Enterprise Resource Planning (ERP) architecture of governing and complying with the regulations applicable to intercompany transactions. It enables distributed governance and compliance without disrupting an MNC’s daily operations. In the following, we shall ﬁrst model the challenge of compliance of in an MNC’s cross-border intercompany transactions, and

Blockchain-Based Distributed Compliance

311

then propose an approach of Blockchain-based distributed compliance for it (and give the technical details in next section). 3.1

Model of MNC’s Business Network

The business network of an MNC typically consists of the following types of stakeholders: (1) Subsidiaries, such as manufacturers, assembler, distributors, etc. (2) Shared Service Centers in different jurisdictions, which integrates speciﬁc operational services, such as accounting, HR, IT, legal, logistics, procurement, etc., and supplying these services to other subsidiaries/business units of the MNC. (3) Regulators in each jurisdiction. (4) Suppliers and clients across various jurisdictions. For simplicity, this paper will consider only the ﬁrst three types of stakeholders, namely, Subsidiary, Shared Service Center and Regulator. On this network, there are contracts/agreements, bills, payment and various documents being processed and transited among the stakeholders. In this paper, we regard all these as transactions. The following example illustrates this model of MNC’s business network: product components are manufactured by a manufacturer, shipped to an assembler to be assembled into a product, and ﬁnally shipped to distributors. These subsidiaries of the MNC are in different jurisdictions, and there are shared service centers serving as regional centers to support subsidiaries in each region. As mentioned earlier, suppliers and clients will not be considered here. A simple scenario is shown in Fig. 2. • Two connected shared service centers (SSC) in Europe (EU) and Asia Paciﬁc (AP) which are regional centers of the MNC. • Multiple subsidiaries (manufacturer, assembler, distributors) in different jurisdictions, each connecting with a SSC. • One regulator in each jurisdiction. 3.2

Challenges of Compliance in MNC’s Intercompany Transactions

Business processes and transactions are generally subject to compliance and audit of Regulators. Some examples of regulation rules to be complied with are: • The entity, staff or type of transaction subject to regulations need to be clearly identiﬁed. • Proper process and procedures be put in place or existing processes are modiﬁed to ensure the rules are followed. This includes instituting controls such as having separate maker-checker of a document, having tiered approval processes where higher risk or amounts involved require a higher level of approval, such approvals may be within the MNC or may involve external parties such as banks or authorities.

312

W. Zhang et al.

Fig. 2. A simple illustration of an MNC’s business network. In reality, there may be more subsidiaries surrounding a Shared Service Center, and there can be multiple Shared Service Centers in a region performing different functions.

• MNCs often need to follow or set down rules and procedures, constitute legal or specialist teams that interpret the rules, compliance teams that vet and ﬁlter the transactions against these rules and internal audit functions that perform periodic checks and veriﬁcation. • The content of these rules may be (a) prescriptive e.g. the contracts shall contain certain terms and conditions, transactions must follow certain steps, pricing must be at arm’s length price, revenue and costs should be matching or (b) exclusionary e.g. do not deal with entities or areas beyond a certain limit or from certain sanctioned countries. Frequently, to allow external agencies or auditors to validate compliance, the MNC is required to (a) ﬁle, maintain and archive the relevant contracts or supporting documents (b) demonstrate that there are proper controls and procedures, and (c) show that the controls are applied timely and adequately applied (authentications, veriﬁcations, approvals etc.). In general, if some business processes and transactions are related to a Jurisdiction, then they may need to be compliant with regulations in that jurisdiction. For example, if a product distributed in Jurisdiction A involves a service rendered or a device component manufactured in Jurisdiction Z, then the business processes/transactions of the service/device component could be subject to Jurisdiction A’s compliance. Therefore, we make the following assumption: Assumption: Processes and transactions involving a subsidiary in Jurisdiction A are subject to Jurisdiction A’s compliance rules. According to this assumption, transactions between Subsidiary Manufacturer and SSC EU (resp. Subsidiary Assembler and SSC AP) in Fig. 2 are subject to compliance and audit in each other’s jurisdiction. The challenge is that a transaction initiated from Manufacturer to SSC EU need to comply with certain rules in Distributors’ jurisdictions, but the two Distributors have no control in the process of the transaction. In addition, there can be considerable number of transactions processed each year making

Blockchain-Based Distributed Compliance

313

it costly to have human check in the process. These process transactions could only be veriﬁed after a certain time which could be months or even years when the Distributors need to trace the provenance of product components upon audit. This would put a subsidiary and the MNC under risk of audit failure and the MNC could potentially suffer ﬁnancial and reputational loss if there were non-compliant transactions. 3.3

Blockchain-Based Distributed Compliance for MNC

This challenge of compliance in MNC’s intercompany transactions can be addressed if there is a platform to link all these fragmented processes and controls such that each subsidiary can enforce automatic compliance in the entire transaction process regardless of the number of other subsidiaries owning pieces of the process. We propose the following idea of Blockchain-based distributed compliance: MNCs can build Blockchain-based distributed ledger to complement existing systems to process transactions, record supporting documents and enable automatic compliance in the process across the business network of the MNC. It can connect all the involved MNC’s subsidiaries, accounting, ﬁnance and intercompany departments, and internal auditors. External auditors and regulators in each jurisdiction can be connected to the distributed ledger, but through the local subsidiary of the MNC on a permissioned access basis that is safe and secure. Such a Blockchain-based distributed ledger can enable distributed compliance by building Smart Contracts with compliance rules of each Jurisdiction embedded into the process at each stage. This can ensure that each step of the process complies with the required rules, including changes in accounting and regulatory rules or rate changes. For example, if there is a change in an accounting convention, transfer pricing uplift or foreign exchange rate rule, smart contracts can reflect the change at the precise time the change goes into effect for multiple entities across the MNC without manually having to cascade such changes to each entity globally. The further potential to build real time dashboards and analytics on top of these blockchain transactions can provide the assurance to CFO, Treasurers, ﬁnancial controllers and others responsible for central governance that compliance is executed throughout the network of MNC. The immutable records can then serve as veriﬁcation to internal and external auditors. When a subsidiary is being audited, it can retrieve immutable transactional records and supporting documents from the ledger to prove control and provenance to the local auditors. To build such a Blockchain-based distributed ledger for the MNC’s business network, we shall employ a permissioned Blockchain with smart contract (see Fig. 3). Each subsidiary is a peer node whose ledger contains transactions related to it only. Each auditor/regulator can also be a peer node, but its ledger contains only the header and hash of each Block of the corresponding subsidiary’s ledger. This is generally sufﬁcient for the regulator as the MNC needs to protect its necessary business privacy and the header and hash of each Block can ensure immutability. Another practical reason is that since a large number of companies may fall under the jurisdiction of a regulator, it would be costly for the regulator to hold the full ledgers of all these companies.

314

W. Zhang et al.

To enforce compliance in the processes across the network of the MNC, each jurisdiction’s compliance rules can be implemented as part of the smart contract which will be deployed into all peer nodes. All transactions that take place will be veriﬁed by the smart contract for compliance with the rules. Each subsidiary can further update its local compliance rules across the network by updating the smart contract. For audit, Regulator can access to the subsidiary’s ledger through an audit portal of that subsidiary.

4 Proposed Approach of Blockchain-Based Distributed Compliance In this section, we shall present the detailed design of a Blockchain-based solution for distributed compliance in cross-border intercompany transactions. We shall ﬁrst build a Blockchain system of cross-border intercompany transactions for an MNC’s subsidiaries following the approach of [10], then propose a Blockchain-based compliance model and embed it into the Blockchain of intercompany transactions to have the Blockchain-based distributed compliance. 4.1

Blockchain-Based Cross-Border Intercompany Transactions

For the business network of an MNC, permissioned Blockchain should be used rather than public Blockchain. A suitable choice of permissioned Blockchain is Hyperledger Fabric [16] as it is particularly built for enterprises’ environment. The proposed architecture is shown in Fig. 4 following the active mediator approach of [10]. The Blockchain connects with existing systems and processes, coordinates the process execution among the MNC’s Subsidiaries by smart contract, and stores the transaction data and process status on the blockchain ledger. There are three major components: (1) the blockchain interface, (2) smart contract in terms of chaincode and endorsement policy, and (3) the ledger, which are explained in detail below. The client, i.e., blockchain interface, connects a Subsidiary’s existing systems and internal processes. Through this client, smart contracts, implemented as chaincode, can interact with the systems and processes outside the blockchain. Similar to [10], the client holds conﬁdential information and runs on a full blockchain node, keeping track of the execution context and status of running business processes. The client calls external APIs if needed, receives API calls from external components, and updates the process state in the blockchain based on external observations. It further keeps track of data payload in API calls and keeps the data and documents in a storage database. The document storage can be either on or off the Blockchain. Chaincode and endorsement policy are speciﬁc key components of Hyperledger Fabric [16]. Chaincode is software deﬁning an asset or assets, and the transaction instructions for modifying the asset(s). In other words, it’s the business logic processing transaction request with data (read set and input) and generates transaction result (write set). Chaincode enforces the rules for reading or altering key value pairs or other state database information. Chaincode functions execute against the ledger current state database and are initiated through a transaction proposal. Chaincode

Blockchain-Based Distributed Compliance

315

Fig. 3. Blockchain for MNC’s intercompany transactions.

Fig. 4. Architecture design of Blockchain-based intercompany transactions.

execution results in a set of key value writes (write set) that can be submitted to the network and applied to the ledger on all peers. Endorsement policies are used to instruct a peer on how to decide whether a transaction is properly endorsed and therefore to be considered valid. When a peer receives a transaction, it invokes the VSCC (Validation System Chaincode) associated with the transaction’s Chaincode to make the following determinations:

316

W. Zhang et al.

(1) all endorsements of a transaction are valid (i.e. they are valid signatures from valid certiﬁcates over the expected message), (2) there is an appropriate number of endorsements, and (3) endorsements come from the expected source(s). The ledger is the sequenced, tamper-resistant record of all state transitions in the Fabric. State transitions are a result of chaincode invocations (‘transactions’) submitted by participating parties. Each transaction results in a set of asset key-value pairs that are committed to the ledger as creates, updates, or mark as deleted. The ledger is comprised of a blockchain (‘chain’) to store the immutable, sequenced record in blocks, as well as a state database to maintain current Fabric state. There are additional capabilities of Hyperledger Fabric, especially channels, which can be easily incorporated in the solution. Due to space limitation, these capabilities will not be discussed in this paper. 4.2

Blockchain-Based Compliance Model

We next propose a Blockchain-based compliance model which will be embedded into the above MNC’s Blockchain-based intercompany transactions. The overall idea is to use two typical conversion tools to convert regulations into codes which can be embedded into the Blockchain chaincode. Firstly, a domain speciﬁc language (DSL) template is created and applied to convert regulations into three parts, as shown in Fig. 5, compliance rules, parameters, and enforcement policies. A simple example of DSL template is given in Fig. 6. Secondly, a smart contract translator is created to convert them into Blockchain implementation including chaincode, constant state and endorsement policy: (1) Chaincode implements the compliance logic, processing automatic compliance veriﬁcation of a submitted transaction against the required compliance rules. (2) Constant state includes the compliance parameters stored in the ledger (world state) which will be fetched by the compliance chaincode in the process of compliance veriﬁcation. (3) Endorsement policy is used to assert the validity of a transaction. If compliance is fulﬁlled, the transaction proposal will be endorsed. The endorsement policy needs to be satisﬁed in order to have a transaction begin committed to the ledger.

Fig. 5. Blockchain-based compliance model.

Blockchain-Based Distributed Compliance

317

Fig. 6. An example of DSL template.

In addition, whenever there is change of regulations in a Jurisdiction, the changes can be updated accordingly to the Blockchain implementation. 4.3

Blockchain-Based Distributed Compliance in MNC’s Intercompany Transactions

We next integrate the previous Blockchain-based compliance model into the Blockchain-based intercompany transactions as shown in Fig. 7. (1) A new chaincode comprising the compliance logic and the business logic will be deﬁned; (2) The compliance endorsement policy will be bound to the business logic; (3) The constant states of the compliance parameters are stored in the ledger. The process flow of a transaction is shown in Fig. 8. A user of the host peer submits a transaction proposal on behalf of a subsidiary. The required endorsing peers will then simulate the execution of the chaincode against the transaction proposal. If compliance is fulﬁlled, meaning that the simulation of the chaincode is successful, then the transaction proposal will be endorsed. The host peer collects all endorsing peers’ endorsements and creates a transaction that is submitted to the ordering service. Once the transaction gets ordered, it is dispatched to all the committing peers. The committing peers verify the endorsement against the endorsement policy bound to the chaincode reference by the transaction. If succeeds, then the transaction is deemed valid and will be committed to the ledger. The design of Hyperledger Fabric has the capability of easy updating chaincode, making it easy for updating compliance rules across the Blockchain network. When regulations in a Jurisdiction are changed, the MNC’s Subsidiary in that Jurisdiction can

318

W. Zhang et al.

Fig. 7. Architecture design of Blockchain-based distributed compliance.

implement the changes and submit a transaction of updating compliance rules to the Blockchain to update the chaincode which will be processed as a usual transaction. If it can be endorsed by the required endorsers, then the chaincode can be rebuilt to update the compliance rules. Our solution also includes Regulators/Auditors as shown in the architecture design Fig. 7. In each Jurisdiction where the MNC has a Subsidiary, a regulator/auditor can also be a peer node of this Blockchain network, but it only records the header and hash of each block of the ledger on Blockchain. This ensures immutability of the ledger and protects the MNC’s necessary business privacy. For the auditors to access the ledger, there can be an Audit Portal of a Subsidiary for Auditors in a Jurisdiction to access the ledger and supporting documents under the MNC’s necessary privacy control. 4.4

Discussion on Implementation of the Proposed Solution

We have implemented a prototype of the proposed solution to access the feasibility of Blockchain-based distributed compliance. We built our prototype on Hyperledger Fabric 1.0 to demonstrate the following scenario: (1) MNC’s subsidiary A in Jurisdiction A’ is making a payment to the Shared Service Center B in Jurisdiction B’ for a service, and the Subsidiary C in Jurisdiction C’ is related to this payment as this service contributes to a contract between A and C. (2) The compliance rules concerns satisfying the sanction restrictions, i.e., no external parties in sanction list, and retention of certain supporting documents, both of which are country speciﬁc. Namely B and C require different set of supporting documents according to their local regulations. (3) The MNC’s own control policy requires (i) payment need to be signed off by subsidiary B; (ii) if payment >US$100K, the Tax department of MNC need to sign off; (iii) if payment > US$500K, the CFO Ofﬁce need to sign off.

Blockchain-Based Distributed Compliance

319

Fig. 8. Transaction flow.

We conﬁgured this Blockchain network with ﬁve peer nodes representing the Subsidiaries, Shared Service Center, Tax Department and CFO Ofﬁce. The sanction list and required set of supporting documents of each peer are stored in the ledger, the compliance rules are implemented as chaincode, and required approvals from peers are implemented as endorsement policy. We then tested a number of transactions and compliance rules (sanction list and required supporting documents). The implementation result showed that a transaction will be rejected with the correct reason, whenever there is a missing supporting document, or a party in the sanction list involved, or insufﬁcient approvals gathered. This prototype implementation shows that it is feasible to implement our proposed solution of Blockchain-based distributed compliance for MNC’s intercompany transactions. Due to space limitation, more details of the implementation are omitted here. Meanwhile this prototype is also limited in terms of the scale and complexity. The proposed solution requires further development before it gets implemented in practice. The converting tools of regulations, document storage, security and privacy, performance issues require further investigation.

5 Conclusion In this paper, we have discussed about the problem of compliance and audit in an MNC’s cross-border intercompany transactions, and investigated the challenges of compliance enforcement in the process of transactions across multiple jurisdictions. We proposed the solution of Blockchain-based distributed compliance to address this challenge. We ﬁrst built a Blockchain-based intercompany transaction by adjusting the approach of [10] to ﬁt into the MNC’s scenario, then proposed a Blockchain based compliance model, and ﬁnally integrated the two together to achieve Blockchain-based distributed compliance. We also implemented a small prototype to show the feasibility of our solution. With this solution, any intercompany transaction in process between two Subsidiaries of an MNC can be veriﬁed automatically by compliance rules of all

320

W. Zhang et al.

related Jurisdictions. This reduces the risk of compliance and audit and addresses the challenges that MNCs are facing in cross-border intercompany transactions.

References 1. McDermid, D.: Integrated business process management: using state-based business rules to communicate between disparate stakeholders. In: Lecture Notes in Computer Science, vol. 2678, pp. 58–71 (2003). http://dx.doi.org/10.1007/3-540-44895-0_5 2. van der Aalst, W.M.P.: Business process management: a comprehensive survey. In: ISRN Software Engineering (2013) 3. Fdhila, W., Rinderle-Ma, S., Knuplesch, D., Reichert, M.: Change and compliance in collaborative processes. IEEE Xplore Document. (2015). http://ieeexplore.ieee.org/abstract/ document/7207349/ . Accessed 31 July 2017 4. Liu, Y., Muller, S., Xu, K.: A static compliance-checking framework for business process models. IBM Syst. J. 46, 335–361 (2007) 5. Schultz, M.: Enriching process models for business process compliance checking in ERP environments. In: vom Brocke, J., et al. (eds.) DESRIST 2013, LNCS, vol. 7939, pp. 120– 135 (2013) 6. Becker, J., Delfmann, P., Eggert, M., Schwittay, S.: Generalizability and applicability of model-based business process compliance-checking approaches—a state-of-the-art analysis and research roadmap. Bus. Res. 5, 221 (2012) 7. Ghose, A.K., Koliadis, G.: Auditing business process compliance. In: Proceedings of the International Conference on Service-Oriented Computing (ICSOC-2007). Lecture Notes in Computing Science, vol. 4749, pp. 169–180 (2007) 8. Kuhn, J.R., Sutton, S.G.: Continuous auditing in ERP system environments: the current state and future directions. J. Inf. Syst. 24, 91–112 (2010) 9. Mondéjar, R., García-López, P., Pairot, C., Brull, E.: Implicit BPM: a business process platform for transparent workflow weaving. In: Lecture Notes in Computer Science, vol. 8659, pp 168–183 (2014). http://dx.doi.org/10.1007/978-3-319-10172-9_11 10. Weber, I., Xu, X., Riveret, R., Governatori, G., Ponomarev, A., Mendling, J.: Untrusted business process monitoring and execution using blockchain. In: Lecture Notes in Computer Science, vol. 329–347 (2016). http://dx.doi.org/10.1007/978-3-319-45348-4_19 11. Action 13: Country-by-country reporting implementation package. (2015). https://www. oecd.org/ctp/transfer-pricing/beps-action-13-country-by-country-reporting-implementationpackage.pdf 12. Swan, M.: Blockchain: Blueprint for a New Economy. O’Reilly Media, Sebastopol (2015) 13. Tschorsch, F., Scheuermann, B.: Bitcoin and beyond: a technical survey on decentralized digital currencies. IEEE Commun. Surv. Tutor. 18(3), 2084–2123 (2016) 14. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system, May 2009 15. Wood, G.: Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper, 151, April 2014 16. Hyperledger fabric. https://hyperledger-fabric.readthedocs.io/2017

HIVE-EC: Erasure Code Functionality in HIVE Through Archiving Aatish Chiniah(&) and Mungur Utam Avinash Einstein Faculty of ICDT, University of Mauritius, Reduit, Mauritius [email protected]

Abstract. Most of the researches being conducted in the area of cloud storage using Erasure Codes are mainly concentrated in either ﬁnding optimal solution for a lesser storage capacity or lesser bandwidth consumption. In this paper, our goal is to provide Erasure Code functionalities directly from the application layer. For this purpose, we reviewed some application layer languages, namely, Hive, Pig and Oozie, and opt for the addition EC support in Hive. We develop several Hive commands that allow Hive tables to be ﬁrst archived and then encoded or decoded with different parameters, such as join and union. We test our implementation using the MovieLen Dataset locally and on the cloud. We also compare the performance against a replicated system. Keywords: Cloud storage

Erasure code Hive Hadoop Archiving

1 Introduction Most Cloud Storage System uses replication to provide guard against hardware failures, which is quite frequent in data centers. As an alternative erasure codes (EC) is being proposed, the main advantage of erasure codes compared to replication is that they provide higher fault-tolerance for lower overheads. However, as erasure codes were originally designed for a different environment (error control in transmission of onetime messages over an erasure channel), they do not consider two of the essential constraints/properties of distributed storage systems [1]: (i) data is scattered among a large number of storage nodes connected through a network with limited bandwidth, and (ii) data has a long lifespan, during which its content may be updated. These constraints result in erasure codes being used mostly for archiving purposes [2]. Cloud Systems running on Hadoop uses an eco-system as shown below in Fig. 1: Oozie [3], Pig [4] and Hive [5] are application layer languages that allow the manipulation of data stored in the physical layer. To this end, the queries are translated into MapReduce tasks that allow parallelism amongst clusters. Erasure code libraries like JErasure [6], ReedSolomon [7, 8] enables the deployment of Erasure coding in the Hadoop environment. However, most manipulation still needs to be performed through commands. Even following the deployment of WebHDFS [9], which allows ﬁle system commands to be executed on Hadoop clusters through a web interface; functions for calling erasure codes operations are non-existent. Given the recent successful deployment and adoption of Hive at Facebook, we implement HIVE Query language (HiveQL) queries that would be performing ﬁrst of © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 321–330, 2019. https://doi.org/10.1007/978-3-030-03405-4_21

322

A. Chiniah and M. U. A. Einstein

Fig. 1. Hadoop eco system

all archiving (not available now) and then perform EC operations like encoding or decoding using speciﬁc schemas. The rest of this paper is structured as follows: we ﬁrst give a short overview of Hive and database archiving as background in Sect. 2. Then we explain the details of new archiving HiveQL commands in Sect. 3. A brief description of the implementation is given in Sect. 4 and the subsequent experimental results are presented in Sect. 5. Finally, the paper is concluded in Sect. 7.

2 Background 2.1

Hive

Business Intelligence and Analytics is quite an important player in the world of cloud computing as it generates some very prosperous opportunities to businesses in connection with the cloud. The more data is being analyzed and processed, the higher the amount of output can be generated. Actually the size of data available is exponentially growing to hundreds of petabyte. Even though map-reduce programming model does provide an interface to deal with such amount of data, strict manipulation and high throughput queries are still signiﬁcant issues. Hive is an open-source data warehousing solution that runs of top of the Hadoop, which provides HDFS as ﬁle system. Hive allows the query operations with SQL – like statements, which is implemented using HiveQL that is a combination of SQL and map-reduce operations. Hive’s data model is similar to a traditional RDBMS, that is, it stores data in tables, rows, columns and partitions. A partition is a group of rows from a given table. Hive supports data types such as integers, floats, doubles and strings as well as more complex structures like maps, lists and structs. Tables created using HiveQL, are serialized and de-serialized using the SerDe java interface [10]. As such tables are converted into ﬁles to be stored in Hadoop. The query language in Hive is a subset of SQL. It supports all major query statements such as join, group by, aggregations, unions and create table. However there is no INSERT INTO, UPDATE and DELETE. Apart from those query statements, HiveQL also has some extensions that allow analysis to be expressed as map-reduce programs.

HIVE-EC: Erasure Code Functionality in HIVE Through Archiving

323

As mentioned earlier, Hive is built on top of Hadoop (Map-Reduce and HDFS) and has several components as shown in Fig. 2. In Hive itself, these components are present: Megastore, Driver, Query Compiler, Execution Engine, HiveServer, Command Line Interface, JDBC and ODBC. The Megastore holds the metadata about locations of tables, columns, partitions and so on. The Driver component is the interface that connects Hive to the ﬁle system. It also manages sessions and related statistics. The Query Compiler is the component that transforms the HiveQL statements into a directed acyclic graph of map-reduce tasks. The Execution Engine performs the action as compiled by the Query Compiler component. The HiveServer provides a thrift interface, and together with JDBC/ODBC, for integrating Hive with other applications.

Fig. 2. System architecture of Hive

There was also a Web Interface (HWI), however it has been removed as from Hive 2.2.0. and now the WebHCat API can be linked to a webpage, to retrieve information. 2.2

Pig

In [4] the author proposes another information handling environment being used at Yahoo! called Pig and its related language Pig Latin. It is intended to ﬁt in a spot between the deﬁnitive styles of SQL and the low-level, procedural style of map

324

A. Chiniah and M. U. A. Einstein

reduces. On one hand, the issue of parallel database with SQL interface is that it torques developers far from their favored technique for composing basic contents to composing explanatory inquiries in SQL. Then again, the issue of map reduce is that its one-input, two-stage information stream is greatly unbending and restrictive, which makes the client code adoption and keep up difﬁcult. Be that as it may, The Pig Latin joins the best of the two schemes: high-level declarative querying in the spirit of SQL and lowlevel, procedural programming of map reduce. Furthermore, this paper displays a novel investigating condition that comes coordinated with Pig, named Pig Pen, which makes it less demanding for clients to develop and troubleshoot their projects in an incremental manner. 2.3

Database Archiving

Archiving is the process of compressing and storing part of an information system, away from the original maintain the response time of the system. Unlike Backup, which is making an exact copy of the whole system, so that it can be recovered if ever the main is damaged, Archiving can be done for part of the data or the whole of it. Archiving is most done on transactional data having the following requirements: • Preserving records of background data used to inform or justify a signiﬁcant decision. • Preserving records of important events or transactions that were stored in a database. • Preserving structured information of historical interest. As such data archiving for transactional databases, being hosted on the cloud, it is imperial to have an archival policy, to lighten the database and provide reasonable service time. Not doing so will lead to rapid growing size of tables, and thus searching time will grow exponentially. The Archive process [11] is done in four steps as shown in Fig. 3 below:

Fig. 3. Archiving process

Deﬁning the archival period and how of the data is still active is the ﬁrst step. Then determine which tables of the database needs to be archived. Finally perform the archiving and export the ﬁle generated to the speciﬁed destination.

HIVE-EC: Erasure Code Functionality in HIVE Through Archiving

325

3 Related Works Most of the works reviewed are based on relational database and only a few were found in the domain of archiving cloud based databases. Tools like RODADB [12] and SIARD [13] use an approach to preserve business system records by exporting the contents of all of the tables in the database in an open XML format. And later on the XML archive can be loaded to a different SQL database platform to allow ongoing access over time. RODADB now also offers the possibility to store the archive using cloud services such as AWS and SmartCloudPT. Commercial database archiving software tools like HPAIO [14] and CHRONOS [15] are primarily designed to purge data from large transactional databases to reduce storage costs and improve performance. They use a similar export-all-tables approach for retiring business systems, but they also have functionality to assemble ‘data objects’ (and so archival records) from their constituent columns and tables and extract these in XML format.

4 Implementation In order to have an archiving policy for a database running on Hive, we need to implement the following: • Create the archive command in the CLI module of Hive. • Have corresponding functions in the Web Interface linked to CLI command or implement a new module in the WebHCat API for the archive command. • Add components to the Hive Compiler-Optimizer-Executor. • Enable Erasure-Code functionalities in Hadoop. (Raidnode) 4.1

Archive Command

We followed the general principle of the mysqldump [16] command to implement the Hive archive command. The syntax is as follows:

As any HiveQL queries, ﬁrst the source is speciﬁed, then the main operation is called. In our case, it is the archive operation. Parameters after the keyword ARCHIVE indicate whether the output will be replicated or will be Erasure Coded. For ‘rs’, the scheme is predeﬁne in the conﬁguration ﬁle of raidnode either (3,2) or (6,3) or (10,4). There is another optional parameter, which is used to specify whether the output will be

326

A. Chiniah and M. U. A. Einstein

overwritten or not. The rest is standard select statement. Most of time, archival data is retrieved according to timestamp, but it can also be done by ID. The ﬁrst implementation allowed only one table to be archived at one time, so added the join and union operation to be used in conjunction with the archive command and will have the following syntax:

Using the join operation, tables can be grouped together, then ﬁltered, archived and encoded as only one object. However it might result in one object having gigabytes of data. Encoding and decoding such big object will consume bandwidth and computation intensively. A solution would be to use partitioning. In order to make archived partitioned table, it requires three steps, ﬁrst create partitioned table, then populate it and ﬁnally archive it.

This statement will create the partitioned table, and then to populate it, we need to use the following statement:

Then call the archive operation by joining with it other dependent tables, so that upon recovery, the database will be fully functional. 4.2

WebHCat API

Hive GUI was implemented in HWI (Hive Web Interface), however has been deprecated as from Hive 2.2.0 release. Instead the WebHCat API is a middleware that can perform HDFS operations being called from a web interface. In our Web service we provide the necessary GUI interface to select and launch Archive operation. In turn, we implemented the archive function in WebHCat API, to call the raidnode after having done partitioning and join.

HIVE-EC: Erasure Code Functionality in HIVE Through Archiving

4.3

327

Hive Server

The Hive Server consists of the Compiler-Optimizer-Executor. We implement corresponding functions that would link the operation from WebHCat and Hadoop. 4.4

Erasure Code in Hadoop

The raidnode needs to be conﬁgured through the raid.xml before the launch of Hadoop server. As mentioned before the schemes available are 3-2 or 6-3 or 10-4, where the ﬁrst digit indicates the number of data blocks and the second indicates the number of parity blocks, thus number of failures that can be tolerated.

5 EXPERIMENTS We benchmarked the implementation with experiments run on a cluster of 8 nodes which has one powerful PC (4x3.2 GHz Intel Processors with 8 GB of RAM and 1 TB HDD) hosting the NameNode/RaidNode and 7 HP PCs acting Clients hosting DataNodes (each with a 3.2 GHz Intel Processor with 4 GB RAM and 500 GB HDD). The average bandwidth of this cluster is 12 MB/s. We ran two sets of experiments. In the ﬁrst set we compare the performance of the archiving (ﬁlter, copy and encode) the 2 datasets obtained from MovieLen having 19 Gb and 2 Gb respectively. We did our experiments locally and using Amazon EC2 with 20 nodes. In the second experiment, we measure the recovery performance based on the failures. From Fig. 4, it can be seen that replication is more performant; however this scheme will allow usage of only 33% of storage capacity available. If the EC (6,3) is used, a storage capacity usage of 66% can be achieved having the same level of faulttolerance, that is 3 block failures. And for even better percentage of storage capacity (71%), the EC (10,4) can be used with four failures tolerance.

Fig. 4. Archive performance

328

A. Chiniah and M. U. A. Einstein

As mention, in our second experiment, we measure the recovery time in cases of failures that leads to loss of blocks. It might also result in the loss of parity blocks. As such for this experiment we used the EC(10,4) which tolerates more failures, and the performances are shown in Fig. 5.

Fig. 5. Recovery performance

Again we ﬁnd that the recovery performance of replication is better, but it can only tolerate three failures. The EC (10,4) though has worst the performance, it can tolerate four failures, with the optimal usage of storage capacity.

6 Future Works Having the implemented the basics HIVEQL support for EC, we now plan to further add queries that would take full advantage of the usage of Erasure Codes. We also plan to make use of HIVEQL queries to leverage the distribution of blocks based on either of the locality, availability or demand of blocks. And lastly we plan to enhance NoSQL functionalities such as leveraging on data locality while executing Hadoop MapReduce task.

7 Conclusion In this work, we tackled the issue of application of erasure codes and having a fault tolerant cloud-based archiving system. We implement a full functional web interface connected to Hadoop and HDFS through WebHCat API that allows archiving either per table or joined table or partitioned table. The major advantages of our solution are especially pronounced in the user – friendliness of an HDFS based system, and storage efﬁciency through erasure codes having more fault tolerance than replicated systems. Our current implementation of HIVE-EC runs in a local environment, and has easily been migrated to the cloud since it has been primitively been built in Hadoop. As

HIVE-EC: Erasure Code Functionality in HIVE Through Archiving

329

future enhancement, we intend to further elevate the system by taking advantage of the processing power of the clusters in cloud infrastructure, and thus, scale the system. Acknowledgement. We thank Associate Professor Anwitaman Datta from NTU, Singapore, for his constant support and expertise reviews that greatly assisted the research.

References 1. Esmaili, K.S., Pamies-Juarez, L., Datta, A.: The CORE storage primitive: cross-object redundancy for efﬁcient data repair & access in erasure coded storage. CoRR, vol. abs/1302.5192 (2013) 2. Pamies-Juarez, L., Oggier, F.E., Datta, A.: Data insertion and archiving in erasure-coding based large-scale storage systems. In: ICDCIT, pp. 47–68 (2013) 3. Islam, M., Huang, A.K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann, A., Abdelnur, A.: Oozie: towards a scalable workflow management system for hadoop. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, p. 4. ACM (2012) 4. Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of map-reduce: the Pig experience. Proc. VLDB Endow. 2(2), 1414–1425 (2009) 5. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009) 6. Plank, J.S., Greenan, K.M.: Jerasure: A library in C facilitating erasure coding for storage applications–version 2.0. Technical Report UT-EECS-14-721. University of Tennessee (2014) 7. Beach, B.: Backblaze releases the Reed-Solomon Java Library for free. Backblaze Blog| Cloud Storage & Cloud Backup (2017). https://www.backblaze.com/blog/reed-solomon. Accessed 3 Aug 2017 8. GitHub: openstack/liberasurecode (2017). https://github.com/openstack/liberasurecode. Accessed 3 Aug 2017 9. Hadoop.apache.org: WebHDFS REST API (2017). https://hadoop.apache.org/docs/r1.0.4/ webhdfs.html. Accessed 10 July 2017 10. Chandole, N.S., Kulkarni, C.S., Surwase, M.D., Shelake, S.M.: Study of HIVE Tool for Big Data used in Facebook. Ijsrd.com (2017). http://ijsrd.com/Article.php?manuscript= IJSRDV5I30070. Accessed 1 Aug 2017 11. Fitzgerald, N.: Using data archiving tools to preserve archival records in business systems— a case study. iPRES (2013) 12. KEEP SOLUTIONS: RODA | Repository of Authentic Digital Objects (2017). http://www. keep.pt/produtos/roda/?lang=en. Accessed 22 Nov 2017 13. Loc.gov.: SIARD (Software Independent Archiving of Relational Databases) Version 1.0 (2017). https://www.loc.gov/preservation/digital/formats/fdd/fdd000426.shtml. Accessed 2 Aug 2017 14. Saas.hpe.com.: Application Archiving & Retirement Software, Structured Data | Hewlett Packard Enterprise (2017). https://saas.hpe.com/en-us/software/application-databasearchiving. Accessed 29 July 2017

330

A. Chiniah and M. U. A. Einstein

15. Brandl, S., Keller-Marxer, P.: Long-term archiving of relational databases with Chronos. In: First International Workshop on Database Preservation (PresDB 2007), Edinburgh (2007) 16. Dev.mysql.com:. MySQL :: MySQL 5.7 Reference Manual :: 4.5.4 mysqldump—A Database Backup Program (2017). https://dev.mysql.com/doc/en/mysqldump.html. Accessed 9 Aug 2017

Self and Regulated Governance Simulation Exploring Governance for Blockchain Technology Hock Chuan Lim(&) Faculty of Engineering and Information Sciences, University of Wollongong, Dubai, United Arab Emirates [email protected]

Abstract. Blockchain technology and blockchain applications sit at the crossroad of data science and Internet of Things applications where getting the governance right for this new technological paradigm is of core concern for leaders aspiring to realize smart city and living initiatives. In this research, we deploy computational simulation of self and regulated governance and extend the ﬁndings to the new blockchain technology ecosystems. We propose that getting the governance approach right is as important as getting the technological platform issues resolved. Keywords: Blockchain technology Simulation Data science Smart living Self-governance Regulated governance

1 Introduction Digital connected age comprises all the current developments of systems, devices and connectivity and these items are enclosed within a digital sphere, where we now heard of the buzz phrase Internet of Things (IoT), a concept of an entire digital sphere, connected Internet-based devices, one that allows for development of technological infrastructures and smart applications; in short for the exchange of Things [1]. One of the many developments of this digital age is in the continuing development of future technologies and applications that is geared towards greater connected living, a plan to allow for connection of devices from households and communities; for sharing and use of digital data and resources. Complementing the IoT grand vision is the related development of blockchain technology that continues to be a controversial topic among government, business and technology leaders [2]. As we continue to advance into future digital age, the important focus of governance and blockchain technology and data management become a primary concern for all government, business and technology leaders. This research applies simulation project developed with modern game engine to study the various governance approaches and presents the initial ﬁndings in the blockchain technology context. The paper discusses the governance options for blockchain technology and generates greater awareness of governance trend for future computing landscape. This paper is organized into the following major sections: Sect. 2 looks at functional deﬁnitions; Sect. 3 addresses methodology, design and model; © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 331–340, 2019. https://doi.org/10.1007/978-3-030-03405-4_22

332

H. C. Lim

Sect. 4 presents ﬁndings, results and discussions and Sect. 5 suggests future work and concluding remarks.

2 Functional Deﬁnitions In our modern digital space, one of the most important aspects that need to be addressed is that of governance. However, governance of new technologies for the modern digital world is easier said than done. Due to the evolving nature of emerging paradigms and computing platforms, in some instances, formalized governance approaches and models may not be readily available while for others, it is simply not taken into consideration. For simple illustration, take the new “kid” in town, the “blockchain technology”. Some says blockchain is the future of modern computing, others say, blockchain technology and applications have great foundational potential that underpin smart cities and smart living initiatives. Most agree that a concept of governance as well as blockchain is of great importance and yet at the same time, these areas are not well understood. Here we look at some basic functional deﬁnitions that will aid us in our model formulation and study. 2.1

IoT, Smart Living and Data Science

The Internet of Things (IoT) is a vision to make our digital living more immersive and pervasive; to allow for objects to be linked, connected and to allow for ease of communications and data transfer. One of the applications of IoT is the “smart city” that aimed at achieving connected living. This application is envisioned to allow for swift administrations and provision of services. Not only are we seeing the rise of IoT, we are also seeing great interest in newer technologies such as wearable technologies, devices and blockchain technology that is highly data-driven. These applications all contribute towards the trend of living in a connected digital sphere. As reported by [3, p. 29] “…IoT and wearable technology will challenge existing social, economic, and legal norms. In particular, these technologies raise a variety of privacy and safety concerns…”, particularly for smart city and smart living and data science (DS); and to this new trend, we see the same for blockchain technology that is closely based on the tenets of data science. These new technologies require a good dose of quality governance. 2.2

Blockchain Technology

A blockchain is a digital collection of book-keeping records. The two types of records essentially are transaction information and block. A block refers to time-stamped data structure that is linked and is data structures that cannot be altered retroactively. A primary hypothesis is that the blockchain technology creates a system of distributed consensus in the digital space [1]. Blockchain applications for this paper refers to software applications designed based on blockchain technology as a core feature. Blockchain applications interface with users and back-end servers.

Self and Regulated Governance Simulation

2.3

333

Data Governance

Governance refers to efﬁcient and effective management of resources within an appropriate framework [4]. In the modern business world, it also requires the use and exercise of political, economic and administrative powers in managing an enterprise’s affairs. Governing bodies is considered effective and good if it can achieve various levels of goals and commitments, efﬁciently, effectively and economically. Due to the grand objective of connected living, IoT and wearables, recently, we see a resurgence of interest in digital governance. Traditional deﬁnition of governance “…is understood as the design of institutions and the structure of authority to allocate resources and coordinate or control activities in the society…” [5, p. 341]. It is possible to view governance concept in terms of levels of goals and objectives. From a resource perspective, these goals include • Level 1 – Smooth and operational as per designed • Level 2 – Well-being • Level 3 – Fair and equitable The governance study for this project will only address level 1 goal and will not address level 2 and 3 goals due to the scoping of this study.

3 Methodology, Design and Model Blockchain technology and its related application development are still in its infancy. Yet, in the last few years, blockchain interests have been growing in various diverse ﬁelds such as ﬁnance, e-government, healthcare, smart city and to date some signiﬁcant pilot projects are surfacing. Due to evolving and dynamic nature of blockchain development, putting in place suitable governance model and approaches is a nontrivial effort. In order to facilitate better understand of the governance approaches for blockchain technology, it is useful to apply computational simulation as a study technique. Since there is a general lack of proper network of blockchain data setup and data flow, our project applied the study of “crowd blocks” in a movement simulation for the study of blockchain governance. We next report on the simulation design and model. 3.1

Simulation Design

Where governance rules are determined and ﬁnalized, each blockchain entity can be conceptualized as a single particle entity or in crowd simulation metaphor, a block of crowd. Crowd “block” trafﬁc management are found in most city living. In scenarios where the crowd is dense, movement and trafﬁc flow if not well managed can become congested. While short duration of congestion does not pose a serious threat, there are instances wherein congestion can lead to fatal accidents and can trigger other movement related difﬁculties, for example, in a crisis outbreak and in a panic situation. Such a case is similar in principle to a “herd stampede”, where the mass of the crowd can lead to fatal accidents. Hence, crowd trafﬁc presents interesting issues for research and study.

334

H. C. Lim

The traditional approach to managing crowd trafﬁc is via governance or regulation – that is setting up of control and administration point. This is a common practice for example, setting up of trafﬁc signals to regulate the time to move and the flow density; deploying electronic travellators to speed up crowd velocity and to move the crowd at a faster pace to avoid crowding and congestion; the opening up of additional path for dispersion of trafﬁc and crowd and introducing human agency to regulate and control the crowd. These traditional measures while effective are by no means without its cost. Some of the measures are costly to implement while others, especially those involving human agency, may not be immediately feasible or will incur a high cost. Hence, there is motivation to apply alternative means such as self-regulatory measures. Selfregulation implies that the individual object is capable of sensing changes in the environment, making the right decision and carry out timely actions. These aspects of crowd “blocks” governance measures have features that can be mapped to the abstract model of blockchain environment. The key is to formulate the right model. 3.2

Simulation Model

We simplify the model to represent the random arrival of particle objects. These particle objects form groups or crowd as they grow in intensity and they all try to cross a designated channel, similar in principle to a physical travellator. This travellator moves at a pre-designated speed. Here each particle object represents an autonomous entity capable of having its own computing facilities and able to sense the environment and carry out simple decision-making actions. Each entity can be interpreted as a representation of one blockchain system and the summed total of all the particle objects represent the entire digital ecosystem. The use of a particle object is an abstraction to model simple generic cases. In addition, each particle object can in principle represent changes to the dynamic blockchain technology, for example a public or private blockchain. The key attributes and behaviors of the particle object are shown in Fig. 1, in simple UML notation. The particle objects are spawn randomly and all particle objects is given the same simple goal, to move towards the end of the channel and clear the channel from start to the end. The channel can allow for two particle objects to be side by side or it can allow one particle object to pass it. Entry point of the channel only allows two particle objects at any one time and to model trafﬁc control point, signals are located before the entry point of the channel that can regulate particle objects movement. Congestions that formed at the entry position are monitored and displayed in the simulation as particle objects are spawn and as the flow of block trafﬁc progresses. Simple rules are given to the particle objects to allow for simulation of selfregulation and movement. These rules are based on the electronic distance (eDist) of each particle object from each other. Self-regulation is modelled as a form of agency self-awareness, for example, if a particle object sees that it is too near to another particle object, it will give away to that particle object, otherwise, it will try to speed up and move towards the end of the travellator. This aspect of self-regulation is then encoded in two simple rules:

Self and Regulated Governance Simulation

335

Fig. 1. Simple UML for a particle object.

(1) Rule 1: Maintain a distance (eDist) from 4 neighboring particle objects (front, back, left and right). (2) Rule 2: If (eDist is < threshold) give way, otherwise move to goal. 3.3

Simulation Engine

Simulation is coded in unreal engine 4, a game engine. The use of game engine for simulation is not a new concept (see [6, 7]). Unreal engine 4 as the simulation engine of choice allows for quick and rapid prototyping [8]; running of the experimental trials, adjusting of parameters and simulation tuning. The coding of the simulation model is done mainly with unreal engine blueprint and some basic C++ is used for display and administrations. Simulation logic and rule movement are place inside necessary blueprint nodes and required visualization is captured both as metrics and as attributes for onward processing. Other tools are used to allow for integration of batch processing of the simulated run. Self-regulation involves the use of computational logic based on the assigned rules. The movement nodes are shown in Fig. 2. Computational logic follows similar unreal engine blueprint visual scripting [9]. In that they are built into suitable blueprint nodes and for our experimental simulation modelling, we applied the nature-inspired “lateral line system” as a computational logic for sensing the environment and for decision making needed for non-player characters (NPC) or other particle objects. Fishes and aquatic species use a “lateral line system” to help them to sense their environment and to navigate in their marine world [10, 11]. This is an important non-visual cue to aid in an environment where visibility is limited. The lateral line sensory system concept is applied to allow for sensing of distances and storage of required distance data as part of the self-regulation mechanism. The lateral line sensors are located at the ends of the

336

H. C. Lim

channel. In the blockchain world, this sensory concept refers to a world state where the rule-making is completed and governance is applied.

Fig. 2. Blueprint movement nodes.

4 Findings, Results and Discussions The ﬁndings and results are compiled from a series of simulation gameplay and collectively, they form indicative results. Here, we deviate from the traditional results compilation and statistical analysis; instead, provide a visualization of congestion formation as shown below: 4.1

Results

We show some example congestion formation scenes in Figs. 3 and 4. Figure 3 shows a single lane where Section A represents the start of the channel and Section B represents the end of the channel. Blue small cube represents particle objects simulated.

Fig. 3. Single lane.

Self and Regulated Governance Simulation

337

Fig. 4. Congestion, time t increases from 0.

4.2

Initial Discussions

In regulatory theory, the concept of layered regulation is a common metaphor. Regulations has always been seen as a form top down activity and aimed at being responsible and responsive. In the light of what the International Telecommunication Union (ITU) has deﬁned about the Internet of Things (IoT) as “development of item identiﬁcations, sensor technologies and the ability to interact with the environment…” [12, p. 1], our current digital sphere is indeed highly connected. Signals and information processing and social human behavior take centre stage in the light of digital connections. These digital lifestyle calls for moderation and a digital culture that is acceptable for future generations. This IoT era has rekindled our sense of socialness and we ask “…Should it be regulated or should it be self-regulation…”? They remain as an active debate that continues to call for deeper research and understanding. While this rudimentary simulation did not champion self-regulation as a winning strategy, it does suggest that given all things being equal, in selected context, selfregulation could be leveraged for management use. What is interesting in our gameplay simulation is not solely on the strength of one or the other approaches, but rather, the visualization of local congestion (where local congestion refers to the immediate vicinity within the primary particle object) and other salient observations. We will focus on highlighting discussions on these elements. 4.2.1 Implicit Theme Our preliminary results conﬁrm previous understanding about governance and selfregulation as a largely implicit theme where each has its own merits when it comes down to management and control [12–14]. Recent research does suggest forms of “new governance” [13] wherein blurring of roles needed to be noted; while others have called for clear governance framework in the light of IoT developments [12]. This is especially relevant in the scenario wherein “…[t]he levelling effect of social media and the Internet have changed the way citizens relate to each other and to their institutions, demanding a much more participatory and engaged style of leadership and more shared

338

H. C. Lim

models of authority…” [15, p. 1807]. In reality, the view that self-regulation is part of governance framework and within an establish structure is more important than one that views self-regulation as a distinct separate strategy. These suggest possible option for similar blockchain technology. 4.2.2 Social Dimensions Our simulation highlights the aspect of social dimension; without sensing the other particle object and without sensing the environment and the externalities, it would not have been possible to break out of the congestion. This observation is particularly of interest as we have more elements connected. It is not about pure decision-making instead, it is also about having the suitable environmental signals and information in order to facilitate decision making and taking of actions. What has it to do with more connected devices in the age of digital living? One of the immediate implications is that these devices are usually in the form of machine-to-machine computing mode and as such, the ability to “sense” and how to “sense” will need to be addressed. This also raises the need for man-in-the-loop concern and how we can design our information systems to be ready to balance out the need for this social dimension. In essence, the importance of governance in our social environment should not be overlooked, especially the need for suitable governance framework for service oriented architectures (SOA) and service level agreements (SLA) in the new IoT era. 4.2.3 Culture Assumptions Sensing and decision making is not an individual trait. Our simulation reveals that unless we include a general sense of wanting to resolve congestion and wanting to avoid being stuck in congested locations, the overall clearing of congestion will not happen. We call this a general culture of being socially nice and acceptable. This is an important assumption that requires careful consideration. We note two important concerns, ﬁrstly, there are many real-world cultural traits and not all are being socially gracious and nice. It is not uncommon to be in a place where the feeling of discomfort and feeling marginalized exists. This is usually from the perspective of groups or communities that viewed themselves as “lower caste”, or lower social ranking and different social behaviors may arise. Secondly, it takes time for culture to grow and build and so even if the self-regulatory strategy clears the local congestion, in a realworld scenario, time may be needed to grow such as culture. Hence, to assume a “culturally nice mentality” may be too simplistic. In our simulation, we have chosen to simplify this aspect. In our blockchain technology context, having appropriate rules that is culturally fair and balanced is an important assumption.

5 Future Works and Concluding Remarks We set out to model and study governance and self-regulation using experimental gameplay simulation. We found out and conﬁrm previous understanding about governance and self-regulation and in the recent trend and pattern of “new governance”, a form of regulatory mind-set. Not only that, we uncovered additional observations and

Self and Regulated Governance Simulation

339

interesting outcomes. Firstly, we noticed that game engine and experimental gameplay simulation can be a viable approach for simulation and research and with the recent advances in game engine development, this is becoming even more attractive and warrants serious consideration. Secondly, nature has its own way to problem resolution and we can learn a lot from it. The nature-based ecosystem and the nature-inspired lateral line system are good examples. We have just touched the tip of the iceberg and these naturally occurring events are worthwhile our attention and further study. Next, while the model was intended to be simpliﬁed, we noticed the value of sociocultural assumption. In the days of IoT, such sociocultural dimensions will need to be addressed. How they can be addressed may not be trivial, for example, how respective blockchain technology systems play out in the digital world. Careful attention in the case of governance and self-regulation for blockchain technology will be helpful. While the idea that self-regulation would be useful if it is within an appropriate framework of governance, it must not be seen as a separate strategy, rather as a continuous approach. Finally, we intend to continue with our experimental gameplay simulation and work towards batch distributed simulations to allow for rigorous statistical analysis, testing and evaluations.

References 1. Wortmann, F., Flüchter, K.: Internet of things. Bus. Inf. Syst. Eng. 57(3), 221–224 (2015) 2. Carlozo, L.: What is blockchain? J. Account. 224(1), 29–31 (2017) 3. Thierer, A.: The internet of things and wearable technology: addressing privacy and security concerns without derailing innovation. Richmond J. Law Technol. 21(2), 1–118 (2015) 4. De Haes, S., Van Grembergen, W.: Enterprise Governance of Information Technology Achieving Alignment and Value, Featuring COBIT 5. Springer, New York (2015) 5. Weber, R.H.: Internet of things–governance quo vadis? Comput. Law Secur. 29(4), 341–347 (2013) 6. Hu, W., Qu, Z., Zhang, X.: A new approach of mechanics simulation based on game engine. In: 2012 Fifth International Joint Conference on Computational Sciences and Optimization (CSO), Harbin, China (2012) 7. Hjelseth, S., Morrison, A., Nordby, K.: Design and computer simulated user scenarios: exploring real-time 3D game engines and simulation in the maritime sector. Int. J. Des. 9(3), 63–75 (2015) 8. Sanders, A.K.: An Introduction to Unreal Engine, vol. 4. CRC Press, Boca Raton, FL (2017) 9. Shah, R.: Master the Art of Unreal Engine 4 - Blueprints. CreateSpace Independent Publishing Platform, Kitatus Studios (2014) 10. Abdulsadda, A.T., Tan, X.: An artiﬁcial lateral line system using IPMC sensor arrays. Int. J. Smart Nano Mater. 3, 226–242 (2012) 11. Janssen, J.: Lateral line sensory ecology. In: The Senses of Fish: Adaptations for the Reception of Natural Stimuli, pp. 231–264. Springer, Dordrecht (2004) 12. Weber, R.H.: Governance of the internet of things—from infancy to ﬁrst attempts of implementation. Laws 5(3), 1–12 (2016) 13. Gibbons, L.J.: No regulation, government regulation, or self-regulation: social enforcement or social contracting for governance in cyberspace. Cornell JL Publ. Policy 6, 475–551 (1996)

340

H. C. Lim

14. Solomon, J.M.: New governance, preemptive self-regulation, and the blurring of boundaries in regulatory theory and practice. Wis. Law Rev. 591–625 (2010) 15. Baron, R.J.: Professional self-regulation in a changing world: old problems need new approaches. JAMA 313(8), 1807–1808 (2015)

Emergency Departments A Systematic Mapping Review Salman Alharethi1(&), Abdullah Gani2(&), and Mohd Khalit Othman3 1

Department of Computer System and Technology, FSKTM, University of Malaya, Kuala Lumpur, Malaysia [email protected] 2 Centre for Mobile Cloud Computing, FSKTM, University of Malaya, Kuala Lumpur, Malaysia [email protected] 3 Department of Information System, FSKTM, University of Malaya, Kuala Lumpur, Malaysia [email protected]

Abstract. Emergency services are essential and any person may require these services at some point in their lives. Emergency services are run by complex management and consist of many different parts. It is essential to establish effective procedures to ensure that patients are treated in a timely fashion. By obtaining real-time information, it is expected that intelligent decisions would be made. Hence, thorough analytics of problems concerning appropriate operational effective management, would help prevent patient dissatisfaction in the future. Mapping studies are utilized to conﬁgure and explore a research theme, whereas systematic reviews are utilized to combine proofs. The use of improvement strategies and quality measurements of the health care industry, speciﬁcally in emergency departments, are essential to value patients’ level of satisfaction and the quality of the service provided based on patients’ experience. This paper explores and creates momentum with all the methodologies utilized by researchers from 2010 and beyond with the stress on patient fulﬁllment in the emergency services segment. Keywords: Emergency department Health care Real time algorithm Overcrowding Waiting time Systematic mapping

1 Introduction Scoping studies [1–5] through taxonomy and input involve searching existing literature to identify certain similarities between search methods and paper collection. Emergency services or Emergency Departments (EDs) manage various types of severe emergencies through in/out-of-hospital medical care. For the assessment of health technology, it is essential to include a decision-analytic model. This technological

© Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 341–358, 2019. https://doi.org/10.1007/978-3-030-03405-4_23

342

S. Alharethi et al.

analysis needs to be updated at hospitals, as modeling methods are required to manage interactions between patients and EDs staff as well as patient care pathways. System analyses are also required due to the complex nature of EDs and the various issues experienced with them. Numerous studies exist on mathematical models in health care, but these do not include mathematical models in EDs even though such models are vital to reduce the long waiting periods experienced in EDs. Major focuses for future research directions; healthcare workforce, assets to be used during emergency conditions, effective processes, patient experiences within emergency department systems near quality health care and resource allocation using real time algorithm. 1.1

Study Motivation

Deﬁning the motivation for research and its processes is essential. Responsive bounding in collaboration and inventive problem solving allows researchers to take competitive action and approaches. Saudi Vision 2030 focuses on economic diversiﬁcation to achieve national goals by valuing performance and measuring sustainable action through Saudi Vision 2030’s governance model. One of Saudi Vision 2030’s main goals are to implement “efﬁcient and high-quality health care” to improve the quality of health care services by increasing the effectiveness and output of care and boosting the accessibility of health care services to citizens. Improved ED systems allow for the ampliﬁcation of resources utilization, assets, and economic stability, all of which have long-lasting effects. 1.2

Knolwedge Gap

The topics studied were classiﬁed based on EDs mapping for 381860 articles from 1864–2017. Same research methodology used early this year in [32]. All EDs activity was well represented in results. The main problems and methods of EDs were classiﬁed looking for a theme. Gaps were found in the health care industry, emergency preparedness, quality of health care, performance measurement, and others, as shown in research questions answers: RQ2, RQ3 and RQ6.

2 Background Mathematical modeling techniques exist to map industrial engineering and operation processes or systems and provide a simple structure for real-world applications. Although EDs have limited resources, they provide acute cure for a large percentage of the patient population admitted. Resource utilization, throughput, and wait times are parts of ED system behavior measurements. Overcrowding can occur in EDs if waiting periods are long, and this may increase patient mortality risks. In addition, patients may leave without being seen, resulting in them readmitting to EDs again. Organizational, physical, and human factors must be considered in EDs patient and environments’. For instance, management systems, equipment, buildings, patient’s real time algorithm, and their links must also be considered. Basic requirements include waiting areas and spaces avoiding overcapacity on

Emergency Departments

343

hazardous time. The following order is used to deal with patients: registration, triage, examination, X-rays and blood tests, evaluation, pharmacy, EDs bed location and EDs staff, handling, allocation, and discharging. ED wait times may be long due to overcrowding. In addition, demand might not meet capacity, the number of beds might be insufﬁcient, capacity management might be suboptimal, and patient acuity and service demand may vary [2]. From 2000–2009 [2], the discrete-event simulation method was the most common method used in EDs, especially in UK health care system. To a minor degree, system dynamics has also been used to improve wait times in EDs (see Fig. 1).

Fig. 1. Methods used to solve EDs problems from 2000–2009.

EDs aim to meet an important health care objective, which is why they are considered the most critical part of system. It is necessary for EDs to develop rational solutions and procedures in normal and disaster scenarios. The Simulation software’s aim to address prevention-related issues, reduce wait times, and predict variables related to disaster situations in EDs. The simulation model identiﬁes issue that occur in real situations, including those pertaining patient flow, arrival patterns, and the infrequent extraction of optimal resources in emergency response domains. The sources used for gathering data include direct sampling, historic data, hospital databases, and observation. Simulation method applied to enhance resources and reduce wait times by implementing a cost analysis and introducing strategic policies [7].

344

S. Alharethi et al.

3 Methodology The information is taken from recent updates to suggest and guide. Mapping a system is only used as a starting point to evaluate existing studies by subject and classify them in order to conduct a thematic evaluation. This systematic study comprehensively details previous research. A systematic mapping study is used to summarize a research area and detect research gaps. Up-to-date sources are used for this study. Systematic mapping is a preliminary study that allows researchers to review papers related to a certain theme [3] and classify research, conduct a thematic evaluation. The systematic review process characterizes and summarizes existing research following a predeﬁned protocol [4]. Therefore, offering an indication of a research ﬁeld and distinguishing study gaps are the key targets of a mapping study. 3.1

Research Questions

The intelligence of this study, the monitoring strategies used in [1, 3–6, 8, 32] were used to deﬁne the problems in EDs. The following research questions (RQs) were addressed: • • • • • •

RQ1: RQ2: RQ3: RQ4: RQ5: RQ6:

Which techniques are used in EDs research? Which topics are introduced in EDs? When/where were studies published? How do studies visualize their results? What problems were addressed in existing studies? How are studies classiﬁed/clustered?

The management of our research area is performed through mapping studies. RQs in this research area are developed to meet aims systematically. Our aims for this systematic mapping study are (a) to obtain a general idea of issues that require addressing in EDs, and (b) to review the approaches used in existing research. 3.2

Search for Primary Studies

The search was conducted in the following databases resulting in: ABI/INFORM [9, 10], Emerald [15, 16], IEEE Xplore [17–21], and ProQuest Dissertations and Theses Global [22–27]. These were chosen because they are comprehensive databases containing millions of publications, especially on EDs, engineering and computer science. Moreover, these databases are user friendly and have advanced search features. The identiﬁed keywords were as follows: Emergency department, emergency medical care, emergency clinics, and methods. These were used to develop the following search strings: • Set 1: Search terms related to scoping research on EDs (i.e., emergency department). • Set 2: Search terms related to the string (e.g. emergency medical care and emergency clinics). • Set 3: Search terms related to techniques (e.g., methods).

Emergency Departments

345

The keywords were classiﬁed based on the RQs and grouped into these sets. Each set was identiﬁed in the databases, and each search string can be found in (Table 1). This study was systematized based on the date it was conducted: early 2017, late 2016. (Table 2) shows the number of search results per database. Table 1. Database searches Database ABI/INFORM Emerald IEEE Xplore ProQuest Dissertations

Command search (“emergency department” or “emergency medical care” or “emergency clinics”) and (“methods”)

Table 2. Number of studies per database Database Search results ABI/INFORM 103,025 Emerald 12,313 IEEE Xplore 891 ProQuest Dissertations & Theses Global 265,631

3.3

Date 1864–2017 1898–2017 1924–2017 1897–2017

Study Selection

We ignored items based on several database features, as shown in Fig. 2. Quality assessment was based on an article’s citations at ﬁrst place; articles without citations were excluded in some cases. The following inclusion criteria were considered: studies focused on the research methods for studying EDs, studies published between 2010 and 2016, and studies in the ﬁeld of EDs. Finally, the following exclusion criteria were considered: studies not presented in full text, studies not reviewed, studies duplicating other work, and non-English studies. The numbers of included and excluded articles in search process for database are given in Fig. 2 and ﬁnal selected content in Table 4. 3.4

Data Extraction

The extracted data form used the modiﬁed template given in [32], which was updated to suit this study, as shown in Table 3. Each data area includes the item and value. Data extraction was completed by the ﬁrst author and reviewed by second and third authors for validity and quality control. 3.5

Veriﬁcation and Validation

The data collected has strong degree of objectivity. This kind of validity is exposed to less risk than data obtained from quantitative analysis. To shrink this risk, data compilation table was adapted to back the documented data; the table used in Data mining

346

S. Alharethi et al.

Fig. 2. Study selection process.

Table 3. Adapted data extraction table Item Study ID Author Name Year of publication EDs area Venue Method Problem Visualization type

RQ result Number Name(s) Calendar year Knowledge area in EDs Journal name Method used Problem identiﬁed Style of presentation

RQ

RQ3 RQ2, RQ6 RQ3 RQ1 RQ5 RQ4

[32] to allow for reexamination. Data collection table are used to document data and reduce risk. Further, data extraction can be rechecked, which also reduces risk. In this study, two different authors took these steps independently; when a common understanding is accomplished, risk to validity decreases [3]. In this study, the information gathered was accurate and objective thus, risk was limited [1–8].

Emergency Departments

347

4 Results Several publications were identiﬁed and reviewed between 2010 and 2017 in each database. For more details see Table 4. Other related data are given below to answer the RQs after Table 4 in Sects. 4.1–4.6. 4.1

RQ1: Which Techniques Are Used in ED Research?

Approximately more than eight different methods and techniques were found to be used in EDs research. According to the ABI/INFORM database [9–14], literature reviews, interviews, and questionnaires are the main methods used in EDs research. According to the Emerald database [15, 16], queuing theory and focus groups or interviews (problem trees) are the main methods used in EDs research. According to the IEEE Xplore database [17–21], image processing/machine learning, neural network machine learning, and clustering and logistic regression algorithms are the main methods used in EDs research. According to the ProQuest Dissertations and Theses Global database [22–27], mixed methods, descriptive research, experimental research, and qualitative research are the main methods used in EDs research. 4.2

RQ2: Which Topics Are Introduced in EDs?

The topics screened were categorized based on EDs research topics. All EDs activities are well presented. The main problems in EDs and the methods used to study them are covered by mapping [9–27], and they are not influenced by a speciﬁc topic. Thus, research gaps were found in emergency preparedness, health care quality, patient satisfaction, performance measurement, and health care industry, as shown in Figs. 3 and 4. 4.3

RQ3: When and Where Were Studies Published?

Many publications were identiﬁed that were published between 2010 and 2016 in each database. The earliest study identiﬁed was published in 1864. Interest in this ﬁled increased between 2010 and 2014 and signiﬁcantly dropped in 2016. In this study, only peer-reviewed journals, conferences, and materials were included to answer this question. Figures 4, 5 and 6 provides an overview of articles included targeted venues. Engineering, simulation, and process management only account for 2% of the total studies on EDs between 2010 and 2016. 4.4

RQ4: How Do Studies Visualize Their Results?

In this study, the visualization approaches of previous studies were identiﬁed (see Table 4). Most commonly, ﬁgures or graphs and tables are used to visualize data.

4 Morgans 2012 Emergency and Burgess Department or Ambulance Utilization 5 Rosenberg 2013 Skills and and Hickie Competencies

2011 Quality

3 Hanson

Australian Health Review

Australian Health Review

Australian Health Review

2012 Skills and Journal of Health Competencies Organization and Management

2 Fulop

ID Author Year Area in EDs Venue Name 1 Allnutt et al. 2010 Skills and Australian Health Competencies Review

Problem

Qualitative research using a review

Survey conducted as part of quantitative research using an information sheet and consent forms sent through email

Provided an ideal approach to community and home mental care

Assessment of nurse practitioner’s role as observed by a client along with their satisfaction with their nurse practitioner’s education, care, skill, and knowledge Investigation of how hybridity Qualitative research with interactive interviews to present can be utilized to re-speculate authority in services, as it accounts of how health care professionals describe leadership identiﬁes change strategies that address initiative projects to grasp the utilization of various approaches Qualitative research using a Demonstrated that health care literature review centers need a structured strategy to enhance data quality and create a robust information culture that harnesses health information Qualitative research using a Deﬁned and measured comprehensive literature review inappropriate emergency health service use in Australia

Method

Table 4. Extraction table

Text: Percentages and classiﬁcation Text: Percentages and classiﬁcation (continued)

Process map

Tables

Visualization Type Tables

348 S. Alharethi et al.

2014 Management

2016 Quality: Process reengineering

2016 Quality: Engineering

8 Buttigieg et al.

9 Esfahani et al. 1

Australian Health Review

2010 Management

Method

Qualitative research on planning, hospital discharge, patient discharge, and discharge processes to conduct a systematic meta-review of controlled trials Health Organization Queueing theory to study the and Management time of arrival, exact time of triage, and total number of patients and arrival rates and system capacity measures and derive average queueing times and The theoretical relation between them Journal of Health Multiple case study on effective Organization and strategic planning and the project Management management methodologies of three units in Malta’s health care system, all of which are popular methods for improving the quality of health care services Segmentation methods, neural 38th Annual network/deep learning, and International convolutional neural networks Conference of the IEEE Engineering in classiﬁed into three groups as tracking-, model-, and ﬁlterMedicine and based Biology Society

Venue

Year Area in EDs

7 Lantz and Rosén

ID Author Name 6 Scott

Table 4. (continued)

Described vessel segmentation to ensure that the images obtained are of high quality by reducing their noise and enhancing their contrast

(continued)

Figures, tables, mathematical equations

Visualization Type Determined the relative efﬁcacy Tables and of pre-discharge interventions to Text: reduce post-discharge problems Percentages and in adults classiﬁcation Developed a technique based on Mathematical a queuing model to evaluate the equations and ﬁgures operational capacity of health (graph), tables services without process observation by appraising Skaraborg Hospital’s operative capacity during the triage process in the emergency department Tables, Determined the root causes of quality issues speciﬁc to the three ﬁgures, charts settings; objective trees were formed to suggest solutions to these quality issues

Problem

Emergency Departments 349

38th Annual International Conference of the IEEE Engineering Medicine and Biology Society 38th Annual International Conference of the IEEE Engineering Medicine and Biology Society 38th Annual International Conference of the IEEE Engineering Medicine and Biology Society 38th Annual International Conference of the IEEE Engineering Medicine and Biology Society

2016 Quality: Engineering

2016 Quality: Engineering

Venue

Year Area in EDs

13 Kadkhodaei et al.

2016 Quality: Engineering

12 Jamali et al. 2016 Utilization

11 Jafari et al.

ID Author Name 10 Esfahani et al. 2

Graphs, ﬁgures, mathematical equations Proposed a robust watermarking method where the watermark data are hidden to prevent the distortion of the region of interest

Experimental use of the robust watermark method in advanced image processing and in diagnostic/discrete Fourier transform

(continued)

Experimental algorithm with a Minimized problems in brain MR Graphs, method to join hybrid clustering images ﬁgures and logistic regression mathematical equations in

Graphs, ﬁgures, mathematical equations

Proposed an efﬁcient prescreening mechanism for pigmented skin lesions

Visualization Type Figures, Proposed a method to enhance tables, the detection of melanoma through an analysis of enhanced mathematical equations images

Problem

Algorithms for digital image magniﬁcation of details and extraction features to detect in surfaces

in

Neural network, deep learning methods

Method

Table 4. (continued)

350 S. Alharethi et al.

2013 Management

2000 Management

2014 Emergency Preparedness: Disaster Response

2011 Utilization Geometric Optimization

16 Gautam

17 Nikolai

18 Cheung

The University of Texas at Dallas

University of Notre Dame Ph.D. Thesis

Southern Illinois University Ph.D. Thesis

Queen’s University Ph.D. Thesis

Determined health beliefs and knowledge to determine the factors that predict demographic variables Coordinated new forms of Mixed-method study using quantitative research to observe, collective action to solve critical problems in crises at a speciﬁc collect, and analyze key time for a speciﬁc purpose to documents including past prioritize recommendations situation reports, after action reports, and exercise documents Proposed a method to determine problem severity and used and qualitative research to classiﬁcation in the analysis of classify informal and formal data collected during evaluation interviews with emergency activities managers Algorithm to simulate the process Proposed a method to simplify problems and allow for their observation at different angles to ﬁnd the shortest path to the solution with the fewest number of obstacles

(continued)

Figures, mathematical equations

Tables, charts

Tables

Evaluated public administration in real-world to identify failures and weaknesses associated with systems to reduce hazards Co-produced knowledge about a Tables, complex problem graphs

Mixed quantitative and qualitative examination of data from year (2008) using action research Multiple-method case study of systematic, scientiﬁc, systematic, and empirical knowledge Quantitative, cross-sectional, descriptive, correlational survey

University of Baltimore Ph.D. Thesis

2010 Emergency preparedness: operations

Visualization Type Tables

Problem

Method

Venue

Year Area in EDs

15 Donnelly

ID Author Name 14 Clark

Table 4. (continued)

Emergency Departments 351

Year Area in EDs

Venue

Method

Problem

Visualization Type Graphs, ﬁgures, mathematical equations

2013 Utilization: Resource Allocation

The University of California

Determined resource allocation Experimental use of Webster’s algorithm, real-time optimization and job scheduling with processors using real-time data methods, multi-user resource and proposed an online allocation (content-aware networking), adaptive Webster’s scheduling algorithm to maximize the quality of patient method, and simulation care methodology a Extraction Table 4 with column including; ID, Author Name, Year of Publication, Area of Knowledge in EDs, Venue of Publication, Method, Problem and Visualization Type. b Extraction Table 4 with column including; ID, Author Name, Year of Publication, Area of Knowledge in EDs, Venue of Publication, Method, Problem and Visualization Type.

ID Author Name 19 Pandit

Table 4. (continued)

352 S. Alharethi et al.

Emergency Departments

353

Fig. 3. Subjects with a research gap in performance measurement.

Fig. 4. Overview of topics with research gaps in emergency preparedness and health care quality.

Fig. 5. Where studies were published.

354

S. Alharethi et al.

Fig. 6. When studies were published.

4.5

RQ5: What Problems Were Addressed in Existing Studies?

Dynamic and iterative processes that decrease risks and exposure may be uncontrolled in some emergency management structures. Active and repetitive processes, which include parallel computing, dissemination, exchanges, and ethically sound knowledge applications in health care systems, can result in decreased service quality or inappropriate crisis management. Crisis management requires simulation, focus, memory, exceptions, people, authorities, and resources to be brought together at a speciﬁc time for a speciﬁc purpose. ED problems can be classiﬁed into major concerns, as shown in Table 4 and Fig. 8. 4.6

RQ6: How Are Studies Classiﬁed?

Classiﬁcation of content in Fig. 7 as scanned but Fig. 8 is thematic cluster we built through taxonomy of our content extracted from Table 4. That is, skills and competencies, management, quality, emergency preparedness, and utilization of EDs. Furthermore, classiﬁcation of scanned content showed that the review type papers were rare. Thus, systematic mapping and systematic review papers are appropriate to be conducted.

Emergency Departments

355

Fig. 7. Classiﬁcation of studies.

Fig. 8. Study thematic cluster

5 Conclusion Various complex factors are present in the management of an emergency. It is necessary to use an analytical decision-making process so that a health technology can be evaluated based on its performance. This analytical system needs to be regularly updated since modeling procedures are essential for the management of patient and staff interactions and patient care systems in hospitals. This need is essential due to the complicated nature of EDs and the problems that arise in EDs. Many studies have been conducted on mathematical models; however, few have been conducted pertaining to mathematical models in EDs. Such studies are vital to reduce wait times in EDs. Mapping research extracts vital issues and methods to devise

356

S. Alharethi et al.

solutions [2]. Some mapping studies are currently being conducted [7]; however, less are being conducted on EDs research [17–21]. Important aspects for analysis include the study selection quality and continuous research updates [8]. We have deﬁned and explained the dynamic problems in EDs and approaches to manage these issues to attain positive outcomes from our mapping study. The objective of this research was to present the brief foundation of a systemic literature review input as in [8]. It is only to be used as secondary research. In developing nations, health care systems are quite poor, so it is important to manage issues and meet demands for the acute hospital-based health care. It is also necessary to manage the implementation risks of activity-based funding. The following are the solutions derived in this study. Emergency preparedness systems require continuous training and simulations along with information assessments. The primary factors are the people involved in, authorities of, and assets to be used during emergency conditions. Patient experiences, patient satisfaction levels, effective procedures, patient safety, and quick response programs should be major focuses.

6 Future Work For ED simulation modeling, researchers should assess the present scenario and the research gaps in our study and [7]. Multi case studies of health care personnel should be carried out to determine workforce competence in terms of skills and capabilities [30]. EDs require leadership [10] within management [28] to ensure control in EDs [29]. Managing EDs and providing personnel with knowledge regardless of the ED’s policies, structure, capacity, network, etc. is important to ensure informed analytic decision making through real time algorithm and the effective management of emergency cases in normal and disaster situations using simulation model with decreased and controlled crowding. Analyzing techniques and utilizing the correct one to practice emergency procedures allow for their efﬁcient implementation. Health care quality standards must be updated according to this systematic review. Figure 8 presents the features of and insights to the theme of the research to be conducted in the future within the context of emergency and risk management [31]. Future research should focus on the sustainability of implementing real time data monitoring in EDs as well as the performance measurements of emergency systems.

References 1. Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping studies in software engineering: an update. Inf. Softw. Technol. 64, 1–18 (2015) 2. Lim, M., Nye, T., Bowen, J., Hurley, J., Goeree, R., Tarride, J.: Mathematical modeling: the case of emergency department waiting times. Int. J. Technol. Assess. Health Care 28(2), 93– 109 (2012) 3. Elberzhager, F., Münch, J., Nha, V.: A systematic mapping study on the combination of static and dynamic quality assurance techniques. Inf. Softw. Technol. 54(1), 1–15 (2012)

Emergency Departments

357

4. Paz, F., Pow-Sang, J.: A systematic mapping review of usability evaluation methods for software development process. Int. J. Softw. Eng. Appl. 10(1), 165–178 (2016) 5. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. Eur. Assoc. Sci. Editors 8(26), 68–77 (2008) 6. Li, Z., Liang, P., Avgeriou, P.: Application of knowledge-based approaches in software architecture: a systematic mapping study. Inf. Softw. Technol. 55(5), 777–794 (2013) 7. Gul, M., Guneri, A.: A comprehensive review of emergency department simulation applications for normal and disaster conditions. Comput. Ind. Eng. 83, 327–344 (2015) 8. Kitchenham, B.: Procedures for performing systematic reviews. Keele University, Keele (2004) 9. Allnutt, J., Allnutt, N., O’Connell, J., Middleton, S., Hillege, S., Della, P., Gardner, G., Gardner, A.: Clients’ understanding of the role of nurse practitioners. Aust. Health Rev. 34 (10), 59–65 (2010) 10. Fulop, L.: Leadership, clinician managers and a thing called ‘hybridity’. J. Health Organ. Manag. 26(5), 578–604 (2012) 11. Hanson, R.: Good health information: an asset not a burden! Aust. Health Rev. 35(1), 9 (2011) 12. Morgans, A., Burgess, S.: Judging a patient’s decision to seek emergency healthcare: clues for managing increasing patient demand. Aust. Health Rev. 36(1), 110 (2012) 13. Rosenberg, S., Hickie, I.: Making activity-based funding work for mental health. Aust. Health Rev. 37(3), 277 (2013) 14. Scott, I.: Preventing the rebound: improving care transition in hospital discharge processes. Aust. Health Rev. 34(4), 445 (2010) 15. Lantz, B., Rosén, P.: Measuring effective capacity in an emergency department. J. Health Organ. Manag. 30(1), 73–84 (2016) 16. Buttigieg, S., Gauci, D., Dey, P.: Continuous quality improvement in a Maltese hospital using logical framework analysis. J. Health Organ. Manag. 30(7), 1026–1046 (2016) 17. Nasr-Esfahani, E., Samavi, S., Karimi, N., Soroushmehr, S., Ward, K., Jafari, M., Felfeliyan, B., Nallamothu, B., Najarian, K.: Vessel extraction in X-ray angiograms using deep learning. In: 38th Annual International Conference of IEEE Engineering in Medicine and Biology Society, Florida, USA, pp. 643–646 (2016) 18. Nasr-Esfahani, E., Samavi, S., Karimi, N., Soroushmehr, S., Jafari, M., Ward, K., Najarian, K.: Melanoma detection by analysis of clinical images using convolutional neural network. In: 38th Annual International Conference of IEEE Engineering in Medicine and Biology Society, Florida, USA, pp. 1373–1376 (2016) 19. Jafari, M., Samavi, S., Karimi, N., Soroushmehr, S., Ward, K., Najarian, K.: Automatic detection of melanoma using broad extraction of features from digital images. In: 38th Annual International Conference of IEEE Engineering in Medicine and Biology Society, Florida, USA, pp. 1357–1360 (2016) 20. Jamali, M., Samavi, S., Karimi, N., Soroushmehr, S., Ward, K., Najarian, K.: Robust watermarking in non-ROI of medical images based on DCT-DWT. In: 38th Annual International Conference of IEEE Engineering in Medicine and Biology Society, Florida, USA, pp. 1200–1203 (2016) 21. Kadkhodaei, M., Samavi, S., Karimi, N., Mohaghegh, H., Soroushmehr, S., Ward, K., Najarian, K.: Automatic segmentation of multimodal brain tumor images based on classiﬁcation of super-voxels. In: 38th Annual International Conference of Engineering in Medicine and Biology Society, Florida, USA, pp. 5945–5948 (2016)

358

S. Alharethi et al.

22. Clark, L.: Implementation of the National Incident Management System in New Jersey. Ph.D., University of Baltimore, School of Public Affairs, Baltimore, Maryland, USA (2010) 23. Donnelly, C.: Evaluation as a mechanism for integrated knowledge translation. Ph.D., Queen’s University, Faculty of Education, Kingston, Ontario, Canada (2013) 24. Gautam, Y.: A study of assessing knowledge and health beliefs about cardiovascular disease among selected undergraduate university students using health belief model. Ph.D. Southern Illinois University, Health Education, Carbondale, USA (2012) 25. Nikolai, C.: SimEOC: a virtual emergency operations center (VEOC) simulator for training and research. Ph.D. University of Notre Dame, Computer Science and Engineering, Indiana, USA (2014) 26. Cheung, Y.: Optimization problems in weighted regions. Ph.D., University of Texas, Computer Science, Dallas, USA (2011) 27. Pandit, K:. Real-time resource allocation and optimization in wireless networks. Ph.D., University of California, Computer Science, Davis, USA (2013) 28. Hjortdahl, M., Ringen, A., Naess, A., Wisborg, T.: Leadership is the essential non-technical skill in the trauma team: Results of a qualitative study. Scand. J. Trauma Resusc. Emerg. Med. 17(1), 48 (2009) 29. Pinkert, M., Bloch, Y., Schwartz, D., Ashkenazi, I., Nakhleh, B., Massad, B., Peres, M., BarDayan, Y.: Leadership as a component of crowd control in a hospital dealing with a masscasualty incident: lessons learned from the October 2000 riots in Nazareth. Prehosp. Disaster Med. 22(06), 522–526 (2007) 30. Harding, P., Prescott, J., Sayer, J., Pearce, A.: Advanced musculoskeletal physiotherapy clinical education framework supporting an emerging new workforce. Aust. Health Rev. 39 (3), 271 (2015) 31. World Health Organization: WHO’s six-year strategic plan to minimize the health impact of emergencies and disasters: 2014–2019. World Health Organization, Geneva, Switzerland (2015) 32. Almozayen, N., Othman, M., Gani, A., Alharethi, S.: Data mining techniques: a systematic mapping review. In: Saeed, F., Gazem, N., Patnaik, S., Balaid, A., Mohammed, F. (eds.) Recent Trends in Information and Communication Technology, pp. 66–77. Springer, Cham (2017)

Benchmarking the Object Storage Services for Amazon and Azure Wedad Ahmed(&), Hassan Hajjdiab, and Farid Ibrahim College of Engineering and Computer Science, Abu Dhabi University, Abu Dhabi, United Arab Emirates [email protected], {hassan.hajjdiab, farid.ibrahim}@adu.ac.ae

Abstract. Cloud computing is increasingly being used as a new computing model that provide users rapid on-demand access to computing resources with reduced cost and minimal management overhead. Data storage is one of the most famous cloud services that have attracted great attention in the research ﬁeld. In this paper, we focus on the object storage of Microsoft Azure and Amazon cloud computing providers. This paper reviews object storage performance of both Microsoft Azure blob storage and Amazon simple storage service. Security and cost models for both cloud providers have been discussed as well. Keywords: Cloud object storage Amazon S3 storage

Microsoft azure blob storage

1 Introduction Cloud computing is becoming the mainstream for application development. The cloud, which is a metaphor for the internet, provides high capability computing resources and storage services based on demand. Cloud support could be represented in terms of software support, platform support, and developmental tools support. Cloud computing comes in many forms: platform as a service (PaaS), where developers build and deploy their applications using the APIs provided by the cloud. Others offer infrastructure as a service (IaaS), where a customer runs applications inside virtual machines (VMs), using the APIs provided by their chosen host operating systems. SaaS uses the web to deliver applications that are managed by a third-party vendor and whose interface is accessed on the clients’ side. IaaS cloud providers are responsible of providing the data center and their infrastructure software at a reduced cost and high availability. However, as a trade-off, cloud storage services do not provide strong guaranteed consistency. Due to replication, data could be inconsistent when a read immediately follows a write process. Across the web, many vendors offer data storage that resides in the cloud. There are several vendors offering cloud services in the market today such as Amazon, Google AppEngine and Microsoft Azure. Depending on access needs, one can access data stored in the cloud in three different ways: (1) using a web browser interface that enables moving ﬁles to and from the storage area; (2) through a mounted desk drive that look like a local desk drive letter or mounted ﬁle system in the © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 359–368, 2019. https://doi.org/10.1007/978-3-030-03405-4_24

360

W. Ahmed et al.

computer; (3) for application developers, the storage services could be handled using a set of application program interface (API) calls. Cloud providers offers variety of storage services such as object storage, Block storage, File storage and VM disk storage. Object storage is designed for unstructured data, such as binary objects, user generated data, and application inputs and outputs. It can also be used to import existing data stores for analytics, backup, or archive. Object storage processes data as objects and can grow indeﬁnitely by adding nodes and that what makes this kind of storage highly scalable and flexible. However, having very high scalability sometimes affects the performance requirements. A single object could be used to store Virtual Machine (VM) images or even an entire ﬁle system or database. Object storage requires less metadata to store and access ﬁles which reduce the overhead of managing metadata. The HTTPS/REST API is used as the interface to object storage system. This paper will benchmark the Microsoft Azure Blob Storage (MABS) and Amazon simple storage service (Amazon S3). Amazon Simple Storage Services (S3) is a distributed data storage used to store and retrieve data as objects. Bucket is used as a container that holds unlimited number of objects. Objects are stored in different sizes from a minimum of 1 Byte to a maximum of 5 TB. Each bucket belongs to one geographical location including US, Europe and Asia. There are three storage classes in AWS: (1) Standard: used for frequently accessed data; (2) Standard - Infrequent Access (IA) used with less frequently accessed data but need fast access; (3) Amazon Glacier used for archiving and long-term backup. Microsoft Azure blob storage is used to store unstructured data in the form of objects. Blob storage is scalable and persistent. Microsoft replicates their data within the same data center or within different data center in multiple world locations for maximum availability and to ensure durability and recovery. Each object has attribute-value pairs. The Blob size in Microsoft ranges from 4 MB to 1 TB. Amazon however, uses a “bucket” to hold their objects, Microsoft use containers (such as Azure Table) to hold their blobs (objects). Azure tables facilitate query-based data management. Azure Blobs are in three types Block blob, Append blob and Page blob. Block blobs are suitable for large volume of blobs and for storing cloud objects. The Append blobs consist of blocks but they are enhanced for appending operations such as random read and write and Page blobs are best used for random accessible data. Azure implements two access tiers aspects: (1) Hot access tier which is dedicated for objects that is more frequently accessed at lower access cost. (2) Cool access tier which is dedicated for less frequently accessed objects at lower storage cost. For example, one can put Long-term back-up in cool storage and ondemand live video in the hot storage tier [2, 3, 5–10]. This paper is organized as follows: Sect. 2 provide literature review of both cloud providers. Sections 3 and 4 discuss security and the pricing model implemented in Amazon and Azur storage. Section 5 evaluates which cloud provider has a better performance. In addition, it provides a quick comparison between the object storage services for Amazon and Microsoft. Finally, Sect. 6 concludes the paper and provides future scope.

Benchmarking the Object Storage Services for Amazon and Azure

361

2 Literature Review Li et al. [6] have compared the performance of blobs (object) in Amazon AWS (referred to as C1 in the ﬁgure) and Microsoft Azure (C2). They used two metrics for performance comparison: operation response time and time to consistency. Operation response time metric measures how long it takes for a storage request to ﬁnish. The response time for an operation is the time that extends from the instance the client begins the operation to the instance when the last byte reaches the client. Time to consistency metric measures the time between the instance when a datum is written to the storage service and when all reads for the datum return consistent and valid results. Such information is useful to cloud customers, because their applications may require data to be immediately available with a strong consistency guaranteed. They ﬁrst write an object to a storage service. The authors then repeatedly read the object and measure how long it takes before the read returns correct result. Figure 1 shows the response time distributions for uploading and downloading one blob measured by their Java-based clients. They consider two blob sizes, 1 KB and 10 MB, to measure both latency and throughput of the blob store. The performance for the blob services for our selected providers depends on the type of operation. Microsoft has better performance than Amazon when it comes to downloading operations but for uploading operations, amazon performs way better than Microsoft. Figure 2 illustrates the time to download a 10 MB blob measured by non-Java clients. Compared to Fig. 1c, the non-Java clients perform much better for both providers, because it turns out that the Java implementation of their API is particularly inefﬁcient.

Fig. 1. The cumulative distribution of the response time to download or upload a blob using Java-based clients.

362

W. Ahmed et al.

Fig. 2. The cumulative distribution of the time to download a 10 MB blob using non-Java clients.

Figure 3 compares the scalability of the blob services. They send concurrent operations. They use non- java clients since the implementation of Java is inefﬁcient as mentioned above. The ﬁgure only demonstrates the requests for downloading since the uploading results are similar in trend. Amazon shows good scaling performance for small size blobs, but it cannot handle large number of simultaneous operations very well when it comes to large blob size.

Fig. 3. The blob downloading time from each blob service under multiple concurrent operations. The number of concurrent requests ranges from 1 to 32. Note that the x-axes are on a logarithmic scale.

Benchmarking the Object Storage Services for Amazon and Azure

363

Wada et al. [1] compared the consistency between Amazon S3 and blob storage in Microsoft. First, they measured the consistency for AWS S3 during 11-day period of time. They updated an object in a bucket with the current timestamp as its new value. In this experiment, they measure ﬁve conﬁgurations: a write and a read run in a single thread, different threads, different processes, different VMs or different regions. S3 provides two kinds of write operations: standard and reduced redundancy. A standard write operation is for durability of at least 99.999999999%, while a reduced redundancy write durability goal is to reach at least 99.99% probability. Amazon S3 buckets offer eventual consistency for overwrite set operations. However, stale data was never found in their study regardless of write redundancy options. It seems that staleness and inconsistency might be visible to a consumer of Amazon S3 only, while carrying out the operations such that there is a failure in the nodes where the data is stored. The experiment was also conducted on Windows Azure blob storages for eight days. They did measurements for four conﬁgurations: a write and a read run in a single thread, different threads, different VMs or different regions. On Windows blob storage a write updates a blob and a reader reads it. No stale data found at all. It is known that all types of Windows Azure storages support strong data consistency and the study in [4] conﬁrms that. Persico et al. [9], assess Amazon S3 performance of remote data delivery using the standard HTTP GET method of cloud-to-user network. They study standard AWS S3 storage class in different four cloud regions distributed in four continents, namely, United States, Europe, Asia Paciﬁc, South Africa. In each region they created a bucket that contains ﬁles of various sizes, from 1B to 100 MB. Bismark platform is used to simulate clients worldwide. Bismark nodes (vantage points (VPs)/source regions) are distributed in 77 locations distributed such that the United States (US, 36 VPs), Europe (EU, 16), Central-South America (CSA, 4), Asia-Paciﬁc Region (AP, 12), and South Africa (ZA, 9). The study conﬁrms that the size of the downloaded object heavily affects the measured performance of the network, independently from the VP. Considering the goodput average values from all the source regions, the cloud regions reported 3562, 2791, 1445, and 2018 KiB/s, respectively for objects of 100 MB size. So, United States and Europe represent the best available choices for cloud customer. Often the best performance is obtained when a bucket and VP are in the same geographic zone. However, [9] found that the US and EU cloud regions give better performance in terms of goodput (+45.5%, on average), though sometimes this choice leads to suboptimal performance. Bocchi et al. [11] compared the performance of Microsoft azure and AWS S3 storage for a customer located in Europe. When downloading a single ﬁle of 1 MB, 10 MB, 100 MB and 1 GB size. The download rate was faster for large ﬁles; this is due to TCP protocol. Microsoft Azure blob storage performs better than Amazon in case of 1 MB and 10 MB ﬁle sizes. However, for 100 MB and 1 GB ﬁle size Amazon S3 download rate had higher throughput than Microsoft Azure. In case of downloading multiple objects of different sizes at the same time, the throughput was minimized for both cloud service providers. The same test was also repeated over time to understand whether there are other factors affecting the performance of both cloud providers such as trafﬁc peaks during a day. They downloaded and uploaded 10 MB ﬁle making 3500 transactions in a weak. The performance of uploading ﬁle was the same for both

364

W. Ahmed et al.

providers but the download was faster than the upload. For both types of operations, there was no correlation found between the time of the day and the performance. Hill et al. [12] evaluated Windows Azure blob storage performance. They measured the maximum throughput in operations/sec or MB/sec for 1 to 192 concurrent clients. They analyzed the performance of downloading a single blob of 1 GB and uploading a blob of the same size to the same container by multiple concurrent users. This test was run three times each day at different times. There was no signiﬁcant performance variation across the different days. The maximum throughput found was 393.4 MB/s for 128 clients, downloading the same Blob, and 124.25 MB/sec for 192 clients uploading a blob to the same container. Obviously, as the clients increase, the performance decreases. Figure 4 demonstrates the average client bandwidth as a function of the number of concurrent users downloading the same blob. Both operations are sensitive to the number of concurrent clients accessing the object. For uploading an object into the same container, the data transfer rate is decreased to 50% with 32 clients as compared to one client. The performance of downloading an object is limited by the client’s bandwidth; there was a drop of 1.5 MB per client bandwidth when the number of concurrent clients doubled. The overall blob upload speed is much slower than the download speed. For example, the average upload speed is only * 0.65 MB/s for 192 VMs, and * 1.25 MB/s for 64 VMs.

Fig. 4. Average client blob download bandwidth as a function of the number of concurrent clients.

3 Cost Both cloud providers have no ﬁxed costs and apply ‘pay-as-you-go’ model. There are three factors in which expenses are charged: (1) Raw storage: Amazon S3 and Windows Azure start charging $ 0.085 per GB per month. The price increases as the stored objects increase and vice versa. (2) Requests: Microsoft and Amazon charge

Benchmarking the Object Storage Services for Amazon and Azure

365

users according to the number of requests they make. Amazon S3 charges $0.005 per 1,000 PUT, COPY, LIST requests and $0.004 every 10,000 GET requests. Windows Azure charges $0.005 per 100,000 requests. (3) Outgoing bandwidth: uploading to the cloud is free, but download is charged by size. Both cloud providers charges 0.12 $ per GB and declining with the amount of capacity consumed. All in all, the cost is calculated based on three factors: stored object size, the number of download requests, and trafﬁc transferred volume. In addition, there are other aspects that determine the pricing model such as the location of the object and the storage tiers of cloud providers- Hot or Cool access tiers in Microsoft Azure or standard, infrequent access, or Glacier in Amazon [9, 11].

4 Security Cloud computing faces new and challenging security threats. Microsoft and Amazon are public cloud providers and their infrastructure and computational resources are shared by several users or organizations across the world. Security concerns go around data Conﬁdentiality, Integrity and Availability (CIA) [9, 10]. Amazon AWS conﬁdentiality: Amazon uses AWS Identity and Access Management (IAM) to ensure conﬁdentiality. To access AWS resources, a user should ﬁrst be granted a permission that consists of an identity and a unique security credentials. IAM applies the least privilege principal. AWS Multi-Factor Authentication (MFA) and Key rotation are used to ensure conﬁdentiality as well. MFA is used to boost the control over AWS resources and account settings for registered users. MFA requires a user to provide a code along with username and password. This code is sent to user by an authenticating device or a special application on a mobile phone. For Key rotation, AWS recommends that access keys and certiﬁcates be rotated regularly, so that users can alter security settings and maintain their data availability [9, 10]. Microsoft Azure conﬁdentiality: Microsoft uses Identity and Access Management (IAM) so that only authenticated users can access their resources. A user needs to use his/her credit card credentials to subscribe to Windows Azure. Then a user can access resources using Windows Live ID and password. In addition, Windows use isolation to provide protection. This keeps data and containers logically or physically segregated. Moreover, Windows azure encrypt its internal communication using SSL encryption. It provides a user with a choice of encrypting their data during storage or transmission. In addition, Microsoft Azure gives the choice to integrate the .NET “Cryptographic Service Providers (CSPs)” by extending its SDK with .NET libraries. Thus, a user can add encryption, hashing and key management to their data [9, 10]. AWS integrity: Amazon user can download/upload data to S3 using HTTPS protocol through SSL encrypted end points. In general, there are server-side encryption and client-side encryption in Amazon S3. In the server side, amazon manages the encryption key for its users. In the client side, a user can encrypt its data before uploading them to Amazon S3 using client encryption library. For integrity, Amazon uses a Hash-based Message Authentication Code (HMAC). A user has a private key that is used when a user makes a service request. This private key is used to create

366

W. Ahmed et al.

HMAC for the request. If the created HMAC with the user request matches the one stored in the server then the request is authenticated, and its integrity is maintained [9, 10]. Microsoft Azure integrity: Microsoft provides architecture to ensure integrity. It uses Cryptographic Cloud Storage Service with searchable encryption schema to ensure data integrity. This service ensures that only authorized access is permitted and enables encryption (SSE and ASE) and decryption of ﬁles [9, 10]. Availability: Microsoft Azure’s have two storage tiers each with different availability; hot storage availability tier which is 99.9 and 99% is the percentage for cool storage tier. Both Amazon Standard and standard – infrequent access storage tiers have 99.9% availability. To ensure high availability, both Microsoft and Amazon provide geo-replication. Geo-replication is a type of data storage replication in which the same data is stored on servers in multiple distant physical locations [9, 10].

5 Results and Evaluation In summary, there is no clear winner when it comes to evaluating the performance of object storage of Amazon and Microsoft. Many combinations of factors can affect the throughput of object storage such as: • • • • • • • • •

Object size The location of cloud data center The location of clients with respect to their cloud resources Type of operation (download, upload, etc.) Concurrency level of both client’s and operations Files size Client’s bandwidth Client’s hardware Cloud provider policy

The following are some points that summarize some of the overall differences and similarities in the two selected cloud providers: • There are different types of access classes in both Microsoft Azure Blob Storage (MABS) and AWS S3. Hot access tier in Microsoft corresponds to standard access tier in amazon. In addition, Cool tier in Microsoft correspond s to Amazon Glacier Access type. • While Microsoft Azure stores data in a container, AWS stores data in a bucket. • Microsoft offers storage objects in three types: block blob, page blob and append blob, while Amazon does not classify their object storage. • For security and access management both Amazon and Microsoft allow only the account owner to access their data. However, account owners could make some objects public for sharing purposes. • Both Microsoft and Amazon use Identity and Access Management to ensure security and encryption to ensure conﬁdentiality. Both provide the facility of data encryption before downloading or while uploading data.

Benchmarking the Object Storage Services for Amazon and Azure

367

• While Amazon veriﬁes the integrity of data stored using checksums, Microsoft provides independent architecture to eliminate integrity issues using their Cryptographic Cloud Storage Service.

6 Conclusion The time for computing-as-a-utility is now used by many organizations around the world. Cloud services, such as object storage, are the building blocks for cloud applications. Cloud storage allows objects of different nature to be shared or archived. Cloud object storage is mostly used for its high scalability and its ability to store different types of data. It is challenging to compare cloud providers, since they all provide almost the same features for an end user point of view. However, each cloud provider has its own implementation features. In this context, a comparison of performance, cost, and security has been done for Amazon S3 and Microsoft Azure Blob storage. This paper presents several studies that tested the performance of object storage for both cloud providers. We have summarized many factors that affect the performance of storage operation from end user point of view. In addition, we presented a review on cost and security models for both cloud providers. As a future work, we are going to explore additional services provided by AWS and Microsoft Azure and we will do our own practical experiment to test and compare the performance of Microsoft Azure Blob Storage and Amazon object storage.

References 1. Wada, H., Fekete, A., Zhao, L., Le, K., Liu, A.: Data consistency properties and the tradeoffs in commercial cloud storages: the consumers’ perspective. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA (2011) 2. Ruiz-Alvarez, A., Humphrey, M.: An automated approach to cloud storage service selection. In: Proceedings of the 2nd international workshop on Scientiﬁc Cloud Computing, ScienceCloud 2011, pp. 39–48 (2011) 3. Agarwal, D., Prasad, S.K.: AzureBench: benchmarking the storage services of the Azure cloud platform. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, Shanghai, pp. 1048–1057 (2012) 4. Krishnan, S.: Programming Windows Azure: Programming the Microsoft Cloud. O’Reilly (2010) 5. Jamsa, K.: Jones & Bartlett Publishers, Mar 22, 2012, cloud computing 6. Li, A., Yang, X., Kandula, S., Zhang, M.: CloudCmp: comparing public cloud providers. In: Proceeding IMC 2010 Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, pp. 1–14 (2010) 7. Samundiswary, S., Dongre, N.M.: Object storage architecture in cloud for unstructured data. In: 2017 International Conference on Inventive Systems and Control (ICISC), Coimbatore, India, pp. 1–6 (2017) 8. Tajadod, G., Batten, L., Govinda, K.: Microsoft and Amazon: a comparison of approaches to cloud security. In: 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, Taipei, pp. 539–544 (2012)

368

W. Ahmed et al.

9. Persico, V., Montieri, A., Pescapè, A.: On the network performance of Amazon S3 cloudstorage service. In: 2016 5th IEEE International Conference on Cloud Networking (Cloudnet), Pisa, pp. 113–118 (2016) 10. Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., Haq, M., Haq, M., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., Rigas, L.: Windows azure storage: a highly available cloud storage service with strong consistency. In: SOSP 2011 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 143–157 (2011) 11. Bocchi, E., Mellia, M., Sarni, S.: Cloud storage service benchmarking: methodologies and experimentations. In: 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), Luxembourg, pp. 395–400 (2014) 12. Hill, Z., Li, J., Mao, M., Ruiz-Alvarez, A., Humphrey, M.: Early observations on the performance of Windows Azure. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 367–376 (2010)

An Improvement of the Standard Hough Transform Method Based on Geometric Shapes Abdoulaye Sere1(B) , Fr´ed´eric T. Ouedraogo2 , and Boureima Zerbo1 1 Laboratory of Mathematics and Computer Science, University OUAGA 1 Prof. Joseph KI-ZERBO, BP 7021 av. C.D.Gaulle, Ouagadougou, Burkina Faso [email protected], [email protected] 2 Laboratory of Mathematics and Computer Science, Universit´e de Koudougou, Koudougou, Burkina Faso [email protected]

Abstract. Hough Transform is a well known method in image processing, for straight line recognition, very popular for detecting complex forms, such as circles, ellipses, arbitrary shapes in digital images. In this paper, we are interested in the Hough transform method that associates a point to a sine curve, named the standard hough transform, applied to a big set of continue points such as triangles, rectangles, octogons, hexagons in order to overcome time problem, due to the small size of a pixel and to establish optimization techniques for the Hough Transform method in time complexity, in the main purpose to obtain thick analytical straight line recognition, in following some parameters. The proposed methods, named Triangular Hough Transform and Rectangular Hough Transform considers an image as a grid, respectively represented in a triangular tiling or a rectangular tiling and contribute to have accumulator data to reduce computation time, accepting limited noises in straight line detection. The analysis also deals with the case of geometric shapes, such as octogons and hexagons where the tiling procedure of image space is necessary to obtain new Hough Transform methods based on these forms. Keywords: Hough transform Pattern recognition

1

· Analytical straight line

Introduction

Paul Hough introduced the Hough Transform Method since 1962, used initially to recognize a straight line and later taking into account complex form recognition in noisy pictures. Hough transform establishes a relation between an image space and a parameter space, like an isomorphism, to transpose a problem of complex form recognition in an image space, to a detection of a high number of vote in a parameter space. It has been adapted to the recognition of digital circles, ellipses and generalized shapes in [2,4]. c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 369–384, 2019. https://doi.org/10.1007/978-3-030-03405-4_25

370

A. Sere et al.

Some works concerning the analysis of errors generated by the application of Hough Transform have been proposed in [3]: Image space digitalization and parameter space digitalization have an impact in the precision of straight line recognition. Standard Hough Transform is also a Hough Transform method that associates a point (x, y) in an image space to a sine curve p = x ∗ cos θ+y∗ sin θ in a parameter space. Maˆıtre in [4] has introduced several united deﬁnitions of the Hough Transform method. Mukhopadhyay and others in [10] have also proposed a survey of the Hough Transform method that gives a great view about the method. Moreover, several relations have been realized between the Hough transform method and other researches. There are several literatures on an analysis of Grobner’s bases and fuzzy and Hough transform have been done. Han and others in [14] have proposed Fuzzy Hough Transform for fuzzy object recognition. Nowadays, the Hough Transform is not only a theoretical method. There are also many applications of the Hough Transform method such that mouth localization for audio-visual speech recognition in [6], action recognition in [7], building extraction in a picture in [12] and text segmentation in [13]. We also notice some adaptations of the Hough Transform method to detect thick line in [11]. One can also ﬁnd in the libraries OpenCV or Cimg.h, an implementation of the Hough Transform method. Others Hough Transform have changed the original deﬁnition of the Hough transform method to propose a deﬁnition in n dimensional space: Martine Dexet and others in [1], have proposed an extension of Hough Transform based on a pixel transformation to an area of straight lines in the parameter space with an extension in n dimension. SERE and others, have introduced extensions of Standard Hough Transform using object dual: an application of the standard Hough transform based on a square dual has been introduced in this sense in [9]. This paper focuses on an analysis of standard Hough transform based on rectangle dual, triangle dual, hexagon dual, octagon dual in order to improve computation time in thick analytical straight line recognition. This paper is organized as follows: the section named preliminaries recalls diﬀerent analytical straight line deﬁnitions, Rectangular Hough Transform and Triangular Hough Transform that we need for the best understanding of sections concerning algorithms. Our contributions set place in Sects. 3, 4, 5 and 6. Section 3 proposes techniques to apply standard Hough Transform to a hexagon or an octogon through where two propositions are done in this sense. Building accumulator data are analyzed in Sect. 4. We end the paper by Sects. 5 and 6 with diﬀerent algorithms to improve the Hough Transform method, illustrations and discussions about possible applications on real images.

2

Preliminaries

In this section, we recall the basic concepts of discrete geometry that are essential to the best understanding of next sections.

An Improvement of the Standard Hough Transform Method

371

Our purpose consists of straight line recognition in doing an extension of the Hough Transform method. Then, we recall the analytical straight line deﬁnition and the Hough Transform method deﬁnition in using the triangle dual and the rectangle dual. 2.1

Analytical Straight Line

An analytical straight line is a particular analytical straight hyperplane in 2 dimensional space deﬁned by Reveill`es in [5]. Deﬁnition 1 (Analytical Straight Hyperplane): Let H be a analytical hypern plane in dimension n noticed μ < i=1 (Ai xi ) < μ + ω with the parameters A(A1 , A2 , ..., An−1 , An ) ∈ Rn , μ ∈ R and ω ∈ R, xi ∈ Z. then: • • • • •

H H H H H

is is is is is

called called called called called

naif if w = max1≤i≤n n (|Ai |) standard if w = i=1 (|Ai |) thin if w < max n 1≤i≤n (|Ai |) thick if w > i=1 (|Ai |) n ∗ -connected if max1≤i≤n (|Ai |) 1) that belong to a contour, its dual will be computed in the accumulator: one will have certainly rectangles containing pixels inside that should not be in a contour (detected by a ﬁlter). The previous theorem 1 is used to compute the dual of a rectangle. Figure 9 gives an example of virtual rectangles that contain real pixels. Each rectangle is represented by a blue border. Our aim is to use virtual analytical straight line (standard deﬁnition) based on virtual rectangles to realize the detection of thick analytical straight line that rests on real pixels. The reason is that the last one is included in the ﬁrst. Let I be an image space. Let xi be the length of a pixel side. In the deﬁnition of the rectangular tiling, a rectangle R owns a width wr and a height hr where wr = k.xi and hr = k .xi with (k, k ) ∈ (N−{0})2 . A rectangular tiling is the result of the applications of Quasi-aﬃne transformations based on two perpendicular straight lines. The parameters (k, k’) also determine the new quantization of the image space. Then, a rectangle will have (k x k’) square pixels inside. In reality,

378

A. Sere et al.

Fig. 9. A set of rectangles with blue borders and a thin analytical straight line deﬁned ≤ 4x − 7y + 13.5 ≤ |4| . by − |4| 2 2

a pixel is very small to be seen with the eye. Hence, a rectangle having (k x k’) square pixels is a good reason to optimize computation time because one of the purpose of the recognition processing is to help image interpretation: a pixel must be representative to the eye. − → − → For more details, in two representations in coordinate system (o, i , j ) and − → − → → − → − → − → − (o, I , J ) with the relation I = k i and J = k j , we can deﬁne diﬀerent − → − → −−→ → − → − analytical straight line. A point M (X, Y ) in (o, I , J ) veriﬁes OM = X I +Y J . −−→ − → − → → − → − In (o, i , j ), one will get OM =kX i +k’Y j . Inversely suppose that M(x, y) is −−→ − → − → − → − → → − → − represented in (o, i , j ), we’ll have OM = x i + y j and ﬁnally in (o, I , J ) −−→ → − → − one will get OM = xk I + ky J . The analytical straight line μ ≤ax+by≤ μ + ω in − → − → − → − → the system (o, i , j ) becomes μ ≤akX + bk’Y≤ μ + ω in the system (o, I , J ) with M(X, Y). Inversely in this way, we can easily ﬁnd the equation from a − → − → system to another system. When we change the coordinate system of (o, i , j ) − → − → to (o, I , J ) the nature of analytical straight line will depend on its thickness: − → − → − → − → Table 1 summarizes the relation between (o, i , j ) and (o, I , J ). − → − → The constraints in the column of (o, i , j ) imply the constraints in the col− → − → umn (o, I , J ). Thus, in the last row of this table, a standard straight line in − → − → − → − → (o, I , J ) corresponds to a thick analytical straight line in (o, i , j ). Figure 10 presents an example of a thick analytical straight based on several rectangles. Table 1. Constraints on thickness values in two representations → − − → (o, i , j )

→ − − → (o, I , J )

ω

ω

≤ max(a,b) < max(ak,bk’) = max(a,b) < max(ak,bk’) ≤|a| + |b|

> > >

1. 1. 1. 1.

If d + D 1, the models are ﬁtted with c 6¼ 0 else we set c ¼ 0. Comparing the four models, the one having the smallest AIC value is selected as the “current” model and is denoted by ARIMA (p, d, q) if m = 1 or ARIMA ðp; d; qÞðP; D; QÞm if m > 1. Step 2: Up to 13 variations on the current model are computed and compared, where: • • • •

one of p, q, P and Q is allowed to vary by ±1 from the current model; p and q both vary by ±1 from the current model; P and Q both vary by ±1 from the current model; the constant c is included if the current model has c = 0 or excluded if the current model has c 6¼ 0.

Whenever a model with lower AIC is found, it becomes the new “champion” model, and the procedure is repeated to look for challenger models to the champion model. The process ends when there is no model with lower AIC.

438

M. Choy and M. N. Laik

Hyndman and Khandakar [14] also instituted several constraints on the ﬁtted models to avoid problems with convergence or near unit-roots. They are listed below: • • • •

The values of p and q are not allowed to exceed speciﬁed upper bounds. The values of P and Q are not allowed to exceed speciﬁed upper bounds. We reject any model that is “close” to non-invertible or non-causal. If there are any errors arising in the non-linear optimization routine used for estimation, the model is rejected.

The algorithm proposed is guaranteed to return a valid model due to ﬁnite model space and accepting one of the starting models. The selected model is then used to produce forecasts. In the next section, we will discuss using the neighbourhood search heuristics to look for the correct model.

4 Maximum Likelihood Model Transverse (ML-MT) Approach The Hyndman and Khandakar approach can be complex and involves testing of the models using different criteria. In this section, we propose a novel neighbourhood search heuristic called the maximum likelihood model transverse approach. The following is a description of our approach: Step 1: Build a (0,0,0) ARIMA model and (0,1,0) ARIMA model. Step 2: Calculate the AIC for both models and select the model with the smallest AIC. Set d to the value of d in the selected model. Step 3: Create the ACF and PACF plots. If (0,1,0) ARIMA is selected, the plots are created using the data with one differencing. • Only one differencing is used as multiple differencing can have a serious impact on the ﬁnal models. Step 4: Check for the type of time series model using rules of Box-Jenkins method. Checking for Random Walk model • If the maximum of lag 2 to 12 is less than 0.2, then the model is said to be a random walk model. Checking for AR model • If lag 1 is greater than lag 2, lag 2 is greater than lag 3. • If lag 3 is greater than 0.05 with the mean of lag 4-12 less than 0.2 Checking for MA model • If the mean of the absolute of lag 1 to 3 is greater than 0.2 • If lag 2 is greater than 0.2 • If the mean of lag 4–12 is less than 0.2 If all conditions are not met, the model is classiﬁed as an ARMA model. Step 5: Initialization and search according to the case deﬁned in Step 4 above Random Walk Model

Intelligent Time Series Forecasting Through Neighbourhood Search Heuristics

439

• Both p and q are set to 0. • Build the model AR Model (1) (2) (3) (4) (5)

Set p as 1. Build the base model with p Set p as p + 1 Calculate AIC for case p and p + 1 If p + 1 case has a lower AIC than p, set p as p + 1. Repeat for 2 to 5. Else set p as p and exit. MA Model

(1) (2) (3) (4) (5)

Set q as 1. Build the base model with p Set q as q + 1 Calculate AIC for case q and q + 1 If q + 1 case has a lower AIC than q, set q as q + 1. Repeat for 2 to 5. Else set q as q and exit. ARMA Model

(1) (2) (3) (4) (5) (6) (7) (8)

Search for the lag with the highest spike in ACF plot. If the lag is greater than 3 then set p as 3. Else set p as the lag position value. Search for the lag with the highest spike in PACF plot. If the lag is greater than 3 then set q as 3. Else set q as the lag position value. Build the base model using the initial p, d and q. Build models with p and q both vary by ±1 from the current model. Compare the AIC and ensure the Ljung-Box is not violated. If any of the models have a lower AIC than base model without violating LjungBox test, set that model as the base model. Repeat 6–8. Else exit.

The process combines the rules of order identiﬁcation with the Ljung-box test and AIC to identify the best parsimonious model without model violation. In the next section, we will compare the ability of the new algorithm to accurately identify simulated models order speciﬁcations compared to the Auto.Arima approach.

5 Experimental Results The statistical software R [23] has an internal utility function for the generation time series given an ARIMA model. A standard group of parameters must be provided in order to deﬁne the ARIMA model and the required ones are listed below: • • • •

The The The The

order P of the auto-regressive (AR) and its coefﬁcient set. Q of the order based on moving averages (MA) and its coefﬁcient set. order of differentiation D. number of samples to generate.

440

M. Choy and M. N. Laik

After deﬁning the parameter values, pseudo-random data can be generated to train and test the search algorithm. For any given values of P, D and Q, the value of the coefﬁcients can signiﬁcantly alter the results of ACF and PACF functions applied to the generated series. In this case, the basic set of parameters for both the auto-regressive part and moving averages part is set as {−1, −0.9, −0.8, … −0.1, 0, 0.1, … 0.8, 0.9, 1}. Using the parameters, we generated 1,000 samples for each combination of p and q with varying parameters. To evaluate the accuracy of the model, we calculated the percentage of time that the model of the right speciﬁcation is found using the methodology. The following is the results comparing our algorithm (Maximum Likelihood Model Transversal) with Hyndman and Khandakar’s algorithms. From the simulated data results in Fig. 1, we can see that ML-MT is more accurate than Auto.Arima in some cases. Even for cases where it did not perform as well, it compared favourably. One interesting aspect is that the model performs strongly even in the more complicated ARIMA models as the Auto.Arima approach tapers off. Towards the tail end of the model, we can see that in some situations, ML-MT defeats

% Identified Correctly

the Auto.Arima by a fair margin. Thus, we have some assurance of the accuracy of ML-MT in identifying the correct model.

80 70 60 50 40 30 20 10 0 000 002 011 100 102 111 200 202 211

ARIMA Model of the Form (p, d, q)

Proposed

Hyndman

Fig. 1. Comparison between Hyndman and Khandakar’s approach with our ML-MT search approach.

Intelligent Time Series Forecasting Through Neighbourhood Search Heuristics

441

6 Real Dataset Results After testing the approaches against simulated data, we also tested them using real life data covering a range of industries and ﬁelds. The M competition data set for Time Series forecasting is considered to be a benchmark for accuracy. To compare the accuracy of the model, we compute the MSE for the prediction and validation data and highlight which model did better. The results from the test are shown below in Table 1. Table 1. M competition accuracy comparison using MSE comparison Category DEMOGR INDUST MACRO1 MACRO2 MICRO1 MICRO2 MICRO3

MSE comparison 49.3% 53.0% 51.8% 50.6% 51.6% 56.1% 51.5%

# of series 144 236 139 180 31 139 132

In Table 1, the test results are encouraging as the proposed method performs better than the current approach across six of seven categories. Below in Table 2 is the comparison using MAPE. Table 2. M competition accuracy comparison using MAPE comparison Category DEMOGR INDUST MACRO1 MACRO2 MICRO1 MICRO2 MICRO3

MAPE 47.2% 52.1% 53.2% 43.9% 51.6% 54.7% 49.2%

The Table 2 shows the MAPE comparison which indicates that the new proposed method only performed better in 4 out of 7 categories. M competition time series are quite small and to further validate the model, we will look at real data sets in the healthcare environment. There are 4 time series from the MIT-BIH database [7] that provides time series on patients’ heart rates. Each series contains between 950 and 1800 evenly-spaced measurements of instantaneous heart rate from a single subject. The subjects were engaged in comparable activities for the duration of each series. The measurements occur at 0.5s intervals so that the length of each series is exactly 15 min. Another set of clinical data came from the UCR Time Series Data Mining website [6].

442

M. Choy and M. N. Laik Table 3. Healthcare datasets accuracy comparison Data MSE (ML-MT) MSE (Hyndman) Health care statistics Annual-nonfatal-disabling-mine-i 6366101.00 6546339.00 Annual-us-suicide-rate-per-10010 4.00 4.00 Monthly-reported-number-of-cases 0.28 3549.62 Monthly-us-polio-cases 5.17 5.17 Weekly-counts-of-the-incidence-o 1.17 1.17 UK-deaths-from-bronchitis-emphys 31264.43 28140.33 Clinical data hr.207 2.24 8.88 hr.237 14.05 7.44 hr.7257 1.04 7.23 hr.11839 45.78 59.16 nprs43 247669.40 259426.00 nprs44 91079.95 98238.14

The healthcare statistics were downloaded from the datamarket.com that is derived from McCleary and Hay [19]. The data shows a variety of health care statistics from disease incidents to suicides counts. The results are shown below in Table 3. The results as shown in Table 3 are encouraging. From the health care statistics, the forecast matches the results from Auto.Arima or outperform for all cases except 1. In the clinical data section, similar performance was observed for ML-MT approach. This indicated that the approach is more accurate than Auto.Arima in terms of MSE for the real life data samples.

7 Discussion of Experimental Results From the results, we easily verify the superior performance of the technique ML-MT approach compared to Hyndman’s approach in selecting the correct model. The case is particularly pronounced for models with simpler structure such as ARIMA (0, 1, 0). The difference is weaker for complex cases like ARIMA (2, 1, 1). The reason could be the level of difﬁculty in identifying the complex cases where the search values in the search subspace are very similar and the correct model differs from the selected model slightly. At the same time, complex ARIMA structure might be ﬁtted by a simpler structure given the data set size. This can pose a certain level of bias when it comes to model selection. The other difference is the measurement of performance. In Hyndman’s paper, he speciﬁcally uses MSE as performance measurement. However, it is important to consider other factors when we do forecasting especially if the technique yielded less

Intelligent Time Series Forecasting Through Neighbourhood Search Heuristics

443

than desirable accuracy at identiﬁcation of model structure. In this case, the proposed technique has performed admirably in identifying the correct model as compared to Hyndman’s approach with good performances in accuracy using real life health care data as well.

8 Conclusion Using simulated data, we compared the performance of the proposed novel algorithm against the Auto.Arima approach. The proposed algorithm performs favourably compared to the Auto.Arima approach and in certain cases, performs much better by a fair margin. The novel approach also has better accuracy in identifying the order speciﬁcations of time series model even in the case of complex ARIMA models. A further comparison between the approaches was done using a combination of real world healthcare statistics data, clinical data and time series benchmarking tools. The novel approach outperforms the Auto.Arima approach in majority of the cases tested. Further testing will be required to assess the approach’s efﬁcacy in the wider context of time series forecasting. Acknowledgement. We wish to thank Michele Hibon and Makridakis Spyros for providing M3 data; Hyndman [14, 15] for his R package, Forecast which provides us with the Auto.Arima function, and Mcomp, which allowed us to implement M3 time series data easily in R; and the anonymous referees for insightful comments and suggestions.

References 1. Bergmeir, C., Hyndman, R.J., Benítez, J.M.: Bagging exponential smoothing methods using STL decomposition and Box-Cox transformation. Int. J. Forecast. 32(2), 303–312 (2016) 2. Box, G., Jenkins, G.: Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco (1970) 3. Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods, 2nd edn. Springer, New York (1991) 4. Dickey, D.A., Fuller, W.A.: Likelihood ratio statistics for autoregressive time series with a unit root. Econom. J. Econom. Soc., 1057–1072 (1981) 5. Durbin, J., Koopman, S.J.: A simple and efﬁcient simulation smoother for state space time series analysis. Biometrika 89(3), 603–616 (2002) 6. Keogh, E., Lin, J., Fu, A.: HOT SAX: efﬁciently ﬁnding the most unusual time series subsequence. In: The Fifth IEEE International Conference on Data Mining (2005) 7. Goldberger, A.L., Rigney, D.R.: Nonlinear dynamics at the bedside. In: Glass, L., Hunter, P., McCulloch, A. (eds.) Theory of Heart: Biomechanics, Biophysics, and Nonlinear Dynamics of Cardiac Function, pp. 583–605. Springer, New York (1991) 8. Gomez, V.: Automatic model identiﬁcation in the presence of missing observations and outliers, Technical report, Ministerio de Economía y Hacienda, Dirección General de Análisis y Programación Presupuestaria, working paper D-98009 (1998)

444

M. Choy and M. N. Laik

9. Gomez, V., Maravall, A.: Programs TRAMO and SEATS, instructions for the users, Technical report, Dirección General de Análisis y Programación Presupuestaria, Ministerio de Economía y Hacienda, working paper 97001 (1998) 10. Goodrich, R.L.: The Forecast Pro Methodology, pp. 533–535 (2000) 11. Hannan, E.J., Rissanen, J.: Recursive estimation of mixed autoregressive- moving average order. Biometrika 69(1), 81–94 (1982) 12. Hendry, D.F., Doornik, J.A.: The implications for econometric modelling of forecast failure. Scott. J. Polit. Econ. 44(4), 437–461 (1997) 13. Hylleberg, S., et al.: Seasonal integration and cointegration. J. Econom. 44(1), 215–238 (1990) 14. Hyndman, R.J., Khandakar, Y.: Automatic time series for forecasting: the forecast package for R. No. 6/07. Monash University, Department of Econometrics and Business Statistics (2007) 15. Hyndman, R.J.: Data from the M-competitions. Comprehensive R Archive Network. http:// cran.r-project.org/web/packages/Mcomp/ 16. Kwiatkowski, D., et al.: Testing the null hypothesis of stationarity against the alternative of a unit root: how sure are we that economic time series have a unit root? J. Econom. 54(1–3), 159–178 (1992) 17. Liu, L.M.: Identiﬁcation of seasonal ARIMA models using a ﬁltering method. Commun. Stat. A Theor. Methods 18, 2279–2288 (1989) 18. Makridakis, S., Hibon, M.: The M3-competition: results, conclusions and implications. Int. J. Forecast. 16, 451–476 (2000) 19. McCleary, R., Hay, R.: Applied Time Series Analysis for the Social Sciences. Sage Publications, Beverly Hills (1980) 20. Melard, G., Pasteels, J.-M.: Automatic ARIMA modeling including intervention, using time series expert software. Int. J. Forecast. 16, 497–508 (2000) 21. Ord, K., Lowe, S.: Automatic forecasting. Am. Stat. 50(1), 88–94 (1996) 22. Smith, J., Yadav, S.: Forecasting costs incurred from unit differencing fractionally integrated processes. Int. J. Forecast. 10(4), 507–514 (1994) 23. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www. R-project.org (2008) 24. Reilly, D.: The AUTOBOX system. Int. J. Forecast. 16(4), 531–533 (2000)

SmartHealth Simulation Representing a Hybrid Architecture Over Cloud Integrated with IoT A Modular Approach Sarah Shafqat1,2(&), Almas Abbasi1, Tehmina Amjad1, and Haﬁz Farooq Ahmad3 1 Department of Basic and Applied Sciences, International Islamic University (IIU), Islamabad, Pakistan [email protected], [email protected], [email protected] 2 Founder and President DreamSofTech, Islamabad, Pakistan 3 Department of Computer Science, College of Computer Sciences and Information Technology (CCSIT), King Faisal University (KFU), Alahsa 31982, Kingdom of Saudi Arabia [email protected]

Abstract. Every ﬁeld is being evolved in a new direction with the technological advancement. In case of healthcare, the traditional system is being transferred over to cloud with integration of Internet of Things (IoT) inclusive of all the smart devices, wearable body sensors and mobile networks. This healthcare community cloud would be a beginning of context aware services being provided to patients at their home or at the place of medical incident. This context aware platform built over the cloud and IoT integrated infrastructure would save cost as well as time to reach to hospital and ensuring the availability of services by the qualiﬁed staff. In this paper, researchers would be simulating the healthcare community cloud that is context aware of the patients based on their current medical condition. This simulation model is compared to another previously simulated SelfServ Platform in combination with societal information system that used NetLogo. Keywords: Hybrid cloud simulation Learning Healthcare System Healthcare analytics Community cloud

1 Introduction When the world steps towards innovation in healthcare the role of technology comes upfront. It is understood that all ﬁeld of studies and their activities generate generations of thousands of data due to digitization and evolution of information technologies and

Special thanks to International Islamic University and Higher Education Commission of Pakistan to sponsor the researchers for presenting the novel idea in FICC 2018. © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 445–460, 2019. https://doi.org/10.1007/978-3-030-03405-4_31

446

S. Shafqat et al.

advent of internet that has become a widely used interface for daily interactions. Data being generated daily is so huge that it has become difﬁcult to manage it not just in volume but in different varieties and velocity. Thus the term big data became known [1]. In 2008, Gartner used the term “Big Data” and foresee the change it would bring to the way we work. Big data and analysis are correlated as this whole world of digitized data has to be understood by turning into disseminating some meaningful information. And, meanwhile there has also been a huge transformation in the medical ﬁeld because of cloud computing that has shifted the medical paradigm beyond the boundaries of a hospital and has made available the services to anyone anywhere. There are examples like; robotic surgery and laparoscopic surgeries taking place in substitution to classical surgery. There are smart homes allowing patients for self-care with monitoring through smart devices with applications and software that analyzes the body signals through sensors and mHealth technologies for data collection of biological, environmental and behavioral patterns. And, all this monitoring used with sensors gives higher accuracy. This huge medical data comes from all sources and electronic medical records that contain diagnostic images, lab results, and biometric information to be stored and evaluated. And, from here the researchers grasped the idea of analyzing the data to reach better clinical decisions for doctors. Now a days Big Data Analytics [1] is seen as the only solution to problems of healthcare service sector providing; (i) best service, (ii) monitoring quality in hospital, (iii) improving treatment processes, and (iv) detection of disease earlier. Till now a lot of work is done in coming up with many algorithms to classify and predict the chronic diseases like; heart disease, breast cancer, motor neuron, and diabetes. This big data is being analyzed for knowledge discovery and decision making by incorporating machine learning techniques [2] moving towards the formation of Learning Healthcare System (LHS) [3]. Learning Healthcare System (LHS) [3] is the term that is getting momentum in recent days. It is the concept of enhancing medical care by analyzing and acquiring knowledge from the previous evidences that can be in form of Electronic Health Record (EHR) or other free text. Integration of LHS with big data infrastructure is essential to provide context aware medical assistance to the patients at the facility of their home or any convenient point. Acknowledging the need for linking LHS to big data infrastructure researchers cannot ignore that big data is generated through cloud and wearable sensors that become part of Internet of Things (IoT) [4]. The big data is generated as the cloud infrastructure plays an important role in providing point of connection to healthcare staff, patients and other people that would help in building the context of needed medical care. While patients use wearable sensors that transmit the medical data through the mobile application to the cloud based on some social interface. The neighbors and relatives facilitate and speed up this connection with the doctors and healthcare specialists. The cloud infrastructure necessity is most felt when the data being generated from patients over social networks and IoT accumulates at the cloud interface to be stored for further processing by healthcare analytics tools that are for converting this structured and unstructured data in meaningful yet understandable contextual information. There have been various scenarios been considered to simulate various healthcare community cloud infrastructures. In this paper, there is yet another scenario being considered for simulation of SmartHealth cloud. SmartHealth cloud would be a beginning of context aware services being provided to patients at their

SmartHealth Simulation Representing a Hybrid Architecture

447

home or at the place of medical incident. This context aware platform built over the cloud and IoT integrated infrastructure would save cost as well as time to reach to hospital and ensuring the availability of services by the qualiﬁed staff. SmartHealth cloud is also referred as the healthcare community cloud that is context aware of the patients based on their current medical condition. This simulation model is compared to another previously simulated SelfServ Platform [6] and an agent-based societal information system for healthcare [8] that used NetLogo. The results depicted from these simulations would greatly assist in formation of standardized LHS.

2 Background Healthcare System revolves around prediction, detection, diagnosis, and treatment of various diseases. Speciﬁc to forming a medical diagnosis there are some most common risks and precautions associated [5]: • To reach a well-established diagnosis a physician is required to be well versed with some very experienced cases and that experience does not come through completing academics but after lots of experience in a specialized ﬁeld or disease. • In case of new or rare disease even the experienced physicians feel to be at the same level as an entry level doctor. • Humans are good at observing patterns or objects but fail when need to ﬁnd the associations or probabilities for their observations. And, here computer statistics helps. • Quality in diagnosis relates to physician’s talent and years of experience. • Doctor’s performance gets affected with the emotions or fatigue. • To train the doctors in specialties is expensive as well as a long procedure. • Medical ﬁeld is always evolving with new treatments and new diseases are coming up with time. It is not easy for a physician to keep abreast with so much change and new trends in medicine. Step towards formation of Learning Healthcare System [3] is the realization to conquer these problems through computation. The use of computation especially in medical diagnostics is a complex task [5]. Till now to expect a complete diagnostic system is understood as unrealistic. But, no matter how much complex it is advances are being made using artiﬁcial intelligence (AI) techniques. Computers have advantage over humans as they do not get fatigued or bored. Computers update themselves within seconds and are rather economical. If automated diagnostic system is made such that it would take care of routine clinical tasks in which patients are not too sick, then doctors would be free to focus on serious patients and complex cases. The infrastructure for LHS requires accessing all the data whether it is EHR or free text to apply healthcare analytics embedded with Natural Language Processing (NLP) capabilities. Integrated with Big Data infrastructure the implementation of LHS seems possible at personal level. This data is available in structured as well as unstructured form within clinical notes with detailed understanding of patient’s condition. For knowledge delivery in clinical apps the use of NLP with other healthcare analytics techniques is a requirement. When LHS is empowered with big data the NLP techniques would be part of it.

448

S. Shafqat et al.

The utilization of NLP in healthcare is worked on since 1980s. Medical Language Extraction and Encoding System (Med-LEE) [3] developed in 1990s is the most studied NLP system. NLP empowered IBM Watson is an independent analytical application that can be integrated with EHR for the effectiveness of clinical practices. Mayo Clinic [3] has been actively conducting research to implement big data empowered clinical NLP in LHS. These researchers used Apache Hadoop as the big data platform, HBase for fast key-based data retrieval, Apache Storm for real-time distributed computation environment and for indexing and efﬁcient retrieval of information elastic search were used. With the implementation of big data empowered NLP infrastructure in its Uniﬁed Data Platform (UDP) it reads clinical documents in HL7 format through Java Messaging System using HL7-HAPI-based parser available open source. A decision support system named MEA was implemented on top of this big data infrastructure for individualized care recommendations as part of clinical practice. Finally, Mayo Clinic has come up with a web-based application that is AskMayoExpert (AME). There are 115 Care Process Models (CPMs) and 40 risk factor scoring tools for choosing the right intervention per CPMs. But providers input patient data manually into these scoring tools and review the protocols in AME for deﬁning the correct procedure for the patient [3]. To exhaustively utilize the knowledge assets in AME there is need to incorporate the knowledge mechanism in clinical workflow of providers for individual patient care. Till now there is no mechanism for knowledge delivery in context of patient-speciﬁc data. For point of care recommendations there is need to extract real-time information to be able to deliver patient-speciﬁc screening reminders, follow up calls, shared decision making tools, patient-speciﬁc links to AME and other resources. The success was in formation of three working CPMs [3] that are hyperlipidemia, atrial ﬁbrillation and congestive heart failure (CHF) in pilot phase of MEA. SelfServ [6] is another platform introduced that combines the social context of patients, their families and doctors that can be termed as Service Oriented Community (SoC) with the complexevent processing (CEP) concept. Both these concepts were ﬁrst introduced separately in two research projects namely; Little Sister that was a Flemish Project and CASANET by Morocco. It was in recognition by Morocco that there was a wide spread of diabetes within their population. Diabetes known to be a chronic disease is spread all over the world and around 382 million people suffer. While 50% of the population mostly are unaware of their disease until it is diagnosed for visible symptoms that occur too late. The complex scenario that relates to the diagnostic of diabetes and high cost attached to it let Morocco initiate the Project of CASANET for CEP. Later the requirement was felt to capitalize the beneﬁts acquired by both the projects to form “ecosystem of care” that would enable the continuous monitoring of patients. SelfServ platform is simulated using NetLogo and has tested SoC and CEP for (i) its efﬁciency gained when integrated with tailoring principles to some extent, (ii) secondly, its effectiveness is tried to decrease waiting time at hospitals by delivering home care to elderlies managing complex conditions and provide access to healthcare facilities on priority basis. Finally, there is yet another agent-based societal information system for healthcare [8] simulated. Here basically there were two agents; human agent and an assistant agent. Patient assistant agent is there to collect various feedbacks about healthcare

SmartHealth Simulation Representing a Hybrid Architecture

449

providers and evaluates the care service given. There human agent acts as a patient as well as a health provider role. Four different strategies are designed for assisting a patient or attendants to ﬁnd a health specialist that is linked with three waiting strategies in three societal scenarios. Results demonstrate that societal information system facilitate in decreasing the annual sick days of a person by 0.42–1.84 or 6.2–27.1%.

3 Related Work In healthcare, decisions regarding disease diagnostics through classiﬁcation are done by identifying patterns of clinical practices taken by physicians [2]. It is not at all easy to comply with the needs of stake-holder in healthcare whether they are patients, medical staff, administration and other governing bodies. It is already late for adoption of big data approaches that has left the medical ﬁeld not fully prepared for its aspiration of future precision medicine to be designed for individualized healthcare using personalized information through learning. The systemized and accurate view of information about patients to clinicians is not apparent and the knowledge of risks and beneﬁts is often vague. Even if the evidence exists for a particular decision in a speciﬁc case it is not always applicable to the patient. Predictions for personalized prognosis and response to treatment are required for improvement in informed decisions. Relationships between all factors including risks associated to drugs and devices, effective ways of prevention, diagnostics, prescriptions and related treatments need to be apparent and known for understanding of a patient’s case in a healthcare system that is an integral part of society. Developing clusters of patient groups is one way to form taxonomies of diseases on basis of their similarity because of shared characteristics and outcomes. Various machine learning algorithms are then applied over this data for analysis and making meaning out of it. Empirical classiﬁcation [2] is used for selection of best treatment strategy with predicted results. The knowledge of best possible treatments is precedent in understanding the underlying mechanism of disease and its response to treatment. This is a learning approach to reproduce consistent mechanisms that would work efﬁciently in diversiﬁed settings and patients. Inductive reasoning and pattern recognition [2] in contrast to deductive reasoning is based on learning through observations to evolve conceptual models and tools for informed decisions. The approach is validated through testing the consistency of results and conclusions. Inductive reasoning is little less certain to identify causes and more relates to forming the conﬁdence level than to reach deﬁnitive conclusions. For example, there is no such experiment to validate that smoking causes cancer but a high conﬁdence level is gained through observation. The criterion is required that would assist researchers to interpret results of millions and trillions of observations for rapid decision making. While retaining the complexity of patients and medical decision making big data approaches highlights the interactions between all the factors underlying to reach future of precision medicine diminishing the effects of associated risks and outcomes. Researchers are good at ﬁnding patterns without knowledge of the outcome. Prior to come up with our SmartHealth cloud model several latest healthcare analytics systems have been reviewed. Relating to one such proposed system we evaluated GEMINI [10]

450

S. Shafqat et al.

and what it is composed of. GEMINI refers to “GEneralizable Medical Information aNalysis and Integration System”. It provides doctors with the capability of utilizing point of care analytics over the information gathered on patients through examining. GEMINI stores the information of patients coming in through various channels and forms a patient proﬁle graph. These patient proﬁle graphs are structured letting GEMINI infer an implicit view for administration and clinical trials to perform predictive analytics. To keep itself updated it keeps doctors interactive through feedback loop and managing its self-learning knowledge base. The case study presented to validate its usefulness is through predicting unplanned patients readmissions. Furthering its validation it has also taken into account the prediction of diabetes mellitus (DM) if it is controlled based on lab test for HbA1c. To form structured data GEMINI referred to the medical knowledge base of Uniﬁed Medical Language System and some other similar data sources. EPIC forms an important utility of GEMINI to perform analytics as it carefully categorizes clinical data in multiple nodes and help in scalable NLP processing and data analytics using MapReduce for entity extraction, Pregel for graphical modeling, deep learning for analytics, etc. Then another important healthcare cloud analytics system [11] is seen based on patient proﬁles to monitor diabetes mellitus (DM). This system again relates to converting patients’ data in patient proﬁle graphs but has only added the ability of cloud to converge all the data coming in from various sources in raw form. The analytics are now an integrated part of the cloud for processing. Patient Electronic Health Records (EHR) is a good source to apply data analytics [12]. Data-driven healthcare promises the future direction for achieving effective personalized medical care to reach millions of patients through application of big data analytics over medical data that would transform healthcare. To represent EHR in understandable format to learn becomes a challenge because of various factors that are; data sparseness, temporality, disjoint, noise, and bias, etc. The problem is addressed by extracting features and phenotyping from EHRs applying a proposed deep learning approach. A temporal matrix is formed for each patient EHR that has time as one dimension and event on other. And then four layer convolutional neural networks model [12] is built to extract phenotypes to form prediction. First layer is having EHR matrices. The second is a one-sided convolutional layer to extract phenotypes from the ﬁrst layer. Third layer is a max pooling layer that extracts most signiﬁcant phenotypes by introducing sparsity on the detected phenotypes. Fourth layer is known to be fully connected softmax prediction layer. To deal with the temporality three different temporal fusion mechanisms have been investigated in the model that were; early fusion, slow fusion, and late fusion. Smooth temporal connectivity is important for prediction and every data sample is taken as collection of short, ﬁxed size sub-frames where connectivity is formed based on time dimension further learning temporal features. Single-frame architecture is adopted as part of the proposed basic model [12]. It is to view EMR record as a static matrix. A normalization layer is added to it. Single-frame architecture evaluates the contribution of static representation of data towards classiﬁcation accuracy. The model is validated on various chronic diseases for predictive outcome in a real time scenario comprising of EHR data warehouse. It was clearly demonstrated [12] that temporal fusion connectivity within CNN-based model on real clinical data boosted the overall performance of prediction model. Still in [6] it is

SmartHealth Simulation Representing a Hybrid Architecture

451

clearly deﬁned that trying to directly transform raw patients’ and clinical data in structured form to put it through analytics based on neural networks would not be a good measure. It is seen as compromising the accuracy of results as lots of details would be missed out at abstract level. Therefore, the study is formed to ﬁrst perform data mining techniques over the data to get a detailed structured view. For this detailed view the proposed method for clustering is random subsequence method and its advanced form is random dynamic sub-sequence method. After the thorough realization of detailed structured view of data rigorous machine learning techniques should be applied for ﬁnal more accurate results. For analyzing complex medical data, the use of artiﬁcial intelligence is known to be more capable [13]. The way artiﬁcial intelligence exploit and relate the complex datasets giving it a meaning to predict, diagnose and treat the particular disease in a clinical setup. Several artiﬁcial intelligence techniques with signiﬁcant clinical applications have been reviewed. Its usefulness is explored in almost all ﬁelds of medicine. It was found that artiﬁcial neural networks were the most used technique while other analytical tools were also used and those included fuzzy expert systems, hybrid intelligent systems, and evolutionary computation to support healthcare workers in their duties, and assisting in tasks of manipulation with data and knowledge. It was concluded that artiﬁcial intelligent techniques have solution to almost all ﬁelds of medicine but much more trials are required in a carefully designed clinical setup before these techniques are utilized in real world healthcare scenarios. Artiﬁcial intelligence (AI) is termed to be part of science and engineering known to exhibit computational understanding to be said as behaving intelligently through the creation of such artifacts that form the stimulus in it. Alan Turing (1950) explained intelligent behavior [13] of a computer that could act as any human for cognitive tasks applying logics (right thinking) and this theory was later termed as ‘Turing Test’. To automate medical diagnosis, the artiﬁcial neural networks (ANN) approach is greatly studied in combination with fuzzy approach [5]. The experimental setup was created to analyze the medical diagnosis done by physicians and automate it with machine implementable format. Selecting symptoms of eight different diseases a dataset containing few hundreds of cases was created and Multi-layer perception (MLP) Neural Networks was applied. And, later results were discussed to conclude that effective symptoms selection for data fuzziﬁcation using neural networks could lead to automated medical diagnosis system. A diagnostic procedure is work of an art of specialized doctors and physicians. It starts from patient’s complaints and discussion with doctor that leads to perform some tests and examinations and on basis of results the patient’s status is judged and diagnosis is formed. Then the possible treatment is prescribed. And, patient remains under observation for some time where the whole diagnostic procedure is repeated and reﬁned or even rejected if needed. We all are aware of the complexity of forming a medical diagnosis as even the profession requires twice the study than other professions. There is diversiﬁed symptoms history that is caused for diversiﬁed reasons. All these causes have to be included in the patient history. It is already been determined that artiﬁcial neural networks are near to accurate in predicting various other chronic illnesses like; coronary artery disease, myocardial infarction, brain disease (multiple sclerosis and cerebrovascular disease), Down’s Syndrome in unborn babies, benign and malignant ovarian tumors, liver cirrhosis or nephrotic syndrome, and inflammatory

452

S. Shafqat et al.

bowel disease. Finally, reviewing Deep Patient Representation [16] based on unsupervised artiﬁcial neural networks (ANN) learning mechanism that iterates through multiple layers to reach a deﬁnite conclusion. It is again observed that it showed increased accuracy in predicting severe diabetes, cancers and schizophrenia. Moving ahead Deep learning algorithm and techniques would be exploited here as it is an emerging research paradigm not just limited to feature generation in images and speech applications, but also its application is seen in convolutional neural networks (CNNs) with high accuracy in classiﬁcation. All these models are data driven but the choice of architecture is highly intuitive and depends on the parameters selection. Multi-node Evolutionary Neural Networks for Deep Learning (MENNDL) [14] is therefore presented to tackle this issue. This model does the automatic network selection over computational clusters where optimized hyper-parameterization is achieved through genetic algorithms [13]. These hyper-parameters need to be known before deep learning (DL) is applied to any dataset. Hyper-parameters can be searched manually, in a grid, or randomly. Random search proves to be better off as it is focused and ﬁx the issue of under-sampling important dimensions as in grid search while it is as easy to implement and parallelize. But it suffers from non-adaptive and sometimes can also be out performed by manual or grid search. But genetic algorithms resolve this issue letting hyper-parameters to be searched randomly while learning through previous results to direct the search. While genetic algorithm is more towards covering shallow networks, deep learning algorithm is for deeper networks with multiple hidden layers holding multiple hidden units. Evolutionary algorithm [13] is applied on MENNDL for optimizing the hyper-parameters to eventually evaluate the ﬁtness of the population where genes i.e. hyper-parameters to be optimized, are calculated by a single node handling selection, mutation and cross-over. ORNL’s Titan super computer platform is used to experiment MENNDL framework. MENNDL framework would allow application of deep learning algorithm on new applications with less difﬁculty. Synchronous evolutionary algorithm [14] would not be able to maximize the use of available resources where the hyper-parameters identiﬁed add to the computational complexity. Also, the time required for evaluation of individuals i.e. hyper-parameters set vary greatly as other nodes wait idly for one node to be evaluated for its assigned hyperparameters set. Its remedy could be found in asynchronous evolutionary algorithms [15] to maximize use of resources and better management of time. Thus proved to be the base of future machine learning framework in healthcare decision support system and our proposed SmartHealth cloud.

4 Existing Simulation Environments Initially a medical diagnostic system [5] that was proposed based on several interviews that were conducted of expert doctors getting some diagnostic flow diagrams of various diseases and associated list of symptoms. The dataset was created such that hundreds of patients were tested against 11 symptoms (features) and 9 diseases (classes). First eight classes were associated with speciﬁc illnesses and the ninth one was deﬁned for normal/healthy person. Multilayer Perceptron (MLP) neural network was used with application of back propagation GDR training algorithms. And, simulation was

SmartHealth Simulation Representing a Hybrid Architecture

453

developed on MATLAB with NETLAB toolbox. A three-layer feed forward perception was kept to keep the structure simple and focus on the hidden nodes with the training iterations as variable parameters. The performance of classiﬁer was tested while changing the parameters. Also, feature fuzziﬁcation rule [5] on the accuracy of diagnosis was investigated. With k-folding scheme, the lack of dataset was overcome to give better accuracy for validation. So, the training procedure was repeated k = 5 times with 80% as training dataset and 20% as testing. The mean was taken for all the outcomes of 5 tests. 88.5% best accuracy was achieved with 30 nodes at the hidden layer. Then, membership-based fuzziﬁcation scheme [5] was applied to the dataset converting it to fuzziﬁed set of symptoms. A linear membership function was selected for each symptom with experts’ consultation. From three to ﬁve linguistic variables were linked to each symptom and classiﬁcation tests were repeated. And, maximum performance was achieved with diagnostic accuracy of 97.5%. Going towards simulating cloud environment distributed simulation (DS) is seen as an effective and most reliable option. Distributed simulation [7] is applicable for large-scale systems where multiple modules combine to form one system or one system is divided into multiple modules to be simulated on multiple nodes. In case of operations research and management sciences (OR/MS) that includes healthcare, manufacturing, logistics or commerce etc. distributed simulation is rarely been applied. The barrier to apply DS over OR/MS systems [7] is the technical complexity and gap between DS and OR/MS scenarios that is visualized by the world. OR/MS systems are simulated using various techniques from System Dynamics (SD) or Discrete Event Simulation (DES) to more recently as Agent Based Simulation (ABS). Larger systems take longer to simulate and it would be convenient if a hybrid model is simulated using a combination of various techniques to manage the heterogeneity and size of the system effectively. Using distributed simulation for simulating OR/MS systems would not compromise the details that could be lost otherwise. If distributed simulation [7] as per IEEE recommended practice for Distributed Simulation Engineering and Execution Process (DSEEP) standard it could model the interoperability between models that use High Level Architecture (HLA) standard. DS could distribute the load over network forming hybrid federations for capturing the heterogeneity of the whole system realistically. There are some examples representing details of implementing DS over OR/MS scenarios [7]. One case study is of automotive industry for virtual manufacturing environment using Distributed Manufacturing Simulation Adapter (DMS) that is developed by National Institute of Standards and Technology (NIST). It used Arena (DES software), QUEST, Simul8 (DES software) and JADE for multi-agent systems framework to simulate supply chain network. Finally, it used HLA web services for model discovery and federation creation thus car manufacturing DS demonstrates the enterprise interoperability. For various healthcare scenarios [7], there are various DS tools including Arena, NetLogo, FlexSim Healthcare (www.ﬂexsim.com/ﬂexsim-healthcare), simul8 (www. simul8.com), ProModel (www.promodel.com), simio (www.simio.com) and Repast Simphony (repast.sourceforge.net)

454

S. Shafqat et al.

There is a case study [7] presented for simulating hybrid model of London Emergency Medical Services (EMS) using ABS and DES techniques. A semantic relationship between both these techniques was implemented where an agent in ABS can be the resource or an entity in DES. DES is event driven while ABS is time driven. In London EMS, Accident and Emergency departments are closely related and require to efﬁciently communicating the availability to ambulance model. Patient object or agent is to be transferred from the ambulance service model to entry point of A&E department model. Ambulance should communicate the A&E department model that patient is on way and to pre-allocate resources to avoid conflict in hospital availability level. In addition, all properties of patient object must be known to A&E department model beforehand. To simulate this case study, the Repast Symphony (RepastS) [7] open source toolkit was used. RepastS is primarily an ABS tool but it can be converted to DES through hard code as both are discrete time simulations. Fundamental components of DES are queues, work stations and resources. The Portico v2.0 RTI having implementation of IEEE-1516-2000 was used for interface. However the whole simulation is complex and starts from planning to development to experimentation phase requiring lots of computation and resources. For example, to experiment this simulation, a complete LAN with 1 Gbps network card would be used. NetLogo is seen as less complex simulation tool that can model multiagent based environment and at some time these agents have system dynamics (SD) embedded within. SelfServ platform [6] is introduced that combines the social context of patients, their families and doctors acting as agents termed as Service Oriented Community (SoC) with the complexevent processing (CEP) concept. SelfServ platform [6] is simulated using NetLogo and has tested SoC and CEP for (i) its efﬁciency gained when integrated with tailoring principles to some extent, (ii) secondly, its effectiveness is tried to decrease waiting time at hospitals by delivering home care to elderlies managing complex conditions and provide access to healthcare facilities on priority basis. There is a case of chemical synergy [17] being developed between atoms and molecules when they collide. The atoms (molecules) acting as agents or a system when collaborate or collide with each other in their neighborhood form a bonding that is known as System of Systems (SoS). This is how the researchers have validated the concept of Boardman Sauser SoS theory on the basis of collaborative SoS [17] using NetLogo. Here we would study another case study [8] where an agent-based societal information system for healthcare is simulated in NetLogo. Four different strategies are designed for assisting a patient or attendants to ﬁnd a health specialist that is linked with three waiting strategies in three societal scenarios. Results demonstrate that societal information system facilitate in decreasing the annual sick days of a person by 0.42–1.84 or 6.2–27.1%. Finally, we evaluated AnyLogic simulation tool [18] with relevance of SD, DES and ABS. In comparison to tools like; Venism, Stella, and iThnk (that are speciﬁcally for SD) and ProModel (that is mainly for DES) AnyLogic is believed to model all three simulation environment whether it is SD, DES or ABS. In comparison to other tools it requires programming in Java/C++ to give better results and demonstrate results in 3D visuals. In [20] Role Activity Diagram (RAD) notations are used to form Discrete Event Simulation (DES) model for process mapping of complex healthcare service delivery collaboratively. The tool used is AnyLogic. To make a DES model RAD

SmartHealth Simulation Representing a Hybrid Architecture

455

model is created for healthcare service delivery collaborative process. Extensive interviews of staff are conducted to extract signiﬁcant terms to be translated in RAD. A matrix based tool then determines the relationships between the terms in form of 0 and 1. Finally, a graphical representation of RAD is automatically extracted in MS Visio. Now, the need for creating data model is felt to associate with RAD that would be transferred over to DES model. The data model [20] consists of detailed descriptions of; (i) roles, (ii) activity and (iii) interactions and encapsulated processes. eXtensible Markup Language (XML) then transforms this data model to other domains. Each RAD attribute in Visio diagram is linked with data entry form that is in turn stored in text or XML format. The next step is to get DES model from RAD model. AnyLogic for developing DES is integrated with RAD software. The dynamic parameters are then provided to DES in AnyLogic to validate. The case study chosen to provide with these dynamic parameters is MR scanning process in radiology department [20]. The analysis [20] of this DES model based RAD model for MR scanning process helped to identify bottlenecks, waiting times, low throughput with using minimum resources. Dynamic Simulation Modeling (DSM) [19] has also been reviewed integrating big data for health. It is already identiﬁed that healthcare delivery systems as complex involving various stakeholders communicating dynamically. Thus, DSM purpose is to transform this complex scenario in mathematical representations as formal methods. DSM are differentiated with other simulation models due to their signiﬁcant representation of states, non-linear or special relationships of stake-holders and the emerging changes in behaviors as the system is evolved with time. DSM [19] incorporates three common methods that are; Discrete Event Simulation (DES), System Dynamics (SD) and Agent Based Modeling (ABM). These methods cater both stochastic and deterministic scenarios and processes of a complex system incorporating diverse data types related to administrative and mobile device. DSM use data for empirical evaluation as ﬁrstly, for model parameterization having data insertion directly or indirectly, and secondly, for model calibration where diverse data is used to be compliant with the emergent behavior of model. The continuous incoming streams of big data would help model to emerge in new dimensions while mitigating the risks of high model parameter inaccuracy, any missing processes or high levels of aggregations. Such model would be able to anticipate forward inclinations of the system with time as well as being informed of downstream tradeoffs that would occur, thus building conﬁdence over initial state estimates based on latest evidence formed. Table 2 of [19] gives an overview of ﬁve organizations involved in developing such applications integrated with big data where the potential of applying DSM is seen. Seven major challenges are highlighted when integrating big data with these applications. First, is the accessibility of big data that resides on various databases located in dispersed locations over the world and may be restricted due to geographical policies; second, linking of data coming in from various sources. It is possible that one patient has his records in multiple hospitals/clinics or with providers as well has health claim or insurance data are present with several other parties. Third challenge is based on data reliability, normalization and cleaning. Limitations in form of under coding or over coding of diseases and procedures that are further truncated by claims ofﬁcers cannot be ignored. Therefore, cleaning and standardization of EMR data is required that is not much worked upon. Fourth challenge is maintenance and updating of databases

456

S. Shafqat et al.

periodically having the time stamp of each event properly. Fifth involves the ownership and control of data. Sixth includes privacy and security of patient data that is integral when linking external databases. Thus, ﬁnally it is observed that building a dynamic simulation model catering to all the underlying issues would not be an easy task at all. Further, if we are including big data analytics to process this vast amount of heterogeneous data then the effort is increased considerably.

5 Our Approach We would simulate our SmartHealth cloud over AnyLogic [18] creating a hybrid model based on DSM [19] that would include System Dynamics (SD), Discrete Event Simulation (DES) and Agent based Simulation (ABS). The simulation model would be detailed and as an extension compared to the SelfServ platform [6] integrated with societal information system [8] simulated using NetLogo as discussed earlier. In our case study SmartHealth cloud processes and events would be depicted through Discrete Event Simulation. Agent based simulation would further map the entities that are of doctors, patients and facilitators as agents forming part of social community in cloud infrastructure. These agents would have different System Dynamics that would communicate to evolve a complete diabetic diagnostic system. Our simulation design model and results would be compared with both approaches having there parameters combined for evaluation. 5.1

Proposed SmartHealth Cloud Simulation Model

Our SmartHealth cloud would be composed of agents that are patients, doctors and facilitators. Patient agent would have a system dynamics implemented within that would read all the symptoms for diabetes and send an alert as there is some major incident sensed. The facilitator agent would get the alerts from nearby patients and based on the problem would recommend the doctor. Doctor would get all the symptoms and test reports of patient to perform diagnostics. If patient is diagnosed with diabetics he or she would be in queue for further treatment otherwise the patient would be sent to another related doctor or may be discharged. All these agents are part of SmartHealth cloud that is assumed to take inputs from biosensors attached to the patients. 5.2

SmartHealth Cloud (Discrete Event Simulation)

SmartHealth Cloud is simulated in AnyLogic (anylogic.com) where there are mainly three entities patient, facilitator and doctor that are treated as intelligent agents. These entities communicate and some events occur that are shown in Fig. 1.

SmartHealth Simulation Representing a Hybrid Architecture

5.3

457

Facilitator Agent

Facilitator agents are continuously updating their list of available healthcare providers based on quality of services and ranking. Their role comes when patient agent prompts an alert and shares a critical case. Facilitator agent using its system dynamics applies a rigorous clustering algorithm to map patients to the most relevant diseases that they may be suffering. It is necessary to make a structured form of complex informal temporal data and the best approach considered is sequential clustering [9]. In sequential clustering the best approach considered is Random Dynamic Subsequence method proposed in [9]. Facilitator agent matches the patient to the most suitable doctor according to the identiﬁed disease having a higher conﬁdence level as outlined in inductive reasoning approach [2].

Fig. 1. Smart health hybrid cloud simulation model.

5.4

Patient Agent

Patient agent would have the data coming in from various IoT biodevices in form of medical history, current symptoms, proﬁle history and family history. The patient agent has system dynamics to compute a proﬁle graph [10, 11] for every patient and is responsible to alert emergency on any criticality that occurs.

458

5.5

S. Shafqat et al.

Doctor Agent

Doctor agent receives the patient information and lets its system dynamics perform diagnostics and as mentioned in [9] the structured data is further understood using machine learning techniques [16]. If the diagnosis is reached it is communicated to the patient agent with further recommendations of either patient need to be admitted, should go for a routine checkup or given a medical prescription. In case, the diagnosis is not reached and needs consultation of another expert then the patient proﬁle is forwarded to the next most relevant doctor agent until a diagnosis is formed. In critical diagnosis, human doctors get involved for giving feedbacks and updating the doctor agent.

6 Evaluating SmartHealth Cloud Model by Comparison Comparing our model with [6, 8] it is realized that SelfServ platform [6] is simulated in NetLogo using multiagent simulation environment. It combines Complex Event Processing (CEP) with Service Oriented Communities (SoC) over a cloud environment. It has tried to justify its approach demonstrating two case studies through a brief discussion of preliminary results obtained. The model is said to achieve 10% improved sensitivity and relatively marginal reduction in social costs and waiting times. While achieving strength in these metrics other metrics were computed for number of treated cases and average and overall servicing time. In the example considered it is predicting the critical level of glucose to alert patient and doctor in advance. Thus, patient gets connected to the doctor to get some feedback for required therapeutic treatment. It remains more of a telecare servicing model [6] that aims to work on more simulations to come up with healthcare services platform. While in [8], the societal information system for health is simulated using NetLogo. It again behaves in form of multiagent systems due to the limitation posed by NetLogo. It is there to connect patients over societal information system to be guided over health insurance issues, if it is costeffective, also gets assistance in ﬁnding a good healthcare provider, get advice on costeffective drugs and care while monitoring the spread of disease symptoms and the available treatments. It has mainly taken two agents; a patient assistant and other is human agent. An assistant agent collects various feedbacks about healthcare providers and evaluates the care service given, while human agent is to map a patient role as well as a health provider role. The evaluation parameters consisted of number of patients reported sick every day. The providers’ capacity was designed to take eight patients every day. There were various strategies modeled with respect to patient if “waits” or “do not wait”. The number of days it took for patient treatment depended on the selection of best strategy in the given scenario. Acknowledgment. This paper would not have existed without the guidance and support of Dr Tehmina Amjad, Dr Almas Abbasi and Dr Haﬁz Farooq Ahmad who are already regarded as eminent scholars in the ﬁeld.

SmartHealth Simulation Representing a Hybrid Architecture

459

References 1. Boukenze, B., Mousannif, H., Haqiq, A.: Predictive analytics in healthcare system using data mining techniques. Comput. Sci. Inf. Technol., Hassan 1st University, Morocco and Cadi Ayyad University, Morocco (2016) 2. Krumholz, H.M.: Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 33(7), 1163–1170 (2014) 3. Kaggal, V.C., et al.: Toward a learning healthcare system – knowledge delivery at the point of care empowered by big data and NLP. Innov. Clin. Inform., Division of Information Management and Analytics, Mayo Clinic, Rochester, MN, USA (2016) 4. Botta, A., et al.: Integration of cloud computing and internet of things: a survey. Future Gener. Comput. Syst., University of Napoli Federico II, Italy (2016) 5. Moein, S., Moallem, P., Monadjemi, A.: A Novel Fuzzy-Neural Based Medical Diagnosis System. University of Isfahan, Isfahan (2009) 6. Florio, V., et al.: Towards a smarter organization for a self-servicing society. In: ACM DSAI 2016, Morocco & Belgium (2016) 7. Anagnostou, A., Taylor, S.J.E.: A distributed simulation methodological framework for OR/MS applications. Simul. Model. Pract. Theory 70, 101–119. Elsevier, Department of Computer Science, Brunel University London, UK (2017) 8. Du, H., Taveter, K., Huhns, M.N.: Simulating a societal information system for healthcare. In: Proceedings of the Federated Conference on Computer Science and Information Systems, pp. 1239–1246. University of South California, USA (2012) 9. Zhao, J., et al.: Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inform. 65, 105–119. Stockholm University, Sweden (2017) 10. Ling, Z.J., et al.: GEMINI: an integrative healthcare analytics system. Proc. VLDB Endow. 7(13). National University Health System, Singapore (2014) 11. Kulkarni, S.M., Babu, B.S.: Cloud-based patient proﬁle analytics system for monitoring diabetes mellitus. IJITR, 228–231 (2015) 12. Cheng, Y., Wang, F., Zhang, P., Hu, J.: Risk prediction with electronic health records: a deep learning. Healthc. Anal. Res.,[IBM T.J. Watson Research Center and University of Connecticut (2016) 13. Ramesh, A., Kambhampati, C., Monson, J., Drew, P.: Artiﬁcial intelligence in medicine. The Royal College of Surgeons of England (2004) 14. Young, S.R., Rose, D.C., Karnowski, T.P., Lim, S., Patton, R.M.: Optimizing deep learning hyper-parameters through an evolutionary algorithm. MLHPC2015, ACM, 15–20 November, Oak Ridge National Laboratory (2015) 15. Alba, E., Giacobini, M., Tomassini, M., Romero, S.: Comparing Synchronous and Asynchronous Cellular Genetic Algorithms. University of Malaga, Spain and University of Lausanne, Switzerland (2000) 16. Miotto, R., et al.: Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientiﬁc Reports, Icahn School of Medicine at Mount Sinai, New York, NY (2016) 17. Baldwin, W.C., Saucer, B.J., Boardman, J.: Revisiting “the meaning of of” as a theory for collaborative system of systems. IEEE Syst. J. University of North Texas, USA (2015)

460

S. Shafqat et al.

18. Sumari, S., et al.: Comparing three simulation model using taxonomy: system dynamic simulation, discrete event simulation and agent based simulation. Int. J. Manag. Excell. 1(3). Faculty of Computing, Universiti Teknologi Malaysia (2013) 19. Marshall, D.A., et al.: Transforming healthcare delivery: integrating dynamic simulation modelling and big data in health economics and outcomes research. PharmacoEconomics Springer, Clinical Engineering Learning Lab, Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, USA (2015) 20. Shukla, N., Keast, J.E., Ceglarek, D.: Role activity diagram-based discrete event simulation model for healthcare service delivery processes. Int. J. Syst. Sci., Online First 1–16, University of Wollongong, Wollongong, NSW, Australia, University of Wisconsin, Madison, USA (2015)

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers Walaa F. Elsadek(&) and Mikhail N. Mikhail Department of Computer Science and Engineering, American University in Cairo, P.O. Box 74, New Cairo 11835, Egypt {walaa.farouk,mikhail}@aucegypt.edu

Abstract. More and more companies are advocating for additional unlicensed spectrum in next-generation Wi-Fi to cover the vast increase in internet connected devices. Additional 50 billion devices are projected to be connected by 2020. Wi-Fi is regarded as the “Oxygen for Innovation” that will incorporate new bunches of services in the smart IoT era as of low cost service delivery. This paper introduces a new mobility framework, called SDWM, based on Software Deﬁned Networking (SDN) to extend residential/enterprise indoor real- time services across standard carriers and service providers with Network Functions Virtualization (NFV) in smart cities. Efﬁcient date forwarding mechanism and trafﬁc offload technique are adopted to avoid core network congestion. Indoor services are extended over any type of infrastructure without enforcing small cell setup. Mobility is achieved through SDN overlay network that dynamically establishes virtual path to roaming mobile node’s (MN) home network using a new unique identiﬁer that is forwarded during DHCP IP allocation process. The distributed architecture simpliﬁes the integration to existing infrastructures with uniﬁed access to both wireless and wired networks. A physical prototype is created to illustrate how mobile nodes can roam freely across carriers’ wireless hotspots with direct agreements with home networks while ensuring seamless accessibility to indoor services without violating involved entities’ security policies. Experimental results show clear improvements over existing mobility protocols or wireless controllers as of restricting tunnels overheads and VLAN/MPLS headers. Keywords: Mobility Indoor CAPWAP Wireless Software Deﬁned Networking (SDN) Network Functions Virtualization (NFV) Real time services PMIP vCPE

1 Introduction IDG Enterprise survey for “Building the Mobile Enterprise” highlights the extent to which mobility becomes a driving factor toward business prosperity. The survey reveals that 64% of organizations regard mobile as a critical tool potentiating fast decisions, facilitating internal communications, and enhancing customers’ retention policy while 49% intend leveraging their Wi-Fi networks to handle more devices. Internal networks’ reliability is lower in priority while the major aspect attracting 63% of enterprises and 48% of small and medium-sized business (SMB) is communication © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 461–474, 2019. https://doi.org/10.1007/978-3-030-03405-4_32

462

W. F. Elsadek and M. N. Mikhail

security. Intranet service becomes less important as employees use their smart phones to carry most of data services with the rise of COPE (Corporate Owned, Personally Enabled) and BYOD (Bring Your Own Device) policies. Thus, Wi-Fi provider’s role becomes limited to internet access with continuous services migration to private clouds [1]. For maximum beneﬁt of both unlicensed spectrum and lower cost of Wi-Fi services, a new framework is required to enhance poor coverage and capacity indoors and to extend IoT smart residential/enterprise services across outdoor hotspots for seamless mobility. Existing mobility solutions that extend indoor services are almost restricted to 4G Long Term Evolution (LTE) with limited offering in scope, scalability, cost, and privacy requirements. For example; Distributed Antenna System (DAS) has wide scalability and coverage but its deployment may require LTE upgrade and is not cost competitive except for very large and multicarrier venues. On the other hand, small cell setups as femtocell or macrocell are cost effective when compared to DAS but lack wide coverage and scalability [2]. Even, Wi-Fi small cells adopting Control and Provisioning of Wireless Access Points (CAPWAP) protocol serves as remote controlled AP bridging indoor trafﬁc through centralized Wireless LAN Controller (WLC) without in-depth concern of customers’ security policy or WAN links congestion [3]. The current role of Wi-Fi/DSL/Cable service provider is almost conﬁned to providing megabytes at good price without any strategic impact on customer’s business or cost base. With about three-quarters of mobile voice and data sessions originating indoors, a new chance is open for service providers to higher their value by launching new bunches of services [2]. Solutions for indoor service accessibility must incorporate strong differentiated offering to create proﬁtable niche market. Moreover, compliance to uniﬁed access methodology of both wireless and wired networks is a must for seamless integration to enterprise physical network or private cloud services. Assume a voice IP phone application is installed on an employee smart phone; Quality of Service (QoS) must be guaranteed with seamless integration to enterprise IP Private Branch Exchange (PBX) while moving across the enterprise’s main wireless network or branches. The same scenario applies on cable providers in smart cities. Customers need to roam freely across Wi-Fi carriers’ hotspots with continuous access to residential services as VoIP system, remote health monitoring, IPTV broadcast, video streaming, smart home services, etc. Launching of Software Deﬁne Wireless Network (SDWN) facilitates such achievement through the scalable Software Deﬁned Networking (SDN) programmable infrastructure. This paper is presenting a novel LTE independent framework, called SDWM, for offering residential/enterprise indoor services mobility across enterprise branches, Network Function Virtualization (NFV) Wireless operators, and next generation SDWN carriers with efﬁcient usage of the scarce bandwidth interconnecting various entities and ensured compliance to customers’ security policy. OpenFlow SDN-based technology is adopted to dynamically establish mobility overlay network that overcomes existing protocols’ challenges and hides the complexities of underlying infrastructure while guaranteeing seamless extension of indoor services. This paper is organized as follows: Sect. 2 summarizes existing mobility deployment challenges. Section 3 briefs OpenFlow SDN-based technology and describes the structure of SDWM mobility framework. Section 4 illustrates the SDN/NFV virtual

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

463

Customer Premises Equipment (vCPE) integration to SDWM framework. Section 5 describes SDWM prototype and analyzes experiments’ results. Section 6 provides the summary and highlights various contributions.

2 Mobility Deployments Challenges Historically, WLAN started as standalone Access Points (AP) for fast internet access. Latter, WLC becomes widely adopted for centralized management of distributed APs deployed within enterprise and campus for wider coverage and lower interference liability across neighbors APs to avoid decrease in overall capacity [4]. Moreover, centralized WLC deployments ensure session continuity for Mobile Nodes (MNs) roaming across neighbor APs. Initially, this encouraged vendors to develop their proprietary WLCs till CAPWAP was proposed in IEFT RFC 5415/5416 as an extensible protocol for WLCs’ management of agnostic Wireless Termination Points (WTPs) with IEEE 802.11 binding speciﬁcation [5, 6]. Even after evolution to SDNbased network, Open Network Foundation (ONF) favors OpenFlow and CAPWAP hybrid mode deployment till covering the uniﬁed access gap of wireless and wired network in enterprise and large campus [7]. Despite this gap, existing WLCs and mobility protocols struggle in deployments across enterprises’ main buildings and branches or Wi-Fi providers’ hotspots as stated bellow. SDWM framework adopts OpenFlow SDN-based technology to solve the following challenges while ensuring uniﬁed access: (1) Manual enforcement of VLANs conﬁguration on trunks connecting WLC to core network switches is required. There are two modes of deployments: Per-SSID and Per-User VLAN Tagging. The former needs each VLAN to be statically mapped to a single AP SSID while the latter, either MN’s MAC address or 802.1 authentication guides WLC to dynamically force VLAN TAG on MN’s packets. Both modes are sufﬁcient for VLANs with static conﬁguration or for limited purchased of proprietary VoIP sets by enterprise with known MAC addresses. For 802.1 authentication, there is no unique indexed identiﬁer like International Mobile Subscriber Identity (IMSI) used in cellular network to support instant retrieval of MN’s proﬁle when changing a predeﬁned mobile set. Thus, in smart cities with high capacity coverage or enterprise with COPE and BOYD policies, both modes are inefﬁcient to satisfy extensive indoor mobility requirements. (2) WLC provides mobility across enterprise building and branches with static VLANs conﬁgured on trunks to connect remote sites. No efﬁcient trafﬁc offload mechanism is adopted regardless the scarce bandwidth. This congests WAN links and limits real time applications mobility that require guaranteed QoS as video streaming. (3) WLCs static VLAN nature is not scalable to carrier-grade hotspot deployments. It is hard to conﬁgure trunks with 212 bits ≅ 4096 VLAN or even 212 bits 220 bits Virtual Extensible LAN (VXLAN) to support indoor mobility for smart cities’ homes and enterprises’ ofﬁces [8]. Also, it is almost impossible for recent APs

464

W. F. Elsadek and M. N. Mikhail

with NFV to create virtual hotspots for each ofﬁce in enterprise or each home in smart cities for indoor mobility. (4) Inter-Domain mobility is not supported across WLCs as there is no method to negotiate VLANs’ conﬁgurations between WLCs under different administrative domains. Thus, real deployments are subject to high liability of duplicates VLAN TAGs when extending indoor services under roaming Service Level Agreements (SLAs) across Wi-Fi carriers, smart cities’ operators, enterprises etc. Mainly, the ﬁrst three challenges are the keys behind Proxy Mobile IP (PMIP) wide adoption in 4G LTE to extend indoor services in Local IP Access (LIPA) mobility offerings for residential/enterprise. PMIP has two modes of deployments; flat domain model and domain chaining model. In flat domain model, a single tunnel is created to a known home IP address or domain name. Each ofﬁce/home registered for residential/enterprise service must state a public IP address or a known private IP address to which service provider can terminate the tunnel. This solution is almost infeasible in practical deployment. To solve this issue, domain chaining model adopts a hierarchical deployment of flat domain model which dramatically increases mobility latency as of requiring several tunnels. Thus, PMIP internet draft for Wi-Fi providers expired without standardization [9, 10]. These challenges enforce expensive solutions using small cell, DAS and hybrid mode deployment of CAPWAP and PMIP for indoor services extension regardless the highlighted violation to security policy of residential/enterprise [9, 11]. Existing challenges in mobility deployments serve as igniting factors for reconsidering existing solutions and reinventing a new mobility paradigm.

3 SDWM Framework 3.1

Overview

SDN is perhaps one of the most transformational shifts affecting networking industry today as of providing a direct programmable infrastructure through proprietary or open source automation tools, including OpenStack, Puppet, and Chef. The architecture decouples the intelligent control plan from the forwarding plan controlling underlying network infrastructure. Controller is logically centralized and offers a complete view of the network topology. The North Bound Interface (NBI) offers an open programmable network for administrators to rapidly deploy new applications and services. The South Bound Interface (SBI) provides full manipulation of users’ packets and underlying resources such as router and switches. OpenFlow is the ﬁrst standard SDN protocol in SBI to facilitate relay of information and packets between both control and forwarding plane by adopting flow concept to identify network packets based on pre-deﬁned match rules that are statically/dynamically programmed. This dramatically enhances network flexibility, availability, agility, reliability, and response to real-time threats through automated provision and dynamic orchestration of offered services [12–14]. SDWM adopts OpenFlow SDN-based technology to propose a scalable mobility framework to extend residential/enterprises indoor services across Wi-Fi hotspots deployed by enterprise, branches, Cable network or Wi-Fi carriers without violating

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

465

involved entities’ security policies. No hardware is required in full SDN-migrated network as the framework is provided as service through SDN application layer. An out-band overlay network is used for seamless integration to standard networks with conventional switching and routing techniques. MN’s home network is located through mobility overlay network that recursively establishes SDN OpenFlow virtual path based on new identiﬁer that is sent during IP address allocation for a joining MN. OpenFlow virtual paths isolate MN’s packets without tunnel header or packet TAG as in VLAN or MPLS. No need for Multiple Service Set Identiﬁer (Multi-SSID) in WLAN for VLAN mapping. Single SSID is possible as MN’s packets are isolated in OpenFlow virtual paths. For instant retrieval of MNs’ subscription proﬁle in carrier grade deployment, Multi-SSID can be used not for VLAN mapping but for faster indexing to minimize the searching complexities in millions registered proﬁles. SDWM framework can be seamlessly integrated to standard DSL/Cable home network without remote controlled home AP deployment or a small cell setup or full migration to SDNbased topology to ensure fast deployment compliant to existing security policy. 3.2

User Equipment Mobility Subscription Identiﬁer

In carrier grade deployments, proﬁles allocation process will suffer from tremendous delay during MN’s Join if the retrieval process is based on MN’s L2 hardware address. Thus, the proposed framework adopts a new identiﬁer called User Equipment Mobility Subscription Identiﬁer (UEMS_ID). This identiﬁer is either manually set in DHCP client conﬁguration or in MN’s username forwarded to the Authentication, Accounting, and Authorization (AAA) server to set it in DHCP messages. UEMS_ID is set in DHCP_CLIENT_ID of DHCP_REQUEST/DHCP_DISCOVERY. This is option ﬁeld number #61 in DHCP messages specifying a unique client identiﬁer within an administrative domain [15, 16]. UEMS_ID Proposed Format --- 3.3

SDWM Mobility Overlay

A highly scalable distributed structure of three OpenFlow switches tiers is adopted to realize the mobility overlay. The design is fully aligned with ONF SDN arch 1.0, ONF OpenFlow spec 1.4, and RFC 7426 [17–19]. In fully migrated SDN network, no hardware is required as mobility is implemented as a service in SDN application layer. In standard network, mobility is achieved through out-band SDN overlay network attached to core switches. Each tier can be deployed on a separate switch arranged in hierarchical tree structure. In small deployment, a single OpenFlow switch carries the three tier functions. Each entity of mobility overlay has a unique identiﬁer called MOBILITY_ID mapped to a unique multicast IP for message exchange. This identiﬁer serves as an index for instant location of next hop entity in the virtual path toward MN’s home network.

466

W. F. Elsadek and M. N. Mikhail

4 Mobility Entities 4.1

Mobility Access Switch Tier (AS)

This is the leaf node in the tree structure and the ﬁrst mobility overlay tier. AS tier is represented as a service on all SDN Open Flow access switches of MN’s home wired network. In non-SDN residential/enterprise network, single Open Flow switch is connected to the home aggregation/core switches. If multiple domains are connected to the same Open Flow AS, each will be separately identiﬁed by both MOBILITY_ID and VLAN TAG/PORT. Home AP/Switch/DSL installs separate ﬁrewall rules to restrict the trafﬁc forwarded to remote AS for maximum security without SDN migrating or using remote controlled AP. In visited wireless network, AS is an OpenFlow WLC managing deployed APs. WLC AS tier intercepts DHCP messages with valid UEMS_ID from roaming MNs and forwards them to next tier for home network identiﬁcation. 4.2

Mobility Detector Switch Tier (DS)

In MN’s home network, DS represents mobility home agent that is responsible for spooﬁng roaming MN’s presence and relaying DHCP messages to home DHCP server. MN’s proﬁle is stored in home DS’s database at the SDN controller indexed with home AS identiﬁer. In visited networks, DS orchestrates the services offered to roaming MNs; internet, intranet, and home services. Foreign DS needs only to identify next hop in the virtual path toward home network without pre-conﬁgured MN’s proﬁle. Next hop is retrieved by mapping UEMS_ID in DHCP message using a best match process with the mobility routing table of foreign DS. Next hop is either a parent node in upper tier or another child node in access tier representing the home broadcast domain. If next hop is AS then DS will perform dual functions of home/foreign DS. This situation is referred to intra-overlay mobility. 4.3

Relay Switch (RS) Tier

The third tier of OpenFlow switch is the tree root. RS connects mobility overlays managed by the same SDN controller. This is referred to as inter-overlay mobility. MG/RS tier facilitates seamless extension of mobility service over any LAN/WAN infrastructure without violating involved entities’ policies or revealing internal structures. 4.4

Authentication Methods

There are two methods for authenticating those with mobility subscription; advance authentication and inline authentication. In advance methods, MN are authenticated before joining WLAN using AAA server as Diameter or RADIUS. UEMS_ID must be sent in the username ﬁeld for guiding AAA proxy authentication to redirect authentication request to home AAA server [20, 21]. After successful authentication, AAA server sets DHCP_CLIENT_ID value in DHCP messages to UEMS_ID. In inline

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

467

methods, UEMS_ID must set be manually in DHCP client conﬁguration. This ﬁeld guides the DHCP relay process provided through mobility overlay to ensure correct forwarding to home DHCP server. Home network intercepts DHCP message and redirects MN to be authenticated through a web proxy client/domain controller for uniﬁed access policy of wireless and wired networks. 4.5

DHCP Relay Process

After successful authentication, MN leases an IP address from home DHCP server as if it were directly attached to home network. The mobility overlay network uses the DHCP relay process to forward DHCP messages to home DHCP server. DHCP relay process performs recursive match of UEMS_ID indices against the routing table of mobility overlay entities for discovering the next hop in the virtual path to home network. The recursive discovery of virtual path preserves the security policy of involved entities. This process does not enforce remote controlled home AP or require small cell setup. Any DSL/cable network connection can be used. Home network discovery process happens without any need for advance awareness of enterprise interior VLANs or IP conﬁguration. Overlapping VLAN TAGs and IP subnet can exist as routing is based on UEMS_ID not subnet. 4.6

Activation Phase

A valid DHCP offer from the home network activates the MN’s roaming mobility proﬁle on all mobility entities in the virtual path between home network and current point of attachment. With the ﬁrst non DHCP packet entering any mobility entity from MN with activated proﬁle, OpenFlow rules are installed on the mobility switch based on its proﬁle for wire speed forwarding of latter MN’s standard packets. AAA and DHCP messages are continuously monitored by SDN controller through the DHCP relay process. This ensures MNs’ status synchronization. Home networks have full control of packets forwarded to roaming MNs in visited network without remotecontrolled AP/switch deployment. 4.7

IP Allocation Process

SDWM mobility framework assigns for every MN three IP addresses. Each address has a different function. 4.7.1 Home Address (HA) This address is leased from home DHCP server to access indoor service. This IP is the only registered address on MN’s set from joining till disconnecting or till DHCP lease expires. 4.7.2 SDN Address (SA) SDN controller assigns this non-routable IP based on UEMS_ID from internal DHCP service. This facilitates applications accessibility through SDN NBI.

468

W. F. Elsadek and M. N. Mikhail

4.7.3 Care of Address (CoA) Visited network assigns MN a CoA from for internet and legacy intranet services access. This is to avoid core network congestion while ensuring instant break out to internet packets. 4.8

Mobility Virtual Paths

After MN’s activation, DS maps various MN’s IP to ensure effective orchestration of offered services. MN’s packets become divided into three paths as shown in Fig. 1.

Fig. 1. VCPE integration to SDN mobility overlay.

4.8.1 DHCP Path This path is dedicated to DHCP relay process. Both DHCP and authentication messages are relayed between roaming MNs at visited network and their home DHCP or AAA servers. No OpenFlow rules are installed to keep SDN controller aware of MNs’ Join/Disconnect status. 4.8.2 Home Path HA subnet identiﬁes this path for indoor service packets. OpenFlow are installed on all mobility entities in the virtual path between visited and home networks for wire speed. 4.8.3 Internet Path HA is mapped to CoA leased from the visited network. This provides instant breakout of selective internet packets and legacy services to avoid congesting core network.

5 Virtual Customer Premise Equipment (VCPE) Next generation network is regarded as a transformational change in automated service delivery that shift existing networks from “service provisioning through controlled ownership of infrastructures” to “uniﬁed control framework through virtualization and dynamic programmability of multi-tenant networks and services” [22]. The prospects of such notable variations are vast increase in network’s revenue generating capabilities

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

469

and efﬁcient control of operations. SDN facilitates such realization through logical centralization of control functions while NFV guides enterprise virtualization. vCPE is logical extension for such competitive service delivery mode based on SDN/NFV. Wide deployments of vCPE are expected to create remarkable simpliﬁcation in service providers’ processes for different adopted business models as of eliminating unnecessary equipment at customers’ premises. Figure 1 highlights the direct compatibility of SDWM proposed framework to vCPE layout as proposed by European Telecommunications Standards Institute (ETSI) [23, 24]. This example illustrates how a roaming MN at the hotspot of SDN/NFV Wi-Fi service provider can concurrently access the internet and its home services. The access aggregation cloud of vCPE represents the meeting point for integrating SDWM. Mobility AS, acting as WLC for streets Wi-Fi hotspots, detects valid roaming UEMS_ID in DHCP message of a joining MN. The message is automatically directed to foreign DS acting as services’ orchestrator. Foreign DS relays DHCP message to home vCPE where MN is authenticated by home web portal. With a valid DHCP offer from home vCPE entering any mobility entity, MN mobility proﬁle is activated. Foreign DS, acting as service orchestrator, triggers home vCPE registered services and activates ondemand service at visited Wi-Fi hotspot. The ﬁrst non-DHCP packet entering any mobility entity installs OpenFlow rules base on MN’s subscription proﬁle to ensure wire speed forwarding of future packets.

6 Experiments 6.1

Overview

A prototype is created for inter-overlay mobility to extend indoor services. CARRIER_CA has two aggregation L3 switches OH_SWL3 and OF_SWL3 for the two buildings as shown in Fig. 2. The network is not SDN migrated but uses conventional switching and routing methods. Each aggregation has at least two VLANs and a single

Fig. 2. Prototype layout.

470

W. F. Elsadek and M. N. Mikhail

DHCP server. The two aggregations switches are connected through router’s links of bandwidth 100 Mbps without extending VLANs across both buildings. Moreover, overlapping VLAN TAGs exit across the aggregation switches. Each building has its own SDWN AS WLC. A single SDN controller manages both mobility overlays CA_OH, and CA_OF connected to both aggregation points. Connection speed between mobility entities is 100 Mbps. Two Hosts OH_MN1 and OH_MN2 moved from their home VLAN12 connected to switch OH_SWV2 to AS WLC with ID:OF4 monitored by overlay EC_OF in the public hotspot. Roaming OH_MN1 at aggregation OF_SWL3 initiates sessions with OH_SERVER2 located at home network VLAN12: OH_SWV2. OH_MN1 suffers from initial idle activation delay for virtual path activation; EC_OF ! EC_OH. Once OpenFlow rules are installed, latter packets are wire speed forwarded. Experiments analyze the latency of inter-overlay mobility versus that of routed network connecting the two buildings. This is achieved by comparing the performance of OH_MN1 and OF_HOST1 located in AS WLC with ID:OF4 and VLAN11: OF_SWV1 respectively connected to the same aggregation switch OF_SWL3. OH_MN1 is an active mobility subscriber using the mobility overlay while OF_HOST1 is a standard host on switch OF_SWV1 using standard network. SDWN OpenFlow mobility WLCs AS ID:OH4/ID:OF4 manage deployed APs identify roaming MNs from standard users based on their UEMS_ID in DHCP messages then redirecting their packets through the mobility overlay to their home VLANs. 6.2

Implementation

Experiment are performed with an Ubuntu Linux 64-bit machine running a single SDN controller with mininet to create the mobility OpenFlow switches, WLCs, and APs. Each building has a mobility overlay connected to a standard Cisco L3 switch 3640 with at least two VLANs, assuming smart homes and Network Operation Center (NOC). A single DHCP serves the building VLANS and APs. A Cisco router 3725 is used to communicate both 3640 L3 switches of both building. 6.3

Results

6.3.1 IDLE Activation Time Estimation The idle activation time starts from the detection of CARRIER_CA WLC AS ID:OF4 a DHCP_REQUEST message from the roaming OH_MN1 till receiving on the same WLC a DHCP_ACKNOWLEDGMENT from OH_DHCP server at VLAN2. The average value from several runs is almost equal 0.32 s. The home established virtual path between EC_OF ! EC_OH is: Overlay CA_OF [WLC AS $ DS $ RS] Overlay CA_OH [RS $ DS $ AS VLAN2] 6.3.2 Standard Network Versus Mobility Overlay Latency Figures 3 and 4 compare the latency in 100 pings sent from OH_MN1 to OH_SERVER2 using the mobility overlay against those sent by OF_HOST1 using the routed network. DHCP relay feature presented by SDWM mobility framework separates

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

471

between MN’s standard packets and DHCP messages that require to be monitored by the SDN Controller. Figure 3 shows the latency before the separation of MN’s standard packets and DHCP messages.

Fig. 3. Latency comparison - No DHCP relay feature.

Fig. 4. Latency comparison with DHCP relay feature.

In Fig. 3, routed network introduces *17 ms average latency with deviation of 3–4 ms while that of mobility overlay network, without OpenFlow rules installed on the mobility switch, is *47 ms average latency with deviation of *15 ms. The large value of standard deviation in the mobility overlay network sounds reasonable as it represents the utilization of the SDN controller not ﬁxed problems facing the mobility

472

W. F. Elsadek and M. N. Mikhail

overlay network. With the relay feature, OpenFlow rules make advancement to the latency of the mobility overlay network as shown in Fig. 4. The mobility overlay’s performance reaches wire speed with almost zero average latency and deviation of *0.1 ms in comparison to *17 ms average latency with 3–4 ms deviation in routed network. 6.3.3 Standard Network Versus Mobility Overlay Performance Figure 5 reveals that without the installation of OpenFlow rules, standard routed network and inter-overlay mobility almost have the same performance in UDP but the mobility overlay suffers from higher jitter. Concerning TCP, inter-overlay mobility throughput is slightly better.

Fig. 5. Standard network versus Mobility overlay - No OpenFlow.

In Fig. 6, the installation of OpenFlow rules dramatically improved the performance of inter-overlay mobility over routed network. Concerning TCP, inter-overlay’s throughput exceeds double of routed network. This is almost wire speed. On the other hand, UDP performance is similar but inter-overlay’s jitters are improved till being almost negligible.

Fig. 6. Standard network versus Mobility overlay – OpenFlow.

SDWM: Software Deﬁned Wi-Fi Mobility for Smart IoT Carriers

473

7 Conclusion Creating carrier-grade mobility platform for activating smart indoor real-time services through smart WiFi carrier is no longer a dream with SDN in existing and next generation networks. SDWM adopts a new routing mechanism based on identiﬁer not subnet to establish SDN OpenFlow virtual paths. Such structure restricts VLAN, VXLAN TAGs and Tunnel headers while simplifying the adoption of SDN QoS for enhanced users’ experience. Overlapping conﬁgurations are allowed in IP address and VLAN across carriers and home networks without violating involved entities’ security policies. The established prototype proves the feasible of the proposed framework in instantly connecting roaming users to indoor services within *0.38 s. TCP/UDP performance as well as latency/jitter encountered inside the dynamically established SDN mobility overlay is better than standard routed networks not just NFV networks with VXLAN overheads.

References 1. IDG Enterprise: Tech Insights: Building the Mobile Enterprise Survey 2015. IDG Enterprise, Boston (2015) 2. Gabriel, C.: Small cells inside the enterprise—the “Who, What & Where. MaravedisRethink (2013) 3. Linegar, D.: Tomorrow Starts here Service Provider Wi-Fi and Small Cell. Cisco Systems Inc, White Paper (2014) 4. Noblet, S.B.: The Whys and Hows of Deploying Large-Scale Campus-wide Wi-Fi Networks. White Paper, Aruba Network (2012) 5. Calhoun, P., Montemurro, M., Stanley, D.: Control And Provisioning of Wireless Access Points (CAPWAP) Protocol Speciﬁcation, IETF Proposed Standard, RFC 5415 (2009) 6. Calhoun, P., Montemurro, M., Stanley, D.: Control and Provisioning of Wireless Access Points (CAPWAP) Protocol Binding for IEEE 802.11, IETF Proposed Standard, RFC 5416 (2009) 7. Congdon, P., Perkins, C.: Wireless & Mobile – OpenFlow. Wireless & Mobile Working Group (WMWG) Charter Application, Open Networking Foundation (2015) 8. Mahalingam, M., et al.: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks, RFC 7348 (2014) 9. Gundavelli, S., Leung, K., Devarapalli, V., Chowdhury, K., Patil, B.: Proxy Mobile IPv6, IETF, RFC 5213 (2008) 10. Gundavelli, S., Pularikkal, B., Koodli, R.: Applicability of Proxy Mobile IPv6 for Service Provider Wi-Fi Deployments, Internet-Draft, expired April 2014 (2013) 11. Wakikawa, R., Gundavelli, S.: IPv4 Support for Proxy Mobile IPv4, IETF, RFC 5844, ISSN: 2070-1721 (2010) 12. Kolias, C., Ahlawat, S., Ashton, C., Cohn, M., Manning, S., Nathan, S.: OpenFlow™Enabled Mobile and Wireless Networks, ONF Solution Brief, Open Networking Foundation (2013) 13. ONF: Software-Deﬁned Networking: The New Norm for Networks, White Paper. Open Networking Foundation (2012)

474

W. F. Elsadek and M. N. Mikhail

14. Zhang, C., Addepalli, S., Murthy, N., Fourie, L., Zarny, M., Dunbar, L.: L4-L7 Service Function Chaining Solution Architecture, Open Networking Foundation, ONF TS-027 (2015) 15. Alexander, S., Droms, R.: Dynamic Host Conﬁguration Protocol, IETF Standard, RFC 2131 (1997) 16. Alexander, S., Droms, R.: DHCP Options and BOOTP Vendor Extension, IETF Standard, RFC 2132 (1997) 17. Betts, M., Fratini, S., Davis, N., Hoods, D., Dolin, R., Joshi, M., Dacheng, Z.: SDN Architecture, Issue 1, Open Networking Foundation, ONF TR-502 (2014) 18. Denazis, S., Koufopavlou, O., Haleplidis, E., Pentikousis, K., Hadi Salim, J., Meyer, D.: Software-Deﬁned Networking (SDN): Layers and Architecture Terminology, RFC 7426 (2015) 19. Nygren, A., Pfa, B. Lantz, B. Heller, C. Barker, C. Beckmann, … Kis, Z.L.: The OpenFlow Switch Speciﬁcation, Version 1.4.0, Open Networking Foundation, ONF TS-012 (2013) 20. Aboba, B., Vollbrecht, J.: Proxy Chaining and Policy Implementation in Roaming, RFC 2607 (1999) 21. Zorn, G., Network Zen. Diameter Network Access Server Application, IETF RFC 7155 (2014) 22. G-PPP: 5G Vision - The 5G Infrastructure Public Private Partnership: The Next Generation of Communication Networks and Services, 5G-Vision-Brochure-v1, European Commission (2015) 23. ETSI: Network Functions Virtualisation (NFV); Use Case #2 Virtual Network Function as a Service (VNFaaS), ETSI GS NFV 001, V1.1.1 pp. 15–20 (2013) 24. Odini, M.P., et al.: Network Functions Virtualization (NFV); Ecosystem; Report on SDN Usage in NFV Architectural Framework, ETSI GS NFV-EVE 005 V1.1.1. pp. 95–98 (2015)

Smart Eco-Friendly Trafﬁc Light for Mauritius Avinash Mungur(&), Abdel Sa’d Bin Anwar Bheekarree, and Muhammad Bilaal Abdel Hassan Department of Information and Communication Technologies, University of Mauritius, Reduit, Mauritius [email protected], {abdel.bheekarree,muhammad.abdel}@umail.uom.ac.mu

Abstract. Nowadays going out anywhere, especially in urban areas is becoming more and more of a headache. Going out during peak hours to go to work or from leaving work, and being stuck for a long time in trafﬁc is just frustrating. Having trafﬁc lights on the road during peak hours sometimes do not really help. Traditional trafﬁc lights do not really account for the difference of vehicle density on the different lanes. Thus police ofﬁcers are tasked to take responsibility to control trafﬁc at junctions which, if not controlled, can be chaotic. Furthermore, the police ofﬁcers, while controlling trafﬁc are exposed to harmful gas emission which can be disastrous for health. Commuters who are stuck in the trafﬁc are also exposed to notorious gas emission. As a consequence the problem is that traditional trafﬁc lights do not react dynamically to the change in trafﬁc density at different point of time and also they do not take into account the amount of pollutants which drivers are exposed to. To try and solve these two problems, this paper proposes a smart trafﬁc light which takes into account the density of vehicle in a lane as well as the level of vehicle emission within each lane. Such that if the level of vehicle emission is notorious for the health within a lane, the trafﬁc light will go green otherwise it will remain red up until a threshold number of vehicle has been reached within a lane. The smart trafﬁc light, through the use of Internet of Things, works with sensors such as magnetic and gas to detect the amount of vehicles and gas emission levels on each lane, respectively. Each lane has a set of these sensors connected to an Arduino which in turn are all connected to a central Raspberry Pi. The Raspberry Pi, being connected to the Internet, will do all the processing via NodeRED. Node-RED is a graphical interface for node.js. All data captured by the sensors are sent to the IBM Bluemix Cloud for analysis. With this system it is envisaged a more fluid and dynamic trafﬁc which takes vehicle emission into account at the trafﬁc light. Keywords: Smart trafﬁc light Internet of Things IBM Bluemix Gas emission Congestion

Arduino Raspberry Pi

1 Introduction Trafﬁc congestion nowadays is gradually becoming a real issue. Being stuck for hours in trafﬁc after a long and hard day at work does not make things better at all. In Mauritius, being such a small island and having a population of around 1.3 million © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 475–487, 2019. https://doi.org/10.1007/978-3-030-03405-4_33

476

A. Mungur et al.

people, the problem of trafﬁc congestion is a very big issue. In the morning, people need to leave their house one or two hours earlier for a trip that normally would have lasted 20–30 min. The latter situation repeats itself in the afternoon at around 4 pm when workers return to their home. There were some measures that were taken to try and combat this problem. For example, policemen were dispatched during peak hours to monitor the trafﬁc. As from 2010, a series of road widening and junctions improvement projects have been implemented between Pond Fer and Place D’Armes [1]. In addition, to ease trafﬁc flow, the introduction of Verdun Motorway, in 2015, helped a lot. The Government is still looking and investing in several other ways to lower trafﬁc congestion in general [3]. However, when it comes to trafﬁc lights, we clearly notice that during peak morning hours, between 7 am and 10 am, it is heavily congested as well as in the afternoon during 3 pm till 6 pm. Every morning and afternoon we need policemen at almost every critical junction around the island to help the vehicles move in a much disciplined way. In addition, what is the use of our current trafﬁc lights if we always need policemen to ease trafﬁc flow? Also our standard trafﬁc light system is one of the sources of the trafﬁc congestion. The reason being that the standard trafﬁc light system operates only based on timing. The problem with this timing based trafﬁc light system is that it is not dynamic; the lights can be green for a lane with no vehicles and red for a lane with vehicles. Hence the standard trafﬁc light is not “smart” enough to distinguish lane with or without vehicles. Besides with congestion there is a lot of harmful gas emission coming from the vehicles. According to the United Nations, the air quality in Mauritius is described as being one of the best in the whole world, but when it comes to what pollutes our air quality, transportation is among the top of the list [2]. With the number of vehicles increasing on our roads, the amount of toxic gases released increases too. People stuck in trafﬁc are forced to breathe that polluted air which can be fatal to their health. From small babies to elders, there are quite a variety of people who suffers a lot from those toxic gases and no proper solution has been proposed to solve that problem till date. As a result, this paper proposes a smart trafﬁc light which takes into account both the number of vehicles and the level of gas emission within a lane. Thus, if the gas emission within a lane reaches a threshold which is notorious for health, the trafﬁc light will go green for this particular lane. Otherwise, the trafﬁc light will go green when the number of vehicle present in a lane reaches a threshold. To enable the trafﬁc light to make informed decision, gas and magnetic sensors are used. Each lane has a set of those sensors connected to an Arduino which in turn are all connected to a central Raspberry Pi. The Raspberry Pi, being connected to the Internet, will do all processing via Node-RED [7]. Node-RED is a graphical interface for node.js. All data captured by the sensors are sent to the IBM Bluemix Cloud for analysis. This paper is organized as follows. Section 2 presents the related work. Section 3 describes the proposed architecture and Sect. 4 details the algorithm and its implementation through the use of Node-RED. Section 5 illustrates the physical setup of the proposed architecture. Section 6 describes how the threshold value for gas emission, for the smart trafﬁc light, has been obtained and determined. Section 7 discusses some additional features which the proposed system offers and Sect. 8 concludes this paper.

Smart Eco-Friendly Trafﬁc Light for Mauritius

477

2 Related Work To the best of our knowledge, such smart trafﬁc light system as proposed in this paper has not been implemented. However intensive research is being conducted in order to reduce vehicle emission at trafﬁc lights. In [4], the authors propose a trafﬁc light system which improves trafﬁc fluidity at intersections by using wireless communication with the vehicles. However this solution is targeted to reduce the air pollution and is not targeted to solve congestion. In [5], the authors propose a method to model fuel consumption and emission in order to evaluate trafﬁc management applications. The authors do not try to tackle the congestion and decongestion issues and related health issues as proposed in this paper. In [6], the authors propose a trafﬁc light system to reduce the vehicle emission by controlling/modelling the acceleration of the vehicle. However, in [6], the authors do not provide a solution to alleviate congestion or make the trafﬁc fluid at peak hours by taking into account the level of vehicle emission.

3 Proposed Architecture This section provides an overview of the proposed architecture. Figure 1 shows the different components of the smart trafﬁc light.

Fig. 1. Proposed architecture.

The proposed architecture consists of the following components: • The Raspberry Pi: The Raspberry Pi is the core processing unit. It hosts the NodeRED tool [7] which contains the main logics and algorithms. The Raspberry Pi is also connected to an Arduino on each lane and uses the data received to power the algorithms. • The Arduino: The Arduino is a mini-controller that is located on each lane. Each Arduino is connected to several sensors. The main task of the Arduino is to send live data from the sensors to the Raspberry Pi. • The LED display: The LED display will display all data collected.

478

A. Mungur et al.

• The Gas particle sensors: These sensors (MQ-135 Pollutant Sensor [8]) will detect the pollutant levels on each lane and the data will be sent to the Arduino. The pollutant sensors are highly sensitive to Nitrogen Dioxide, Sulﬁde and Benze and other harmful gases. • The Hall Effect Sensor: This sensor is used to simulate the detection of the vehicles on each lane. This sensor will give the count of vehicles. We “simulate” the detection because theoretically each car would have to have magnetic properties. Due to budget limitations the solution was to use the Hall Effect sensor as other means were too expensive or too time consuming. Nevertheless this sensor can be interchanged with another sensor in order to detect the vehicle in a lane. The architecture proposed in this paper serves as a proof of concept. • The Bluemix cloud: The Bluemix cloud [9] offers many services such as database and many APIs that are used as the backbone for the architecture. The cloud has many services designed for IoT projects like node-RED and IBM Watson Analytics which connects the Raspberry Pi to the cloud. The Cloudant NoSql database also has APIs which works with IoT. The IBM Bluemix cloud will be alimented with real-time values generated by the sensors as well as the status of the trafﬁc light colours. These values will be projected on a map at real-time. This will enable users (such as the police) to monitor the situation. Furthermore, the use of IBM Bluemix, will enable police to control the trafﬁc light remotely. That is from the website, commands can be sent to the Raspberry Pi to change the colour of the trafﬁc light. This feature is proved to be important in case of emergency.

4 Node-RED Node-RED is a node.js based tool for wiring together hardware devices like a laptop or Raspberry Pi, APIs and online services. It is a browser-based editor that uses nodes as blocks of functions and wiring the nodes together for a simple or complex behavior. JavaScript can be created to add more complexity to the programs with the dedicated JavaScript node. Node-RED is used to implement the algorithm/pseudocode (Fig. 2) in the Raspberry Pi. In terms of scalability, Node- RED makes it easier to add more features in the future. The pseudocode shows that as soon as the gas emission level is above a threshold and the light is red for a lane, it will change the light to green. The light will remain green for a certain period of time and changes back to red. However, if the gas emission threshold is not reached, the trafﬁc light will change colour based on the number of vehicle in a particular lane. If the number of a vehicle in a particular lane reaches a certain threshold, the trafﬁc light will change accordingly. Figure 3 shows a snapshot of the algorithm/pseudocode which is implemented in Node-RED. From Fig. 3, a set of nodes can be seen which comprises of two parallel branches each doing the same functionality for different lanes. From the ﬁrst branch, the getReadingLane1 node is responsible to capture the serial output by the Arduino. To capture the data from the port, the Arduino is conﬁgured to read on/dev/tty/ACM0. The LaneOneOuput is a function node which captures the serial data and formats the data such that they can be used in other nodes. The outputs from the laneOneOutput and

Smart Eco-Friendly Trafﬁc Light for Mauritius

479

Fig. 2. Pseudocode snapshot.

Fig. 3. Pseudocode in Node-RED.

LaneTwoOutput functions enters a Check Gas Value function node. This function veriﬁes if the gas emission data are higher than the set threshold and appropriate action is taken by changing the trafﬁc light colours. Furthermore, this function node also ensures that the gas emission reading is within the sensor range of valid values which helps in detecting whether the sensor is functioning properly or not. If the values being read are too high, i.e. an extremely higher level than expected, it means the sensor is

480

A. Mungur et al.

malfunctioning and the system is intelligent enough to switch to magnetic sensor count (in this case it will trigger the magnetic branch of Node-RED). Figure 4 shows a snapshot of the magnetic branch.

Fig. 4. Magnetic Node-RED snapshot.

One thing to consider is that it might be possible that both the sensor data are asynchronous and are not entering the function node at the same time to be compared. To solve this problem, node context is used. It is ensured that the function is storing the data captured within its context object. That way, even if the second branch has not sent the data yet, they are stored, waiting to be compared as soon as the data arrive. It then checks whether the Gas Level is above the threshold, if yes it proceeds to either the second or third output respectively depending which lane has the highest Gas Level value. Else, it is routed to the ﬁrst output where the Gas Level is low and Magnetic Sensors are used to control the trafﬁc. In Fig. 3, there is the Trafﬁc Analytics branch. This branch is conﬁgured such that the Raspberry PI sends the data to IBM Watson IoT which in turns transmit the data to IBM Bluemix IoT for visual analytics and storage on Cloudant Cloud. To visualize the data correctly, a map is provided which shows, in real-time, the trafﬁc light changing based on the pseudocode computation.

5 Component Setup This section illustrates the setup of the different components as described in Sect. 3. Figure 5 shows the Raspberry Pi Setup. The Led simulates the trafﬁc light and it is the Raspberry Pi which controls the trafﬁc light. Figure 6 illustrates how the different sensors are connected to the Arduino and Fig. 7 shows the real system.

Smart Eco-Friendly Trafﬁc Light for Mauritius

Fig. 5. Raspberry Pi set up.

Fig. 6. Arduino set up.

Fig. 7. Raspberry Pi and Arduino set up.

481

482

A. Mungur et al.

6 Determining the Gas Emission Threshold After setting up the proposed architecture, the gas emission threshold had to be determined in order to be fed to the algorithm. To obtain the emission threshold, emission being emitted on the road at the trafﬁc lights had to be collected. This value will determine whether the emission produced at the trafﬁc light, during congestion, is hazardous for health and consequently, the trafﬁc light for the lane needs to become green. To collect the emission data, a junction has been identiﬁed. The capital city PortLouis has been identiﬁed because trafﬁc density is usually high and during workdays congestion is at its worst. The junction next to Les Casernes has been selected (Fig. 8) and the apparatus has been set up to capture emission data as shown in Fig. 9. Data have been collected from 7 am till 6 pm every 1 min on a workday and is shown in the chart (Fig. 10). Note one assumption made is that the wind is negligible.

Fig. 8. Chosen location.

Fig. 9. Equipment to collect pollutant data.

Smart Eco-Friendly Trafﬁc Light for Mauritius

483

Fig. 10. Pollutant collection graph.

From the reading collected in Fig. 10, we can ﬁnd that during the morning from 7 am to 10 am, the peak emission is slightly above 70 ppm and from 2 pm till 5 pm, the peak emission is around 69 ppm. From the collected emission data and considering the exposure limits as suggested in [11–13] for the three main pollutants (Hydrogen Sulﬁte: 20 ppm, Nitrogen dioxide: 10–20 ppm and Sulphur dioxide: 5 ppm), the ideal threshold at this particular junction would be between 40 ppm and 50 ppm. Note: the threshold value is a conﬁgurable value which needs to be changed according to the environment/junction (a similar emission collection should be collected in order to determine the ideal threshold value). Nevertheless, the proposed architecture shows that this value can be amended at any time (which shows the flexibility of the proposed smart trafﬁc).

7 Website Dashboard This section provides an overview of additional features which the proposed architecture provides. A website has been created where a Google Map has been introduced in order to allow the administrator (or other users) to view all the trafﬁc lights and their status. There is also a twitter section embedded in the website where the administrator can have an idea of what commuters are tweeting about the trafﬁc lights and congestion in general. Furthermore, the website dashboard provide emergency features, sensor status as well as twitter features as illustrated in Subsects. 7.1, 7.2 and 7.3. 7.1

Emergency Features

In case of emergency, the proposed system enables the user/admin to remotely control the trafﬁc light. The proposed system also provides suggestion which route to take and

484

A. Mungur et al.

subsequently switches all nearby trafﬁc light to red to enable the emergency vehicle to pass through. Below is a simulation of how it works: (a) Route from Hospital (A) to Incident (B) should be displayed (Fig. 11). (b) Marker Moving: representing moving ambulance (Fig. 12). (c) When near Trafﬁc light, if green, switches to red (Fig. 13).

Fig. 11. Route from hospital to incident.

Fig. 12. Trafﬁc light turns red in the map.

Smart Eco-Friendly Trafﬁc Light for Mauritius

485

Fig. 13. Actual change in led colour from green to red.

7.2

Gas Sensor Status

If ever a gas sensor fails, the system is immediately notiﬁed through the website dashboard. When the gas sensor icon is grey (it means that the sensor is working) otherwise the icon will be red as depicted in Figs. 14 and 15, respectively.

Fig. 14. Gas sensor working.

486

A. Mungur et al.

Fig. 15. Gas sensor faulty.

7.3

Twitter

Node-RED provides us with a functionality where whenever an event happens, let’s say a certain junction is very dense with vehicles, it can tweet a warning so that drivers that use this road can ﬁnd an alternative route. Furthermore the proposed systems enable commuters to send twitter feeds to the website. Since the system uses Watson Analytics [10], twitter feeds received on the website can be analyzed. Twitter Feeds can be analyzed to ﬁnd places where trafﬁc light has stopped working and inform the administrator accordingly.

8 Conclusion This paper details the design, architecture and process that are used to create a smart trafﬁc light system with an eco-friendly aspect with takes into account gas emission in order to regulate the flow of trafﬁc. Using IoT compatible platforms like IBM Bluemix and devices such as gas sensors, Arduinos and Raspberry Pi and tools such as NodeRED, a smart trafﬁc light system has been designed and implemented. With this paper we hope to contribute to solving the big problem that is trafﬁc congestion and also contribute to a better health of the population by decreasing their exposure to pollutants. The proposed system is cost-effective as it uses affordable equipment and can be implemented in any countries. A big potential and scope of this proposed system is the integration of artiﬁcial intelligence (through Watson Analytics) which will learn trafﬁc pattern and decide the gas emission threshold for different roads at different time and communicate the data among other sensor nodes in order to create a network of connected trafﬁc lights that will ease and smooth the trafﬁc flow.

Smart Eco-Friendly Trafﬁc Light for Mauritius

487

References 1. Caderassen, D., Chansraj, P.: Alleviating trafﬁc congestion along the m1 corridor: an economic perspective. J. Inst. Eng. Maurit. (2013). http://rda.govmu.org/English/Pages/ HomeDocuments.aspx 2. United Nations: Mauritius Air Quality Overview (2015). http://www.unep.org/transport/ airquality/Mauritius.pdf 3. Nando, B.: The Road Trafﬁc (Amendment) Bill (2015). http://mauritiusassembly.govmu.org/ English/bills/Documents/intro/2015/bill0615.pdf 4. Gradinescu, V., Gorgorin, C., Diaconescu, R., Cristea, V., Iftode, L.: Adaptive trafﬁc lights using car-to-car communication. In: Proceedings of the IEEE 65th Vehicular Technology Conference (VTC2007-Spring), pp. 21–25. IEEE Computer Society Press (2007) 5. Akcelik, R., Besley, M.: Operating cost, fuel consumption, and emission models in aaSIDRA and aaMOTION. In: Proceedings of 25th Conference of Australian Institutes of Transport Research, University of South Australia, Adelaide, Australia (2003) 6. Dobre, C., Szekeres, A., Pop, F., Cristea, V., Xhafa, F.: Intelligent trafﬁc lights to reduce vehicle emissions. In: Troitzsch, K.G., Möhring, M., Lotzmann, U. (eds.) Proceedings 26th European Conference on Modelling and Simulation ©ECMS (2012) 7. Node-Red, Running on Raspberry Pi. https://nodered.org/docs/hardware/RaspberryPi 8. MQ 135 Gas Sensor. http://www.china-total.com/Product/meter/gas-sensor/MQ135.pdf 9. IBM Bluemix. https://www.ibm.com/cloud-computing/bluemix/what-is-bluemix 10. IBM Watson Analytics. https://www.ibm.com/analytics/watson-analytics/us-en/ 11. Occupational Safety and Health Administration, Safety and Health ToPics: Hazards. https:// www.osha.gov/SLTC/hydrogensulﬁde/hazards.html 12. Ministry of Environment and NDU: Report of Technical Committee on Review of Environment Protection (Standards for Air) Regulations 1998 (2005) 13. United Nations Environment Programme: Report of the Sulphur Working Group of the Partnership for Clean Fuels and Vehicles (2006)

Logistics Exceptions Monitoring for Anti-counterfeiting in RFID-Enabled Supply Chains Xiaoming Yao1(&), Xiaoyi Zhou1, and Jixin Ma2 1

2

College of Information Science and Technology, Hainan University, Haikou, China [email protected], [email protected] Department of Computing and Information Systems, University of Greenwich, London, UK [email protected]

Abstract. In recent years, the radio frequency identiﬁcation (RFID) technology has been used as a promising tool for anti-counterfeiting in RFID-enabled supply chains due to its track-and-trace abilities of contactless object identiﬁcation with the unique electronic product code (EPC). While this system does improve its performance in many ways, uncertainties of daily operations might bring about one or more logistics exceptions, which would further trigger other dependent exceptions. These exceptions could be well organized and exploited by adversaries to fool the system for counterfeiting while related reports are unfortunately very few. In this paper, we presented our results focusing on the inter-dependencies between those logistics exceptions and the detecting intelligence in a resource-centric view. A cause-effect relational diagram is ﬁrst developed to explicitly express the relations among those logistics exceptions by incorporating the taxonomy of exceptions and resource-based theory, and then an improved intelligent exception monitoring system is designed to achieve the goal of autonomous, flexible, collaborative and reliable logistics services. Finally, a case study of two typical logistics exceptions indicates that our proposed cause-effect diagram outperforms the extant approaches in the understanding of group logistics exceptions, which enables the designed monitoring system to perform well for anti-counterfeiting. Keywords: RFID-enabled supply chains Relational diagram Anti-counterfeiting Logistics exceptions Exceptions monitoring

1 Introduction In recent years, the radio frequency identiﬁcation (RFID) technology has been used as a promising method for anti-counterfeiting in RFID-enabled supply chains due to its track-and-trace abilities of contactless object identiﬁcation with the unique electronic product code (EPC). The e-pedigree that is dynamically updated at each node of the supply chain could build a complete view of the product history of its activities starting from manufacturing or even earlier and be examined and authenticated for data © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 488–505, 2019. https://doi.org/10.1007/978-3-030-03405-4_34

Logistics Exceptions Monitoring for Anti-counterfeiting

489

consistency [1–3]. This sounds very effective when used for product anti-counterfeiting since the “visibility of the e-pedigree” is able to remove the uncertainty along the supply chain and thus provide with “certain belief” of the product. However, some uncertainties due to the changing real world still exist and unexpected exceptions can inevitably occur. For example, a sudden change of the weather condition may cause the delay of the delivery of the product and make the data inconsistent between the actual delivery date and the scheduled one; similarly, mistakes from the related personnel may cause data errors, leading to a variety of exceptions that might occur in every possible phase of the business processes, as reported in [4, 5]. To provide with proper understanding of the exception situations, many techniques such as exception patterns [6], AND/OR trees [7], ﬁshbone diagram [8], and multi-perspective ontologies [9] have been proposed, but few researchers have ever dealt with the interdependencies of those exceptions derived in different perspectives, not to mention the respective relations with that of counterfeiting. On the other hand, most of the extant researchers about anti-counterfeiting focused either on the speciﬁc know-hows of increasing the difﬁculties of breaking some patterns hidden in some physical place [10], or patterns mined from the trajectory data acquired from the e-pedigree of the products [11, 12]. Only in [1], the authors argued that some continuous exceptions might incur the possibility of counterfeiting, but no further details have ever been given. In the case of supply chains, exceptions usually refer to that when some deviations from its planned course occur. For instance, failure of a commitment to deliver a product on time will cause a delivery exception; failure of a speciﬁc task (say, to send a package to the shipping port) or resource unavailable (say, no drivers or no vehicles to perform a task) will cause similar exceptions. However, those deviations can be well organized and exploited by some malicious attackers for effective counterfeiting. A track-and-trace system requires the accurate record of the product’s e-pedigree at each node of the supply chain, and failure of doing so will cause the failure of the epedigree, making supply chains the best entry place for fake products. Therefore, to provide with a reasonable interpretation of those exceptions and the relations between the uncertainties of the changing real world and those exceptions, this paper ﬁrst develops a cause-effect relation diagram to express the relations among those logistics exceptions by incorporating the taxonomy of exceptions and the resource-based theory, and then an improved intelligent exception monitoring system is designed to achieve the goal of autonomous, flexible, collaborative and reliable logistics services. These results are explained and demonstrated with two typical logistics cases. The rest of the paper is organized as follows: the next section briefly reviews the relevant works in taxonomy of logistics exceptions, and the related detection intelligence using data mining and intelligent agents; Sect. 3 presents the cause-effect diagram of the logistics exceptions according to the resource-based view and interdependent relations among those exceptions; Sect. 4 demonstrates and explains these results on two typical logistics cases in comparison with that of previously published approaches from our new perspective; and it concludes in Sect. 5 with the look ahead for future research.

490

X. Yao et al.

2 Related Works Most of the e-pedigree based track-and-trace systems only use the trajectory data of the product, i.e. the time series of the triplets (EPC number, time, location) to detect the relevant exceptions [11, 12], few also consider group events, such as the neighboring relations, as reported in [1]. To the best of our knowledge, it is very rare to further investigate the relations among those exceptions. Therefore, we focused our discussion on the taxonomy of logistics exceptions and the respective detecting intelligence using intelligent agents. 2.1

Taxonomy of Logistics Exceptions

Taxonomy is particularly useful in system design since it improves the understanding of the environment from the taxonomical perspective, and could be developed as a decision-making tool [13]. In [9], a multi-perspective approach of taxonomy is developed, where the taxonomy was proposed by adding an information categorization to that of the functional ones. Meanwhile, a social ontology is developed to diminish uncertainties from the social dependencies, as in logistics environment, of the senders, receivers and logistics service providers; and a dynamic ontology that deals with the changing of the business rule has been taken into consideration. Unfortunately, this perspective approach is too rough to provide with more valuable information, because social ontology only addresses the relations of the social dependency, which is unable to cover uncertainties from the changing real world; and while the dynamic ontology refreshes the business rule and thus updates the relations among exceptions, it does not provide with any direct connections to the real uncertainties we actually concern. Therefore, in this paper, we would go further deeper and investigates their direct relations with those uncertainties. 2.2

Detection Intelligence of Logistics Exceptions

In RFID-enabled supply chains, the detection intelligence of logistics exceptions is mainly adopting the techniques of machine learning, which may often require a template tier for the training purpose, as reported in [1, 14–16]. Their inference tier is usually consisted of six engines which includes the EPC object event query engine (OEQE) from the renowned EPC global network, the Network explorer engine (NEE) as a GUI interface, the Item veriﬁer engine (IVE), Case-based reasoning engine (CBRE) to retrieve similar case from the case database with similar patterns, Datamining engine (DME) to cluster similar patterns, and Intelligent agent engine (IAE) to detect abnormal patterns using rule-based reasoning. Obviously, to some extent, this could achieve the goal for some simple situations but would fail when more sophisticated situations appear, for instance, where one exception may trigger some other exceptions. Therefore, we need to develop the relations that link directly to the uncertainties due to the changing real world in this paper.

Logistics Exceptions Monitoring for Anti-counterfeiting

491

3 Cause-Effect Diagram of Logistics Exceptions A supply chain is typically a set of organizations, for instance, the supplier, logistics service provider, the retailer or end customers, linked together via physical and information (or more speciﬁcally data) flows to achieve the goal of moving the products to the end customers, as shown in Fig. 1.

Fig. 1. Illustration of a typical supply chain.

While the physical flow is mainly downstream from the suppliers to the retailers, the data flow can be both downstream and upstream, because the end customers should order from the suppliers, and then the delivery data in accordance with the order would go downstream via the chain. For simplicity, in this paper, we focused our investigation on the logistics exceptions, and we limit our discussion within the range of the internal business process of the logistics service providers, taking into the considerations of the data exchange for the border impacts. 3.1

Logistics Business Processes

As shown in Fig. 2, a typical logistics business processing of an organization can basically be viewed as two tiers that are logically and physically interconnected, i.e. the managerial business processes and the executive business processes. In the upper tier, the managerial personnel will forecast the future business based on surveys of the relevant industries and operational results from the executive business processing, and make speciﬁc business plans according to respective managing strategies to achieve its competitive advantages [17], and then further make the decisions of warehouse allocation and trafﬁc dispatching to fulﬁll its commitment to the delivery on time. Comparatively simpler in the lower tier, the staff will work on several speciﬁc tasks under the business rules and regulations: to accept coming orders by doing the respective data inputs and possible inspections; to place the packages into the right stocks of the warehouse; to drive the designated vehicles on schedule along with the planned route to short-term destinations for transitions or the ﬁnal address as stated on the order form. Interdependence between both tiers needs more concrete and complicated analysis, and essentially through the data sets exchanged among them at all stages. In Fig. 3, an

492

X. Yao et al.

Fig. 2. Illustration of typical logistics business processes.

example is given for describing the interdependencies of those processes using the techniques reported in [9]. It includes: • The managerial tier depends on the information of current orders, available stocks, dispatchable vehicles, and routing options for delivery to provide with predictions for future planning according to current market conditions. • The managerial tier depends on the general utility of the current resources and the dynamic distributions of the orders to make the scheduled plan and strategy of concrete dispatching based on the forecasting results. • The managerial tier schedules speciﬁc dispatching plans for resource-allocations, which may be modiﬁed and adjusted according to the feedback from the executive tier. • The order information will be input in the executive tier and shared with or sent to the managerial tier for resource allocation. • Information of the warehouse will be computed and shared with or sent to the managerial tier for resource allocation via its network of the logistics organization. Some necessary adjustment among warehouses in different network nodes in the same region should be done to meet the requirements of the newly generated orders. • Information of the warehouse and the route should be sent to the staff responsible for the delivery with scheduled vehicle allocations. • The delivery depends on the information of the orders, and the trafﬁc (vehicle and routing) according to the scheduled dispatching.

Logistics Exceptions Monitoring for Anti-counterfeiting

493

Fig. 3. Example of interdependencies among those business processes.

3.2

Uncertainties Related to Outbound Logistics Exceptions

Logistics exceptions can be inbound and outbound, which are classiﬁed for different management purposes. In this paper, for simplicity, we focused our discussion on outbound logistics exceptions, but the approach we use can be easily adapted to that of inbound logistics exceptions. To begin with, the taxonomy of outbound logistics exceptions is re-organized and developed on the basis of the previous results reported in [9] using resource-based view (RBV) [17, 18] strategy as shown in Fig. 4.

Fig. 4. Taxonomy of outbound logistics exceptions in a resource-based view.

494

X. Yao et al.

The resources of the logistics organizations can be categorized into ﬁve types: the warehouse, staff, vehicles, data, and others. Some exceptions can be found in different categories due to the cause from the resources of different types. Taxonomy of the outbound logistics exceptions in the resource-based perspective provides with necessary conditions for us to understand the uncertainties related to them. Assertion 1: From the resource-based perspective, it is certain that exception occurs only when at least one resource mismatches with the situation. This assertion is obviously true because it is impossible if exceptions occur when all resources work well with the environment. Assertion 2: From the resource-based perspective, it is certain that uncertainties can be deﬁned as factors that are able to change the resources or make them malfunctioning. This assertion implies that the detection rules can be built by ﬁnding sources that cause the exceptions. However, since factors that potentially cause some exceptions may be very complicated, and need classifying, i.e. the taxonomy of them. Most of the resources can be parameterized. For instance, the warehouse can be characterized by its capacity, number of stocks; vehicle can be described with its capacity, distance coverage; staff can be evaluated with their skills, personality, and numbers; data can be evaluated with their quality, relations, and dynamics; and other soft resources can also be measured in a similar way. Based on this understanding, the taxonomy of those uncertainties that may impact the resources in the logistics processes is developed with daily business logic and rules, which is shown in Fig. 5.

Fig. 5. Taxonomy of the uncertainties that impact logistics resources.

The relevant uncertainties can be further classiﬁed into four types for deep understandings, i.e. generic market state, speciﬁc market state, environmental factors, and managerial factors.

Logistics Exceptions Monitoring for Anti-counterfeiting

495

Generic Market State: Current condition of the generic market deﬁnes the basic state of the industry, and thus determines the essential supply-demand relations, which can determine the general scale of the logistics organizations in some regions. Usually, it gives the general description of the relevant industries in a nation or a region. Now this kind of state can be effectively predicted using big data techniques. Speciﬁc Market State: Current condition of the speciﬁc market deﬁnes the basic state of the industry in a speciﬁc region, and thus determines the essential supply-demand relations in this place, which can determine the actual scale of the logistics organization in this region. Local pattern of this kind might be predicted using local economic data, but some randomized factors still exist. Environmental Factors: Mainly refer to geological and climate events that may cause the road/bridge damage, route change. It can be authenticated after the exceptions occurred. Managerial Factors: Mainly refer to speciﬁc managerial strategies of the logistics organizations including their planning and abilities of resource-dispatching. It can be improved by training staff and diminishing human errors with automation. The taxonomy of those uncertainties and the detailed interpretations are given in Table 1. Table 1. Taxonomy of Uncertainties in Their Own Perspective Type of uncertainties Generic market state

Speciﬁc market state

Environmental factors

Managerial factors

Subtype of uncertainties Global change Local change Internal industries External industries Natural disasters Unexpected events Planning Dispatching

3.3

Speciﬁc instances

Interpretations

Rapid demand increase/reduce Rush chained demand Strong/less competitive Enhance/diminish the economies Bad weather, earthquake, etc. Road/bridge damage, etc. Mistakes on resources Mistakes on strategies

Developing state of the economies will have impacts on its global and local market Social dependencies will have impacts on its developing trends

Natural disasters and unexpected events or technical failure, etc.

Bad managerial strategies will impact the speciﬁc efﬁciency, etc.

Cause-Effect Diagram of Logistics Exceptions

Based on the results reported in Figs. 4 and 5, a set of cause-effect diagrams of logistics exceptions can be developed in a resource-based view.

496

X. Yao et al.

In regard to the resource of inventory management, the warehouse, the conventional logistics exceptions include: rush orders, mismatched stock planning, goods tracking, capacity, and no/missed delivery, its cause-effect diagram can be developed using ﬁshbone diagram technique as reported in [8], as shown in Fig. 6.

Fig. 6. Cause-effect diagram of warehouse-related logistics exceptions.

Noting that the mismatched stock plan deﬁnes exceptions when the designated stocks are insufﬁcient or excessive for the current daily demand; however, a bad storage plan mainly refers to that when the warehouse is planned in different places in the same region. In the case of an RFID-enabled system, the RFID tag data acquisition may cause the exception of No/missed tag reading, triggering the data inconsistency with that from the information flow, i.e. the exception of the missed data. Therefore, it is important to note that interdependencies among those resources exist. According to this diagram, it is easy to see that mismatched stock plan may cause capacity exceptions, and which may also cause the exception of goods tracking, and then trigger that of No/missed delivery. Moreover, it is also easy to see that, some exceptions can be diminished by means of an optimization process, while others depend on external uncertainties. The staff-related exceptions include allocation planning, No/missed delivery, wrong delivery and delivery holdings. The related cause-effect diagram is developed and shown in Fig. 7.

Logistics Exceptions Monitoring for Anti-counterfeiting

497

Fig. 7. Cause-effect diagram of staff-related logistics exceptions.

Obviously, the impacts that the managerial tier would place on the side of the executive tier can be seen from this diagram, and it is expected that good strategies may be developed based on surveys of the actual activities. The vehicle-related exceptions include vehicle allocation, vehicle scheduling, routing, capacity, and late/partial delivery. The cause-effect diagram is developed as shown in Fig. 8. Noting that some exceptions are caused by multiple factors which might not be exhaustive in the diagram, but when requested in practical applications, it is possible to make it within the range of the topic overhead. Similarly, the data-related exceptions include wrong data source, wrong data, missing data, wrong delivery and delivery holdings. The cause-effect diagram is developed as shown in Fig. 9. While data-related exceptions may be brought out due to multiple factors. They can be detected by comparing data from information flow and physical flow, respectively. The other factors-related exceptions include long led time, order changes, schedule changes, and delivery holdings, which owe mainly to the managerial capabilities. A cause-effect diagram of this kind is developed and shown in Fig. 10. Long led time exception is typically from complicated procedures that require conﬁrmations from both the suppliers and the logistics service providers, which would trigger the delivery holdings exception.

498

X. Yao et al.

Fig. 8. Cause-effect diagram of vehicle-related logistics exceptions.

Fig. 9. Cause-effect diagram of data-related logistics exceptions

Logistics Exceptions Monitoring for Anti-counterfeiting

499

Fig. 10. Cause-effect diagram of others-related logistics exceptions.

3.4

An Improved Intelligent Exception Monitoring System

Proactive and effective logistics exception monitoring system with intelligent agents has been widely reported in previous works [9], where the agents are expected to retrieve the necessary data from the data sources and check them for validity and accuracy, and test them for exceptions based on business rules, report the testing results and update the system by adding new case records to the data repository. In our proposed system, new functionalities are added to more detailed inference: An exception relational parser, a big-data analytic engine, and a network explorer engine, which are listed in Table 2. The architecture of the monitoring system is shown in Fig. 11. Table 2. Newly-added main functions Core engines Exception relational parser (ERP) Network explorer engine (NEE) Big-data analytic engine (BAE)a

Functional description To provide services of parsing the cause-effect relations among exceptions identiﬁed To explore, decompose and integrate related data sources from the internet for big-data analysis To provide services of eliminating the uncertainties using adaptable template-based reasoning and if requested, with reports of potential inferred events a Since this engine is beyond our topic, it will be discussed in the future

500

X. Yao et al.

Fig. 11. Architecture of logistics exceptions monitoring system.

4 Case Study and Discussions In order to compare with that of [9], we adopt the two cases used in [9], i.e. (1) the case when “No delivery” exception was detected because of a wrong address; (2) the case when planning exception can be identiﬁed at the instance of unavailable planned driver and/or unavailable stocks. We compare our approach with that in [9] with respect to ontology development and business rules. 4.1

Ontology Development

According to [9], the static ontology (taxonomy) should be ﬁrst determined using multi-level reasoning, i.e., at the ﬁrst level, an exception is identiﬁed, then at the secondary level, the delivery exception and/or planning exception can be located, and depending on the cause of the exceptions, therefore, at the third level, the “No delivery” or “delivery holdings” exceptions can be further classiﬁed. Similarly, the dynamic ontology can be developed. The dependencies among the senders, logistics service providers and the receivers in those speciﬁc situations will become very concrete, for instance, in the case of “No drivers”, the senders turn out to be the planning managers, and the receivers would be the drivers, thus the social dependencies can be in place. The readers should refer to [9] for more details.

Logistics Exceptions Monitoring for Anti-counterfeiting

501

In our proposed scheme, a bit different taxonomy will be developed, as in the two cases of “No delivery” due to a wrong address, and unavailable drivers and insufﬁcient stocks, which is shown in Fig. 12.

Fig. 12. Taxonomy of situational exceptions.

Obviously, the taxonomy of our scheme provides not only the necessary structured view of the exceptions but the cause-effect relations among them as well. Whenever an exception is detected, the type of the related resources can be determined at the same time, and in our testing cases, the drivers, the stocks, and the data. And the cause of bad planning, wrong data (wrong address as speciﬁed) can further help us to focus on the most related situations. 4.2

Business Rules

In [9], a set of procedures as the business rules are used to test and monitor the exceptions according to the taxonomy previously developed. The pseudo-codes in [9] are shown as follows:

502

X. Yao et al.

Procedure Exception_monitoring_process IF delivery_scheduled_time>=delivery_actual_time THEN GOTO Procedure END ELSE RESET Exception DO WHILE Exception /* Check all the business rules*/ Procedure Check_address Procedure Check_delivery_instructions … /* Check all the business rules*/ END DO IF Exception found Update system Send message /*send reports*/ END IF END IF Procedure END /*End the process*/ Procedure Check_address SET Exception=FALSE IF address=‘’ SET Exception=TRUE THEN GOTO Procedure ‘END Address’ ELSE DO WHILE Exception /* Check all possible factors of the address*/ IF (City= ‘’) OR (City=FALSE) THEN SET Exception=TRUE END IF IF (Street= ‘’) OR (Street=FALSE) THEN SET Exception=TRUE END IF IF (Property number= ‘’) OR (Property number=FALSE) THEN SET Exception=TRUE END IF END DO END IF Procedure ‘END Address’ /* End the procedure*/

Apparently, the business rules can cover and integrate all possible reasons for the single “No delivery” exception under the delivery instructions. However, from our proposed cause-effect diagram, it is understood that this exception may be brought out by different resource-based exceptions. For instance, apart from the exception of “wrong address”, the warehouse-based exceptions may also trigger the exception of “No/missed delivery”. Therefore, in our scheme, once an exception is detected, the related resource is ﬁrstly determined by means of the business rules and cause-effect diagram. As in the case of wrong address, the data-related exceptions can be checked afterward. The basic logic is illustrated as follows:

Logistics Exceptions Monitoring for Anti-counterfeiting

503

Procedure Exception_monitoring_process IF delivery_scheduled_time>=delivery_actual_time THEN GOTO Procedure END ELSE RESET Exception DO WHILE Exception /* Check the resources’ conditions based on cause-effect diagram*/ IF( address= ‘’ ) OR (address=FALSE) SET Exception=TRUE THEN GOTO Procedure Check_data /* check data-related exceptions*/ END IF IF (staff signature= ‘’) OR (staff signature=FALSE) SET Exception=TRUE THEN GOTO Procedure Check_staff END IF IF (goods number= ‘’) OR (goods number=FALSE) SET Exception=TRUE THEN GOTO Procedure Check_warehouse END IF END DO IF Exception found Send reports END IF END IF Procedure END /*End the process*/ Procedure Check_data SET Exception=FALSE IF data_difference>=Threshold SET Exception=TRUE THEN GOTO Procedure ‘END data’ ELSE DO WHILE Exception /* Check the data-related causes*/ IF (Data= ‘’) OR (Data=FALSE) /* missing data or wrong data*/ THEN SET Exception=TRUE END IF …/*other situations based on the cause-effect diagram*/ END DO … /*other procedures*/

Obviously, only by explicit cause-effect diagrams, can all possible cause-exceptions be listed and retrieved from the database for such examinations. For example, warehouse-related exceptions may also trigger the exception of “No delivery” because of exception of “goods tracking” instead of the commonest “wrong address”, which is seldom seen nowadays. By comparing the two exception monitoring processes in their pseudo-code format, we can ﬁnd out that (1) the second monitoring process bears explicit complete logic for related exceptions, thereby providing with better and more generic treatment of the exceptions identiﬁcation; (2) the cause-effect diagram of the second monitoring process helps to reduce the computing complexity of the procedures and thus works well to improve the system’s efﬁciency.

5 Conclusions In this paper, we have investigated mainly the outbound logistics exceptions in the resource-based view, focusing on the taxonomy of the logistics exceptions from the perspective of the logistics resources, and their cause-effect relations, which we believe

504

X. Yao et al.

would be critical in logistics exception monitoring. We explained and demonstrated by comparing with [9] that such explicit representation of cause-effect diagram can not only perfectly map the taxonomy into the procedures (the business rules) of exceptions identiﬁcation, but help to reduce the computing complexity of the identifying procedures, thus work well to improve system efﬁciency, enabling the system to enhance its power of product anti-counterfeiting. Moreover, we also noticed that the system response to the exceptions can be greatly improved on the basis of these results, using some predictive tools and big data analytical techniques, which would be further investigated in the near future. Acknowledgment. This work is supported by Natural Science Foundation of China under the grant No. 61462023.

References 1. Yao, X., Zhou, X., Ma, J.: Object event visibility for anti-counterfeiting in RFID-enabled product supply chains. In: Proceedings of the Science and Information Conference, London, UK, pp. 141–150, 28–30 July 2015 2. Musa, A., Gunasekaran, A., Yusuf, Y.: Supply chain product visibility: methods, systems and impacts. Expert Syst. Appl. 41, 176–194 (2014) 3. Schapranow, M., Muller, J., Zeier, A., Plattner, H.: Costs of authentic pharmaceuticals: research on qualitative and quantitative aspects of enabling anti-counterfeiting in RFIDaided supply chain. Pers. Ubiquit. Comput. 16, 271–289 (2012) 4. Wemmerlöv, U.: A taxonomy for service processes and its implications for system design. Int. J. Serv. Ind. Manag. 1(3), 20–40 (1990) 5. Svensson, G.: A conceptual framework of vulnerability in ﬁrms’ inbound and outbound logistics flows. Int. J. Phys. Distrib. Logist. Manag. 32(2), 110–134 (2002) 6. Russell, N., Aalst, W., Hofstede, A.: Workflow exception patterns. In: Dubois, E., Pohl, K. (eds.) CAiSE, Berlin, pp. 288–302 (2006) 7. Ozkohen, A., Yolum, P.: Predicting exceptions in agent-based supply chains. In: Engineering Societies in the Agents World, pp. 168–183 (2006) 8. Wang, M., Wang, H., Kit, K., Xu, D.: Knowledge-based exception handling in securities transactions. In: Proceedings of 37th Hawaii International Conference on Information System Sciences, Hawaii, 30074a, vol. 3 (2004) 9. Xu, D., Wijesooriya, C., Wang, Y., Beydoun, G.: Outbound logistics exception monitoring: a multi-perspective ontologies’ approach with intelligent agents. Expert Syst. Appl. 38, 13604–13611 (2011) 10. Yang, L., Peng, P., Dang, F., Wang, C., Li, X., Liu, Y.: Anti-counterfeiting via federated RFID tags’ ﬁngerprints and geometric relationships. In: Proceedings of 34th IEEE Conference on Computer Communications (INFOCOM), Hong Kong, pp. 1966–1974, 26 April–1 May 2015 11. Wang, L., Ting, J., Ip, W.: Design of supply-chain pedigree interactive dynamic explore (SPIDER) for food safety and implementation of hazard analysis and critical control points (HACCPs). Comput. Electron. Agric. 90, 14–23 (2013) 12. Masciari, E.: SMART: stream monitoring enterprise activities by RFID tags. Inf. Sci. 195, 25–44 (2012)

Logistics Exceptions Monitoring for Anti-counterfeiting

505

13. Noy, N., Klein, M.: Ontology evolution: not the same as schema evolution. Knowl. Inf. Syst. 6(4), 428–440 (2004) 14. Ahmada, S., Lavin, A., Purdya, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262, 134–147 (2017) 15. Kwok, S., Tsang, A., Ting, J., Lee, W., Cheung, B.: An intelligent RFID-based electronic anti-counterfeit system (InRECS) for the manufacturing industry. In: Proceedings of the 17th International Federation of Automatic Control (IFAC) World Congress, Seoul, Korea, pp. 5482–5487, 6–11 July 2008 16. Cheung, H., Choi, S.: Implementation issues in RFID-based anti-counterfeiting systems. Comput. Ind. 62, 708–718 (2011) 17. Barney, J.: Firm resources and sustained competitive advantage. J. Manag. 17(1), 99–121 (1991) 18. Chi, J., Sun, L.: IT and competitive advantage: a study from micro perspective. Modern Econ. 6(3), 404–410 (2015)

Precision Dairy Edge, Albeit Analytics Driven: A Framework to Incorporate Prognostics and Auto Correction Capabilities for Dairy IoT Sensors Santosh Kedari1(&), Jaya Shankar Vuppalapati1, Anitha Ilapakurti2, Chandrasekar Vuppalapati2, Sharat Kedari2, and Rajasekar Vuppalapati2 1

Hanumayamma Innovations and Technologies Private Limited, HIG – II, Block – 2/Flat – 7, Baghlingumpally, Hyderabad 500 044, India {skedari,jaya.vuppalapati}@hanuinnotech.com 2 Hanumayamma Innovations and Technologies, Inc., 628 Crescent Terrace, Fremont, CA, USA {ailapakurti,cvuppalapati, Sharat,raja}@hanuinnotech.com Abstract. Oxford English Dictionary deﬁnes Prognostics as “an advance indication of a future event, an omen”. Generally, it is conﬁned to fortune or future foretellers, more have subjective or intuition driven. Data Science, on the other hand, embryonically enables to model and predict the health condition of a system and/or its components, based upon current and historical system generated data or status. The chief goal of prognostics is precise estimation of Remaining Useful Life (RUL) of equipment or device. Through our research and through industrial ﬁeld deployment of our Dairy IoT Sensors, we emphatically conclude that Prognostics is a vital marker in the lifecycle of a device that can be deduced as inflection point to trigger auto-corrective, albeit edge analytics driven, in Dairy IoT Sensors so that the desired ship setting functions can be achieved with precision. Having auto-corrective capability, importantly, plays pivotal role in achieving satisfaction of Dairy farmers and reducing the cost of maintaining the Dairy sensors to the manufacturers as these sensors are deployed in geographically different regions with intermittent or network connectivity. Through this paper, we propose an inventive, albeit, small footprint, ML (Machine Learning) dairy edge that incorporates supervised and unsupervised models to detect prognostics conditions so as to infuse autocorrective behavior to improve the precision of dairy edge. The paper presents industrial dairy sensor design and deployment as well as its data collection and certain ﬁeld experimental results. Keywords: Dairy sensors Precision sensors Prognastics Dairy Precision dairy edge Prognosis approach Open system architecture for condition based monitoring OSA-CBM Hanumayamma Innovations and Technologies

© Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 506–521, 2019. https://doi.org/10.1007/978-3-030-03405-4_35

Precision Dairy Edge, Albeit Analytics Driven

507

1 Introduction Using system generated data, current and/or historical, one can model and predict the health condition of a system and/or its components. This is the underlying and chief principle of Prognostics [1]. The chief aim of prognostics is precise estimate of Remaining Useful Life (RUL) of equipment or device [2]. Through our research and through industrial ﬁeld deployment our Dairy IoT Sensors, we emphatically conclude that Prognostics is a vital marker in the lifecycle of a device that can be deduced as inflection point to trigger auto-corrective, albeit edge analytics driven, in Dairy IoT Sensors so that the desired ship setting functions can be achieved with precision. The IoT Edge Analytics is enabler and catalyst for prognostics. The IoT architecture generally comprises of three layers: the edge layer, the fog layer and the cloud layer. The edge layer usually contains sensors (in our case our dairy sensor), actuators, and embedded devices that are, typically, source of the data. The Fog layer generally characterized as network layer with network equipment such as gateways and other connected equipment with pre-deployed applications. The Cloud layer includes compute and storage infrastructure with servers. In order to deliver rapid responses to the sources in the IoT, especially at the edge layer, edge analytics collects and aggregates the data at the source before relaying to the cloud. The beneﬁt of analyzing the data closer to the source empowers Edge Analytics not only provides swift responses but also captures prognostic device markers that peek into rapidly changing device health conditions, data anomalies and abnormalities. Edge Analytics with prognostics capabilities, in general, corroborates core tenets of Open System Architecture for Condition Based Monitoring (OSA-CBM) principle – “reduce cost & improve the life of the device”. We can draw parallels between architectural processing components of the edge analytics and the OSA-CBM. The Edge Analytics core-processing components, semantically, very much fall into the category of components as outlined in the OSA-CBM architecture, shown in Fig. 1. The processing components include: data procurement or acquisition, data operation or manipulation, state model or detection, health validation or assessment, prognostics, logic or decision support and closed-loop response or presentation [3]. Devices with built-in prognostics capabilities can help in reducing the cost of device post sales operations and engineering service efforts to the manufacturers. For instance, in Energy Industry, a proven use case for improvement of safety and reduction in remediation cost is achieved through the application of intelligent prognostics1. Similarly, Harpreet Singh2, Founder and Co-CEO of Experfy3 noted “in order to maximizing return on the device investment, prognostic analytics for predictive maintenance is essential”.

1

2

3

Spark recognition - http://www.iiconsortium.org/energy-summit/presentations/Industrial_Internet_ Summit_SparkCognition.pdf IoT and Prognostic Analytics for Predictive Maintenance - https://www.experfy.com/blog/iot-andprognostic-analytics-for-predictive-maintenance. Experfy - https://www.experfy.com/blog/iot-and-prognostic-analytics-for-predictive-maintenance.

508

S. Kedari et al.

Fig. 1. OSA-CBM (Open System Architecture for Condition-Based Maintenance (OSA-CBM) http://www.mimosa.org/mimosa-osa-cbm) architecture [4].

Thus, devices with built-in auto corrective capabilities infused with the help of edge processing and prognostics analytics will improve device longevity and thus improve overall device return on the investment (ROI). As part of the paper, we have presented the data anomalies that our sensors have encountered in the dairy ﬁelds and proposed self-correcting prognostics analytics detection algorithm and ﬁnally deployed these algorithms on small footprint in-memory (32 KB) edge embedded device. The structure of the paper is as follows: Edge, Hardware, Prognostics, and Machine Learning concepts are discussed in Sect. 2. Section 3 presents edge architecture. System Design, Field Data Analysis, Prognostics and Model Construction are discussed in Sect. 4, and the conclusion, future work and acknowledgements are included in Sect. 5.

2 Understanding Precision Dairy IoT Sensor: Software, Analytics and Hardware 2.1

Embedded IoT Sensor with Temperature and Humidity (TH) Capabilities

For deploying precision dairy sensor with prognostics capability, our Sensor stack comprises following hierarchy (Fig. 2): 1. Bluetooth Chip: Mobile Interface 2. Accelerometer: Forces of acceleration

Precision Dairy Edge, Albeit Analytics Driven

3. 4. 5. 6. 7. 8.

2.2

509

Timer Sensor: Retrieves temperature & humidity Block with Pro Terminal ROM Data Store Microcontroller (ATmega328P) Power Supply (bottom).

Sensors

The Embedded device (Fig. 3) uses Silicon Labs Si7020-A10 sensor4 architecture for measuring temperature and humidity with accuracy measurement resolution up to (±0.4C).

Fig. 2. TH device.

Fig. 3. Si7020 Shield (http://www.mouser.com/ds/2/368/Si7020-272416.pdf).

4

Si7020 – A10: http://www.mouser.com/ds/2/368/Si7020-272416.pdf

510

S. Kedari et al.

The speciﬁcation of Drain Supply Voltage (VDD) includes: I grade 1:9 VDD 3:6V; TA ¼ 40 to þ 85 CðG gradeÞor 40 to þ 125 C Y

2.3

Humidity Sensors5

The scientiﬁc deﬁnition of Humidity is “the water vapor content in air, atmosphere or other gases”. Generally, humidity measurements are denoted in a variety of terms and units. The three most commonly used are absolute humidity, dew point, and relative humidity (RH) [5]. The humidity data, capture is more cumbersome than that of temperature data. The recent development in semi-technology, however, paved path to the development of humidity sensors: capacitive, resistive, and thermal conductivity6 [6]. In its basic form, the thermal conductivity sensors are built using two negative temperature coefﬁcient (NTC) thermistor elements in a DC bridge circuit (Figs. 4 and 5). One thermistor is sealed in dry nitrogen, while the second thermistor is exposed to the environment (see ﬁgure). The absolute humidity is the difference in the resistance between the two thermistors [6].

Fig. 4. Thermal conductivity (or absolute) humidity sensors [5].

5

6

Materials Science and Engineering – A First Course by V. Raghavan, Fifth Edition, Thirty-Fourth Print, April 2007 Edition, Prentice-Hall of India Pvt Ltd. Sensor Technology Handbook, Jon S. Wilson

Precision Dairy Edge, Albeit Analytics Driven

511

Fig. 5. Circuit for relative humidity and temperature measurement (Si7020 – A10 - http://www. mouser.com/ds/2/368/Si7020-272416.pdf).

Below is the temperature and humidity circuit of Si7020-A10 based sensors: 2.4

Machine Learning

(1) Decision Tree: Decision tree (DT), as name implies, is a tree like structure with following data to node representations: (a) attributes of the data is represented by the intermediate nodes, (b) leaf nodes represent outcomes and (c) attribute value held by the branches. Given no domain knowledge is needed to construct the decision tree, the DTs are extensively used in the classiﬁcation process [7].

Fig. 6. EEPROM code.

512

S. Kedari et al.

For a given set of data, root node identiﬁcation of the decision tree is the primary step and Information Gain and Gini impurity are methods to identify the root node. (a) Information Gain: Used to compute (using entropy and information) the root node and the branch nodes.

Fig. 7. Measuring sensor accuracy including hysteresis.

Fig. 8. Prognosis approach using feature map [9].

Precision Dairy Edge, Albeit Analytics Driven

513

Formula for Entropy is [7]: ENT ðDÞ ¼

Xm i¼1

pi log2 ðpiÞ

ð1Þ

Attribute Information calculated using (2). InfoA ðDÞ ¼

Xv jDjj x infoðDjÞ j¼1 jDj

ð2Þ

Gain ð AÞ ¼ Info ðDÞInfoA ðDÞ

(b) Root Node: The attribute with the highest information gain (entropy - information of that attribute) is the root node. 2.5

Linear Model – Temperature and Humidity [8]

The Forecasted temperature or humidity (THft) at each time t is assumed to be linearly related to the observed temperature or humidity at the same time (THot) as follows7: THot ¼ a þ dTHft þ VT

ð3Þ

Where, VT is random variable. (2) Temperature method [8] As observed to forecasted temperature, relation is: THf2 ¼ THo1 ; THf3 ¼ cTHo2 þ ð1 cÞTHf2 ; THf ðt þ 1Þ ¼ cTHot ¼ ð1 cÞTHft or even

ð4aÞ

THft ¼ THfðt1Þ þ c ðTHoðt1Þ THfðt1Þ Þ

ð4bÞ

Where, c is deﬁned constant value, between 0 and 1. A general formulation can then be given as follows (linear regression): THft ¼ cTHoðt1Þ þ cð1 cÞTHoðt2Þ þ cð1 cÞ2 THoðt3Þ þ . . . þ cð1 cÞt2 THo1 ð5Þ

7

Correcting temperature and humidity forecasts using Kalman ﬁltering: Potential for agricultural protection in Northern Greece - https://www.researchgate.net/publication/233997840_Correcting_ temperature_and_humidity_forecasts_using_Kalman_ﬁltering_Potential_for_agricultural_protection_in_Northern_Greece

514

S. Kedari et al.

Note: optimal value for the constant value c is around 0.6 (See footnote 7).

3 System Overview The embedded system, Fig. 9, is built on top of high-performance Eight bit Advanced Virtual Risc (AVR)-based microcontroller that combines flash memory (32 KB ISP) with rww (read-while-write) capabilities, ROM (1024B), SRAM – 2 KB, general purpose I/O lines (23), general purpose working registers (32), and serial programmable Universal asynchronous receiver-transmitter. The device operates between 1.8–5.5 V [10]. 3.1

System Architecture

The embedded system consists of following components: Battery Power Supply, Microcontroller, EPROM, Terminal Blocks, Timer Sensor, Accelerometer, Bluetooth Low Energy (BLE) and Temperature & Humidity Sensors. (1) Battery Power Supply: Three AA batteries power the system. The total input voltage of 5 V. (2) Microcontroller: The embedded system is built on top of high-performance 8-bit AVR RISC8. (3) EEPROM: EEPROM (Fig. 6) provides nonvolatile data storage support. 0–255 record counters will be stored in EEPROM location zero. 256–336 record counter with maximum of 2 weeks at 24 records/day will be stored in EEPROM location one. Records will start at EEPROM location two. Additionally, the data arrays store on EEPROM and retrieve command over Bluetooth Low Energy (BLE). (4) Bluetooth: The Bluetooth module (Table 1) provides interface to receive commands from mobile application. In addition, the EEPROM stored values are transferred back to the mobile application via Bluetooth transfer protocol. (5) Humidity and Temperature: temperature & humidity is calculated based on: %RH ¼

125 RHCode 6 65536

ð6Þ

Where, the measured relative humidity value in %RH RH_Code is the 16-bit word returned by the Si7020 [10]. In ideal settings, the accuracy of the sensor is measured in the following order: the sensor is placed in a temperature and humidity controlled chamber (Fig. 7) [10, 11]. (6) Prognosis and Decision Making with Self-Recovery Function: [9] There are several ways to perform prognosis (Fig. 8). One of the ways includes development of regression mode, linear or non-linear. Once the model available, next, is to use the model to ﬁt the feature values over a period of time (time-series) and predict the 8

AVR RISC – Advanced Virtual reduced instruction set computer (http://www.atmel.com/products/ microcontrollers/avr/)

Precision Dairy Edge, Albeit Analytics Driven

515

feature values. Then, the mode is used to as diagnosis procedure to evaluate the system how it behaves in the future. To detect fault zones, the model is evaluates to check the feature deviations from the normal values. In essence, the model identiﬁcation plays a pivotal role in assessing fault zones [9].

Fig. 9. Sensor block diagram.

Table 1. Bluetooth code float Adafruit_HTU21DF::readHumidity(void) { Wire.beginTransmission(HTU21DF_I2CADDR);Wire.write(HTU21DF_REA DHUM); Wire.endTransmission();delay(50); // add delay between request and actual read! Wire.requestFrom(HTU21DF_I2CADDR, 3);while (!Wire.available()) {} uint16_t h = Wire.read(); h 0.9

There are also issues with mutual interpretability. Interjections that express feelings (such as “urggghhh”) might be deemed as irrelevant by the machine. It might be hard, if not impossible, for the machine to “master” contextual knowledge such as some exophoric references3 to historical ﬁgures (“the German dictator”, which refers to Hitler). The issue becomes more signiﬁcant when dialects are used. For example, the negation analysis is based on Standard English usage, which might not be useful for other varieties of English. Speakers of certain dialects like African American Vernacular English (AAVE) usually employ double negatives to emphasize the negative meaning.

2

3

Multilinguality is a characteristic of tasks that involve the use of more than one natural language (Kay, n.d.). Exophoric reference is referring to a situation or entities outside the text. (University of Pennsylvania, 2006).

When Siri Knows How You Feel

601

5 Conclusion and Future Work In this study, we have built a machine learning model combining acoustic and linguistic features. As the results have shown, this model has signiﬁcantly higher accuracy than models with only acoustic or only linguistic features. Under this model, excellent prediction can be achieved. Although limitations and challenges are real and a considerable amount of manual work is necessary, the positive results of this study have clearly suggested to the possibility of achieving a fully automated audio sentiment analysis in future. Based on the limitations and challenges discussed in Sect. 4, the following three main directions of research are proposed: (1) From audio sentiment analysis towards video sentiment analysis by incorporating facial expression features, and further towards multi-dimensional sentiment analysis by incorporating physiological features such as blood pressure and heart rate. (2) From semi-automatic sentiment analysis towards fully automatic sentiment analysis, by reducing the amount of manual processing of data. (3) From sentiment recognition towards emotion recognition, by enabling classiﬁcation of speciﬁc emotions such as fear, anger, happiness, sadness. Artiﬁcial Intelligence (AI) is becoming an increasingly interdisciplinary ﬁeld. To achieve the above research goals, cross-discipline cooperation is crucial. Solutions to the challenges of language/emotion recognition and understanding can be inspired by diverse ﬁelds from mathematics and sciences, which provide us with quantitative methods and computational models, to humanities and ﬁne arts, which shed light on qualitative analysis and feature selection. From a neuroscience perspective, learning about how the human brain perceives and processes sentiments and emotions might inspire a better machine learning architecture for sentiment prediction. Mathematical modelling could be useful as well: the high complexity of emotions should be captured more comprehensively by mapping the emotion of each utterance in multi-dimensional vector space. Linguistics theories also imply that language is meaningless without context (the socio-cultural background of the speaker, the conversation setting, and the general mood). It is a timely reminder for Natural Language Processing (NLP) researchers to go beyond content analysis – dissecting language as an isolated entity only made up of different parts of speech – and aim for “context analysis”. Without being context-aware, AI will only be machines with “high Intelligence Quotient (IQ)” but “low emotional intelligence quotient (EQ)”. Emotion theory in drama and acting also provides some insights for developing affective, sentient AI. For example, emotions can be conveyed through subtle means such as silence, cadence, and paralinguistic features (kinesics, i.e. body language and proxemics i.e. use of space etc.). This will give us directions in selecting and extracting features salient to sentiment. Acknowledgment. I would like to express my sincere gratitude to my project supervisor Professor Eddie Ng for being open-minded about my project topic, without which I could not have been able to delve deep into my ﬁeld of interest. His insightful suggestions and unwavering support has guided me through doubts and difﬁculties.

602

L. Zhang and E. Y. K. Ng

References 1. Liu, B.: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press, Cambridge (2015) 2. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293 (2005) 3. Kaushik, L., Sangwan, A., Hansen, J.H.L.: A Holistic Lexicon-Based Approach to Opinion Mining. IEEE (2013) 4. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the Conference on Web Search and Web Data Mining (WSDM-2008) (2008) 5. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004), Aug 22–25, 2004, Seattle, Washington, USA 6. Demšar, J., Curk, T., Erjavec, A.: Orange: data mining toolbox in Python. J. Mach. Learn. Res. 14, 2349–2353 (2013) 7. Unknown. Cornell University: (2003) https://www.cs.cornell.edu/courses/cs578/2003fa/ performance_measures.pdf 8. Tape, T.G.: Interpreting diagnostic tests. University of Nebraska Medical Center (n.d). http:// gim.unmc.edu/dxtests/roc3.htm

A Generic Multi-modal Dynamic Gesture Recognition System Using Machine Learning G. Gautham Krishna1(B) , Karthik Subramanian Nathan2 , B. Yogesh Kumar1 , Ankith A. Prabhu3 , Ajay Kannan2 , and Vineeth Vijayaraghavan1 1

Research and Outreach, Solarillion Foundation, Chennai, India 2 College of Engineering, Guindy, Chennai, India 3 SRM University, Chennai, India {gautham.krishna,nathankarthik,yogesh.bkumar,ankithprabhu, ajaykannan,vineethv}@ieee.org

Abstract. Human computer interaction facilitates intelligent communication between humans and computers, in which gesture recognition plays a prominent role. This paper proposes a machine learning system to identify dynamic gestures using tri-axial acceleration data acquired from two public datasets. These datasets, uWave and Sony, were acquired using accelerometers embedded in Wii remotes and smartwatches, respectively. A dynamic gesture signed by the user is characterized by a generic set of features extracted across time and frequency domains. The system was analyzed from an end-user perspective and was modelled to operate in three modes. The modes of operation determine the subsets of data to be used for training and testing the system. From an initial set of seven classiﬁers, three were chosen to evaluate each dataset across all modes rendering the system towards mode-neutrality and dataset-independence. The proposed system is able to classify gestures performed at varying speeds with minimum preprocessing, making it computationally eﬃcient. Moreover, this system was found to run on a low-cost embedded platform – Raspberry Pi Zero (USD 5), making it economically viable. Keywords: Gesture recognition · Accelerometers Feature extraction · Machine learning algorithms

1

Introduction

Gesture recognition can be deﬁned as the perception of non-verbal communication through an interface that identiﬁes gestures using mathematical, probabilistic and statistical methods. The ﬁeld of gesture recognition has been experiencing a rapid growth amidst increased interests shown by researchers in the industry. The goal of current research has been the quick and accurate classiﬁcation of gestures with minimalistic computation, whilst being economically feasible. c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 603–615, 2019. https://doi.org/10.1007/978-3-030-03405-4_42

604

G. G. Krishna et al.

Gesture recognition can ﬁnd use in various tasks such as developing aids for the audio-vocally impaired using sign language interpretation, virtual gaming and smart home environments. Modern gesture recognition systems can be divided into two broad categories vision based and motion based systems. The vision based system proposed by Chen et al. in [1] uses digital cameras, and that proposed by Biswas et al. in [2] uses infrared cameras to track the movement of the user. For accurate classiﬁcation of the gestures, these systems require proper lighting, delicate and expensive hardware and computationally intensive algorithms. On the other hand, motion based systems use data acquired from sensors like accelerometer, gyroscope and ﬂex sensor to identify the gestures being performed by the user. Of late, most gesture recognition systems designed for eﬀective interaction utilize accelerometers for cheaper and accurate data collection. Accelerometer-based hand gesture recognition systems deal with either static or dynamic gestures as mentioned in [3]. Static gestures can be uniquely characterized by identifying their start and end points, while dynamic gestures require the entire data sequence of a gesture sample to be considered. Constructing a dynamic gesture recognition system that is compatible with any user is diﬃcult, as the manner in which the same gesture is performed varies from user-to-user. This variation arises because of the disparate speeds of the dynamic gestures signed by users. To tackle this problem, a common set of features which represent the dynamic nature of the gestures across various users should be selected. The authors of this paper propose a gesture recognition system using a generic feature set, implemented on two public datasets using accelerometers - uWave and Sony, as shown in [4] and [5], respectively. This system has been trained and tested across various classiﬁers and modes, giving equal importance to both accuracy and classiﬁcation time, unlike most conventional systems. This results in a computationally eﬃcient model for the classiﬁcation of dynamic gestures which is compatible with low-cost systems. The rest of the paper is organized as follows. Section 2 presents the related work in the area of Gesture Recognition. Section 3 states about the problem statement of the paper. Sections 4 & 5 deal with the datasets and pre-processing used in building the model. Section 6 showcases the features extracted from the datasets. Section 7 discusses about the diﬀerent modes provided to the end-user. Section 8 describes the model used and the experiments performed. Section 9 enumerates the results analyzed in the paper. Section 10 ﬁnally concludes the paper.

2

Related Work

Since the inception of gesture recognition systems, there has been a plethora of research in this domain using accelerometers. Speciﬁcally tri-axial accelerometers have been in the spotlight recently owing to their low-cost and low-power requirements in conjunction with their miniature sizes, making them ideal for embedding into the consumer electronic platform. Previous gesture recognition

A Generic Multi-modal Dynamic Gesture Recognition System

605

systems have also used sensors such as ﬂex sensors and gyroscope, but they have their own shortcomings. The glove based system proposed by Zimmerman et al. in [6] utilizes ﬂex sensors, which requires intensive calibration on all sensors. The use of the these sensors increases the cost of the system and also makes the system physically cumbersome. These shortcomings make inertial sensors like accelerometers and gyroscope, a better alternative. In this paper, datasets employing accelerometers were preferred over gyroscopes, as processing the data from gyroscopes results in a higher computational burden. Contemporary gesture recognition systems employ Dynamic Time Warping (DTW) algorithms for classiﬁcation of gestures. For each user, Liu et al. [4] employs DTW to compute the look-up table (template) for each gesture, but it is not representative of all users in the dataset, thereby not generalizing for the user-independent paradigm. To achieve a generic look-up table for each gesture that represents multiple users, the proposed system in [7] uses the concept of idealizing a template wherein, the gestures exhibiting the least cost when internal DTW is performed is chosen as the look-up table. Furthermore, DTW is performed again for gesture classiﬁcation while testing, making the model computationally very expensive to be used in a low-cost embedded platform. The accelerometer-based gesture recognition system proposed in [8] uses continuous Hidden Markov Models (HMMs), but their computational complexity is commensurate with the size of the feature vectors which increase rapidly. In addition, choosing the optimum number of states is diﬃcult in multi-user temporal sequences, thereby increasing the complexity of estimating probability functions. Moreover, the length of the time series acceleration values of a gesture sample is made equal and quantization is performed in the system proposed in [4]. This makes the data points either lossy or redundant. However, the paper proposed utilizes the data points as provided in the datasets without any windowing or alteration, thereby decreasing the computational costs whilst not compromising on eﬃciency. In the system employed in [9], Helmi et al. have selected a set of features that has been implemented on a dataset that contains gestures performed by a single user, making them gesture-dependent and not taking into account the concept of generic features encompassing multiple users. The need for a generic set of features which capture the similarities between gestures, motivated the authors of this paper to implement a feature set that can be applied to any system.

3

Problem Statement

Previous works in haptic-based gesture recognition systems employed computationally expensive algorithms for identiﬁcation of gestures with instrumented and multifarious sensors, making them reliant on specialized hardware. Moreover, most gesture recognition systems extract features that are model-dependent, and fail to provide the user a choice between accuracy and classiﬁcation time. Thus a need for a ﬂexible gesture recognition system that identiﬁes dynamic gestures accurately with minimalistic hardware, along with a generic feature set arises.

606

G. G. Krishna et al.

To overcome this problem, this paper presents a machine-learning based dynamic gesture recognition system that utilizes accelerometers along with a generic set of features that can be implemented across any model with adequate gesture samples. The system provides the end-user the option to choose between accuracy and classiﬁcation time, thereby giving equal importance to both.

4

Datasets

For this paper, the authors have chosen two public gesture datasets, viz. uWave dataset (Du ) and Sony dataset (DS ). The datasets were selected owing to their large user campaign of 8 users, with a multitude of gesture samples for a variety of gestures. Du encompasses an 8-gesture vocabulary with 560 gesture samples per dataset over a period of 7 days, while DS consists of a collection of 20 gestures with 160 gesture samples each. This shows the diverse nature of the datasets. Both the datasets are characterized by U users signing NG gestures with SG samples per gesture, over ND days with the total number of gesture samples being NGS per dataset, as shown in Table 1. Table 1. Datasets characterized by their respective attributes Dataset

Users (U ) Gestures (NG ) Samples per Gesture (SG ) Days (ND ) NGS

uWave (Du ) 8 Sony (DS )

8

8

10

7

4480

20

20

–

3200

Du comprises of 3-D accelerations (g-values) that were recorded using a Wii Remote. The start of a gesture is indicated by pressing the ‘A’ button on the Wii Remote and the end is detected by releasing the button as mentioned in [4]. Du consists of four gestures which have 1-D motion while the remaining gestures have 2-D motion, as depicted in Fig. 1. DS was recorded using a tri-axial accelerometer of a ﬁrst generation Sony Smartwatch which was worn on the right wrist of the user. Each gesture instance

Fig. 1. uWave gestures.

A Generic Multi-modal Dynamic Gesture Recognition System

607

was performed by tapping the smartwatch screen to indicate the start and end of the gesture. The data recorded from the smartwatch consists of timestamps from diﬀerent clock sources of an Android device, along with the acceleration (g-values) measured across each axis, as mentioned in [5]. In DS , there are four gestures which have motions in 1-D while the remaining gestures have motions in 2-D as illustrated in Fig. 2.

Fig. 2. Sony gestures.

Table 2. Confusion matrix for UM mode on DS ; Brown: Gesture signed, Yellow: Gesture classiﬁed, Green: Correct classiﬁcations, Red: Incorrect classiﬁcations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5

1 100 0 0 2.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 2.5 0 0 0 0

4 0 0 0 97.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 2.5 0 0 0 0

6 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 97.5 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0 95 0 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 5 100 0 0 0 0 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 0 100 2.5 0 0 0 0 0 0 0 0

12 0 0 0 0 0 0 0 0 0 0 0 97.5 0 0 0 0 0 0 0 0

13 0 0 0 0 0 0 0 0 0 0 0 0 97.5 0 0 0 0 0 0 0

14 0 0 0 0 0 0 2.5 0 0 0 0 0 2.5 97.5 0 0 0 0 0 0

15 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5 97.5 2.5 0 0 0 0

16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5 92.5 0 0 0 0

17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0

18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0

20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100

Dataset Preprocessing

To generalize the datasets, we eliminate the dependency on time from the datasets, which consist of timestamps along with the raw g-values. The set of g-values per gesture sample (γ), each with a cardinality of nγ is given by,

608

G. G. Krishna et al.

γ = {gxi , gyi , gzi }, ∀ i ∈ [1, nγ ] where, gxi , gyi , gzi are

(1)

g − values of the accelerometer On account of the dynamic nature and varying speeds at which diﬀerent gestures are signed by the users, nγ varies between diﬀerent gesture samples. From (1), an overall dataset (G) can be deﬁned by, G=

N GS

γi

(2)

i=1

Equation (2) is representative of both DS and Du datasets. No pre-processing other than elimination of timestamps was done to both the datasets as the gvalues in γ would be altered, thereby making the datasets lossy.

6 6.1

Feature Extraction Feature Characterization

The proposed system uses the features mentioned in Table 3, which have been already utilized in previous accelerometer-based studies. From [10], the signiﬁcance of Minimum, Maximum, Mean, Skew, Kurtosis and Cross correlation have been shown for Activity Recognition, which is a superset of Gesture Recognition. The inter-axial Pearson product-moment (PM) correlation coeﬃcients as a feature has been signiﬁed in [11]. The Spectral Energy as a feature has been illustrated in [12]. The aforementioned features were iteratively eliminated in various domains until the eﬃciency of the model was the highest. Table 3. Feature characterization in time and frequency domains Features\Domain

Time

Frequency FFT HT

(T 1 ) ✗

Mean

(H 1 )

2

(H 2 )

(T ) ✗

Skew

3

✗

4

✗

(T ) ✗

Kurtosis PM correlation coeﬃcients

(T ) ✗

Cross correlation

(T 5 ) ✗

✗ 1

(H 3 )

Energy

✗

Minimum

✗

✗

(H 4 )

Maximum

✗

✗

(H 5 )

(F )

A Generic Multi-modal Dynamic Gesture Recognition System

6.2

609

Domain Characterization

The set of extracted features were then applied in both time and frequency domains. The features that were extracted in the time domain are Mean, Skew, Kurtosis, PM correlation coeﬃcients and Cross correlation. These features along each axis of the gesture samples of a dataset (G) in time domain constitute the T F vector, as given in (3). 5

TF =

{Txj , Tyj , Tzj }

(3)

j=1

The time-domain sequences were converted to frequency domain using Fast Fourier Transform (F F T ) and only Energy, as mentioned in [12] was used, while the other features from F F T did not provide any signiﬁcant improvements. The feature vector (F F ) along each axis of the gesture samples of a dataset (G) is represented by (4). (4) F F = {Fx1 , Fy1 , Fz1 } From [13], it can be seen that the Hilbert Transform (HT ) hypothesized as a feature was successful in providing a competitive recognition rate for camerabased handwriting gestures. This novel approach was used in the accelerometerbased gesture recognition model proposed by the authors. Hilbert transform is a linear transformation used in signal processing, that accepts as input a temporal signal, and produces an analytic signal. The analytic signal consists of a real and an imaginary part as explained in [13], where the negative half of the frequency spectrum is zeroed out. The real part represents the input signal and the imaginary part (y) is representative of the Hilbert transformed signal, which is phase shifted by ±90◦ . For an input signal (x(t)), the analytic signal (xa (t)) after Hilbert transform is given by (5). Table 4. Confusion matrix for UI mode on DS ; Brown: Gesture signed, Yellow: Gesture classiﬁed, Green: Correct classiﬁcations, Red: Incorrect classiﬁcations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 100 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0

2 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 95 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0

4 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 65 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0

6 0 0 0 0 15 95 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 5 90 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 10 100 0 0 0 0 0 0 0 0 0 0 0 0

9 0 0 5 0 5 0 0 0 85 10 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 5 90 0 0 0 0 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 0 95 0 0 0 0 0 0 0 0 0

12 0 0 0 0 0 0 0 0 0 0 5 100 5 0 0 0 0 0 0 0

13 0 0 0 0 0 0 0 0 0 0 0 0 95 0 0 0 0 0 0 0

14 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0

15 0 0 0 0 15 0 0 0 0 0 0 0 0 0 90 20 0 0 0 0

16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 80 0 0 0 0

17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 95 5 0 0

18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 30

20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 70

610

G. G. Krishna et al.

xa = F −1 (F (x)2U ) = x + iy, where, F − F ourier transf orm, U − U nit step f unction, y − Hilbert transf orm of x.

(5)

This approach of using Hilbert Transform for feature extraction is applied across all gesture samples, and the features - Mean, Skew, Minimum, Maximum and Energy are calculated as shown in Table 3. The features Mean and Skew, which have been used in T F , and Energy which is calculated in F F are also used as Hilbert features since reusing them here yields a better performance for classifying diﬀerent gestures. These Hilbert transformed features along each axis on a dataset (G) make up the HF vector, which is given by Eq. 6. HF =

5

{Hxj , Hyj , Hzj }

(6)

j=1

The FeatureSet (F S) for a single dataset (G) is formed from (3), (4) and (6) by appending all the feature vectors - T F , F F and HF calculated across a dataset (G), as shown in (7). FS =

N GS

{T F ∪ F F ∪ HF }j

(7)

j=1

7

End-User Modelling

This paper enables the end-user to select any one of three proposed modes of operation - User Dependent, Mixed User and User Independent. The User Dependent (UD ) mode is an estimator of how well the system performs when the train-test split is between the gestures of a single user. Mixed User (UM ) is representative of the complete set of gestures of all participants. The User Independent (UI ) mode employs a stratiﬁed k-fold cross validation technique which corresponds to training on a number of users and testing on the rest.

8

Experiment

The three modes UD , UM and UI , as explained in Sect. 7, were ﬁrst trained and tested on an Intel Core i5 CPU @ 2.20 GHz, operating on Ubuntu 16.04. The proposed system was initially implemented using seven classiﬁcation algorithms that were identiﬁed from [14–16] – Extremely Randomized Trees (Extra Trees), Random Forests, Gradient Boosting, Bagging, Decision Trees, Naive Bayes and Ridge Classiﬁer. Upon further analysis, the seven classiﬁers were found to run on a low-cost Raspberry Pi Zero, and the accuracies and time taken for a gesture sample to

A Generic Multi-modal Dynamic Gesture Recognition System

611

Table 5. Average accuracies (Acc) for the datasets Du and DS and time taken (in seconds) for classiﬁcation of a single gesture sample across all modes; Green: Highest eﬃciencies in all modes, Yellow: Least classiﬁcation times taken for a gesture sample in all modes Classifier\Mode Extra Trees Random Forest Gradient Boosting Bagging Decision Trees Naive Bayes Ridge Classifier

User Dependent (UD ) Acc Time 97.76 0.6287 97.41 0.6991 93.75 0.0072 92.74 0.1526 89.55 0.0015 91.16 0.0273 97.5 0.0013

uWave (Du ) Mixed User (UM ) Acc Time 97.85 0.6873 95.45 0.73 94.38 0.0076 94.19 0.1527 84.11 0.0053 71.96 0.0292 83.84 0.0013

User Independent (UD ) Acc Time 82.49 0.6853 77.91 0.6995 75.64 0.0078 76.64 0.1528 66.73 0.0015 64.66 0.0273 74.64 0.0013

User Dependent (UD ) Acc Time 95.88 0.6038 95.25 0.6887 90.5 0.0055 93.5 0.1673 84.13 0.0014 91.38 0.0118 94.13 0.0013

Sony (Ds ) Mixed User (UM ) Acc Time 98.63 0.6538 97.13 0.7041 95.5 0.0057 93.37 0.1527 86.25 0.0051 65.5 0.0116 77.75 0.0014

User Independent (UD ) Acc Time 75.1 0.669 70.13 0.686 66.41 0.0057 62.44 0.1529 50.9 0.0015 54.35 0.0117 61.59 0.0013

be classiﬁed are catalogued for the three modes across both datasets, as shown in Table 5. From this set of seven classiﬁers, the Extra Trees (ET ), Gradient Boosting (GB) and Ridge Classiﬁer (RC) were chosen, based on the inferences from Sect. 9.1. Each of the three modes is evaluated using the three classiﬁers individually, and the eﬃciencies are noted and analyzed for both the datasets Du and DS in Sects. 9.2 and 9.3, respectively.

9 9.1

Results Evaluation of Classifiers

The classiﬁers chosen in Sect. 8 were further analyzed for each dataset based on their computational characteristics. It can be inferred that ET yields the highest accuracy in all modes for both datasets Du and DS , among all the three chosen classiﬁers. Apart from being more accurate than Random Forest, ET is also computationally less expensive as stated in [17]. It can also observed that the time taken for classiﬁcation of a gesture sample is the least in RC. GB, whilst yielding a signiﬁcantly lower classiﬁcation time than ET , provides reasonable accuracies and hence was chosen. Based on the users’ preferences, a trade-oﬀ can be made between eﬃciency and classiﬁcation time of a gesture sample among the three classiﬁers. To evaluate the performance of this model in all the modes across both datasets, evaluation metrics - cross-validation scores and confusion matrices were taken for the results obtained from the ET classiﬁer, as it has the highest eﬃciency and an acceptable classiﬁcation time. 9.2

uWave Dataset Analysis

In the UD mode on Du , each user’s data was divided into a randomized 75%–25% train-test split across all 8 gestures. The average recognition rate achieved over all 8 users was found to be 97.76%, using ET . The individual users’ eﬃciencies for the same can be observed in Table 6.

612

G. G. Krishna et al.

Table 6. Accuracies of 8 users for both datasets Du and DS in user dependent mode (UD ) using extra trees classiﬁer Data\User 1

2

3

4

5

6

7

8

Avg

Du

96.42 93.5 98.57 97.14 99.28 99.28 97.88 100 97.76

DS

98

92

94

93

98

100

95

97 95.88

The classiﬁcation model implemented using UM mode on Du with a randomized 75%–25% train-test split, provides an accuracy of 97.85%, and its confusion matrix can be showcased in Table 7. Table 7. Confusion matrix for UM mode on Du ; Brown: Gesture signed, Yellow: Gesture classiﬁed, Green: Correct classiﬁcations, Red: Incorrect classiﬁcations 1 2 3 4 5 6 7 8

1 99.29 0.71 0.71 0 0 0 1.43 0

2 0.71 99.29 0 0 0 0 1.43 0.71

3 0 0 97.86 2.14 0 0 0 0

4 0 0 1.43 97.86 0 0 0 0

5 0 0 0 0 97.14 0 0 0

6 0 0 0 0 2.86 100 0 0

7 0 0 0 0 0 0 95 2.86

8 0 0 0 0 0 0 3.57 96.43

It was seen that on applying UI mode on Du , an average eﬃciency (average Leave-One-User-Out Cross-Validation score) of 82.49% was observed, while [4] has achieved 75.4%. The confusion matrix for the last user as test, trained upon the ﬁrst seven users, which yields an accuracy of 92.14% has been illustrated in Table 8. Table 8. Confusion matrix for UI mode on Du ; Brown: Gesture signed, Yellow: Gesture classiﬁed, Green: Correct classiﬁcations, Red: Incorrect classiﬁcations 1 2 3 4 5 6 7 8

1 95.71 0 0 0 0 0 0 0

2 4.29 100 0 0 0 4.29 1.43 1.43

3 0 0 82.86 0 0 0 0 0

4 0 0 17.14 100 0 0 0 0

5 0 0 0 0 100 0 0 0

6 0 0 0 0 0 95.71 0 0

7 0 0 0 0 0 0 70 5.71

8 0 0 0 0 0 0 28.57 92.86

Table 9 shows that RC has the least classiﬁcation time along with an accuracy of 97.5%, which is marginally lesser than ET , thereby eﬀectively capturing the similarities between gesture samples signed by the same user, i.e., in UD mode. GB provided accuracies and computational times which are intermediary between ET and RC across all three modes, thereby giving the user a choice to reduce classiﬁcation time by a signiﬁcant margin, while having a reasonable accuracy.

A Generic Multi-modal Dynamic Gesture Recognition System

613

Table 9. Accuracies (Acc) and time taken (in seconds) for classiﬁcation of a single gesture sample for Du using the 3 selected classiﬁers across all modes Classiﬁer\Mode UD Acc

9.3

Time

UM Acc

Time

UI Acc

Time

ET

97.76 0.6287 97.85 0.6873 82.49 0.6853

GB

93.75 0.0072 94.38 0.0076 75.64 0.0078

RC

97.5

0.0013 83.84 0.0013 74.64 0.0013

Sony Dataset Analysis

For all eight participants, the average accuracy for DS in UD mode was observed to be 95.88% using ET , where each user’s data was divided into a 75%–25% train-test split in random, as done in Du . The eﬃciencies across all individual users can be observed in Table 6. In the UM mode, the proposed model with a 75%–25% train-test split at random on DS , yields an eﬃciency of 98.625%. Table 2 shows the confusion matrix for the UM mode across all 20 gestures. An average 8-fold cross validation score (average recognition rate across all users) of 75.093% was observed in the UI mode of DS . The confusion matrix trained upon the ﬁrst seven users with the last user as test, which yields an accuracy of 91.75%, is shown in Table 4. Figure 3 shows the behavior of all users as test (Leave-One-User-Out CrossValidation scores) in the UI mode on DS . The large number of misclassiﬁcations in the second and seventh user is indicative of both users having not performed the gestures in a manner similar to the rest, thereby reducing the overall eﬃciency across all classiﬁers. Table 10. Accuracies (Acc) and time taken (in seconds) for classiﬁcation of a single gesture sample for DS using the 3 selected classiﬁers across all modes Classiﬁer\Mode UD Acc

UI Acc

Time

ET

95.88 0.6038 98.63 0.6538 75.1

0.669

GB

90.5

RC

94.13 0.0013 77.75 0.0014 61.59 0.0013

Time

UM Acc

0.0055 95.5

Time

0.0057 66.41 0.0057

Table 11. Best accuracies (Acc) and least times taken (in seconds) for classiﬁcation of a single gesture sample across all modes for both datasets Du and DS Dataset\Mode UD

UM

UI

Acc (ET ) Time (RC) Acc (ET ) Time (RC) Acc (ET ) Time (RC) Du

97.76

0.0013

97.85

0.0013

82.49

0.0013

DS

95.88

0.0013

98.63

0.0014

75.1

0.0013

614

G. G. Krishna et al.

Fig. 3. 8-fold cross-validation scores for all users in DS .

Owing to the generic FeatureSet (F S) implemented, the results observed in Table 10 for DS are analogous to the results observed in Du for all three chosen classiﬁers across all the three modes.

10

Conclusion

A Gesture Recognition system with the capability to operate in any of the three modes – User Dependent, Mixed User and User Independent, with a set of generic features was designed, validated and tested in this paper. The proposed system was tested on two public accelerometer-based gesture datasets – uWave and Sony. Table 11 showcases the best classiﬁer for each category - Eﬃciency and Classiﬁcation Time for a gesture sample, across all three modes of both the datasets. As can be seen in Table 11, Extremely Randomized Trees was observed to perform the best in terms of accuracy, while Ridge Classiﬁer provided the least classiﬁcation time which was around 500 times faster than ET for a gesture sample, irrespective of the mode or dataset. The end-users are given the ﬂexibility to choose any combination of modes and classiﬁers according to their requirements. This system was implemented on a Raspberry Pi Zero priced at 5 USD making it a low-cost alternative. Acknowledgement. The authors would like to thank Solarillion Foundation for its support and funding of the research work carried out.

References 1. Chen, Q., Georganas, N.D., Petriu, E.M.: Real-time vision-based hand gesture recognition using HAAR-like features. In: 2007 IEEE Instrumentation & Measurement Technology Conference (IMTC 2007), Warsaw, pp. 1–6 (2007) 2. Biswas, K.K., Basu, S.K.: Gesture recognition using Microsoft Kinect. In: The 5th International Conference on Automation, Robotics and Applications, Wellington, pp. 100–103 (2011)

A Generic Multi-modal Dynamic Gesture Recognition System

615

3. Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 37(3), 311–324 (2007) 4. Liu, J., Wang, Z., Zhong, L., Wickramasuriya, J., Vasudevan, V.: uWave: accelerometer-based personalized gesture recognition and its applications. In: 2009 IEEE International Conference on Pervasive Computing and Communications, Galveston, TX, pp. 1–9 (2009) 5. SmartWatch Gestures Dataset, Technologies of Vision, Fondazione Bruno Kessler. https://tev.fbk.eu/technologies/smartwatch-gestures-dataset 6. Zimmerman, T.G., Lanier, J., Blanchard, C., Bryson, S., Harvill, Y.: Hand gesture interface device. In: Proceedings of the SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface, pp. 189–192 (1986) 7. Hussain, S.M.A., Rashid, A.B.M.H.: User independent hand gesture recognition by accelerated DTW. In: 2012 International Conference on Informatics, Electronics & Vision (ICIEV), Dhaka, pp. 1033–1037 (2012) 8. Pylv¨ an¨ ainen, T.: Accelerometer based gesture recognition using continuous HMMs. In: IbPRIA, vol. 3522, pp. 639–646 (2005). 492 9. Helmi, N., Helmi, M.: Applying a neuro-fuzzy classiﬁer for gesture-based control using a single wrist-mounted accelerometer. In: IEEE International Symposium on Computational Intelligence in Robotics and Automation - (CIRA), Daejeon, pp. 216–221 (2009) 10. Altun, K., Barshan, B., Tun¸cel, O.: Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognit. 43(10), 3605– 3620 (2010) 11. Jing, L., Zhou, Y., Cheng, Z., Wang, J.: A recognition method for one-stroke ﬁnger gestures using a MEMS 3D accelerometer. IEICE Trans. 94-D(5), 1062– 1072 (2011) 12. Wu, J., Pan, G., Zhang, D., Qi, G., Li, S.: Gesture recognition with a 3-D accelerometer. In: UIC, vol. 5585, pp. 25–38 (2009) 13. Ishida, H., Takahashi, T., Ide, I., Murase, H.: A Hilbert warping method for handwriting gesture recognition. Pattern Recognit. 43(8), 2799–2806 (2010) 14. Tencer, L., Rezn´ akov´ a, M., Cheriet, M.: Evaluation of techniques for signature classiﬁcation from accelerometer and gyroscope data. In: ICDAR, pp. 1066–1070 (2015) 15. Aswolinskiy, W., Reinhart, R.F., Steil, J.J.: Impact of regularization on the model space for time series classiﬁcation. In: Machine Learning Reports, pp. 49–56 (2015) 16. Yang, Y., Yu, Y.: A hand gestures recognition approach combined attribute bagging with symmetrical uncertainty. In: 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, Sichuan, pp. 2551–2554 (2012) 17. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

A Prediction Survival Model Based on Support Vector Machine and Extreme Learning Machine for Colorectal Cancer Preeti(B) , Rajni Bala, and Ram Pal Singh Department of Computer Science, Deen Dayal Upadhyaya College, University of Delhi, Delhi, India [email protected],{rbala,rpsrana}@ddu.du.ac.in

Abstract. Colorectal cancer is the third largest cause of cancer deaths in men and second most common in women worldwide. In this paper, a prediction model based on Support Vector Machine (SVM) and Extreme Learning Machine (ELM) combined with feature selection has been developed to estimate colorectal-cancer-speciﬁc survival after 5 years of diagnosis. Experiments have been conducted on dataset of Colorectal Cancer patients publicly available from Surveillance, Epidemiology, and End Results (SEER) program. The performance measures used to evaluate proposed methods are classiﬁcation accuracy, F -score, sensitivity, speciﬁcity, positive and negative predictive values and receiver operating characteristic (ROC) curves. The results show very good classiﬁcation accuracy for 5-year survival prediction for the SVM and ELM model with 80%–20% partition of data with 16 number of features and this is very promising as compared to existing learning models result. Keywords: Colorectal cancer · Extreme Learning Machine (ELM) Feature selection · Survival prediction Surveillance, Epidemiology, and End Results (SEER) Support Vector Machine (SVM)

1

Introduction

Cancer is a group of diseases which leads to irregular mutation of cells in the body that get out of control over a short span of time. The term, Colorectal cancer (also known as colon cancer, rectal cancer, bowel cancer), refers to a tumor developed in parts of the large intestine (in the colon or rectum) the lower end of digestive tract. This disease is among the most common type of death causing cancers due to the abnormal growth of cells. It has the ability to invade or spread to the other parts of body.1 Furthermore, colorectal cancer is the third leading cause of cancer deaths in men and the second most common in 1

http://www.cancer.gov/types/colorectal.

c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 616–629, 2019. https://doi.org/10.1007/978-3-030-03405-4_43

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

617

women worldwide.2 Most colorectal cancers start as a polyp growth in the inner lining of the colon or rectum and grow towards the center. Most polyps are not cancerous. Only certain types of polyp called adenomas can lead to cancer. Early diagnosis of this cancer and its treatment by taking out polyp when it is small may not result into cancer disease.3 The early diagnosis of the death causing disease has become increasingly important in clinical research to improve chances of survivability. The usage of machine learning techniques in medical diagnosis and prognosis can facilitate the clinical management of a patient. It may aid the physicians in decision-making process. Since experts can make errors in some of the cases thus machine learning based classiﬁcation systems can help to minimize these errors. In this paper, we have developed a 5-year survival prediction model for colorectal cancer patients on the dataset publicly available on Surveillance, Epidemiology, and End Results (SEER4 ) program. In this study, SVM and ELM classiﬁers along with grid search and symmetrical uncertainty based feature selection technique have been used to classify the patients into survival and non-survival classes. The measures used to show performance of the model are classiﬁcation accuracy, F − score, sensitivity, speciﬁcity, positive and negative predictive values, ROC curve. Rest of the paper is organized as follows. Section 2 summarizes the methods and results of previous research on colorectal cancer survival prediction. Section 3 reviews basic theories of SVM and ELM. Section 4 describes proposed methods along with description of the dataset. Experimental results are presented in Sect. 5. Finally, Sect. 6 concludes the paper along with outlining the future scopes.

2

Related Work

There has been a lot of research on Medical diagnosis of cancer particularly with Breast cancer and Lung cancer data. A number of techniques/approaches based on machine learning have been studied for survival prediction of diﬀerent types of cancer. Fathy [1] has used Artiﬁcial Neural Network(ANN) model for survival prediction of colorectal cancer patients using ANNIGMA for feature subset search. Burke et al. [2] compared two diﬀerent algorithms TNM staging system and ANN for 5, 10 year survival prediction for breast cancer. ANN was found to be signiﬁcantly accurate than TNM staging system. The comparison of nine diﬀerent algorithms for 5 year survival prediction of colorectal cancer patients is performed by Gao et al. [3]. Saleema et al. [4] has studied the eﬀect of sampling techniques (traditional, stratiﬁed, balanced stratiﬁed) in classifying prognosis variable (survival, metastasis, stage) and used them to build classiﬁcation model for breast cancer, respiratory cancer and mixed cancer dataset (both). Saleema et al. identiﬁed prognosis labels for cancer prediction that includes 2 3 4

http://www.cancer.org/research/cancerfactsstatistics/. http://www.cancer.gov/research/progress/snapshots/colorectal. www.seer.cancer.gov.

618

Preeti et al.

patient survival, number of primaries and age at diagnosis by performing study on breast, colorectal cancer patients [5]. It has been found that lymph node (LN) ratio is an important prognostic factor for overall survival, cancer speciﬁc survival and disease speciﬁc survival for colon cancer patients [6]. Swansons et al. have also identiﬁed lymph nodes as the prognostic factor for T3N0 colon cancer patients 5-year survival analysis [7]. A comparative study of 7 classiﬁcation techniques has been performed using ensemble voting with SMOTE pre-processing to provide 1, 2 and 5 years survival prediction for colon cancer dataset [8]. A study on colon cancer patients using two diﬀerent versions of staging that is AJCC ﬁfth and sixth editions has been performed for 5 year survival prediction [9]. It was stratiﬁed that the survival prediction using AJCC sixth edition staging is more distinct than that of ﬁfth edition. Lundin et al. evaluated the accuracy of ANN in prediction for 5, 10 and 15 year breast-cancer-speciﬁc survival [10]. Wang et al. performed a study on conditional survival for rectal cancer patients [11]. Endo et al. compared seven diﬀerent classiﬁcation algorithms, namely, logistic regression, ANN, Na¨ıve Bayes, Bayes Net, Decision Tree with Na¨ıve Bayes, ID3 and J48 to predict 5-year breast cancer survival and found logistic regression to be more accurate than others [12]. Fradkin et al. performed a study on SEER data for 8 months survivability of lung cancer patients [13]. It is evident from the study of related work on medical survival prediction models that the colorectal cancer has not been experimented as much as other types of cancer. In this paper, two diﬀerent machine learning algorithms have been proposed for survival prediction of colorectal cancer patients.

3 3.1

Brief Theories About SVM and ELM Support Vector Machine

SVM is a supervised learning algorithm used for classiﬁcation of input vectors to diﬀerent classes. The basic principle followed by SVM is to ﬁnd a maximummargin hyperplane that separates the set of training examples of diﬀerent outcomes. The hyperplane that maximizes the distance with closest examples is optimal hyperplane and it forms the decision surface for classiﬁcation. SVM is a kernel-based technique that uses kernel function to map the input vectors to high - dimensional space [14]. The mapping function in SVM can either be linear or non-linear. A set of data points closest to the optimal hyperplane are known as support vectors and the distance between these vectors and decision surface is called margin. Consider the problem of separating the set of training vectors belonging to two linearly separable classes, (xi , yi ) where, xi ∈ IRn , yi ∈ {+1, −1} and i = 1, 2, · · · , n. The formation of SVM with kernel function K(x,y) and regularization parameter C is given by n n 1 M in [ αi − αi αj yi yj K(xi , xj )]; α i=1 2 i,j=1

(1)

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

s.t.

N

619

yi αi = 0, 0 < αi < C, i = 1, 2, ..., N

i=1

where, C is user deﬁned positive constant. In this study, RBF kernel has been used which is deﬁned as K(xi , xj ) = exp{− xi − xj 2 /(2σ 2 )}

(2)

The optimal value of parameter C and σ used in experiments is chosen by grid-search approach. 3.2

Extreme Learning Machine

Extreme learning machine (ELM) is a learning algorithm for training single layer feed-forward neural networks (SLFNs). SLFNs have been using gradient descent-based methods for training purpose. Gradient descent is a popular back-propagation learning algorithm used to train neural networks. It tunes the parameters of network iteratively. Gradient learning algorithm for parameter optimization suﬀers majorly from three problems: (1) it may easily converge to local minima; (2) improper and iterative learning steps would lead to very slow convergence of the algorithm; (3) may suﬀer from overﬁtting and underﬁtting problems that leads to less generalized solution. Unlike these traditional learning algorithms, ELM trains SLFN by randomly assigning weights from input layer to hidden layer and biases. After input weights have been chosen randomly, output weights are determined analytically through minimization of the loss function. Given a training set S = {(xi , ti ) | xi = [xi1 , · · · , xin ]T ∈ IRn , ti = [ti1 , · · · , tim ]T ∈ IRm , i = 1, 2, · · · , N }. ELM is mathematically modeled as [15] ˜ N i=1

βi g(xj ) =

˜ N

βi g(wi .xj + bi ) = oj , j = 1, · · · , N

(3)

i=1

where, wi = [wi1 , · · · , win ]T is the weight vector connecting input layer and hidden layer, n is the number of features, m is number of classes, N is number ˜ is number of hidden nodes and g(x) is activation funcof training samples, N tion (even non-diﬀerentiable). The steps used by ELM for training SLFN are as follows: ˜ , input weights wi and biases bi are randomly assigned (1) For i = 1, 2, · · · , N and ﬁxed. (2) The hidden layer output matrix H of neural networks is calculated using H(w1 , · · · , wN˜ , b1 , · · · , bN , x1 , · · · , xN ) ⎡ ⎤ g(w1 · x1 + b1 ) · · · g(wN˜ · x1 + bN˜ ) ⎢ ⎥ .. .. =⎣ ⎦ . ··· . g(w1 · xN + b1 ) · · · g(wN˜ · xN + bN˜ )

˜ N ×N

(4)

620

Preeti et al.

(3) The output weight vector β connecting hidden nodes and output nodes is calculated using β = H † T where T = [ti1 , · · · , tim ]T and H † is the Moore Penrose generalized inverse of matrix H [16]. In this study, radial basis function has been used as the activation function.

4

Methodology and Experiments

The survival prediction framework being followed is depicted in Fig. 1. It shows several steps followed as separated blocks to achieve the survival rates.

Fig. 1. Survival prediction system ﬂow.

4.1

Dataset Description and Pre-processing

The data used for this study is of cancer patients from Surveillance, Epidemiology, and End Results (SEER) Program Research Data (2000–2012), National Cancer Institute. Since, a few of given features of the dataset are providing missing values, so dataset can not be used directly for building a survival prediction model. The pre-processing steps performed are as follows: (1) Select all the records that are lying in the period of interest (2004–2007). (2) Remove all the features that are either completely blank (not applicable or not available or not recorded) or have a constant value for all cases. (3) Remove those attributes which provide unique identiﬁcation to each patient record (e.g. PatientId, RegistryId). (4) There are several derived attributes (e.g. year of birth, age, age recode) representing no new information. Since keeping one of these attributes is suﬃcient, so others are removed. (5) There are a number of attributes that provide recoding to the previously deﬁned attributes, (site recode, state-country recode and so on). So, only recoded attributes (nominal) are kept and other are removed. (6) The dataset is consist of two types of attributes: numeric and categorical. A MATLAB function is written to convert those selected categorical text attributes to nominal codes (1, 2..). For Example, Primary site (C18–C20, C26).

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

621

(7) ‘Survival Months’ feature used as classiﬁcation variable is recoded as ‘survivability’ which is a binary attribute with 1 value to depict survival for greater than or equal to 60 months and 0 represents survival of less than 60 months. After performing study on SEER data and applying these steps of preprocessing to colorectal cancer dataset, 30 features were left. These 30 features are described in Table 1. Complete analysis has been done using these set of features and 10000 patient samples from 2004 to 2007 period. Table 1. Features used in this study with details Features description Feature no.

Name of feature

Description

F1

Marital Status

Represents marital status of the patient at the time of diagnosis

F2

Race

Code to depict race

F3

Sex

Gender of the patient

F4

Age at diagnosis

Age of the patient at the time of diagnosis

F5

Sequence Number

Number and sequence of all primary tumors reported over the lifetime

F6

Primary Site

Site where primary tumor originated

F7

Grade

I, II, III, IV

F8

Diagnostic Confirmation

Best method used to confirm the presence of tumor being reported

F9

Type of Reporting Source

Identify source document for confirmation

F10

Regional nodes Positive

Exact number of regional lymph nodes examined by the pathologist found to contain metastases

F11

Regional nodes Examined

Total no of regional lymph nodes examined

F12

CS Tumor size

Define tumor size in a unit

F13

CS Extension

Extension of tumor

F14

CS Lymph nodes

Involvement of lymph nodes

F15

CS Mets at DX

Information on distant metastasis

F16

Derived AJCC- 6T

Represents T component

F17

Derived AJCC-6N

Represents N component

F18

Derived AJCC-6M

Represents M component

F19

Derived AJCC-6 stage group

Represents Stage group component

F20

Derived SS2000

SEER summary stage 2000

F21

RX Summ-Surg prim site

Surgery of primary site

F22

RX Summ-Scope Reg LN Sur

Scope of regional lymph nodes surgery

F23

Number of primaries

Represents actual number of tumors

F24

First malignant primary indicator

Yes or no

F25

Summary stage

Simplified version of stage: in situ, localized, regional, distant, & unknown

F26

Behavior Recode

Recoded based on Behavior ICD-O-3

F27

Histology Recode

Recoded based on histologic type ICD-O-3

F28

SEER historic stage A

Extent of disease

F29

Vital Status

Status of the patient

F30

Age recode

Recoding of the patient’s age

622

4.2

Preeti et al.

Data Denoising

The SEER data of colorectal cancer is used for this experimental study. We excluded the data before 2004 and after 2007 because of two reasons. First, there are several attributes included in data from 2004 period, so those attributes were not supported by data in 2000–2003 period. Second is to perform study for 5-year survival of patients and 2007 is kept as upper edge to data. Since the follow-up cutoﬀ date for this dataset is December 31, 2012. After preprocessing, a total of 47, 731 samples were obtained. Experiments have been performed on 10000 samples obtained by randomly selecting 2500 records from each year (2004–2007) of truncated data. The class distribution for survived class is 4628 (45.35%) and that for not survived is 5372 (54.65%) which seems to be balanced. The resulting set of features has values in diﬀerent ranges. Therefore, feature scaling method is used to scale them up in the range of [0, 1] using MATLAB user deﬁned function. 4.3

Feature Selection

Feature selection is an important issue in building classiﬁcation system. It is advantageous to limit the number of input features in a classiﬁer in order to have a good predictive and less computationally intensive model. In the area of medical diagnosis, a small feature subset means lower test and diagnostic costs. There are two classes of feature selection methods which are commonly used i.e., Filter and Wrapper methods. Symmetrical Uncertainty, a correlation based feature selection method has been used [17]. This method ranks the features according to their correlation with classiﬁcation variable. The value of Symmetrical uncertainty for each feature is calculated using (5). Symmetrical U ncertainty Coef f icient Inf ormation gain = 2.0 × H(X) + H(Y )

(5)

where H(X) and H(Y ) represents entropy of an input feature and classiﬁcation variable respectively. The entropy H(X) for each feature is deﬁned in (6). The amount by which the entropy of X decreases reﬂect additional information about X provided by Y is called as information gain. It is deﬁned in (7). H(X) = − p(x) log2 (p(x)) (6) x∈X

Inf ormation gain = H(X) − H(X | Y )

(7)

where, H(X | Y ) is the entropy of variable X after observing classiﬁcation variable Y and given by (8). H(X | Y ) = −

i,j

p(xi , yj ) log2

p(xi , yj ) p(yj )

(8)

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

623

Table 2. Relative importance of features based on symmetrical uncertainty S.No. Symmetrical uncertainty (SU)

Attribute no. S.No. Symmetrical uncertainty (SU)

Attribute no.

1

0.5950

F29

16

0.0109

F12

2

0.1201

F15

17

0.0081

F5

3

0.1144

F18

18

0.0049

F24

4

0.0872

F25

19

0.0037

F21

5

0.0842

F28

20

0.0035

F27

6

0.0693

F20

21

0.0026

F2

7

0.0672

F19

22

0.0024

F26

8

0.0450

F17

23

0.0023

F11

9

0.0359

F16

24

0.0022

F6

10

0.0351

F10

25

0.0008

F23

11

0.0319

F14

26

0.0006

F9

12

0.0261

F13

27

0.0005

F22

13

0.0197

F4

28

0.0005

F3

14

0.0146

F7

29

0.0001

F8

15

0.0125

F1

The value of symmetrical uncertainty coeﬃcient (SU C) lies in the range of [0, 1]. The two edge values 0 and 1 depicts completely independent features and one variable completely predicts another variable, respectively. The symmetrical uncertainty coeﬃcient is computed for each input feature and classiﬁcation variable. The features with SUC > 0.01 are selected for further processing. Table 2 represents selected attributes along with their SU C. 4.4

Setting Up the Model Parameters

In order to achieve better classiﬁcation accuracy certain parameters of classiﬁcation algorithm needs to be tuned. The selection of kernel function and its appropriate parameters greatly improve the classiﬁcation accuracy for SVM as well as ELM. According to the size of dataset and experiments, it is found that RBF kernel would be best suited to build a prediction system for both the classiﬁers. As discussed in Sect. 3 about SVM technique, there are mainly two parameters of RBF kernel, C and σ to be deﬁned by user. So, in this study, grid search is performed to ﬁnd the best pair of these two parameters. In this approach, diﬀerent possible combination of (C, σ) is used to build a model and the pair of values yielding best results is chosen to train the ﬁnal model. For ELM technique, radial basis function is selected as activation function. For this classiﬁer, grid search has been performed to ﬁnd best pair of activation function parameter and number of hidden neurons. To generalize the steps followed, dataset D is divided into training and test set using 80–20 ratio, respectively.

624

Preeti et al.

(1) In SVM algorithm, Grid search for (C, σ) is performed with C= {2−1 , 20 , · · · , 24 } and σ = {2−4 , 2−3 , · · · , 23 , 24 } and grid search in ELM is performed for number of hidden neurons = {10, 20, · · · , 490, 500} and activation function parameter = {2−6 , 2−5 , · · · , 20 , 21 }. (2) Choose the best set of parameters for both the algorithms SVM and ELM that give best classiﬁcation rate. (3) Use the best parameter to create a ﬁnal model for training the dataset. 4.5

Modelling with Feature Selection

To create a prediction model, initially the diﬀerent data pre-processing and denoising steps are followed. Thereafter, SU C of every attribute is computed and ranked in descending order. A subset of features is obtained using top N SU C from the ordered list, where N is 2 to 15. Using the subset of features, a grid search is carried out to ﬁnd the optimized value of parameters for SVM and ELM classiﬁers. The obtained parameters are used to build two diﬀerent classiﬁcation models for SVM and ELM. The obtained model is used to train the model using training set and further to predict the labels for test set. This process of building model is repeated by adding attribute with next SU C in ordered list. 4.6

Measures of Performance Evaluation

The performance measure that has been used for proposed methods is test accuracy for 80%–20% partition of the dataset, where test accuracy represent the number of samples in test set classiﬁed accurately using obtained model, respectively. Other measures include sensitivity, speciﬁcity, positive predictive value, negative predictive value and ROC curves. ROC curves [18] are pictorial representation of sensitivity of Y − axis and 1-speciﬁcity on X − axis. The area under the ROC curve (AU C) is used as a measure of classiﬁer’s performance.

5

Results and Discussion

The experiment have been conducted on SEER data of colorectal cancer patients with 10, 000 records. The obtained dataset is divided into the 80%–20% partition that is 8000 records are used for training and 2000 for testing. The correlation of features with classiﬁcation variable depicted in Table 2 is computed by using SU C method on training set. Further, grid search is used to ﬁnd optimal parameters for SVM and ELM. On the basis of ranking, a new feature with next SU C is added to the set of attributes and resultant dataset is used to train and test the algorithms. Features are selected to create models upto a threshold of 0.01. So the feature with SU C less than 0.01 are not selected for modelling. Therefore, it results into an incremental approach used to create 15 diﬀerent models for both the classiﬁers. Tables 3 and 4 depict classiﬁcation accuracies obtained using two

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

625

Table 3. Classiﬁcation accuracies for SVM learning algorithm

Model No. #1 #2 #3 #4 #5 #6 #7 #8

#9

#10

#11

#12

#13

#14

#15

Subset of the Attributes

F29 ,F15 F29 ,F15 ,F18 F29 ,F15 ,F18 ,F25 F29 ,F15 ,F18 ,F25 , F28 F29 ,F15 ,F18 ,F25 , F28 ,F20 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 , F16 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 ,F13 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 ,F13 ,F4 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 ,F13 ,F4 , F7 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 ,F13 ,F4 , F7 ,F1 F29 ,F15 ,F18 ,F25 , F28 ,F20 ,F19 ,F17 ,F16 , F10 ,F14 ,F13 ,F4 , F7 ,F1 ,F12

Test Accuracy on 80%-20% Partition

Training Time (s)

C, σ

90.94 90.99 91.25

0.4588 0.5645 0.8849

2, 2−4 2, 2−4 1, 2−3

90.80

0.9437

2, 2−4

91.25

1.1464

8, 2−4

91.00

1.0033

16, 2−4

91.00

1.4741

4, 2−3

91.8

1.6265

2, 2−4

91.05

1.6214

8, 2−2

91.25

1.7483

16, 21

91.5

2.1636

4, 2−3

90.65

2.0329

4, 2−1

90.85

2.1461

8, 2−2

91.45

2.3187

2, 2−4

90.6

4.6501

2, 21

diﬀerent machine learning algorithms SVM and ELM for 15 models. For SVM classiﬁer, highest 5-year survival prediction accuracy is achieved with model #8 modeled using 9 features that is 91.8%(bold). With ELM classiﬁer, model #6 is outperforming other models with 92.20% classiﬁcation accuracy. Since ELM uses randomization for weights and biases so it has been experimented 10 times with optimal parameters. The obtained best and average test accuracies from 10 runs is shown in Table 4.

626

Preeti et al. Table 4. Classiﬁcation accuracies for ELM algorithm Model No.

Subset of the Attributes

#1 #2

F29 ,F15 F29 ,F15 ,F18 F29 ,F15 ,F18 , F25 F29 ,F15 ,F18 , F25 ,F28 F29 ,F15 ,F18 , F25 ,F28 ,F20 F29 ,F15 ,F18 , F25 ,F28 ,F20 ,F19 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 ,F13 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 ,F13 , F4 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 ,F13 , F4 ,F7 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 ,F13 , F4 ,F7 ,F1 F29 ,F15 ,F18 , F25 ,F28 ,F20 , F19 ,F17 ,F16 , F10 ,F14 ,F13 , F4 ,F7 ,F1 ,F12

#3 #4 #5 #6 #7

#8

#9

#10

#11

#12

#13

#14

#15

Test Accuracy on 80%-20% partition 90.05, 90.05 90.70, 90.70

Training Time (s) 0.0594 0.0391

No of Hidden Neurons 20 20

90.60, 90.42

0.0078

10

2−6

91.10, 87.16

0.0047

10

2−5

90.55, 89.47

0.0125

10

2−1

92.80, 92.74

0.5265

100

2−4

91.40, 91.3

0.6469

110

2−1

91.10, 90.14

0.0156

10

2−2

90.7, 90.6

0.075

30

2−4

90.8, 89.27

0.0156

10

2−1

91.30, 91.00

0.0812

30

2−5

91.55, 91

0.7109

130

2−5

91.2, 91.1

0.0828

30

2−4

91.95, 91.79

1.9172

250

2−6

92.15, 92.09

1.7078

190

2−4

Activation Fn Param 2−6 2−6

For comparison with previous work done on 5 year survival prediction for colorectal cancer, Table 5 gives the analysis. Pen Gao et al. used SVM with BSFS on 20 variables to compute 5-year survival prediction [3]. Results obtained using proposed methods are better than related work. It can be observed that among the two diﬀerent classiﬁers used in this study, obtained classiﬁcation accuracy with ELM is comparable to SVM classiﬁer. However, training time used to build

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

627

Table 5. Comparison of classiﬁcation accuracies obtained using our method and other classiﬁers Reference

Method

Fathy et al. [1]

ANN

84.73%

Fathy et al. [1]

ANN + ANNIGMA

86.5%

Pen Gao et al. [3] SVM + BSFS

Result

AUC = 81.34%

Our method

SVM + Feature Selection 91.8%

Our method

ELM + Feature Selection 92.20%

Table 6. Sensitivity, speciﬁcity, positive predictive value & negative predictive value for model Measure

Support Vector Machine (SVM) ELM

Sensitivity (recall)

86.28

88.38

Speciﬁcity

97.15

97.13

Positive predictive value

97.46

96.79

Negative predictive value 84.89

89.51

F 1 score

0.915

0.924

model for ELM classiﬁer is 0.0156 which is very much less than the time taken by SVM classiﬁer. Table 6 presents the values of diﬀerent performance measures for model #8 and #6 obtained using SVM and ELM, respectively. The ROC curve for model #8 using SVM classiﬁer is represented in Fig. 2. The area under ROC curve is computed to measure classiﬁer performance and obtained AU C for SVM classiﬁer is 0.91307 for 80%–20% training-test partition of dataset. And ROC curve for model #6 with ELM classiﬁer is shown in Fig. 3. The value of AU C is calculated to determine classiﬁer performance and obtained value is 0.941 for 80%–20% training-test partition of dataset.

Fig. 2. ROC curve for best model obtained using SVM algorithm.

628

Preeti et al.

Fig. 3. ROC curve for best model obtained using ELM algorithm.

6

Conclusion and Future Work

In this paper, a survival prediction system based on SVM and ELM with grid search and feature selection is proposed to predict colorectal cancer patient’s 5-year survival from the date of diagnosis. We used feature selection technique to construct N models with diﬀerent subsets of features using SU C. The experiments have been performed on SEER dataset that provides data for diﬀerent types of cancer. The proposed method based on ELM for model #6 provides promising results in survival prediction of patients with highest classiﬁcation accuracy as 92.20% of 80–20% training-test partition, sensitivity 88.38%, speciﬁcity 97.13%, F 1 - score 0.924 and AU C 0.941. Similarly results correspond to SVM based classiﬁer model #8 are classiﬁcation accuracy as 91.80% of 80–20% training-test partition, sensitivity 86.28%, speciﬁcity 97.15%, F 1 - score 0.915 and AU C 0.91037. The results are very promising and outperform the existing methods which indicate the possibility of its use by physicians to make decisions on their cases. Future work includes exploration of new dataset and new classiﬁcation schemes to stabilize these accuracies with bigger size of dataset. Further, we plan to do such study on some other types of cancers as well.

References 1. Fathy, S.K.: A predication survival model for colorectal cancer. In: Proceedings of the 2011 American Conference on Applied Mathematics and the 5th WSEAS International Conference on Computer Engineering and Applications, ser. AMERICANMATH’11/CEA’11, pp. 36–42. World Scientiﬁc and Engineering Academy and Society (WSEAS), Stevens Point (2011) 2. Burke, H.B., Goodman, P.H., Rosen, D.B., Henson, D.E., Weinstein, J.N., Harrell, F.E., Marks, J.R., Winchester, D.P., Bostwick, D.G.: Artiﬁcial neural networks improve the accuracy of cancer survival prediction. Cancer 79(4), 857–862 (1997)

A Prediction Survival Model Based on SVM and ELM for Colorectal Cancer

629

3. Gao, P., Zhou, X., Wang, Z.-N., Song, Y.-X., Tong, L.-L., Xu, Y.-Y., Yue, Z.-Y., Xu, H.-M.: Which is a more accurate predictor in colorectal survival analysis? Nine data mining algorithms vs. the TNM staging system. PLOS ONE 7(7), 1–8 (2012) 4. Saleema, J.S., Bhagawathi, N., Monica, S., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: Cancer prognosis prediction using balanced stratiﬁed sampling. CoRR, vol. abs/1403.2950 (2014) 5. Saleema, J.S., Sairam, B., Naveen, S.D., Yuvaraj, K., Patnaik, L.M.: Prominent label identiﬁcation and multi-label classiﬁcation for cancer prognosis prediction. In: TENCON 2012 IEEE Region 10 Conference, pp. 1–6 (2012) 6. Berger, A.C., Sigurdson, E.R., LeVoyer, T., Hanlon, A., Mayer, R.J., Macdonald, J.S., Catalano, P.J., Haller, D.G.: Colon cancer survival is associated with decreasing ratio of metastatic to examined lymph nodes. J. Clin. Oncol. 23(34), 8706–8712 (2005) 7. Swanson, R.S., Compton, C.C., Stewart, A.K., Bland, K.I.: The prognosis of T3N0 colon cancer is dependent on the number of lymph nodes examined. Ann. Surg. Oncol. 10, 65–71 (2003) 8. Al-Bahrani, R., Agrawal, A., Choudhary, A.: Colon cancer survival prediction using ensemble data mining on SEER data. In: IEEE International Conference on Big Data, pp. 9–16 (2013) 9. O’Connell, J.B., Maggard, M.A., Ko, C.Y.: Colon cancer survival rates with the new American joint committee on cancer sixth edition staging. JNCI J. Natl. Cancer Inst. 96(19), 14–20 (2004) 10. Lundin, M., Lundin, J., Burke, H., Toikkanen, S., Pylkk¨ anen, L., Joensuu, H.: Artiﬁcial neural networks applied to survival prediction in breast cancer. Oncology 57(4), 281–286 (1999) 11. Emery, R., Wang, S.J., Fuller, C.D., Thomas, C.R.: Conditional survival in rectal cancer: a seer database analysis. Gastrointest. Cancer Res. (GCR) 1(3), 84–89 (2007) 12. Endo, T.S.A., Tanaka, H.: Comparison of seven algorithms to predict breast cancer survival( contribution to 21 century intelligent technologies and bioinformatics). Biomed. Fuzzy Hum. Sci. Oﬀ. J. Biomed. Fuzzy Syst. Assoc. 13(2), 11–16 (2008) 13. Fradkin, D., Schneider, D., Muchnik, I.: Machine learning methods in the analysis of lung cancer survival data. DIMACS Technical report 2005–35 (2006) 14. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press (2000) 15. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1), 489–501 (2006) 16. Rao, C.R., Mitra, S.K.: Generalized inverse of matrices and its applications (1971) 17. Hall, M.A.: Correlation-based feature selection for machine learning (1999) 18. Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles in Roman Urdu Moin Khan and Kamran Malik(&) Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan [email protected], [email protected] Abstract. Text mining is a broad ﬁeld having sentiment mining as its important constituent in which we try to deduce the behavior of people towards a speciﬁc item, merchandise, politics, sports, social media comments, review sites, etc. Out of many issues in sentiment mining, analysis and classiﬁcation, one major issue is that the reviews and comments can be in different languages, like English, Arabic, Urdu, etc. Handling each language according to its rules is a difﬁcult task. A lot of research work has been done in English Language for sentiment analysis and classiﬁcation but limited sentiment analysis work is being carried out on other regional languages, like Arabic, Urdu and Hindi. In this paper, Waikato Environment for Knowledge Analysis (WEKA) is used as a platform to execute different classiﬁcation models for text classiﬁcation of Roman Urdu text. Reviews dataset has been scrapped from different automobiles’ sites. These extracted Roman Urdu reviews, containing 1000 positive and 1000 negative reviews are then saved in WEKA attribute-relation ﬁle format (ARFF) as labeled examples. Training is done on 80% of this data and rest of it is used for testing purpose which is done using different models and results are analyzed in each case. The results show that Multinomial Naïve Bayes outperformed Bagging, Deep Neural Network, Decision Tree, Random Forest, AdaBoost, k-NN and SVM Classiﬁers in terms of more accuracy, precision, recall and F-measure. Keywords: Sentiment analysis Classiﬁcation Automobiles WEKA Roman urdu

Customer reviews

1 Introduction With the increase in computer usage and advancements in internet technology, people are now using their computers, laptops, smart-phones and tablets to access web and establishing their social networks, doing online businesses, e-commerce, e-surveys etc. They are now openly sharing their reviews, suggestions, comments and feedback about a particular thing, product, commodity, political affair and other viral news. Most of these are shared publicly and can be easily accessed from the web. Out of all those opinions, classifying the number of positive and negative opinions is a difﬁcult task [1]. If you are planning to buy a particular product or choosing some institution, it’s really difﬁcult without any prior feedback regarding it. Similarly for the producers or service © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 630–640, 2019. https://doi.org/10.1007/978-3-030-03405-4_44

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles

631

providers, it’s a difﬁcult ask for them to alter their SOPs without any review about their products or services from the customers. They can ask their customers to provide feedback via an e-survey, social media page or hand-written reviews. There will be so many opinions and reviews but their categorization as positive and negative is difﬁcult. Some machine learning should be done to overcome this situation and to take betterment decisions later on. The most important aspect in opinion mining is the sentiment judgment of the customer after extracting and analyzing their feedback. Growing availability of opinion-rich resources like online blogs, social media, review sites have raised new opportunities and challenges [2]. People can now easily access the publicly shared feedbacks and reviews which help in their decision making. A lot of issues are involved in opinion mining. A major one is the handling of dual sense of the words, i.e. some words can depict a positive sense in a particular situation and a negative sense in the other. For example, the review: “the outer body of this car is stiff” shows positivity and hence the word stiff comes here in positive sense. Now consider another review “the steering wheel of this car is stiff” shows negativity and hence the word stiff interpreted here in negative sense. Another issue is the understanding of sarcastic sentences e.g. “Why is it acceptable for you to be a bad driver but not acceptable for me to point out”. One more issue is faced in the sentences which have both positive and negative meaning in them e.g. consider the sentences: “The only good thing about this car is its sporty look” and “Difﬁcult roads often lead to beautiful destinations”. Another commonly faced issue in opinion mining is the analysis of opinions, reviews and feedbacks shared on social media sites, blogs and review sites, which lack context and are often difﬁcult to comprehend and categorize due to their briefness. Also they are mostly shared in the native language of the users; therefore to tackle each language according to its orientation is a challenging task [3]. So far a lot of research work has been done on sentiment analysis in English language but limited work has been done for other languages being spoken around the globe. Urdu language, evolved in the medieval ages, is an Indo-Aryan language written in Arabic script and now had approx. 104 million speakers around the globe [4]. Urdu can also be written in the Roman script but this representation does not have any speciﬁc standard for the correctness (correct spelling) of a word i.e. a same word can be written in different ways and with different spellings by different or even by the same person. Moreover, no one to one mapping between Urdu letters for vowel sounds and the Roman letters exist [5]. This research paper aims to mine the polarity of the public reviews speciﬁcally related to the automobiles and are written in Roman Urdu extracted from different automobiles review sites. The collected reviews dataset is used to train the machine using different classiﬁcation models and then to assign the polarity of new reviews by using these trained classiﬁcation models.

632

M. Khan and K. Malik

2 Related Work Kaur et al. in 2014 [6] proposed a hybrid technique to classify Punjabi text. N-gram technique was used in combination with Naïve Bayesian in which the extracted features of N-gram model were supplied to Naïve Bayesian as training dataset; testing data was then supplied to test the accuracy of the model. The results showed that the accuracy of this model was better as compared to the existing methods. Roman Urdu Opinion Mining System (RUOMiS) was proposed by Daud et al. in 2015, in which they suggested to ﬁnd the polarity of a review using natural language processing technique. In this research, a dictionary was manually designed in order to make comparisons with the adjectives appearing in the reviews to ﬁnd their polarity. Though the recall of relevant results was 100% but RUOMiS categorized about 21.1% falsely and precision was 27.1%. Jebaseeli and Kirubakaran [7] proposed M-Learning system for the prediction of opinions as positive, negative and neutral using three classiﬁcation algorithms namely Naïve Bayes, KNN and random forest for opinion mining. The efﬁciency of the stated algorithms was then analyzed using training dataset having 300 opinions equally split i.e. 100 opinions for each classiﬁcation class (positive, negative and neutral). SVD technique was used in the preprocessing step to remove the commonly and rarely occurring words. 80% of the opinions were supplied as training dataset and rest was used for testing purpose. Highest accuracy was achieved by random forest classiﬁer at around 60%. Khushboo et al. presented opinion mining using the counting based approach in which positive and negative words were counted and then compared for the English language. Naïve Bayesian algorithm was used in this study. It is suggested that if the dictionary of positive and negative words is good, then really good results are returned. In order to increase the accuracy, change can be made in terms of parameters which are supplied to Naïve Bayesian algorithm. Opinion mining in Chinese language using machine learning methods was done by Zhang et al. [8] using SVM, Naïve Bayes Multinomial and Decision Tree classiﬁers. Labeled corpus was trained using these classiﬁers and speciﬁc classiﬁcation functions were learnt. The dataset comprises of Amazon China (Amazon CN) reviews. The best and satisﬁed results were achieved using SVM with string kernel. A grammatical based model was developed by Syed et al. [9] which focused on grammatical structure of sentences as well as morphological structure of words; grammatical structures were extracted on basis of two of its substituent types, adjective phrases and nominal phrases. Further assignment was done by naming adjective phrases as Senti-Units and nominal phrases as their targets. A striking accuracy of 82.5% was achieved using shallow parsing and dependency parsing methods. Pang et al. suggested the method to classify the documents on overall sentiment and not on the topic to determine the polarity of the review i.e. positive or negative. Movie reviews were supplied as dataset for training and testing purposes. It was found out that standard machine learning techniques performed pretty well and surpasses the human produced baselines. However, traditional topic-based categorization didn’t produce good results using Naïve Bayes, maximum entropy classiﬁcation and support vector machines.

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles

633

Classiﬁcation of opinions, using the sentiment analysis methodologies, posted on the web forum in Arabic and English languages was proposed by Abbasi et al. [10]. In this study, it was proposed that in order to handle the linguistic characteristics of Arabic language, certain feature extraction components should be used and integrated which returned very good accuracy of 93.62%. However, the limitation of this system was the domain speciﬁcation as only the sentiments related to hate and extremist groups’ forums were classiﬁed by this system; because vocabulary of hate and extremist words is limited, so it isn’t difﬁcult to determine the polarity i.e. positive and negative words in the opinion.

3 Methodology The model proposed in this paper is divided into four main steps. First of all, the reviews written in Roman Urdu were scrapped from different automobiles sites [11, 12] as only automobiles reviews are targeted in this study. Training dataset was created and documented in text ﬁles using these scrapped reviews containing 800 positive and 800 negative reviews. Dataset is converted into the native format of WEKA, which is Attribute-Relation File Format (ARFF) and then that ARFF converted training dataset is loaded in the WEKA explorer mode to train the machine. Different models are developed by applying using different classiﬁers and then the results are analyzed and compared. The methodology comprises of the steps shown in Fig. 1. 3.1

Material

The dataset used comprises of 2000 automobiles reviews in Roman Urdu with equal polarity of positive and negative reviews. 1600 example points are used for training the machine and the rest 400 are used for testing the accuracy of the models trained via different classiﬁers. This large training sentiment corpus was labeled prior to training the classiﬁcation models. 3.2

Data Preprocessing

The purpose of this step is to ensure that only relevant features get selected from the dataset. In this step, before forwarding the data to the training of models and for classiﬁcation, following steps were performed. 3.2.1 Data Extraction The extraction task includes the scrapping of reviews from the automobiles sites. The users freely post their comments and reviews on these sites which are mostly multilingual. For example, “Honda cars ka AC bohut acha hota hai”, “imported cars k spare parts mehengy milty hn”, etc. This is because English has a great influence on Urdu speaking community; and also due to the fact that most of the automobiles related terminologies is used as it is in other local languages as well including Urdu. That’s why most of the Roman Urdu reviews contains the words of both English and Urdu languages.

634

M. Khan and K. Malik

Fig. 1. Proposed model.

3.2.2 Stop-Words Removal Words which are non-semantic in nature are termed as stop-words and usually include prepositions, articles, conjunctions and pronouns. As they hold very little or no information about the sentiment of the review, so they are removed from the data [13, 14]. A list of Urdu stop-words was taken [15] and converted to Roman Urdu script. 3.2.3 Lower-Case Words All the word tokens are converted to lower-case before they are added to the corpus in order to shift all the words to the same format, so that prediction can be made easy. The reason to do this conversion is to follow the same standard for all the tokens/words in the dataset. 3.2.4 Development of Corpus All the extracted reviews and comments are stored in a text ﬁle which includes 1000 positive and 1000 negative reviews. In this study, 800 positive and 800 negative reviews are used as training dataset and rest 400 reviews (200 positive and 200 negative) are used as testing dataset.

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles

635

3.2.5 Conversion of Data into ARFF Attribute-Relation File Format (ARFF) is an ASCII text input ﬁle format [16] of WEKA. These ﬁles have two important sections i.e. Header information and Data information. The dataset text ﬁle was converted to ARFF format using by using TextDirectoryLoader command in Simple CLI mode of WEKA. For example: The elements of text ﬁles are saved as strings with relevant class labels. [ java weka:core:converters:TextDirectoryLoader filename:txt [ filename:arff

3.3

Classiﬁcation

Before doing the classiﬁcation part, all the string attributes are converted into set of attributes, depending on the word tokenizer, to represent word occurrence [17] using StringToWordVector ﬁlter. The set of attributes is determined by the training data. Sentiment classiﬁcation can be binary or multi-class sentiment classiﬁcation. Binary classiﬁcation has two polarities, e.g. good or bad, positive or neutral etc. In this paper, binary classiﬁcation is done using machine learning algorithms. As training dataset is required for supervised machine learning algorithms, having feature vectors and their respective class labels, and a testing dataset [18]. A set of rules can be learnt by the classiﬁers from the training corpus prior to the testing process using the trained models. Multinomial Naïve Bayes, Bagging, Deep Neural Network, Decision Tree, Random Forest, AdaBoost, k-NN and SVM Classiﬁers are used to learn and classify the Roman Urdu dataset. In WEKA, Classify tab is used to run different classiﬁers on the dataset. In this study, machine was trained using six classiﬁers and corresponding six models were built. These generated models are then used to predict about the testing data’s polarity (positive or negative). 3.3.1 Deep Neural Network Deep neural networks are Multi-Layer Perceptron Network. A number of neurons are connected to other neuron with the help of hidden layers. DNNs are also known as function approximator, such as Fourier or Taylor. With enough layers of non-linear function approximator the DNNs can approximate any function. DNNs use non-linear function such as logistic, tan-hyperbolic etc. to compute the big ﬁg function same with the principle of Fourier Series where sin and cos functions are used to determine the function. The coefﬁcients in the DNN represent the same purpose such the coefﬁcients of the Fourier series. As they have multiple layers in between input and output layers, complex non-linear relationships can be modeled using them. With deep learning using DNNs can be utilized to achieve best solutions to natural language processing, speech and image recognition problems. 3.3.2 Decision Tree Classiﬁer Decision Tree builds classiﬁcation models in form of a tree like structure. The dataset is broken down into smaller chunks and gradually the corresponding decision tree is incrementally developed. The ﬁnal outcome of this process depicts a tree with leaf

636

M. Khan and K. Malik

nodes or decision nodes. It shows that if a speciﬁc sequence of events, outcomes or consequences is occurred, then which decision node has the maximum likelihood to occur and what class will be assigned to that sequence. The core logic behind decision trees is the ID3 algorithm. ID3 further uses Entropy and Information Gain to construct a DT. 3.3.3 Bagging Classiﬁer Bootstrap Aggregating Classiﬁer is also known as the Bagging Classiﬁer. This is a boosting and ensemble algorithm which combines the output of weak learners Decision Tree in most cases. It is known to have commonly used to avoid overﬁtting by reducing the variance. It creates m bootstrap samples which are used by classiﬁers and then the output is made by voting of the classiﬁer. 3.3.4 Random Forests Random Forests is also an ensemble algorithm which combines the output of decision trees. For regression it takes the mean of the output while for classiﬁcation it goes for majority voting. It is remedy of decision tree’s very common problem i.e. overﬁtting. It selects many bootstrap samples consisting of random features and thus is prone to lesser overﬁtting. 3.3.5 Multinomial Naïve Bayes Classiﬁer If discrete features like word counts for text classiﬁcation are present, then the multinomial naïve bayes is very suitable. Integer feature count is normally required by the multinomial distribution. However, in practice, fractional counts such as tf-idf may also work. 3.3.6 K-NN K-Nearest Neighbor simply stores all available scenarios and then classiﬁes new unseen scenarios using the similarity measure e.g. distance functions like Euclidean, Manhattan, Minkowski, etc. It is a lazy learning technique learning technique as computation is deferred and approximation is done locally until classiﬁcation. It can be used both for classiﬁcation and regression purposes. In WEKA, it can be used by the name IBk algorithm. 3.3.7 AdaBoost Algorithm Ada Boost is an adaptive boosting algorithm which is used to combine the output of weak learner algorithms. It works by tempering the misdiagnosed subsequent weak learners over previously misclassiﬁed records. This algorithm is sensitive to noise relatively is less prone to the overﬁtting. It works by calculating the weighted out of each learner in order to calculate the ﬁnal output. Moreover, it only works on features that play predictive role and reduces the dimensionality of the data. It usually utilizes Decision tree as the weak learner. Decision tree with its own parameters can be set as a weak learner. The learning rate is compromised with the number of estimator trees.

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles

637

3.3.8 SVM A Support Vector Machine (SVM) is a discriminative classiﬁer formally deﬁned by a separating hyperplane. When given the labeled training data, SVM algorithm outputs an optimal boundary which classiﬁes new cases. It uses kernel trick technique in order to transform the data and then ﬁnds an optimal boundary between the possible outputs based on the transformation which is done earlier. 3.4

Testing of Models

The testing dataset comprising of 200 positive and 200 negative reviews was supplied to the models which were trained on the training dataset. Same preprocessing steps were performed on the testing dataset and an ARFF ﬁle was gotten after executing these steps. It is then supplied to the WEKA by selecting the Supplied test set option in the Classify tab and then providing the test data ARFF ﬁle. After the ﬁle was loaded, each trained model was re-evaluated by right-clicking on each model and selecting the Re-evaluate model on current test set option. The results are noted down for all the classiﬁers. 3.5

Results Analysis and Comparison

The classiﬁcation results of all the classiﬁers used to classify the 400 new example points (testing dataset) and their accuracies are shown in Table 1. The results of the testing process are analyzed and comparison is done among the classiﬁers in order to identify the classiﬁer which best performed in the testing process and classiﬁed the testing reviews more accurately. The analysis is performed using the standard evaluation methodologies, i.e. precision, recall and F-measure (Table 2). On the basis of these evaluations, classiﬁers are compared to declare the best one among all. Table 1. Accuracies of classiﬁers Classiﬁer Deep Neural Network Decision Tree Bagging Random Forests Multi-nomial Naïve Bayes k-NN AdaBoost SVM

Total testing Count of correctly reviews classiﬁed reviews 400 328

Count of incorrectly classiﬁed reviews 72

Accuracy (%) 82

400 400 400 400

303 338 315 359

97 62 85 41

75.75 84.5 78.75 89.75

400 400 400

288 335 306

112 65 94

72 83.75 76.5

638

M. Khan and K. Malik Table 2. Summarized results of evaluation measures Classiﬁer Deep Neural Network Decision Tree Bagging Random Forests Multi-nomial Naïve Bayes k-NN AdaBoost SVM

Precision 0.88 0.83 0.89 0.85 0.93 0.82 0.91 0.87

Recall 0.92 0.9 0.94 0.95 0.96 0.86 0.92 0.87

F-Measure 0.9 0.86 0.91 0.9 0.95 0.84 0.92 0.87

4 Conclusion In this study, we reviewed multiple text classiﬁcation models to classify Roman Urdu reviews related to automobiles using Waikato Environment for Knowledge Analysis (WEKA). The dataset contained 1000 positive and 1000 negative. 80% of the reviews data was labeled and then for machine training, it was supplied to WEKA and different classiﬁcation models were learnt. After that testing data was supplied and trained models were re-evaluated. The results showed that Multinomial Naïve Bayes performed best among all other classiﬁers in terms of more accuracy, precision, recall and F-measure (Fig. 2). For the computation part, Multinomial Naïve Bayes has better efﬁciency in learning and classiﬁcation than the Decision Tree classiﬁer [19] and consequently the classiﬁers which use decision tree at the backend. As far as our research is concerned, we have used Decision Tree, Bagging, Random Forests and AdaBoost classiﬁers which fall in the category of Decision Trees. The main reason of Multinomial Naïve Bayes being better in learning and classiﬁcation than Decision Trees is that it shows a good probability estimate for correct class, which enables it to perform the correct classiﬁcation [20].

Fig. 2. Comparison of results of classiﬁers on testing dataset.

Sentiment Classiﬁcation of Customer’s Reviews About Automobiles

639

5 Future Work • In this study, we have targeted the Roman Urdu reviews related to automobiles; this approach can be extended and can be used for other ﬁelds as well like hotels reviews, taxi service reviews that are provided in Roman Urdu. • We can train our machine to guess the exact area of interest from the given Roman Urdu review, e.g. if a person is talking about the engine of the automobile, our system should point that out. • Many reviews are shared with neutral polarity. Currently we have done this research to handle binary sentiment classiﬁcation of the reviews i.e. positive and negative. This can be extended to handle multi-class sentiment classiﬁcation; in this way, we can handle reviews having neutral polarity as well.

References 1. Khushboo, T.N., Vekariya, S.K., Mishra, S.: Mining of sentence level opinion using supervised term weighted approach of Naïve Bayesian algorithm. Int. J. Comput. Technol. Appl. 3(3), 987 (2012) 2. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retr. 2(1–2), 1–135 (2008) 3. Rashid, A., Anwer, N., Iqbal, M., Sher, M.: A survey paper: areas, techniques and challenges of opinion mining. IJCSI Int. J. Comput. Sci. Issues 10(2), 18–31 (2013) 4. Katsiavriades, K., Qureshi, T.: The 30 Most Spoken Languages of the World. Krysstal, London (2002) 5. Ahmed, T.: Roman to Urdu transliteration using wordlist. In: Proceedings of the Conference on Language and Technology, vol. 305, p. 309 (2009) 6. Kaur, A., Gupta, V.: N-gram based approach for opinion mining of Punjabi text. In: International Workshop on Multi-disciplinary Trends in Artiﬁcial Intelligence, pp. 81–88. Springer, Cham (2014) 7. Jebaseel, A., Kirubakaran, D.E.: M-learning sentiment analysis with data mining techniques. Int. J. Comput. Sci. Telecommun. 3(8), 45–48 (2012) 8. Zhang, C., Zuo, W., Peng, T., He, F.: Sentiment classiﬁcation for chinese reviews using machine learning methods based on string kernel. In: Third International Conference on Convergence and Hybrid Information Technology, ICCIT’08, 2008, vol. 2, pp. 909–914. IEEE, November 2008 9. Syed, A.Z., Aslam, M., Martinez-Enriquez, A.M.: Associating targets with SentiUnits: a step forward in sentiment analysis of Urdu text. Artif. Intell. Rev. 41(4), 535–561 (2014) 10. Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classiﬁcation in web forums. ACM Trans. Inf. Syst. 26(3), 12 (2008) 11. https://www.pakwheels.com 12. https://www.olx.com.pk/vehicles/ 13. Wilbur, W.J., Sirotkin, K.: The automatic identiﬁcation of stop words. J. Inf. Sci. 18(1), 45– 55 (1992) 14. Fox, C.: A stop list for general text. In: ACM SIGIR forum, vol. 24, no. 1–2, pp. 19–21. ACM, September 1989 15. R. NL.: Ranks nl webmaster tools (2016). http://www.ranks.nl/stopwords/urdu 16. http://www.cs.waikato.ac.nz/ml/weka/arff.html

640

M. Khan and K. Malik

17. http://weka.sourceforge.net/doc.dev/weka/ﬁlters/unsupervised/attribute/ StringToWordVector.html 18. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classiﬁcation using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for Computational Linguistics, Chicago, July 2002 19. Amor, N.B., Benferhat, S., Elouedi, Z.: Naive bayes vs decision trees in intrusion detection systems. In Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 420– 424. ACM, March 2004 20. Domingos, P., Pazzani, M.:. Beyond independence: conditions for the optimality of the simple bayesian classiﬁer. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112, Chicago, July 1996

Hand Gesture Authentication Using Depth Camera Jinghao Zhao(B) and Jiro Tanaka Graduate school of IPS, Waseda University, Kitakyushu, Fukuoka, Japan jinghao [email protected], [email protected]

Abstract. Nowadays humans are concerned more about their privacy because traditional text password becomes weaker to defend from various attacks. Meanwhile, somatosensory become popular, which makes gesture authentication become possible. This research tries to use humans dynamic hand gesture to make an authentication system, which should have low limitation and be natural. In this paper, we describe a depth camera based dynamic hand gesture authentication method, and generate a template updating mechanism for the system. In the case of simple gesture, the average accuracy is 91.38%, and in the case of complicated gesture, the average accuracy is 95.21%, with 1.65% false acceptance rate. We have also evaluated the system with template updated mechanism.

Keywords: Gesture authentication Three-dimensional hand gesture · Depth camera

1

Introduction

In recent years, how to protect human’s privacy attracted the attention of people. Text password is the most frequently way that we used in our delay life, but recent research shows that such kind of password become weaker and need to be longer and more complex to be safety [1], which make it very inconvenient and unnatural. Besides, even we used such kinds of text password, it also can be easily stolen or copied by smudge attack or shoulder surﬁng attack [2,3]. In order to make up for the shortage of traditional text password, biometric information is used as a new kind of authentication methods. Biometric information consists of physical and behavioral characteristics. Physical characteristics are referred to the static features of human body, such as iris, ﬁngerprints or DNA, behavioral characteristics means dynamic information from human’s behave, including gait, gesture, typing and so on. According to the source of biometric information, it also can be divided into congenital features and acquired features. Current biometric authentication technologies are mainly based on computer vision technology with RGB camera or wearable sensor such as accelerometer embedded in wristbands or smart phone, those methods have already achieved c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 641–654, 2019. https://doi.org/10.1007/978-3-030-03405-4_45

642

J. Zhao and J. Tanaka

good results, but both of them still have demerits. The computer vision technology can get high accuracy, but has high demands on equipments, users are always need to perform in special environment. The wearable sensor method is portable and usually does not have environment limitation, but it has some accuracy lost because of sensor itself, and some sensors have diﬃculty in power charging. Hence, how to combine the good points of each method to make a more natural authentication system become more important.

2

Background

We will introduce the basic processes in biometric authentication: ﬁrst, we discussed some biometric techniques and application. Then introduce some authentication using hand gesture. Finally, we will give some research for evaluating an authentication system. 2.1

Biometric Techniques

Biometric techniques are used in authentication, such as iris [4] and ﬁngerprint [5], other researchers also used face or palm print as authentication methods. 2.2

Hand Gesture Authentication

As an unique biometric information for individuals, hand gesture is proved based on several research that it can have good performance in authentication system. Liu et al. presented an accelerometer-based personalized gesture recognition methods, in their paper, acceleration data of hand gesture is sensed by mobile phone, and they used Dynamic Time Warping (DTW) to analysis and match the input gesture. Kratz et al. [6] presented an air-auth hand gesture authentication system using Intel Creative Senz3D, they also made some improvements and assessment methods. Chahar et al. used Leap Motion and Neural Network built a dynamic hand gesture authentication system [7]. Those researches shows that dynamic hand gesture could obtain high accuracy in authentication. 2.3

Evaluation Methods

Jain et al. have proposed seven basic rules for an authentication system [8], which are universality, uniqueness, permanence, measurability for any biometric system and performance, acceptability, circumvention in a practical biometric system. Aumi et al. [9] discussed security performance of their system by the test of shoulder-surﬁng attacks and smudge attacks. Yang et al. [10] discussed how to evaluate the permanence and memorability of their research.

Hand Gesture Authentication Using Depth Camera

3

643

Goal and Approach

In our research, we are aimed to ﬁnd a robust authentication methods by using human dynamic hand gesture. Since traditional text password are not friendly to users, while some biometric authentication methods such as iris and ﬁngerprint methods need high requirements on equipments and can not be pervasive. Therefore, we try to ﬁnd a natural and acceptable authentication method for users, also, it should have relative high performance. To achieve our goals, we take a series of approaches. We used new generation depth sensor - Leap Motion, which has lighter weight but keep a high accuracy. In order to use it with good result, we designed a series of preprocessing methods such as smoothing and normalization. Then, we came up a feature extraction method based on clustering and ﬁltering. Finally, we used Hidden Markov Model (HMM) to solve the classiﬁcation of identities. Instead of using static hand shape or other static biometric information, we used dynamic hand gesture as our core elements for authentication. Because dynamic gesture contains more information than static posture, it is harder to be copied or imitated by others, meanwhile, it has lower equipment requirements than traditional ﬁngerprint or iris authentication methods, which will make it can be used in more situation.

4

System Implementation

To implement hand gesture authentication, we designed our system by using Leap Motion and a series algorithm including data processing, feature extraction and classiﬁcation. 4.1

Prototype

Leap Motion is used as our hardware platform. We built our system with following prototype in Fig. 1.

Fig. 1. System platform.

644

J. Zhao and J. Tanaka

The system will works in following process: (1) First, users are asked to perform their gesture over Leap Motion camera to register their account and “gestural password”, all of registered data will be stored as user’s own template. (2) Then, when a user want to access his account, just choose his account and perform the registered gesture, the system will automatically return whether current user can pass or not. Passed veriﬁcation gesture will be stored as a temporary template waiting to be used to update current template. (3) After a period of time, the accuracy of current template will dropped, therefore, the system will merge the temporary template with current template and make it become a new template. (4) Some times users may frequently access their account, the system will also update the template when successful access times achieved the threshold. And the ﬂowchart in Fig. 2 also shows the sequence of system work.

Fig. 2. Process of system.

Hand Gesture Authentication Using Depth Camera

4.2

645

Data Preprocessing

(1) Data Acquisition: After the platform is built in our research, we ﬁrst got the raw data of hand gesture. Leap Motion will return gesture information automatically. The basic gesture information will be recorded frame by frame, and will be shown in the format below: f rame = {f rameid, timestamp, palmposition, f ingertipposition, f ingertipspeed, etc.}

(1)

where, position or speed part consists of data from x, y and z-axis. For each feature, we extracted data and make them in time series as below: f = {(x1 , y1 , z1 ), ..., (xn , yn , zn )}

(2)

Here, n means frame length of the gesture, and each tuple (x, y, z) is feature data in frame. After that we will get the raw data of each gesture. And the raw data will be the format as Fig. 3 shows:

Fig. 3. Raw data.

(2) Data Smooth: Because there are noises’ remains in the raw data, we used Kalman ﬁlter to eliminate those noises. The basic idea of Kalman ﬁlter is that current state f at time t is related to previous state at time t−1 [11,12]. In processing control, suppose we have a system which can be described in a following linear stochastic diﬀerence equation, st = Ast−1 + BU (t) + C

(3)

where A and B is system parameter of this system, st is the feature data of time t, U (t) is the external control, and C is the measurement noises from devices. We only can measure the output of device. yt = Hst + Nt

(4)

where, y is the output value, H is parameter of system, and N is the noise in reading data from device. Obviously, the output value contains two part of noises, one is measurement error C, another is reading error N . We use Q and R to represent the covariance of C and N .

646

J. Zhao and J. Tanaka

(1) Set st|t−1 = Ast−1 + BU (k) to estimate current xt|t−1 value based on previous optimal xt−1 Since there are no external control in our system, the parameter B can be 0, and parameter A, H in (3) and (4) can be treated as 1. (2) Set Pt = APt−1 AT + Q to calculate the covariance of estimated xt , Here, Pt is the covariances of current state. (3) Used Kgt = Pt H T (HPt H T + N )−1 to calculate the Kalman Gain Kgt of the current time. (4) st = st|t−1 + Kgt (yt − Hst|t−1 ) to get the current optimal xt by estimated value xt|t−1 and observed value yt . (5) Finally, we updated the covariance by Pt = Kgt HPt ; And the smooth result is as Fig. 4 shows

Fig. 4. Smoothed data.

(3) Data Normalization: To make authentication data easy to match the template data, we used ¯ fˆk − fˆk nf k = i (5) σ fˆk ¯ to normalize (xk , yk , zk ) of feature data in frame k, Where fˆk means the mean k of smoothed feature f, and σ x ˆ means the standard deviation of smoothed x sequence. After that, the smoothed data will be changed into normalized sequence nxk = {nxk1 , ny1k , nz1k , .., nxkn , nynk , nznk }

(6)

The normalized data is shown in Fig. 5. 4.3

Feature Extraction

In order to collect suitable features, a kind of cluster method is used in our system. First, we put all features that is given by Leap Motion into a feature set X, including speed, track, and other information of ﬁngertip and palm and bones. X = {F1 , F2 , ..., Fn }

(7)

Hand Gesture Authentication Using Depth Camera

647

Fig. 5. Normalized data.

Here, F represents one kind of features, such as Fingertip location, speed and so on. Then, we build a empty set Y to store optimal feature combination. The feature extraction steps are: (1) Test each feature of X and ﬁnd the feature Fi that performed best in X, put it into optimal set Y, and remove it from X. (2) Then, for remaining features in X, pick up one feature Fi each time and combine it with features of Y, test the new set and get result. If the performance is better than using current optimal set Y, the new set will replace Y set, otherwise Y will not be changed and Fi will be discarded. (3) Repeat Step 2 until feature set X become empty. Table 1. Accuracy of using diﬀerent features Feature

Palm center position Fingertip position Fingertip speed 88.4%

Feature

Proximal bone

Intermediate bone Distal bone

81.6%

84.3%

Metacarpal bone

Accuracy 84.5%

84.3%

Wrist position

Accuracy 86.2%

83.2% 81.2%

Table 1 shows the accuracy when only using one feature to classify. The position of ﬁngertip perform best above all features, we put it into the optimal set Y at ﬁrst, and remove it from feature set X. Then for features remain in X, we picked up one feature each time and made the union for Y and this feature, testing the accuracy and got the result in Fig. 6. Therefore, the best combination of two features is ﬁngertip position and ﬁngertip speed. Since the accuracy of feature set with three features is not better than accuracy of ﬁngertip position and speed, we decided to use the combination of these two features. 4.4

Classification

We used trajectory and speed as our features to be used in classiﬁer. (1) First, we make the gesture information for k th frame in following format: f k = (pkx , pky , pkz , vxk , vyk , vzk )

(8)

while p is the position of one ﬁngertip, and v is the speed of this ﬁngertip.

648

J. Zhao and J. Tanaka

Fig. 6. Accuracy of using two features.

(2) Then we put the sequence into the classiﬁer, and use Hidden Markov Model to build the model for each features. The classiﬁer will work when f k is input. First it calculates the state transition probability for each f k , then we used the result of training set and calculated the output probability for each state. By calculating probabilities for each state, the classiﬁer algorithm will build a whole hidden markov model for gesture data. 4.5

Template Updating

We used a double threshold method to update our gesture template, one threshold is the successfully accessed times, another is the build time for a template. In our research, the template updating mechanism work in following sequence. (1) First, the system will build the template for each user and record the last access date and access times. (2) If a user’s accepted times achieved the threshold, the new accepted gesture will be attached to the existed template. (3) If the access time does not achieved the threshold, but it has been a long time after template built or last update times, the system will also update new gesture to the template. (4) Finally, system will update timestamp and access times to be used in next time.

5

Experiments

We mainly did some experiments with the accuracy of original system and template updating mechanism.

Hand Gesture Authentication Using Depth Camera

5.1

649

Evaluation Standard

To evaluate an authentication system, ﬁrst thing to think is the False Acceptance Rate (FAR) because the performance usually depends on the ability of defending. It should also keep good performance on False Rejection Rate (FRR). False Acceptance Rate is the probability that the system accepted a nonauthorized person, and also it is the probability that the system incorrectly rejects a genuine person. In an identiﬁcation system, false rejection rate is always concerned to be more important and should be decreased, but in a veriﬁcation system, in order to have high security performance, False acceptance rate should be emphasized to prevent from attacks. In such case, false rejection rate can be relatively high to reduce false acceptance rate. 5.2

Accuracy of General Authentication

In this part, we invited 4 users aged 22–24, each of the user have no experience with such kind of system but have related knowledge about how to use Leap Motion. The experiment step are as follows: (1) Each user registers a gesture and perform it 20 times, then the system will store the gesture as their template. And they can see the gesture of other users. (2) After built the template, each user try to pass the authentication with their own hand gesture for 20 times, this step will give the false rejection rate of the system. (3) Then, each user try to imitate each others gesture and try to attack their “account”, then the false acceptance rate will be recorded. (4) Finally, calculate the total error using all data from Steps 2 and 3. We did experiments with simple and complicated gesture separately, Figs. 7 and 8 show some examples of two kinds of gesture in 2D vision. The accuracy result of two group is as Fig. 9 shows: As Fig. 9(a) shows, in the case of simple gesture, the average accuracy is 91.38%, while false acceptance rate is 3.62%, and false rejection rate is 6.57%. In the case of complicated gesture as Fig. 9(b), the average accuracy is 95.21%, with 1.65% false acceptance rate and 4.82% false rejection rate. 5.3

Accuracy of Template Updating Mechanism

In the experiments of template updating mechanism, we designed steps as follows: (1) First, we invited 10 participants (aged 21–24) who are also not familiar with this system but have experience of using Leap Motion. (2) Then, 10 users are asked to perform their gesture, which will be registered as gesture password.

650

J. Zhao and J. Tanaka

Fig. 7. Simple gesture.

Fig. 8. Complicated gesture.

(3) Then, 10 users were asked to access their account right after register, 5 of the users used system with template updating mechanism(Group T), others(Group O) used general system without updating template. And then we recorded the accuracy with FAR and FRR. (4) After one week, two weeks and one month, they were asked to access their account again as Step 2 works, and we recorded the accuracy. (5) Finally, we compared these three accuracy records and got the conclusion.

Hand Gesture Authentication Using Depth Camera

Fig. 9. Accuracy result.

Fig. 10. Accuracy comparison of template updating mechanism.

651

652

J. Zhao and J. Tanaka

The accuracy comparison of template updating mechanism is shown in Fig. 10. We can see, at ﬁrst, two group have almost same accuracy (T = 95.34%, O = 95.72%) after one week, the accuracy of two group are generally equal and did not changed a lot, but after two week and one month, the accuracy of group O dropped, while group T keeps a relatively high accuracy. Hence, we can prove that our template updating mechanism works in this system.

6

Related Work

Chahar et al. [7] used Leap Motion and hand gesture built a Leap password system. In their work, they proposed a aLPhabet framework which used hand and ﬁnger static shape data such as length and width information, combined with time information for each user to perform a whole gesture and verify user’s identity. In their work, they used Levenshtein Algorithm to estimate the similarity for gestures, and gave weight to each kind of feature to ranking the importance of features, ﬁnally, they used Naive Bayes, Neural Network and Random Decision Forest classiﬁer and get an average score for each possibility and get the classiﬁcation result. In their work, they kept a 1% FAR, and got an accuracy about 81%. Compared with Chahar’s work, our system got a relative high accuracy with some lost on FAR, meanwhile, we used dynamic hand gesture in our system, which is more unique and more diﬃcult to be copied by tools, and we did some attack and safety experiments to prove the stable of our system. Aumi et al. [9] used dynamic hand gesture and Intel Senze 3D in their work, and they used Dynamic Time Warping (DTW) as their classiﬁcation method. They also did some threat experiments, such as shoulder-surﬁng threat, and got a high accuracy if they set a very low threshold for DTW. Besides, they designed a template updating mechanism by counting successful access times. Compared with Aumi’s work, we used Leap Motion, which is lighter and easier to use. Meanwhile, we proposed a double threshold template updating mechanism based on period and access times, which can reduce the threaten of false acceptance.

7 7.1

Conclusion and Future Work Conclusion

Our research generated a high accuracy dynamic hand gesture based human authentication system. From the general accuracy result we can see our system got an over 95% accuracy with 1.65% false acceptance rate with complicated dynamic hand gesture, compared with previous research [13], we made some improvement on the accuracy performance. Besides, we introduced template updating mechanism of our system, by using this mechanism, the system have

Hand Gesture Authentication Using Depth Camera

653

keep its authentication accuracy with time passing, which proved the good permanence and robust of our system. From the experiment result, it is obvious that complicated gesture perform better than simple gesture, which has same reason with current text password but gesture is easier to remember. Hence, we recommend to use a complicated gesture such as personal sign in our system. After the experiments, most of participants think our system is easy to use and have better security performance than traditional password, which proved the usability of our methods. 7.2

Future Work

In our research, we required every user perform over 20 times to build their own template, which is not convenient in practice, what we should do next is to ﬁnd a method that can reduce the repetition while keep the accuracy and robust of template. Another problem is the template updating method. In our design, we used double threshold mechanism which consists of time and template length to reduce the inﬂuence of false acceptance situation and improve the permanence. But such kind of methods can not completely eliminate the threaten of false acceptance. We will think about how to improve the template updating mechanism to make the system safer.

References 1. Melicher, W., Kurilova, D., Segreti, S.M., Kalvani, P., Shay, R., Ur, B., Bauer, L., Christin, N., Cranor, L.F., Mazurek, M.L.: Usability and security of text passwords on mobile devices. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 527–539. ACM (2016) 2. Aviv, A.J., Gibson, K.L., Mossop, E., Blaze, M., Smith, J.M.: Smudge attacks on smartphone touch screens. Woot 10, 1–7 (2010) 3. Lashkari, A.H., Farmand, S., Zakaria, D., Bin, O., Saleh, D., et al.: Shoulder surﬁng attack in graphical password authentication. arXiv preprint arXiv:0912.0951 (2009) 4. Daugman, J.: How iris recognition works. IEEE Trans. Circuits Syst. Video Technol. 14(1), 21–30 (2004) 5. Maltoni, D., Maio, D., Jain, A., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer Science & Business Media (2009) 6. Kratz, S., Aumi, M.T.I.: Airauth: a biometric authentication system using in-air hand gestures. In: CHI 2014 Extended Abstracts on Human Factors in Computing Systems, pp. 499–502. ACM (2014) 7. Chahar, A., Yadav, S., Nigam, I., Singh, R., Vatsa, M.: A leap password based veriﬁcation system. In: 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–6. IEEE (2015) 8. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004) 9. Aumi, M.T.I., Kratz, S.: Airauth: evaluating in-air hand gestures for authentication. In: Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices & Services, pp. 309–318. ACM (2014)

654

J. Zhao and J. Tanaka

10. Yang, Y., Clark, G.D., Lindqvist, J., Oulasvirta, A.: Free-form gesture authentication in the wild. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3722–3735. ACM (2016) 11. Faragher, R.: Understanding the basis of the Kalman ﬁlter via a simple and intuitive derivation [lecture notes]. IEEE Signal Process. Mag. 29(5), 128–132 (2012) 12. Welch, G., Bishop, G.: An introduction to the Kalman ﬁlter (1995) 13. Mannini, A., Sabatini, A.M.: Machine learning methods for classifying human physical activity from on-body accelerometers. Sensors 10(2), 1154–1175 (2010)

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification Kamran Kowsari1(B) , Nima Bari2 , Roman Vichr3 , and Farhad A. Goodarzi4 1

3

Department of Computer Science, University of Virginia, Charlottesville, VA, USA [email protected] 2 Department of Computer Science, The George Washington University, Washington, USA [email protected] Data Mining and Surveillance and Metaknowledge Discovery, Fairfax, USA [email protected] 4 Department of Mechanical and Aerospace Engineering, The George Washington University, Washington, USA [email protected]

Abstract. This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classiﬁcation task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an eﬃcient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classiﬁcation to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model. Keywords: Fuzzy logic · Supervised learning · Binary feature Learning algorithms · Big data · Classiﬁcation task

c Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 655–670, 2019. https://doi.org/10.1007/978-3-030-03405-4_46

656

1

K. Kowsari et al.

Introduction and Related Works

Big Data Analytics has become feasible as well as recent powerful hardware, software, and algorithms developments; however, these algorithms still need to be fast and reliable [1]. The real-time processing, stretching assumptions and accuracy till remain key challenges. Big Data Fuzzy Supervised Learning has been the main focus of latest research eﬀorts [2]. Many algorithms have been developed in the supervised learning domain such as Support Vector Machine (SVM) and Neural Networks. Deep Learning techniques such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Neural Networks (DNN), and Neural Networks (NN) are ineﬃcient for fuzzy classiﬁcation tasks in binary feature space [3,4], but Deep learning could be very eﬃcient for multi-class classiﬁcation task [5]. In fuzzy Deep neural networks, the last layer of networks (output layer) is activated by Boolean output such as sigmoid function. Their limitation was demonstrated in their inability to produce reliable results for all possible outcomes. Time complexity, memory consumption, the accuracy of learning algorithms and feature selection remained as four critical challenges in classiﬁer algorithms. The key contribution of this study is providing a solution that addresses all four critical factors in a single robust and reliable algorithm while retaining linear processing time. Computer science history in the ﬁeld of machine learning has been shown signiﬁcant development particularly in the area of Supervised Learning (SL) applications [6]. Many supervised learning applications and semi-supervised learning algorithms were developed with Boolean logic rather than using Fuzzy logic; and therefore, these existing methods cannot cover all possible variations of results. Our approach oﬀers an eﬀective Fuzzy Supervised Learning (FSL) algorithm with a linear time complexity. Some researchers have attempted to contribute in their approach to Fuzzy Clustering and utilizing more supervised methods than unsupervised. Work done in 2006 and in 2017, [7,8] provided new algorithm with Fuzzy logic implemented in Support Vector Machine (SVM), which introduced a new fuzzy membership function for nonlinear classiﬁcation. In the last two decades, many research groups focused on Neural Networks using Fuzzy logic [9] or neuro-fuzzy systems [10], and they used several hide layer and. In 1992, Lin and his group worked on the Fuzzy Neural Network (FNN). However, their contribution is besed on outlined in the back-propagation algorithm and real time learning structure [11]. Our work focuses on approach of mathematical modeling of binary learning with hamming distance applied to supervised learning. Between 1979 and 1981, NASA1 developed Binary Golay Code (BGC) as an error correction technique by using the hamming distance [12,13]. The 1969 goal of these research projects was an error correction using Golay Code for communication between the International Space Station and Earth. Computer scientists and electrical engineers used fuzzy logic techniques for Gilbert bursterror-correction over radio communication [14,15]. BGC utilizes 24 bits, however, 1

The National Aeronautics and Space Administration.

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

657

a perfected version of the Golay Code algorithm works in a linear time complexity using 23 bits [16,17]. The algorithm used and implemented in this research study was inspired by the Golay Code clustering hash table [17–20]. This research oﬀers two main diﬀerences and improvements: (i) it works with n features whereas Golay code has a limitation of 23 bits (ii) our method utilizes supervised learning while Golay Code is an unsupervised algorithm which basically is a Fuzzy Clustering method. The Golay code generate hash table with six indices for labelling Binary Features (BF) as fuzziness labeled but FSL-BM is supervised learning is induced techniques of encoding and decoding into two labels or sometimes fuzzy logics classiﬁers by using probability or similarity. Between 2014 and 2015, the several studies addressed on using the Golay Code Transformation Hash table (GCTHT) in constructing a 23-bit meta-knowledge template for Big Data Discovery which allows for meta-feature extraction for clustering Structured and Unstructured Data (text-based and multimedia) [19,21]. In 2015, according to [18], FuzzyFind Dictionary (FFD), is generated by using GCTHT and FuzzyFind dictionary is improved from 86.47% (GCTHT) to 98.2% percent [18]. In this research our meta-features and feature, selection are similar to our previous work, which is done by Golay Code Clustering, but now we introduce a new algorithm for more than 23 features. Furthermore, existing supervised learning algorithms are being challenged to provide proper and accurate labeling [22] for unstructured data. Nowadays, most large volume data-sets are available for researchers and developers contain data points belonging to more than a single label or target value. Due to the limited time complexity and memory consumption, existing fuzzy clustering algorithm such as genetic fuzzy learning [23] and fuzzy C-means [24] aren’t very applicable for Big Data. Therefore, a new method of fuzzy supervised learning is needed to process, cluster, and assign labels to unlabeled data using a faster time complexity, less memory consumption and more accuracy for unstructured datasets. In short, new contributions and the unique features of the algorithms proposed in this paper are an eﬃcient technique of Fuzziness learning, linear time complexity, and ﬁnally powerful prediction due to robustness and complexity. The baseline of this paper is as follows: Fuzzy Support Vector Machine(FSVM) [25] and Original Support Vector Machine(SVM). This paper is organized with the following topics, respectively: Sect. 2: Fuzzy Logic for Machine Learning, Sect. 3: Pre-Processing including Sect. 3.1: MetaKnowledge. Section 3.2: Meta-Feature Selection, Sect. 4: Supervised Learning including Sect. 4.1: Pipeline of Supervised Learning by Hamming Distance and how we train our model. Then, (Sect. 5) Evaluation of Model and ﬁnally, Sect. 6: Experimental Results.

2

Fuzzy Logic for Machine Learning

Fuzzy logic methods in machine learning are more popular among researchers [26,27] in comparison to Boolean and traditional methods. The main diﬀerence between the Fuzziness method in clustering and classiﬁcation for both ﬁelds of

658

K. Kowsari et al.

supervised and unsupervised learning is that each data point can be belong to more than one cluster. Fuzzy logic, in our case, is extended to handle the concept of partial truth, where the truth-value may range between completely true [1] and false [0]. We make the claim that such an approach is suited for the proposed binary stream of data meta-knowledge representation [28,29], which leads to meta-features. Therefore we apply Fuzzy logic as a comparative notion of truth (or ﬁnding the truth) without the need to represent fully the syntax, semantics, axiomatization, truth-preserving deduction, and still reaching a degree of completeness [30]. We extend the many-valued logic [31–34] based on the paradigm of inference under vagueness where the truth-value may range between completely true (correct outcome, correct label assignment) and false (false outcome, opposite label assignment), and at the same time the proposed method handles partial truth, where the label assignment can be either {1, 0}. Through an optimization process of discovering meta-knowledge and determining of meta-features, we oﬀer binary output representation as input into a supervised machine learning algorithm process that is capable of scaling. Each unique data point is assigned to a binary representation of meta-feature which is converted consequently into hash keys that uniquely represent the meta-feature presented in the record. In the next step, the applied hash function selects and looks at the supervised hash table to assign an outcome, which is represented by assigning the correct label. The fuzziness is introduced through hash function selection of multiple (fuzzy) hash representations [32–34]. The necessary fuzziness compensates for the inaccuracy in determining the meta-feature and its representation in the binary data stream. As we represent these meta-features as binary choice of {1, 0}, we provide binary output of classiﬁcation outcome as {1, 0} through the designation of labels [32–34]. There must be some number of meta-features (n = i) such that a record with n metafeatures counts with “result m” whilst a record with n + 1 or n − 1 does not. Therefore, there must be some point where the deﬁned and predicted output (outcome) ceases. Let ∃ n(. . . n . . . ) assert that some number n satisﬁes the condition . . .n. . .. Therefore, we can represent the sequence of reasoning as follows, Fa1 ∼ ∀ n,

(1)

Fan ∃(Fan ∼ Fan+1 )Fai ,

(2)

where i can be arbitrarily large. If we paraphrase the above expressions with utilization of Hamming Distance (HD), there must be a number of metafeatures (n = i) such that a record with n meta-features counts with result m while a records with (n+HD) or (n−HD) does not exist. Whether the argument is taken to proceed by addition or subtraction [35,36], it completely depends on how one views the series of meta-features [37]. This is the key foundation of our approach that provides background to apply and evaluate many valued truth logics with the standard two value logic (meta-logic), where the truth and false, i.e., yes and no, is represented within the channeled stream of data. The system optimizes (through supervised) training on selection of meta-features to assert fuzziness of logic into logical certainty; thus we are combining the optimization

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

659

learning methods through statistical learning (meta-features) and logical (fuzzy) learning to provide the most eﬃcient machine learning mechanism [31–34].

Fig. 1. How fuzzy supervised learning works on fuzzy datasets. In this ﬁgure w indicates the percentage of fuzziness with means if w = [0.2, 0.8] that data point belongs 20 percents to label −1 and 80% belongs to label +1.

Figure 1 indicates how fuzzy logics works on supervised learning for two classes. This ﬁgure indicates that red circle assigned only in label −1 and blue stars belong to label +1, but diamond shape there dose not have speciﬁc color which means their color is between blue and red. if we have k number of classes or categories, the data points can be belonging to k diﬀerent categories. W = [w1 , w2 , ..., wk ] k

wi = 1

(3) (4)

i=0

where k is number of categories, W is labels of data points and wi is percentages of labelling for class i.

3

Pre-Processing

Regarding the hash function, the order of the features is critical for this learning techniques as a feature space [18,20,38]. Therefore, we use a process for feature selection that consists of meta-feature collection, meta-feature learning, and meta-feature selection. The n feature that build the meta-knowledge template technique oﬀers unique added value in that it provides clusters of interrelated data objects in a fast and linear time. The meta-knowledge template is a pre-processing technique built with each feature that can be assigned with either a yes or no as binary logics. In other words, given a template called F = f1 , f2 , f3 , ..., fn , f is a single feature representing a bit along the n-bit string. It is good to indicate that developing meta-knowledge is associated with

660

K. Kowsari et al.

the quality methodology associated with ontology engineering. Ontology aggregates a common language for a speciﬁc domain while specifying deﬁnitions and relationships among terms. It is also important to indicate that the development of the meta-knowledge template is by no means done randomly. This opportunity seems to be unique and unprecedented. In the following sections, we will explain the process that constitutes the building of the meta-knowledge based on speciﬁc feature selections that deﬁnes the questions of meta-knowledge. 3.1

Meta-Knowledge

The deﬁnition of meta-knowledge is extracting knowledge from feature representation and also we can deﬁne it as perfect feature extraction and pre-selected knowledge from unstructured data [39–41]. The meta-knowledge or perfect feature extraction allows the deep study of feature for the purpose of more precise knowledge. Meta-knowledge can be utilized in any application or program to obtain more insightful results based on advanced analysis of data points. In the early 1960s, researchers were challenged to ﬁnd a solution for a large domain speciﬁc knowledge [42]. The goal to collect and utilize knowledge from these large data-repositories has been a major challenge, as a result metaknowledge systems have been developed to overcome this issue. The problem of those to represent this knowledge remained a research question for researchers to develop. Therefore, our presented approach of meta-knowledge template with n-features can signiﬁcantly provide easiness and speed to process large data sets.

Algorithm 1. Generating List of Meta-features 1: for i = 1 to 2f − 1 do h f 2: for j= 1 to do i=0 i 3: ## statisticalmeta-feature determination 4: if (Uc , Uj ) = fk=1 Errk ≤ min(e) then 5: Stattistical Error 6: if ηi is N ull then Create meta-feature (domain knowledge) 7: ηi,j ← ηi,new 8: else Add tested meta-feature 9: ηi,j ← ηi,j + Ψ 10: End of if 11: End of if 12: End of For 13: End of For

3.2

Meta-Learning

According to [43], Meta Learning is a very eﬀective technique for solving support data mining. In regression and classiﬁcation problems Meta Feature (MF) and

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

661

Meta Learning algorithms have been used for application in the data mining and machine learning domain. It is very important to mention that the results obtained from data mining and machine learning are directly linked to the success of well-developed meta-learning model. In this research, we deﬁne Metalearning as the process which helps the selected features use the right machine learning algorithms to build the meta-knowledge. The combination of machine learning algorithms and the study of pattern recognition allows us to study the meta-features correlations and the selections of the most result/goal-indicating features.

4

Supervised Learning

There are generally three popular learning methods in the machine learning community: supervised learning, unsupervised learning, and semi-supervised learning. Unsupervised learning or data clustering by creating labels for unlabeled data points such as Golay Code, K-means, weighted unsupervised and etc. [44–46].

Fig. 2. This ﬁgure indicates how generating FSL-BM. From left to right, extraction of binary input from unstructured big data; and then, we generate meta-feature or meta-knowledge; and ﬁnally, fuzzy hash table is created to use in supervised learning.

In Supervised Learning, more than 80% of the data points are used for training purposes, and the rest of data points will be used for testing purposes or evaluation of the algorithm such as Support Vector Machine (SVM), and Neural Network. Semi-supervised learning uses label generated by supervised learning on part of the data to be able to label the remaining data points [47–49]. The latter is a combination of supervised and unsupervised learning. Overall, the contribution of this paper is shown in Fig. 2 which is conclude to meta-feature learning in pre-processing step and the input feature is ready for learning algorithm as follows. 4.1

Pipeline of Supervised Leaning Using Hamming Distance

In the pipeline of this algorithm (Fig. 3), all possible combinations of input binary features are created and the algorithm improves the training matrix by using hamming distance and ultimately improves the results by meta-feature selection

662

K. Kowsari et al.

and meta-knowledge discovery. As show in Fig. 1, the algorithm is divided into two main parts; (i) the training algorithm, which entails feature selection, hamming distance detection, and updates the training matrix; (ii) testing the Hash function which is included in the meta-feature category; the critical feature order that converts the testing input to indices while each index has at least one or more label.

Fig. 3. Pipeline of generating supervised training hash table using hamming distance.

An explicit utilization for all available data points is not feasible. Supervised Hash Table (SHT) is a hash table with 2f elements where f is the number of binary feature f ∈ {0...2f − 1} indices. The SHT elements are created by Hamming Distance of training data sets from zero to 2f − 1. In Eq. 5 , h is the value of Hamming Distance which can be either {1, 2, 3} or more depending on the number of training data points, and f is number of features. The segmentation of the stream data sets can be 20. . . 32 bits, and φ is the number of training data points. φ

h f k=0

k

=φ

f f f + + ··· + 1 2 h

(5)

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

4.2

663

Data Structure

Data structure is one-implementation criteria for learning algorithms. According to [18,20] the Golay code, Golay Code Transformation Matrix, Golay Code Clustering Hash Table, FuzzyFind Dictionary, and supervised Hash Table use the Hash table which can be the most eﬃcient method for direct access to data by constant time complexity. Hash table is the most eﬃcient techniques for knowledge discovery and it gives us constant time complexity to have easy and linear access to indices. On the other hand, Hash function can convert any unstructured or structured data input into binary features. This data structure is used in order to reduce computational time complexity in the supervised learning algorithm. Algorithm 2. Generating Supervised Hash Table 1: for c = 1 to φ do e f 2: for j = 1 to do i=0 i 3: ## statistical meta-feature determination δ(Cck −Cjk ) ≤ e then 4: if HD(Uc , Uj ) = fk=1 f 5: HD is Hamming Distance 6: if wi is N ull then 7: wi,j ← wi,new 8: else 9: wi,j ∗ ← wi,j + ζ 10: Fuzzy logic oo training model 11: End of if 12: End of if 13: End of For 14: End of For

4.3

Hamming Distance

Hamming Distance (HD) is used to measure the similarity of two binary variables [50]. The Hamming Distance between two Code-word is equal to the number of bits in which they diﬀer; for example: Hamming distance of 0000 0100 0000 1000 1001 (16521) and 0000 0010 1110 0100 1111 (15951) is equal to 9. In the proposed algorithm, we use the values for HD of 1, 2, and 3. This algorithm can Handel larger volume of data using fuzzy logics (depends on the hardware, the algorithm is run on it). The number of bits is represented as binary meta-feature. In our algorithm, generate n bits as feature space (e.g. for 32 bits, 4-billion unique data points will be generated). In this paper, we test our algorithm with 24 binary input which means 22 4 which has nearly 16 million unique records. HD(Uc , Uj ) =

f δ(Cc k=1

k

− Cj k ) f

(6)

664

4.4

K. Kowsari et al.

Generating the Supervised Hash Table

Generating the Supervised Learning Hash Table is a main part of this technique. In this section, the main loop is given from 0 to all training data points for creating all possible results. The calculation of the Hamming Distance e can be 2, 3 or even more for a large number of features and small portion of training data points. After calculating the hamming distance, the Supervised Hash Table is updated with labels. Regarding Algorithm 2, the main loop for all training data points, φ is as follows: φ(Cc , Cj ) =

f δ(Cck − Cjk ) k=1

wi,j ∗

f

≤e

← wi,j + ζ

(7) (8)

Equation 7 is the hamming distance of features and e indicates max value of HD. If a label is assigned two or more labels, that vector of Supervised learning Hash Table keeps all of the labels in hash function, meaning the record uses fuzziness labeling.

5

Evaluating Model

In Supervised Learning using Hamming Distance techniques, the Supervised hash table during the training part. This Hash table contains all possible feature’s input if enough data used for training. For evaluating the trained model, unlabeled data set can ﬁt in FSL-BM algorithm using binary input, and encode all unlabeled data points in same space of trained model as we discussed in Sect. 3. After using Hashing function, the correct indices is assigned to each data point; and ﬁnally, unlabeled has assigned by correct label(s). In feature representation, input binary feature converted to hash keys by meta-feature selected hash function and look at the Supervised hash table. Some data points have more than one label, hence fuzziness logics meaning each data point can be belong to more than one label. As presented in Algorithm 3, main loop is from 0 to m − 1 where m is the number of test data points and the maximum number of fuzziness is the maximum number of labels each point can be assigned. Algorithm 3. Testing Data Point by Supervised Learning Hash Tables 1: for i = 1 to m − 1 do 2: for x = 0 to max Fuzziness label do 3: if wi,j = null then 4: label of x is available and label is exist in existing training labels list 5: P redictioni,j ← w(Hash index),j 6: Add lables to labels list of i 7: End of if 8: End of if 9: End of For 10: End of For

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

6

665

Experimental Results

Although time complexity is one of the most important criteria of evaluating the time consumption, the hardware used for implementation and testing the algorithms is pretty essential. Listed time in Table 1 and Fig. 4 experimented with single thread implementation, yet using multiple thread implementation would reduce the time. In addition, the other signiﬁcant factor is memory complexity which is linear in this case, O(n). Figure 5 indicates how predict the test value as validation data points which if we have m data points for testing the time complexity is O(m). All of the empirical and experimental results of this study shown in Table 1 is implemented in a single processor. The source code will be released on GitHub and our lab website that implemented C++ and C# framework. C++ and C# are utilized for testing the proposed algorithms with a system core i7 Central Processing Unit (CPU) with 12 GB memory. Table 1. FSL-BM Accuracy Dataset 1 Dataset 2 Accurcy Fuzzy measure Accurcy Fuzzy measure SVM

89.62

NA

90.42

NA

FSL-BM 93.41

0.23

95.59

0.86

Fig. 4. Left Figure: In this graph, Accuracy has been shown for online data stream which is increasing over by larger volume of data training. Left Figure) training time and testing time both is linear but generating FSL-BM is near to linear but validations test is faster due to use indexing of hash table (Time is in log of second).

6.1

Data Set

We test our algorithm in two ways: ﬁrst empirical data,The data set which is used in this algorithm has 24 binary features. AS regards to Table 1, we test

666

K. Kowsari et al.

Fig. 5. Pipeline of testing our results by supervised training hash table using hamming distance.

our algorithm as following data sets: ﬁrst data-set includes 3, 850 training data points and 2, 568 validation test. And the second data set include 3, 950 data points as training size and 2, 560 validation test. And also we test accuracy and time complexity with random generated data-set as shown in Fig. 4. 6.2

Results

We test and evaluate our algorithm by two ways which are real dataset from IM DB and Random Generate Dataset. (i)Results of IMDB dataset: Testing a new algorithm with diﬀerent kinds of datasets is very critical. We test our algorithms and compare our algorithms vs traditional supervised learning methods such as Support Vector Machine (SVM). The proposed algorithm is validated with two diﬀerent datasets with 23 binary features. Regarding Table 1, total accuracy of dataset number 1 with 23 binary feature isz; 93.41%, correct accuracy : 93.1%, Fuzziness accuracy: 92.87%, Fuzziness : 0.23%, Boolean Error : 6.8%, Fuzzy Error: 7.1%, and regarding the second data set : Total accuracy is : 95.59%, correct accuracy : 94.4%, Fuzziness accuracy : 96.87%, Fuzziness : 0.86%, Error: 4.4%. Regarding Table 1, these results

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

667

show that Binary Supervised Learning using Hamming Distance has more accurate result in comparison same data set. In ﬁrst data set, we have 93.41% accuracy with FSL-BM while the accuracy in SVM is 89.62% and in second data set with 100 training data points, accuracy is 95.59% and 90.42%. (ii)Results on random online dataset: As regards to Fig. 4, it shows two aspect of FSL-BM 1) this technique has capable for online usage as online learning, and the model works for large volume and big data. Figure 4 indicates how this model is learned with big binary meta-feature data sets beside the fast time complexity.

7

Conclusion and Future Works

The proposed algorithm (FSL-BM) is eﬀectively suitable for big data stream, where we want to convert our data points to binary feature. This algorithm can be comparable with other same algorithms such as Fuzzy Support Vector Machine (FSVM), and other methods. In this research paper, we presented a novel technique of supervised learning by using Hamming Distance for ﬁnding nearest vector and also using meta-feature, meta-knowledge discovery, and meta-learning algorithms is used for improving the accuracy. Hash table and Hash function are used to improve the computational time and the results indicate that our methods have better accuracy, memory consumption, and time complexity. Fuzziness is another factor of this algorithm that could be useful for fuzzy unstructured data-sets which real data-sets could be classiﬁed as fuzzy data, once more reiterating each training data point has more than one label. As a future work, we plan to automate dynamically the number of feature selection process and create meta-feature selection library for public use. This algorithm can be particularly useful for many kinds of binary data points for the purpose of binary big data stream analysis. Binary Features in Fuzzy Supervised Learning is a robust algorithm that can be used for big data mining, machine learning, and any other related ﬁeld. The authors of this study will have a plan to implement and release the Python, R and Matlab source code of this study, and also optimize the algorithm with diﬀerent techniques allowing the capability to use it in other ﬁelds such as image, video, and text processing.

References 1. Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to Data Mining. Springer (2008) 2. Fatehi, M., Asadi, H.H.: Application of semi-supervised fuzzy c-means method in clustering multivariate geochemical data, a case study from the dalli cu-au porphyry deposit in central iran. Ore Geol. Rev. 81, 245–255 (2017) 3. Qiu, X., Ren, Y., Suganthan, P.N., Amaratunga, G.A.: Empirical mode decomposition based ensemble deep learning for load demand time series forecasting. Appl. Soft Comput. 54, 246–255 (2017)

668

K. Kowsari et al.

4. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 5. Kowsari, K., Brown, D.E., Heidarysafa, M., Jafari Meimandi, K., Gerber, M.S., Barnes, L.E.: Hdltex: hierarchical deep learning for text classiﬁcation. In: IEEE International Conference on Machine Learning and Applications(ICMLA). IEEE (2017) 6. Ashfaq, R.A.R., Wang, X.-Z., Huang, J.Z., Abbas, H., He, Y.-L.: Fuzziness based semi-supervised learning approach for intrusion detection system. Inf. Sci. 378, 484–497 (2017) 7. Jiang, X., Yi, Z., Lv, J.C.: Fuzzy SVM with a new fuzzy membership function. Neural Comput. Appl. 15(3–4), 268–276 (2006) 8. Chen, S.-G., Wu, X.-J.: A new fuzzy twin support vector machine for pattern classiﬁcation. Int. J. Mach. Learn. Cybern. 1–12 (2017) 9. Chen, C.P., Liu, Y.-J., Wen, G.-X.: Fuzzy neural network-based adaptive control for a class of uncertain nonlinear stochastic systems. IEEE Trans. Cybern. 44(5), 583–593 (2014) 10. Sajja, P.S.: Computer aided development of fuzzy, neural and neuro-fuzzy systems. Empirical Research Press Ltd. (2017) 11. Lin, C., Lee, C.G.: Real-time supervised structure/parameter learning for fuzzy neural network. In: IEEE International Conference on Fuzzy Systems, pp. 1283– 1291. IEEE (1992) 12. Thompson, T.M.: From Error-Correcting Codes Through Sphere Packings to Simple Groups, vol. 21. Cambridge University Press, Cambridge (1983) 13. West, J.: Commercializing open science: deep space communications as the lead market for shannon theory, 1960–73. J. Manage. Stud. 45(8), 1506–1532 (2008) 14. Bahl, L., Chien, R.: On gilbert burst-error-correcting codes (corresp.). IEEE Trans. Inf. Theor. 15(3), 431–433 (1969) 15. Yu, H., Jing, T., Chen, D., Berkovich, S.Y.: Golay code clustering for mobility behavior similarity classiﬁcation in pocket switched networks. J. Commun. Comput. USA 4 (2012) 16. Rangare, U., Thakur, R.: A review on design and simulation of extended golay decoder. Int. J. Eng. Sci. 2058 (2016) 17. Berkovich, E.: Method of and system for searching a data dictionary with fault tolerant indexing, US Patent 7,168,025, 23 January 2007 18. Kowsari, K., Yammahi, M., Bari, N., Vichr, R., Alsaby, F., Berkovich, S.Y.: Construction of fuzzy ﬁnd dictionary using golay coding transformation for searching applications. Int. J. Adv. Comput. Sci. Appl. 1(6), 81–87 19. Bari, N., Vichr, R., Kowsari, K., Berkovich, S.Y.: Novel metaknowledge-based processing technique for multimediata big data clustering challenges. In: 2015 IEEE International Conference on Multimedia Big Data (BigMM), pp. 204–207. IEEE (2015) 20. Kowsari, K.: Investigation of fuzzy ﬁnd searching with golay code transformations, Master’s thesis. The George Washington University, Department of Computer Science (2014) 21. Bari, N., Vichr, R., Kowsari, K., Berkovich, S.: 23-bit metaknowledge template towards big data knowledge discovery and management. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 519–526. IEEE (2014) 22. Kamishima, T., Fujiki, J.: Clustering orders. In: International Conference on Discovery Science, pp. 194–207. Springer (2003)

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature

669

23. Russo, M.: Genetic fuzzy learning. IEEE Trans. Evol. Comput. 4(3), 259–273 (2000) 24. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984) 25. Qin, G., Huang, X., Chen, Y.: Nested one-to-one symmetric classiﬁcation method on a fuzzy svm for moving vehicles. Symmetry 9(4), 48 (2017) 26. Wieland, R., Mirschel, W.: Combining expert knowledge with machine learning on the basis of fuzzy training. Ecol. Inform. 38, 26–30 (2017) 27. Prabu, M.J., Poongodi, P., Premkumar, K.: Fuzzy supervised online coactive neuro-fuzzy inference system-based rotor position control of brushless DC motor. IET Power Electron. 9(11), 2229–2239 (2016) 28. Gama, J.: Knowledge Discovery from Data Streams. CRC Press (2010) 29. Learning from Data Streams. Springer (2007) 30. H¨ ohle, U., Klement, E.P.: Non-classical logics and their applications to fuzzy subsets: a handbook of the mathematical foundations of fuzzy set theory, vol. 32. Springer (2012) 31. Zalta, E.N., etal.: Stanford Encyclopedia of Philosophy (2003) 32. Forrest, P.: The Identity of Indiscernibles (1996) 33. Logic, F.: Stanford Encyclopedia of Philosophy (2006) 34. Pinto, F., Soares, C., Mendes-Moreira, J.: A framework to decompose and develop meta features. In: Proceedings of the 2014 International Conference on Metalearning and Algorithm Selection, vol. 1201. CEUR-WS. org, pp. 32–36 (2014) 35. Cargile, J.: The sorites paradox. Br. J. Philos. Sci. 20(3), 193–202 (1969) 36. Malinowski, G.: Many-valued logic and its philosophy. In: Gabbay, D.M., Woods, J. (Eds.) The Many Valued and Nonmonotonic Turn in Logic, series: Handbook of the History of Logic North-Holland, vol. 8, pp. 13 – 94 (2007). http://www. sciencedirect.com/science/article/pii/S1874585707800045 37. Dinis, B.: Old and new approaches to the sorites paradox, arXiv preprint arXiv:1704.00450 (2017) 38. Yammahi, M., Kowsari, K., Shen, C., Berkovich, S.: An eﬃcient technique for searching very large ﬁles with fuzzy criteria using the pigeonhole principle. In: 2014 Fifth International Conference on Computing for Geospatial Research and Application (COM. Geo), pp. 82–86. IEEE (2014) 39. Evans, J.A., Foster, J.G.: Metaknowledge. Science 331(6018), 721–725 (2011) 40. Handzic, M.: Knowledge management: through the technology glass. World scientiﬁc, vol. 2 (2004) 41. Qazanfari, K., Youssef, A., Keane, K., Nelson, J.: A novel recommendation system to match college events and groups to students, arXiv:1709.08226v1 (2017) 42. Davis, R., Buchanan, B.G.: Meta-level knowledge. In: Rulebased Expert Systems, The MYCIN Experiments of the Stanford Heuristic Programming Project, BG Buchanan and Shortliﬀe, E. (Eds.). Addison-Wesley, Reading, pp. 507–530 (1984) 43. Vilalta, R., Giraud-Carrier, C.G., Brazdil, P., Soares, C.: Using meta-learning to support data mining. IJCSA 1(1), 31–45 (2004) 44. Alassaf, M.H., Kowsari, K., Hahn, J.K.: Automatic, real time, unsupervised spatiotemporal 3D object detection using RGB-D cameras. In: 2015 19th International Conference on Information Visualisation (IV), pp. 444–449. IEEE (2015) 45. Kowsari, K., Alassaf, M.H.: Weighted unsupervised learning for 3D object detection. Int. J. Adv. Comput. Sci. Appl. 7(1), 584–593 (2016) 46. Qazanfari, K., Aslanzadeh, R., Rahmati, M.: An eﬃcient evolutionary based method for image segmentation, arXiv preprint arXiv:1709.04393 (2017)

670

K. Kowsari et al.

47. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. In: Chapelle, O. et al. (eds.) IEEE Transactions on Neural Networks [book reviews], vol. 20, no. 3, pp. 542–542 (2009) 48. Chapelle, O., Chi, M., Zien, A.: A continuation method for semi-supervised SVMS. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 185–192. ACM (2006) 49. Chapelle, O., Sindhwani, V., Keerthi, S.S.: Branch and bound for semi-supervised support vector machines. In: NIPS, pp. 217–224 (2006) 50. Choi, S.-S., Cha, S.-H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)

A Novel Method for Stress Measuring Using EEG Signals Vinayak Bairagi(&) and Sanket Kulkarni Department of E&TC, AISSMS’s Institute of Information Technology, Pune, India [email protected], [email protected]

Abstract. Stress is one of the major contributing factors which lead to various diseases including cardiovascular diseases. To avoid this, stress monitoring is very essential for clinical intervention and disease prevention. In present study, the feasibility of exploiting Electroencephalography (EEG) signals to monitor stress in mental arithmetic tasks is investigated. This paper presents a novel hardware system along with software system which provides a method for determining stress level with the help of a Theta sub-band of EEG signals. The proposed system performs a signal-processing of EEG signals, which recognizes the peaks of the Theta sub-band above a certain threshold value. It ﬁnds the ﬁrst order difference information to identify the peak. This proposed method of EEG based stress detection can be used as quick, noninvasive, portable and handheld tool for determining the stress level of a person. Keywords: Mental stress

Electroencephalography (EEG) Theta sub-band

1 Introduction An emotional disturbance or a change of environment which is caused by stress is basically termed as Stress. People suffer from stress in their daily life. Factors like biological, Social and psychological factors are affecting the human stress level. Stress can happen due to worrying, impaired judgment, negativity, loss of appetite, loss of conﬁdence, headaches. Stress can also happen due to work pressure. Stress leads to mental disturbance as well as change in other functionalities of body parts specially heart [1, 2]. It is recognized that stress is one of the important criteria for causing chronic disorders and other such productivity losses. The performance of an individual, attitude towards life and wish to work is influenced by stress. Numerous health problems are linked by chronic stresses. Several research ﬁndings have shown a correlation between exposure of stress level and risk factors such as cardiovascular diseases [3]. Stress can be observed when people are engaged in doing challenging work, coping with a strained relationship or are engaged in an intense competition [4]. Many health problems are also related to stress [3, 4].

Research Grant support from: BIRAC GYTI, India. © Springer Nature Switzerland AG 2019 K. Arai et al. (Eds.): FICC 2018, AISC 887, pp. 671–684, 2019. https://doi.org/10.1007/978-3-030-03405-4_47

672

V. Bairagi and S. Kulkarni

People working under high pressure at work or in life get depressed. The causes of mental stress are shown in Fig. 1.

Worrying

Loss of confidence

Mental Stress

Headaches

Negativity

Fig. 1. Causes of mental stress.

On the other hand, moderate stress is also needed to human because it helps to stay focused and alert as shown in Fig. 2. In such a case, stress can be a way to improve performance of an individual to achieve a success. Assessing and monitoring stress is a challenging task since everyone experiences stress. P e r f o r m a n c e

Max Performance

Moderate Stress

Min, Min

Stress level

Max

Fig. 2. Stress-performance curve.

A sign of stress can be monitored by medical tests, psychoanalysis and Biosignals. Headaches, tense muscles, insomnia, and rapid heartbeat are some of the symptoms of high stress. There are different novel wearable devices that help to check stress and plan their daily lives such as Olive Spire, Breath Acoustics and Gizmodo integrated with various biosensors. Electroencephalogram (EEG) signals are widely used in the diagnosis of mental disease and neurodegenerative diseases such as Epilepsy, Alzheimer’s disease and many more. It is more extensively used presently in bioengineering research [5–7]. EEG-based interfaces are used in many applications including psychophysiology, psychology and many more. EEG signal measures the brain activity within the brain.

A Novel Method for Stress Measuring Using EEG Signals

673

It is classiﬁed basically within ﬁve different bands. The measurement of spontaneous electrical activity measured on the scalp or brain is termed as Electroencephalogram. The amplitude of the EEG signal is about 100 lV when measured on the scalp, and it is about 1–2 mV when measured on the surface of the brain. The bandwidth of this signal ranges between 1 Hz to 50 Hz. Evoked potentials are those components of the EEG that arise in response to a stimulus (which may be electric, auditory, visual, etc.). These signals are usually below the noise level and thus not readily distinguished, and one must use a train of stimuli and signal averaging to improve the signal-to-noise ratio (SNR). Brain Computer Interface (BCI) based stress monitoring using EEG consists of two types. First is the invasive type; which involves attaching the EEG electrodes directly to the brain tissue. The patient’s brain gradually adapts its signals to be sent through the electrodes. Second is the Non-invasive electrode which involves putting electrodes on the scalp of the patient, and taking readings [8, 9]. Figure 3 shows the EEG electrode placement as per the standard 10–20 technology recommended by the American EEG society. Let us see in detail, the methodology of stress monitoring using EEG based systems.

Fig. 3. International 10–20 EEG Electrode Placement System [5].

2 Materials and Methods Standard database of EEG signal for stress level determination is not available. In the present study, 50 EEG signal database of age group between 23–28 years is obtained. EEG signals of these persons were recorded after doing some mental and physical task. All subjects have no history of mental diseases and head injuries. Recorders and Medicare Systems (RMS EEG Machine) was used to record EEG signals from 14 channels (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4), placed on the scalp following the international 10–20 system, sampled at 128 Hz with a bandwidth from 0.16 Hz to 43 Hz.

674

2.1

V. Bairagi and S. Kulkarni

Preprocessing and Need of Filter

Raw EEG is contaminated with noise from different form and sources during signal acquisition. As EEG has very small amplitude, ﬁltering out unwanted noise is a critical step to extract useful information. In present the study, two primary noises (artifacts), namely, power line noise and ocular artifacts that arise due to body movement are eliminated. The preprocessing of EEG data was performed offline. EEG data was preprocessed with MATLAB and EEGLAB 2013a toolbox. Filters have been used in the proposed design to select frequency range of 0.1–47 Hz. The reason for using Low pass ﬁlter (LPF) for 47 Hz is to avoid Notch ﬁlter design which is introduced in a circuit by power supply line. Another reason to use LPF of cutoff frequency of 47 Hz is that the EEG signal frequencies are deﬁned between 1 to 47 Hz (Table 1). Table 1. Different EEG frequency bands Type Delta Theta Alpha Beta

2.2

Frequency

Advances in Information and Communication Networks

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch