Algorithms and Architectures for Parallel Processing PDF

The four-volume set LNCS 11334-11337 constitutes the proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2018, held in Guangzhou, China, in November 2018. The 141 full and 50 short papers presented were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on Distributed and Parallel Computing; High Performance Computing; Big Data and Information Processing; Internet of Things and Cloud Computing; and Security and Privacy in Computing.

107 downloads 3K Views 71MB Size

Report

Download pdf

Recommend Stories

Empty story

Idea Transcript

LNCS 11336

Jaideep Vaidya Jin Li (Eds.)

Algorithms and Architectures for Parallel Processing 18th International Conference, ICA3PP 2018 Guangzhou, China, November 15–17, 2018 Proceedings, Part III

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany

11336

More information about this series at http://www.springer.com/series/7407

Jaideep Vaidya Jin Li (Eds.) •

Algorithms and Architectures for Parallel Processing 18th International Conference, ICA3PP 2018 Guangzhou, China, November 15–17, 2018 Proceedings, Part III

123

Editors Jaideep Vaidya Rutgers University Newark, NJ, USA

Jin Li Guangzhou University Guangzhou, China

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-05056-6 ISBN 978-3-030-05057-3 (eBook) https://doi.org/10.1007/978-3-030-05057-3 Library of Congress Control Number: 2018962485 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Welcome to the proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2018), which was organized by Guangzhou University and held in Guangzhou, China, during November 15–17, 2018. ICA3PP 2018 was the 18th event in a series of conferences devoted to research on algorithms and architectures for parallel processing. Previous iterations of the conference include ICA3PP 2017 (Helsinki, Finland, November 2017), ICA3PP 2016 (Granada, Spain, December 2016), ICA3PP 2015 (Zhangjiajie, China, November 2015), ICA3PP 2014 (Dalian, China, August 2014), ICA3PP 2013 (Vietri sul Mare, Italy, December 2013), ICA3PP 2012 (Fukuoka, Japan, September 2012), ICA3PP 2011 (Melbourne, Australia, October 2011), ICA3PP 2010 (Busan, Korea, May 2010), ICA3PP 2009 (Taipei, Taiwan, June 2009), ICA3PP 2008 (Cyprus, June 2008), ICA3PP 2007 (Hangzhou, China, June 2007), ICA3PP 2005 (Melbourne, Australia, October 2005), ICA3PP 2002 (Beijing, China, October 2002), ICA3PP 2000 (Hong Kong, China, December 2000), ICA3PP 1997 (Melbourne, Australia, December 1997), ICA3PP 1996 (Singapore, June 1996), and ICA3PP 1995 (Brisbane, Australia, April 1995). ICA3PP is now recognized as the main regular event in the area of parallel algorithms and architectures, which covers many dimensions including fundamental theoretical approaches, practical experimental projects, and commercial and industry applications. This conference provides a forum for academics and practitioners from countries and regions around the world to exchange ideas for improving the efﬁciency, performance, reliability, security, and interoperability of computing systems and applications. ICA3PP 2018 attracted over 400 high-quality research papers highlighting the foundational work that strives to push beyond the limits of existing technologies, including experimental efforts, innovative systems, and investigations that identify weaknesses in existing parallel processing technology. Each submission was reviewed by at least two experts in the relevant areas, on the basis of their signiﬁcance, novelty, technical quality, presentation, and practical impact. According to the review results, 141 full papers were selected to be presented at the conference, giving an acceptance rate of 35%. Besides, we also accepted 50 short papers and 24 workshop papers. In addition to the paper presentations, the program of the conference included four keynote speeches and two invited talks from esteemed scholars in the area, namely: Prof. Xuemin (Sherman) Shen, University of Waterloo, Canada; Prof. Wenjing Lou, Virginia Tech, USA; Prof. Witold Pedrycz, University of Alberta, Canada; Prof. Xiaohua Jia, City University of Hong Kong, Hong Kong; Prof. Xiaofeng Chen, Xidian University, China; Prof. Xinyi Huang, Fujian Normal University, China. We were extremely honored to have them as the conference keynote speakers and invited speakers. ICA3PP 2018 was made possible by the behind-the-scene effort of selfless individuals and organizations who volunteered their time and energy to ensure the success

VI

Preface

of this conference. We would like to express our special appreciation to Prof. Yang Xiang, Prof. Weijia Jia, Prof. Yi Pan, Prof. Laurence T. Yang, and Prof. Wanlei Zhou, the Steering Committee members, for giving us the opportunity to host this prestigious conference and for their guidance with the conference organization. We would like to emphasize our gratitude to the general chairs, Prof. Albert Zomaya and Prof. Minyi Guo, for their outstanding support in organizing the event. Thanks also to the publicity chairs, Prof. Zheli Liu and Dr Weizhi Meng, for the great job in publicizing this event. We would like to give our thanks to all the members of the Organizing Committee and Program Committee for their efforts and support. The ICA3PP 2018 program included two workshops, namely, the ICA3PP 2018 Workshop on Intelligent Algorithms for Large-Scale Complex Optimization Problems and the ICA3PP 2018 Workshop on Security and Privacy in Data Processing. We would like to express our sincere appreciation to the workshop chairs: Prof. Ting Hu, Prof. Feng Wang, Prof. Hongwei Li and Prof. Qian Wang. Last but not least, we would like to thank all the contributing authors and all conference attendees, as well as the great team at Springer that assisted in producing the conference proceedings, and the developers and maintainers of EasyChair. November 2018

Jaideep Vaidya Jin Li

Organization

General Chairs Albert Zomaya Minyi Guo

University of Sydney, Australia Shanghai Jiao Tong University, China

Program Chairs Jaideep Vaidya Jin Li

Rutgers University, USA Guangzhou University, China

Publication Chair Yu Wang

Guangzhou University, China

Publicity Chairs Zheli Liu Weizhi Meng

Nankai University, China Technical University of Denmark, Denmark

Steering Committee Yang Xiang (Chair) Weijia Jia Yi Pan Laurence T. Yang Wanlei Zhou

Swinburne University of Technology, Australia Shanghai Jiaotong University, China Georgia State University, USA St. Francis Xavier University, Canada Deakin University, Australia

Program Committee Pedro Alonso Daniel Andresen Cosimo Anglano Danilo Ardagna Kapil Arya Marcos Assuncao Joonsang Baek Anirban Basu Ladjel Bellatreche Jorge Bernal Bernabe Thomas Boenisch

Universitat Politècnica de València, Spain Kansas State University, USA Universitá del Piemonte Orientale, Italy Politecnico di Milano, Italy Northeastern University, USA Inria, France University of Wollongong, Australia KDDI Research Inc., Japan LIAS/ENSMA, France University of Murcia, Spain High-Performance Computing Center Stuttgart, Germany

VIII

Organization

George Bosilca Massimo Cafaro Philip Carns Alexandra Carpen-Amarie Aparicio Carranza Aniello Castiglione Arcangelo Castiglione Pedro Castillo Tzung-Shi Chen Kim-Kwang Raymond Choo Mauro Conti Jose Alfredo Ferreira Costa Raphaël Couturier Miguel Cárdenas Montes Masoud Daneshtalab Casimer Decusatis Eugen Dedu Juan-Carlos Díaz-Martín Matthieu Dorier Avgoustinos Filippoupolitis Ugo Fiore Franco Frattolillo Marc Frincu Jorge G. Barbosa Chongzhi Gao Jose Daniel García Luis Javier García Villalba Paolo Gasti Vladimir Getov Olivier Gluck Jing Gong Amina Guermouche Jeff Hammond Feng Hao Houcine Hassan Sun-Yuan Hsieh Chengyu Hu Xinyi Huang Mauro Iacono Shadi Ibrahim Yasuaki Ito Mathias Jacquelin Nan Jiang Lu Jiaxin

University of Tennessee, USA University of Salento, Italy Argonne National Laboratory, USA Vienna University of Technology, Austria City University of New York, USA University of Salerno, Italy University of Salerno, Italy University of Granada, Spain National University of Tainan, Taiwan The University of Texas at San Antonio, USA University of Padua, Italy Federal University, UFRN, Brazil University Bourgogne Franche-Comté, France CIEMAT, Spain Mälardalen University and Royal Institute of Technology, Sweden Marist College, USA University of Bourgogne Franche-Comté, France University of Extremadura, Spain Argonne National Laboratory, USA University of Greenwich, UK Federico II University, Italy University of Sannio, Italy West University of Timisoara, Romania University of Porto, Portugal Guangzhou University, China University Carlos III of Madrid, Spain Universidad Complutense de Madrid, Spain New York Institute of Technology, USA University of Westminster, UK Université de Lyon, France KTH Royal Institute of Technology, Sweden Telecom Sud-Paris, France Intel, USA Newcastle University, UK Universitat Politècnica de València, Spain National Cheng Kung University, Taiwan Shandong University, China Fujian Normal University, China University of Campania Luigi Vanvitelli, Italy Inria, France Hiroshima University, Japan Lawrence Berkeley National Laboratory, USA East China Jiaotong University, China Jiangxi Normal University, China

Organization

Edward Jung Georgios Kambourakis Gabor Kecskemeti Muhammad Khurram Khan Dieter Kranzlmüller Michael Kuhn Julian Kunkel Algirdas Lančinskas Patrick P. C. Lee Laurent Lefevre Hui Li Kenli Li Dan Liao Jingyu Liu Joseph Liu Yunan Liu Zheli Liu Jay Lofstead Paul Lu Amit Majumdar Tomas Margalef Stefano Markidis Alejandro Masrur Susumu Matsumae Raffaele Montella Francesco Moscato Bogdan Nicolae Francesco Palmieri Swann Perarnau Dana Petcu Salvador Petit Riccardo Petrolo Florin Pop Radu Prodan Zhang Qikun Thomas Rauber Khaled Riad Suzanne Rivoire Ivan Rodero Romain Rouvoy Antonio Ruiz-Martínez Françoise Sailhan Sherif Sakr Giandomenico Spezzano

IX

Kennesaw State University, USA University of the Aegean, Greece Liverpool John Moores University, UK King Saud University, Saudi Arabia Ludwig Maximilian University of Munich, Germany University of Hamburg, Germany German Climate Computing Center, Germany Vilnius University, Lithuania The Chinese University of Hong Kong, SAR China Inria, France University of Electronic Science and Technology of China, China Hunan University, China University of Electronic Science and Technology of China, China Hebei University of Technology, China Monash University, Australia Jiangxi Normal University, China Nankai University, China Sandia National Laboratories, USA University of Alberta, Canada University of California San Diego, USA Universitat Autonoma de Barcelona, Spain KTH Royal Institute of Technology, Sweden Chemnitz University of Technology, Germany Saga University, Japan University of Naples Parthenope, Italy University of Campania Luigi Vanvitelli, Italy Argonne National Laboratory, Germany University of Salerno, Italy, Italy Argonne National Laboratory, USA West University of Timisoara, Romania Universitat Politècnica de València, Spain Rice University, USA University Politehnica of Bucharest, Romania University of Klagenfurt, Austria Beijing Institute of Technology, China University Bayreuth, Germany Zagazig University, Egypt Sonoma State University, USA Rutgers University, USA University of Lille, France University of Murcia, Spain CNAM, France The University of New South Wales, Australia ICAR-CNR and University of Calabria, Italy

X

Organization

Patricia Stolf John Stone Peter Strazdins Hari Subramoni Gang Sun Zhizhuo Sun Frederic Suter Yu-An Tan Ming Tao Andrei Tchernykh Massimo Torquati Tomoaki Tsumura Didem Unat Vladimir Voevodin Feng Wang Hao Wang Yu Wei Sheng Wen Jigang Wu Roman Wyrzykowski Yu Xiao Ramin Yahyapour Fang Yan Zheng Yan Laurence T. Yang Wun-She Yap

IRIT, France University of Illinois at Urbana-Champaign, USA The Australian National University, Australia The Ohio State University, USA University of Science and Technology of China, China Beijing Institute of Technology, China CNRS, France Beijing Institute of Technology, China Dongguan University of Technology, China CICESE Research Center, Mexico University of Pisa, Italy Nagoya Institute of Technology, Japan Koç University, Turkey Moscow University, Russia Wuhan University, China Shandong Normal University, China Nankai University, China Swinbourne University of Technology, China Guangdong University of Technology, China Czestochowa University of Technology, Poland Shandong University of Technology, China University of Göttingen, Germany Beijing Wuzi University, China Xidian University, China St. Francis Xavier University, Canada Universiti Tunku Abdul Rahman, Malaysia

Contents – Part III

Big Data and Information Processing TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation in Cooperative Cognitive Radio Networks . . . . . . . . . . . . . . . . . . . . . . . . . Xinxiang Zhang, Jigang Wu, and Long Chen QoS-Driven Service Matching Algorithm Based on User Requirements . . . . . Mengying Guo and Xudong Yang Research on Overload Classification Method for Bus Images Based on Image Processing and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingting Li, Yongxiong Sun, Yanhua Liang, Yujia Zhai, and Xuan Ji Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost . . . . Haojun Ai, Liangliang Han, Yifeng Wang, and Liang Liao An Approach of Collecting Performance Anomaly Dataset for NFV Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingfeng Du, Yu He, Tiandi Xie, Kanglin Yin, and Juan Qiu

3 17

28

44

59

An Axiomatization for BSP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoann Marquer and Frédéric Gava

72

Efficient and Secure Outsourced Linear Regression . . . . . . . . . . . . . . . . . . . Haomiao Yang, Weichao He, Qixian Zhou, and Hongwei Li

89

New Multi-objectives Scheduling Strategies in Docker SwarmKit . . . . . . . . . Tarek Menouer, Christophe Cérin, and Étienne Leclercq

103

Internet Performance Prediction Framework Based on PingER Dataset. . . . . . Wei Zhang, Xiaofei Xing, Saqib Ali, and Guojun Wang

118

MS-RAID: An Energy-Saving Data Layout for CDP . . . . . . . . . . . . . . . . . . Jingyu Liu, Ziyao Zhang, Lu Liu, and Xin Chai

132

Incentivizing Multimedia Data Acquisition for Machine Learning System . . . Yiren Gu, Hang Shen, Guangwei Bai, Tianjing Wang, Hai Tong, and Yujia Hu

142

Toward Performance Prediction for Multi-BSP Programs in ML . . . . . . . . . . Victor Allombert, Frédéric Gava, and Julien Tesson

159

XII

Contents – Part III

Exploiting the Table of Energy and Power Leverages . . . . . . . . . . . . . . . . . Issam Raïs, Laurent Lefèvre, Anne-Cécile Orgerie, and Anne Benoit

175

A Semantic Web Based Intelligent IoT Model . . . . . . . . . . . . . . . . . . . . . . Chao Qu, Ming Tao, Jie Zhang, Xiaoyu Hong, and Ruifen Yuan

186

Accelerating CNNs Using Optimized Scheduling Strategy . . . . . . . . . . . . . . Rui Xu, Sheng Ma, Wenwu Li, and Yang Guo

196

Data Analysis of Blended Learning in Python Programming. . . . . . . . . . . . . Qian Chu, Xiaomei Yu, Yuli Jiang, and Hong Wang

209

APs Deployment Optimization for Indoor Fingerprint Positioning with Adaptive Particle Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . Jianhui Zhao, Jun Li, Haojun Ai, and Bo Cai

218

Deployment Optimization of Indoor Positioning Signal Sources with Fireworks Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianhui Zhao, Shiqi Wen, Haojun Ai, and Bo Cai

229

A Study of Sleep Stages Threshold Based on Multiscale Fuzzy Entropy . . . . Xuexiao Shao, Bin Hu, Yalin Li, and Xiangwei Zheng

239

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels . . . Jing Liu, Kun Han, Wenhua Wu, Shu Wang, and Xiao Yu

249

Facial Shape and Expression Transfer via Non-rigid Image Deformation . . . . Huabing Zhou, Shiqiang Ren, Yong Zhou, Yuyu Kuang, Yanduo Zhang, Wei Zhang, Tao Lu, Hanwen Chen, and Deng Chen

257

P-Schedule: Erasure Coding Schedule Strategy in Big Data Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Yin, Haitao Lv, Tongfang Li, Yan Liu, Xiaoping Qu, and Sihao Yuan Answer Aggregation of Crowdsourcing Employing an Improved EM-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ran Zhang, Lei Liu, Lizhen Cui, Wei He, and Hui Li

270

280

Internet of Things and Cloud Computing A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data Using Apache Spark in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng Yang, Weidong Bao, Xiaomin Zhu, Ji Wang, and Wenhua Xiao

293

Contents – Part III

Task Offloading in Edge-Clouds with Budget Constraint . . . . . . . . . . . . . . . Lei He, Hongli Xu, Haibo Wang, Liusheng Huang, and Jingyi Ma Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenping Yu, Jianzhong Zhang, Jingdong Xu, and Yuwei Xu Towards the Independent Spanning Trees in the Line Graphs of Interconnection Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baolei Cheng, Jianxi Fan, Xiaoyan Li, Guijuan Wang, Jingya Zhou, and Yuejuan Han

XIII

311

327

342

POEM: Pricing Longer for Edge Computing in the Device Cloud . . . . . . . . . Qiankun Yu, Jigang Wu, and Long Chen

355

Mobility Analysis and Response for Software-Defined Internet of Things. . . . Zhiyong Zhang, Rui Wang, Xiaojun Cai, and Zhiping Jia

370

DStore: A Distributed Cloud Storage System Based on Smart Contracts and Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingting Xue, Chunxiang Xu, Yuan Zhang, and Lanhua Bai Towards an Efficient and Real-Time Scheduling Platform for Mobile Charging Vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Liu, Jinyang Li, Xiaoshan Sun, Junjie Wang, Yang Ning, Wei Zheng, Jian Li, and Hengchang Liu SoProtector: Securing Native C/C++ Libraries for Mobile Applications . . . . . Ning Zhang, Guangquan Xu, Guozhu Meng, and Xi Zheng CloudPT: Performance Testing for Identifying and Detecting Bottlenecks in IaaS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ameen Alkasem, Hongwei Liu, and Decheng Zuo Smart Grid Power Trading Based on Consortium Blockchain in Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Zheng, Kaixin Deng, Yinghui Zhang, Jiangfan Zhao, Xiaokun Zheng, and Xinwei Ma

385

402

417

432

453

Energy-Efficient Offloading in Mobile Edge Computing with Edge-Cloud Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Long, Jigang Wu, and Long Chen

460

Quantitatively Investigating Multihop Localization Errors in Regular 2-D Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bing Jia, Baoqi Huang, Tao Zhou, and Wuyungerile Li

476

XIV

Contents – Part III

Optimizing WiFi AP Placement for Both Localization and Coverage . . . . . . . Yu Tian, Baoqi Huang, Bing Jia, and Long Zhao

489

PLZMA: A Parallel Data Compression Method for Cloud Computing . . . . . . Xin Wang, Lin Gan, Jingheng Xu, Jinzhe Yang, Maocai Xia, Haohuan Fu, Xiaomeng Huang, and Guangwen Yang

504

A Caching-Based Parallel FP-Growth in Apache Spark . . . . . . . . . . . . . . . . Zhicheng Cai, Xingyu Zhu, Yuehui Zheng, Duan Liu, and Lei Xu

519

Contextual-Field Supported Iterative Representation for Face Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kangli Zeng, Tao Lu, Xiaolin Li, Yanduo Zhang, Li Peng, and Shenming Qu A Cancelable Multi-Biometric Template Generation Algorithm Based on Bloom Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin You and Xun Li Streaming ETL in Polystore Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nabila Berkani and Ladjel Bellatreche Communication-Aware Prediction-Based Online Scheduling in High-Performance Real-Time Embedded Systems . . . . . . . . . . . . . . . . . . Baptiste Goupille-Lescar, Eric Lenormand, Nikos Parlavantzas, and Christine Morin

534

547 560

575

Predicting SDC Vulnerability of Instructions Based on Random Forests Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LiPing Liu, LinLin Ci, and Wei Liu

593

Hybrid Cloud Architecture for Cross-Platform Interoperability in Smart Homes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Tao, Chao Qu, Wenhong Wei, Bin Zhou, and Shuqiang Huang

608

Conflict-Free Block-with-Stride Access of 2D Storage Structure . . . . . . . . . . Rui Song, Guozhao Zeng, Sheng Liu, and Haiyan Chen

618

Graph-Based Indoor Localization with the Fusion of PDR and RFID Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Wu, Minghua Zhu, Bo Xiao, and Yunzhou Qiu

630

UAV 3D Mobility Model Oriented to Dynamic and Uncertain Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Na Wang, Nan Di, Fei Dai, and Fangxin Liu

640

Contents – Part III

Acquiring Hidden Space via Modifying Block Bitmap for Android Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wang Lianfang, Huang Hong, Li Yuanzhang, and Zhang Li

XV

651

Interest Relevance-Based Caching Design in Content-Centric Networking . . . Guozhi Zhang, Jiqiang Liu, Xiaolin Chang, and Yang Yang

661

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

673

Big Data and Information Processing

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation in Cooperative Cognitive Radio Networks Xinxiang Zhang, Jigang Wu(B) , and Long Chen Guangdong University of Technology, Guangzhou 510006, China zxx [email protected], [email protected], [email protected]

Abstract. Cooperative cognitive radio networks have been proposed to address spectrum starvation problem and enhance the transmission rate of mobile devices. Most works assume one user could aﬀord the whole spectrum and neglect the selﬁshness nature, which is not practical. Based on group-buying, a two-stage auction mechanism named TAMSA is proposed to guarantee the quality of service and improve the utilization ratio of spectrum resources. TAMSA is an incentive mechanism involving the primary users (P U s) and relay nodes. TAMSA can also reduce the cost of the secondary users (SU s) and increase utilities for both P U s and relay nodes. In the ﬁrst stage, SU s submit their budgets, valuations and demands for spectrum resources to relay nodes in groupbuying, relay nodes calculate revenues and determine the winning SU s. In the second stage, we execute VCG auction between the relay nodes and P U s, with a maximum-weighted-matching algorithm. TAMSA can eﬀectively allocate spectrum resources to meet the demands of SU s. We show that TAMSA is truthful, individual rational and computational eﬃcient. Extensive simulation results show that TAMSA outperforms random algorithm by 256% in terms of average utility of P U s. TAMSA is able to improve the average utility of SU s and relay nodes signiﬁcantly up to 213% and 10 times respectively. TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively. Keywords: Spectrum allocation · VCG auction Incentive mechanism · Cooperative cognitive radio networks

1

Introduction

With the explosive growth of smart phones, wearable devices and Internet of Things (IoT), they are demanding for higher data rates and lower latency. Spectrum resource is one of the most valuable resources for wireless communication devices. However, many spectrum resources have been allocated to licensed users. On one hand, existing un-used spectrum resources have become scarce. On the other hand, some used spectrum resources have not been fully utilized, such as c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 3–16, 2018. https://doi.org/10.1007/978-3-030-05057-3_1

4

X. Zhang et al.

radio and TV-channel, resulting in spectrum cavitation [1–3]. Cognitive radio is proposed to solve the above problems to guarantee of Quality of Service (QoS) for mobile devices and improve the utility ratio of spectrum resources. To enhance the performance of cognitive radio networks (CRNs), cooperative cognitive radio networks (CCRNs) was proposed [4]. In CCRNs, there are two kinds of users, one is the spectrum holder, that is, the primary user (licensed user), denoted as P U s. The other is the secondary user (unlicensed user), represented by SU s [5]. The mobile devices with cognitive function can dynamically detect and utilize the idle spectrum resources. And the CCRNs allows SU s to access the licensed spectrum occupied by P U s to improve spectrum utilization [6,7], but SU s must not cause strong interference to the normal communication of P U s. CCRNs can improve the utilization ratio of spectrum resources by spectrum reuse. Auction plays an important role in spectrum resources allocation since there have been numerous researches on spectrum allocation using auctions [8–10]. Most prior works design single-seller and multi-buyer auctions with homogeneous channels. In [1] and [4], authors design truthful auction for trading homogeneous channels between a seller and multiple SU s. Besides, a distributed resource allocation algorithm is adopted, and direct or cooperative transmission can be selected with multiple sellers and multiple buyers [5]. Many studies assume that P U s are willing to share their idle spectrum resources, In reality, P U s are usually selﬁsh, hence it is necessary to provide incentives for P U s to participate in. Vickrey-Clarke-Groves (VCG) auction guarantees the truthfulness of the auction process, which provides a new idea for resources allocation and can eﬀectively guarantee the economic returns of the participants. A McAfee based auction mechanism is proposed, which considers the cooperative transmission of relay nodes and ensures the maximum beneﬁt of P U s, but it does not consider the revenues of relay nodes [7]. In existing works [11–13], authors propose VCGbased auction mechanism to maximize the utility of P U s and guarantee the truthfulness. However, the objective is to maximize the amount of P U s neglect the speciﬁc demands of SU s for spectrum resources. In recent years, double auction [10], and combinatorial auction [11] have been considered in spectrum resources allocation. However, most works neglect data transmission cooperatively by relay nodes. Inspired by the popular group buying services on the Internet, authors in [13] and [14] propose auction algorithms based on group buying, which encourages SU s to be voluntarily grouped together to acquire the spectrum resources in spectrum auctions. The group buying algorithm can eﬀectively reduce the payment of the SU s. In [12–14], they equally distribute the spectrum resources to the winning SU s. Besides, In [15], a multiple input and multiple output method is proposed in CRNs with cooperative communication. It allows SU s to help data transmission for P U s and obtain the opportunity to transmit data for themselves, but the mechanism has a higher requirement of hardware conﬁguration. In this work, we reduce the payment of SU s with group buying. We allocate spectrum resources according to the speciﬁc demands of the SU s.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

5

In order to eﬀectively allocate spectrum resources and encourage P U s to share spectrum resources in auction we designed, we have to solve the following challenges. (1) Running applications on mobile devices are heterogeneous, so the budget and demand for each SU s are diﬀerent. Besides, how to reduce the cost of SU s is a challenge. (2) For spectrum holders and relay nodes should be incentivized because of selﬁshness nature. Therefore, how to provide incentives should be designed for both P U s and relay nodes. (3) Auction should be truthful, budget balance, individual rational and computational eﬃcient. Hence, the auction mechanism should ensure the above properties. Diﬀerent from the previous works, we focus on investigating an incentive auction mechanism for eﬃcient spectrum resource allocation in CCRN. TAMSA provides an incentive for both P U s and relay nodes to participate in auction. Besides, in the scenario, TAMSA is based on group buying to reduce the payments of SU s, and TAMSA allocates spectrum resources according to the speciﬁc demands of SU s. The main contributions of this work are summarized as follows. • To reduce the payment of SU s eﬀectively, we propose an auction algorithm based on group buying for the speciﬁc demands of spectrum resources. The auction mechanism is applicable to heterogeneous networks. The economic properties, truthfulness, budget balance, individual rationality and computational eﬃciency are proved. • We design an incentive mechanism to encourage spectrum holders to share their idle spectrum resources, and encourage relay nodes to transmit data cooperatively. • Numerous numerical results demonstrate that TAMSA is superior to the algorithm Random by 256% in terms of average utility of P U s. The average utility of relay nodes and SU s in TAMSA outperforms Random by 10 times and 213% respectively. TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively.

2

System Model and Problem Formulation

In this section, we not only focus on the system model, but also formulate the problem to be studied. And we introduce the related economic properties that auction scheme should be followed. The basic notations as shown in Table 1. 2.1

System Model

In this paper, we consider a cognitive network with multiple primary users and multiple secondary users. Besides, in order to improve the channel transmission rate, we take the relay node into account. In this scenario, as in [16], we assume all nodes stay static in a given auction period. TAMSA scheme aims to maximize the social welfare in a spectrum auction, which also encourages both P U s and SU s to participate in. To maximize the utilization of spectrum resources, the incentive mechanism should properly assign the matching between the spectrum

6

X. Zhang et al. Table 1. Notations for system model. Notations Meaning P Us

Set of primary users

SU s

Set of secondary users

Ri

The ith relay node, where i ∈ [1, M ]

Si

The ith group, where i ∈ [1, ni ]

sji

The jth secondary user in the ith group, 1 ≤ i ≤ M, 1 ≤ j ≤ ni

dji (k) bji (k) vij (k)

Demand of sji for kth Channel (P Uk ), 1 ≤ k ≤ M

Ak

Ask or reserve price of kth Channel

Siw Riw

Set of winning secondary users, 1 ≤ w ≤ ni

The bid of sji for kth Channel

The valuation of sji for kth Channel

Set of winning relay nodes

P Uiw pji (k)

The payment of sji for kth Channel

pc (k)

The clearing price

Fi (k)

Si (k) s payment for kth relay node

Pi (k)

The ith relay node Ri (k) s payment for P Uk

Bi (k)

The bid of the ith relay node Ri (k) for P Uk

uji

The utility of sji

UP Uk

The utility of P Uk

URk

The utility of Rk

Set of winning primary users

resources and the demands of SU s. Trading between P U s and SU s should meet certain requirements to beneﬁt both parties, so P U s need to be incentivized to provide resources, and the demands of SU s should be satisﬁed. The proposed network model is shown in Fig. 1, which is a hierarchical auction consisting of m P U s and ni SU s. The P U s possess M heterogeneous channels, and each primary user has a reserved price Ak , where k ∈ [1, M ], which is the lowest price the P Ui is willing to sell the kth channel. The P U s have diﬀerent reserved prices Ak for spectrum, and we assume each relay node can buy at most one spectrum. In the ith group Si , where i ∈ [1, M ], there are n SU s and Si = s1i , s2i , · · · , sni , n ∈ ni . Each sji has a bid or budget bji (k) and a valuation vij (k) for the kth channel P Uk . And in order to improve the utilization of spectrum resources, each sji submits the demand for spectrum dji (k) to the P Uk . The spectrum resource is allocated according to the speciﬁc demands of the SU s.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

7

Fig. 1. Auction model.

We design an incentive mechanism to improve the utilities of P U s and relay nodes. TAMSA is a two-stage hierarchical auction, consisting of two-single round sealed bid auctions, called stage I auction and stage II auction respectively. In stage I, auction goes between relay nodes and the group of secondary users Si , and in stage II, the auction conducts between P U s and relay nodes Ri , and the P U s sell their spectrum resources to relay nodes. The relay node Ri (k) gathers bid and demand from the ith group Si . Then system model executes the stage II auction. Ri (k) submits the bid Bi (k) to P Uk , and P Uk gives the reserve price Ak , where k ∈ [1, M ], noting that Bi (k) ≥ Ak . The relay node Ri (k) determines the winners in group Si (k) after gathering the ith group member’s bids, and the set of winning SU s is denoted by Siw (k), where Siw (k) ⊆ Si , and the gathered bid is Fi (k). We assume that each group pays for at most one relay node at the same time, because one relay node serves for multiple groups might cause transmission delay. If it wins in this auction, relay nodes Ri will allocate spectrum resources to the Siw (k). 2.2

Problem Formulation

The system will determine the payment of winners. To achieve fairness, payments of winners should be proportional to the workloads of the demands. The payment of sji (k) is formulated as pji (k) = pc (k) · dji (k), 1 ≤ i ≤ M, 1 ≤ j ≤ ni and 1 ≤ k ≤ M,

(1)

8

X. Zhang et al.

where pc (k) is the clearing price. Let uji denote the utility of secondary user sji , for each sji ∈ Siw . Accordingly, the utility of sji is deﬁned as j vi (k) − pji (k), if sji ∈ Siw and pji (k) ≤ bji (k) j ui = (2) 0, otherwise. What calls for special attention is that the payment of sji (k) should not be higher than the budget bji (k), k ∈ [1, M ], sji ∈ Siw . The relay node Ri (k) calculates the ﬁnance Fi (k) collected from SU s. Hence the utility of relay node Ri is Fi (k) − Pi (k), if Ri (k) ∈ Riw URi = (3) 0, otherwise. Where Pi (k) is the payment of relay node for P U s. In order to encourage spectrum holders to share spectrum resources, each P Uk has a reserved price Ak . The payment of relay nodes Pi (k) should be higher than the reserved price Ak , so the utility of P Uk is deﬁned as Pi (k) − Ak , if P Uk ∈ P Ukw and Ri ∈ Riw UP Uk = (4) 0, otherwise. In this auction, the spectrum owners P Uk s allocate spectrum resources to SU s. The speed of channel transmission is increased by the relay nodes cooperatively. 2.3

Economic Properties

In this section, we present several economic properties clearly that we would like to achieve. In an auction, it will not be executed until the economic properties are satisﬁed. Definition 1 (Truthfulness). An auction is truthful. If it is a dominant strategy, any participant’s utility will be maximized for the bidder’s true valuation, and no bidder can improve its utility by misreporting its valuation. In this paper, it implies the auction mechanism designed by us. Each sji submits true valuation to Ri , and each relay node Ri show its true valuation to the kth primary user P Uk . Definition 2 (Budget Balance). An auction is in budget balance for participators if total payment from buyers are greater than the total revenue of sellers. In our mechanism, the auction is conducted in the form of group in tier I auction. We ensure the utilities of auctioneers are nonnegative. We make sure that the payments that the relay nodes receive from the group are no less than the amount paid to the P U s. Definition 3 (Individual Rationality). An auction is individual rational. The utility of each participant is nonnegative. In TAMSA scheme, the utilities of SU s, relay nodes Ri and P U s are nonnegative. That is, uji , URi and UP Uk are nonnegative. Definition 4 (Computational Eﬃciency). An algorithm is computational eﬃcient if the mechanism can terminate in polynomial time. In our auction mechanism, the selection of winning SU s, the matching of P U s and relay nodes, and the clearing price and payment can be completed in polynomial time.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

3

9

Two-Stage Auction Mechanism

In this section, we propose a truthful two-stage auction framework called TAMSA for cognitive radio networks shown in Fig. 1. TAMSA consists of two sub-auctions, which satisﬁes these properties: truthfulness, budget balance, individual rationality and computational eﬃciency. 3.1

Stage I Auction

In this stage, the ni secondary users are randomly divided into multiple groups. The groups submit their bids or budgets to relay nodes separately. Relay nodes will conduct the auction and decide the winning group members virtually. Then relay nodes calculate the payment of each winner and determine the ﬁnal winners. It will allocate channels to SU s if it gets spectrum resources in tier II auction. We ﬁrst introduce the algorithm to buy the spectrum by group and decide the winners (GBDW), the details are as follows. Firstly, relay node Ri collects the bid vector b1i , b2i , · · · , bni i , demand d1i , d2i , · · · , dni i and valuation vi1 , vi2 , · · · , vini from SU s in Si as previous mentioned. We design an algorithm to calculate the budget vector Fi (k) for P Uk . Then, relay nodes decide the winner in the best performance ratio way and calculate the optimal unit price for each group. The relay node Ri sells at most 1/2 time fraction to the Si for maximizing the revenue. Inspired by the work in [16], we sort the vector of b/d in descending, then we can get the optimal unit price for group Si , denoted as OP T (b/d), OP T (b/d) = max i 1≤i≤|b|

bi , di

(5)

where |b| denotes the length of the array, bi and di denote the ith budget and demand separately. The detail of the algorithm is shown in Algorithm 1. It should be noted that the clear price is extracted from the group to ensure truthfulness. Relay nodes select the maximum integer m by OP T (b/d), and then eliminate m SU s with smallest budget and lowest valuation. Fi (k) is the gathered bid from those winning SU s, and the P Uk charges Ri (k) less than Fi (k) for trading the kth channel. In the example, we will show how Algorithm calculates the clearing price and determines the winner. We assume that there are 5 SU s in group i, and their budget and demands vector are as follows: b = {2, 3, 7, 6, 8}, d = {1, 2, 3, 2.5, 4}, so b/d = {2, 1.5, 2.33, 2.4, 2}, which can be obtained by Algorithm 1. We sort b/d in descending and calculate OP T (b/d) to get the maximum m, hence we can get m = 4 and the clearing price is pc = 8/4 = 2. Si participates in the auction need to pay to the ith relay node is p1i = pc × d1i = 2 × 1 = 2. In the same way, the payment of the other 4 secondary users can be calculated separately, which is 4, 6, 5 and 8. Therefore, the winners in ith group are s1i , s3i , s4i and s5i , and the amount collected by the ith relay node is 21.

10

X. Zhang et al.

Algorithm 1. GBDW : Group Buying and Decide Winners Input: Sorted vector of b/d and the valuation. Output: The revenue of relay nodes, Siw and the payment of secondary users. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

3.2

Let 1 ≤ m ≤ ni − 1 be a bid-independent integer. Search for the maximum m in b/d gets the maximum OP T (b/d). pc = b m i /m. Siw (k) ← ∅ Fi (k) ← ∅ for j ← 1 to ni do pji (k) ← pc · dji (k), if pji (k) < bji (k) and pji (k) < vij (k) if pji (k) < bji (k) and pji (k) < vij (k) then Siw (k) ← Siw (k) ∪ sji (k) Fi (k) ← Fi (k) + pji (k) end if end for return Fi (k), Siw (k).

Stage II Auction

In this procedure, auction conducts between P U s and relay nodes, and relay nodes compete for idle spectrum resources of P U s. According to previous research, McAfee auction mechanism cannot be utilized since it only suits for the scenario where there are homogeneous goods to trade [17]. In order to ensure the truthfulness of auction mechanism and apply to heterogeneous networks, we design a spectrum resource allocation algorithm SRA based on VCG auction mechanism. The detail of SRA is shown in Algorithm 2. We apply VCG-based auction mechanism to maximize the social welfare, that is, the total utility of all the participating bidders. Relay node assigns spectrum resource to the Siw when it wins the primary user. Relay node Ri needs to pay for winning P Uk the reward Pi , which is calculated by algorithm SRA. We use the bid of relay node Bi (k) and the reserve price Ak to construct a weighted complete bipartite graph, and the weight is (Bi (k) − Ak ). MaximumWeighted-Matching (MWM) can optimize all utility of participator in this auction. To ensure the truthfulness of auction, we apply VCG-based auction to calculate payments of relay nodes. The details are as follows.

4

Theoretical Analysis

In this section, we prove that TMASA satisﬁes the truthfulness, individual rationality, budget balance and computational eﬃciency. Theorem 1. TAMSA is truthful in the network. Proof. In the following, we focus on proving the dominant strategy for SU s. For buyer, sji (k) ∈ Si , it will submit its true bid and demand, because it reﬂects its true demand for spectrum resource.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

11

Algorithm 2. SRA : Spectrum Resource Allocation Input: Bi (k), Ak , f or∀1 ≤ i ≤ ni and 1 ≤ k ≤ M . Output: Rw ,P U w ,Pi . 1: W ← ∅, E ∗ ← ∅, Pi ← ∅//W is the edge set in the matching graph. 2: Create a weighted complete bipartite graph G = (R, P U, W, w) and the weight of w(Ri , P Uk ) = Bi (k) − Ak if Bi (k) ≥ Ak . 3: E ∗ ← M aximum − W eighted − M atching(W ). 4: for each (Ri , P Uk ) ∈ E ∗ do 5: Rw ← Rw ∪ {Ri }, P U w ← P U w ∪ {P Uk } 6: W ← W \(Ri , P Uk ), R \ ← R\{Ri } 7: G−i ← (R , P U, W , w) ∗ ← M aximum − W eighted − M atching(W ) 8: E−i ∗ ) − (w(E ∗ ) − w(Ri , P Uk )) + Ak 9: Pi ← w(E−i 10: end for 11: return Rw ,P U w ,Pi .

For sji ∈ Si , it can improve its utility by changing its valuation and budget from the ﬁrst branch of Eq. (2). Besides, inspired by [18], the clearing price pc is randomly generated by the optimal price ratio. For Ri ∈ Rw , it can obtain the maximum utility max(Bi (k) − Ak ) if it gets / Rw , it will fail during this auction the spectrum resource in this auction. If Ri ∈ and cannot get the spectrum resource, because (Bi (k)−Ak ) < 0. If relay node Ri submits untruthful bid, the result will not change, when Bi (k) < Fi (k). When Bi (k) > Fi (k), the utility of relay node URi (k) = Fi (k)−Pi = Fi (k)−Bi (k) ≤ 0, if it submits untruthful bid. Therefore, both relay nodes and SU s cannot improve their utility by submitting untruthful bids. Theorem 2. TAMSA is individual rational and budget balance. Proof. For SU s, the utility of sji (k) is calculated by vij (k) − pji (k) > 0, for ∀sji ∈ Siw , and we have proved the individual rationality of SU s. Then we prove relay nodes are also individual rational. For relay node Ri , the minimum payment price for relay node Ri (k) is Ak for ∀Ri ∈ Riw , Bik ≤ Fi (k) and P Uk ∈ P U w . Besides, the utility of primary user UP Uk = Fi (k) − Pi ≥ Bi (k) − Ak > 0. Therefore, both buyers and sellers are willing to participate in the auction. They can all gain nonnegative utility, and TAMSA mechanism is of individual rationality and budget balance. Theorem 3. TAMSA is computational eﬃcient. Proof. We now analyze the time complexity of algorithm TAMSA. In Algorithm 1, the time complexity of the sorting process is O(ni log ni ). In Algorithm 2, it takes O(max{ni , M }3 ) time by applying the algorithm maximum-weightedmatching. The time complexity of computing the payment is O(ni max{ni , M }3 ). Hence, TAMSA is computational eﬃcient.

12

X. Zhang et al.

5

Numerical Results

In this section, we evaluate the performance of TAMSA. In heterogeneous network structure of we designed, this is the ﬁrst incentive scheme proposed for the speciﬁc demands of second users and there are no existing auction schemes to compare with. Instead, We design the upper bound (Upper) and a random algorithm (Random) for TAMSA to compare with. Meanwhile, we also simulate algorithms TASG and TACC to compare with. The algorithm Upper uses the bids of buyers as the payment to maximize the revenue. In TASG and TACC, secondary users are divided into two sets randomly and selected the winning set from other side. TASG is based on VCG mechanism, and TACC sorts the reserve price Ak of primary users in ascending order and the budget Bi (k) of relay nodes in descending order. The experiment tool is MATLAB, and the results are averaged for 100 repetitions. We consider a heterogeneous network shown in Fig. 1. We assume that the number of P U s is M = 5, and there are 5 relay nodes to participate in this auction, and the number of SU s ni varies from 20 to 120 with an increment of 20. We assume that the valuation of secondary users vij (k) and budget bji (k) are uniform distribution, and their ranges are denote as U (50, 150) and U (5, 10) respectively. The reserve price Ak comply with U (10, 20) following [15–18]. 5.1

Simulation Results

We ﬁrst investigate the running time of TAMSA, and the results are shown in Figs. 2 and 3. From Fig. 2, we can see that the running time is no more than 0.35 s even if the amount of SU s becomes large, i.e., when there are 120 SU s. For Fig. 3, we can see algorithm Random runs fastest, since the algorithm Random selects winning secondary users Siw randomly. 0.4 0.3

0.3

0.3 0.25

0.2

0.2 0.15 0.15

0.1 0.05

Running time(s)

0.25

0.35

Running Time(s)

0.35

Random TAMSA TASG TACC

0.25 0.2 0.15 0.1

0 30

0.1 20

Nu

mb

er

of

10

PU

s

0

20

40

80

60

ber

Num

100

120

Us of S

Fig. 2. Running time of TAMSA (a).

0.05

0.05 0

10

20

30 Number of SUs

40

50

Fig. 3. Running time of TAMSA (b).

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

13

For TACC auction mechanism, the reserve price of primary user Ak is sorted in ascending order, and the budget of relay nodes Bi (k) is in descending order to guarantee the utility of P U s. Besides, TACC needs to match every primary user and relay node, so algorithm runs the slowest. The running time of algorithm TAMSA and algorithm TASG is not large, because they use maximum-weightedmatching algorithm to complete matching between the winning P U s and relay nodes. Next, to validate Theorem 2 regarding individual rationality and budget balance of TAMSA, we show the truthfulness in Fig. 4. In this auction, the payment of relay nodes Pi (k) is not higher than the collected from SU s Fi (k), and each winning primary user P Uiw receives a payment not less than its reserve price Ak from the auctioneer. From the experimental results in Fig. 4, we can see that the utility remains nonnegative when relay nodes submit truthful bids. But when relay node submits an untruthful bid, its utility rapidly reduce and will continue to be negative. Figure 4 depicts the diﬀerence of utility. Relay nodes submit truthful bid, when the bid of relay node is less than 50. When the bid is greater than 50, the diﬀerence between truthful and untruthful bids is presented. The utility of relay nodes and P U s are nonnegative, because the bid of relay node is less than the collected from SU s, and the bid is greater than the reserve price of P U s, Bi (k) ≤ Fi (k) and Bi (k) ≥ Ak , when relay nodes submit truthful bids. The utility of relay node is negative, that is Bi (k) > Fi (k), if it submits an untruthful bid. In summary, as seen in Fig. 4, the utility of relay nodes cannot be improved by submitting untruthful bid. 80

9

Truthful unTruthful

8

60

7

Utility of PUs

Utility of relay nodes

40

20

0

−20

6 5 4 3

−40

−60 10

Upper Random TAMSA TACC TASG

2

20

30

40

50 60 70 Bids of relay nodes

80

90

Fig. 4. Truthfulness of TAMSA.

100

1 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 5. Average utility of PUs with the number of SUs.

Figure 5 shows how the utility of primary users UP Uk varies with the number of SU s. With the increasing number of SU s, the average utility of P U s calculated by the ﬁve algorithms is gradually increasing. On average, the proposed algorithm TAMSA in this paper has improved 256% on the utility of P U s compared with the algorithm Random. TASG is about 217% better than algorithm

14

X. Zhang et al.

Random, TACC achieves about 156% utility gains than the algorithm Random on the UP Uk . TAMSA is further improved up by 28.33% and 78.65% over TASG and TACC in terms of average utility of P U s, respectively. That’s because both algorithm TAMSA and TASG apply the maximum weighted matching algorithm to match P U s and relay nodes to ensure the maximum beneﬁt. Besides, both TAMSA and TASG use the auction mechanism based on VCG to ensure the truthfulness of algorithm. The diﬀerence between TASG and TAMSA is that TAMSA selects the winning set of SU s with the optimal cost performance, and TASG selects the winning set with the subset of SU s’s bid by another subset. The optimal cost performance can enhance the revenue of P U s. In TACC, although the utility of P U s can be increase, it cannot guarantee the maximization of its earnings. 50 45 40

4.5

Upper Random TAMSA TACC TASG

4

3.5

30

Utility of SUs

Utility of Relays

35

25 20 15

Upper Random TAMSA TACC TASG

3

2.5

2

10 1.5 5 0 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 6. Average utility of Relay nodes with the number of SUs.

1 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 7. Average utility of SUs with the number of SUs.

Figure 6 depicts the average utility of relay nodes with the varying number of SU s. We can see that TAMSA outperforms Random by about 10 times averagely, TASG and TACC are about 7 times and 6.6 times better than Random algorithm respectively. TAMSA is further improved up by 44.59% and 64.22% over TASG and TACC in terms of average utility of relay nodes, respectively. The reason is that both TAMSA and TASG use the VCG auction mechanism to calculate the payment of relay nodes Pi (k). In Algorithm 2, we see that the payment of relay node is eﬀectively reduced on the premise of guaranteeing the primary user’s revenue, so the utility of relay nodes is improved. Figure 7 shows the relationship between the average utility of SU s and the number of SU s. The average utility of SU s in TAMSA outperforms Random by 213%, TAMSA is able to improve the average utility of SU s in TASG up to 181%, and TACC achieves about 115% utility gain than the Random algorithm on the utility of SU s. TAMSA is improved up by 16.99% and 85.73% over TASG and TACC in terms of average utility of SU s, respectively. That’s because TAMSA selects the winning set Siw in optimal cost performance. The payment of SU s is calculated according to their speciﬁc demands, so TAMSA eﬀectively improves

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

15

the utility of SU s. Algorithms TASG and TACC calculate the payment of SU s with the subset of SU s’s bid by another subset. TASG adopts the optimal singleprice auction to reduce the payment of SU s. In TACC, the payment of SU s is the average value of the winning SU s. From the above experiments, we can see that TAMSA is suitable for the heterogeneous network where the utility of participants can be maximized at the same time. Algorithm TAMSA can gain higher social welfare than the algorithms Random, TASG and TACC. Hence, TAMSA can be deployed to the real situations, and it can eﬀectively improve the utilization of spectrum resources.

6

Conclusion

In this paper, we have proposed a two-stage truthful auction mechanism for spectrum allocation (TAMSA) in cognitive radio networks with multiple primary users, multiple secondary users and relay nodes. We have investigated an incentive mechanism to encourage the spectrum holders to share their idle spectrum resources and encourage the cooperative transmission of the data to improve the utilization of the spectrum resources. TAMSA is a two-stage auction mechanism. In the ﬁrst stage, SU s submit budgets and valuations for spectrum resources to relay nodes. Relay nodes calculate the payment of SU s and determine the winning set Siw . In the second stage, relay nodes submit bids to P U s to compete for spectrum resources. We have proved that TAMSA satisﬁes properties such as truthful, individual rational and computational eﬃcient. Extensive simulation results show that TAMSA outperforms random algorithm by 256% in terms of average utility of P U s. TAMSA is able to improve the average utility of SU s and relay nodes signiﬁcantly up to 213% and 10 times respectively. The performance of TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively. Numerical results validated our theoretical analysis and demonstrated improvement in auction mechanism eﬃciency. Acknowledgment. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702115 and 61672171, Natural Science Foundation of Guangdong, China under Grant No. 2018B030311007, and Major R&D Project of Educational Commission of Guangdong under Grant No. 2016KZDXM052. This work was also supported by China Postdoctoral Science Foundation Fund under Grant No. 2017M622632.

References 1. Zheng, Z., Wu, F., Tang, S., et al.: AEGIS: an unknown combinatorial auction mechanism framework for heterogeneous spectrum redistribution in noncooperative wireless networks. IEEE/ACM Trans. Netw. 24(3), 1919–1932 (2016) 2. Zhu, Y., Li, B., Li, Z., et al.: Truthful spectrum auction design for secondary networks. In: INFOCOM, pp. 873–881. IEEE, Orlando, FL, USA (2012)

16

X. Zhang et al.

3. Chen, L., Huang, L., Xu, H., et al.: Optimal channel allocation for multi-PU and multi-SU pairs in underlay cognitive radio networks. Int. J. Ad Hoc Ubiquitous Comput. 27(1), 19–33 (2018) 4. Wang, X., Huang, L., Xu, H., et al.: Truthful auction for resource allocation in cooperative cognitive radio networks. In: 24th International Conference on Computer Communication and Networks, pp. 1–8. IEEE, Las Vegas, NV, USA (2015) 5. Wang, X., Huang, L., Xu, H., et al.: Social welfare maximization auction for secondary spectrum markets: a long-term perspective. In: 13th IEEE International Conference on Sensing, Communication, and Networking, Communication, and Networking, pp. 1–9. IEEE, London, UK (2016) 6. Shen, F., Li, D., Lin, P.H., et al.: Auction based spectrum sharing for hybrid access in macro-femtocell networks under QoS requirements. In: IEEE International Conference on Communications, pp. 3335–3340. IEEE, London, UK (2015) 7. Wang, H., Liu, Z., Cheng, Z., et al.: Maximization of link capacity by joint power and spectrum allocation for smart satellite transponder. In: 23rd Asia-Paciﬁc Conference on Communications, pp. 1–6. IEEE, Perth, WA, Australia (2017) 8. Jia, J., Zhang, Q., Zhang, Q., et al.: Revenue generation for truthful spectrum auction in dynamic spectrum access. In: 10th ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 3–12. ACM, New Orleans, Louisiana, USA (2009) 9. Liu, Y., Tao, M., Huang, J.: An auction approach to distributed power allocation for multiuser cooperative networks. IEEE Trans. Wirel. Commun. 12(1), 237–247 (2012) 10. Shi, W., Zhang, L., Wu, C., et al.: An online auction framework for dynamic resource provisioning in cloud computing. IEEE-ACM Trans. Netw. 24(4), 2060– 2073 (2016) 11. Feng, Z., Zhu, Y., Zhang, Q., et al.: TRAC: truthful auction for location-aware collaborative sensing in mobile crowdsourcing. In: INFOCOM, pp. 1231–1239. IEEE, Toronto, ON, Canada (2014) 12. Wu, F., Vaidya, N.: A strategy-proof radio spectrum auction mechanism in noncooperative wireless networks. IEEE Trans. Mob. Comput. 12(5), 885–894 (2013) 13. Lee, C., Wang, P., Niyato, D.: A real-time group auction system for eﬃcient allocation of cloud internet applications. IEEE Trans. Serv. Comput. 8(2), 251–268 (2015) 14. Lin, P., et al.: Groupon in the Air: A three-stage auction framework for Spectrum Group-buying. In: INFOCOM, pp. 2013–2021. IEEE, Turin, Italy (2013) 15. Advaita, A., Gali, M.M., Chu, T.M.C., et al.: Outage probability of MIMO cognitive cooperative radio networks with multiple AF relays using orthogonal spacetime block codes. In: Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 84–89. IEEE, Rome, Italy (2017) 16. Yang, D., Xue, G., Zhang, X.: Group buying spectrum auctions in cognitive radio networks. IEEE Trans. Veh. Technol. 66(1), 810–817 (2017) 17. Yang, D., Fang, X., Xue, G.: Truthful auction for cooperative communications. In: IEEE International Conference on Communications, pp. 1–10. IEEE, Ottawa, ON, Canada (2011) 18. Chen, L., Wu, J., Zhang, X.X., et al.: TARCO: two-stage auction for D2D relay aided computation resource allocation in HetNet. IEEE Trans. Serv. Comput. PP(99), 1 (2017)

QoS-Driven Service Matching Algorithm Based on User Requirements Mengying Guo(B) and Xudong Yang School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China {mengying 1204,xdyang}@bupt.edu.cn

Abstract. Quality of Service (QoS) is an important factor which should be considered in service matching. There are two problems in most existing solutions. Firstly, most QoS models are static model described by determinate values or probability distributions, ignoring the impact of time factor. However, most QoS attributes are time-dependent, such as response time and reliability. Secondly, the service selection criteria of most QoS-driven service matching algorithms are based on service performance, but user requirements and the load of services are not considered. In this paper, we propose a Time-Segmented QoS Model (TSQM) to dynamically model QoS. Based on this model, a Service Matching algorithm based user QoS request and Priority (QPSM) is proposed. The priority of user requests is used to control the load of the services. Simulation results show that the algorithm can achieve a higher response rate and a better eﬀect of load balancing. Keywords: Service matching · QoS Service model · Load balancing

1

· Dynamic QoS model

Introduction

SOA (Service-Oriented Architecture) has provided a possibility for IoT (Internet of Things) systems to build distributed applications by loosely coupled services [1]. IoT services can be provided for diﬀerent systems as web services by this way. Selecting services in numerous registered services has become diﬃcult with the number of IoT services increasing rapidly [2]. The characteristics of IoT services determine that service function and service quality must be taken into account simultaneously when performing service matching. QoS (Quality of service) measured in diﬀerent criterions such as delay, response time, reliability, availability, cost, etc. [3], has been a crucial factor in selecting services from numerous services with the same functions. The results of service matching depend not only on the matching degree to user requirements but also on the QoS attributes of the service itself. QoS-aware service selection is a complex multi-criterion decision problem, which is called NP-hard problem, and it is still a challenging research [4]. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 17–27, 2018. https://doi.org/10.1007/978-3-030-05057-3_2

18

M. Guo and X. Yang

There have been many reasonable selection models and eﬀective matching algorithms for QoS-aware service selection. In these models and algorithms, service matching is considered as an optimization problem based on service selecting and the objective is to ﬁnd the best service. However, the fact that actual requirements of users are not considered is unacceptable for some users, because the matched services may have the best overall performance but cannot satisfy the user requirement for a certain QoS attribute. Another problem of these models is that the QoS attributes are only represented with single-valued model or probabilistic model and the inﬂuence of time is not taken into account. Because the service QoS attributes dynamically change with time and user load, the static model cannot accurately represent the QoS values. Thereby the static model will seriously aﬀect the accuracy of matching results. In this paper, by splitting time and dynamically modeling each time period, we propose a Time-Segmented QoS Model (TSQM) which can represent QoS attributes more accurately. Based on our model, a Service Matching algorithm based user QoS request and Priority (QPSM algorithm) is proposed. In this algorithm, the single QoS performance and comprehensive QoS performance provided by services are considered simultaneously. The load of the service is controlled according priority, so that the purpose of balancing user load on each service can be achieved. The rest of the paper is organized as follows. Section 2 introduces the related work of service matching technology. Section 3 details the TSQM model and the QPSM algorithm. Section 4 shows the simulation results to prove the feasibility and eﬀectiveness of the QPSM algorithm. Section 5 concludes this paper.

2

Related Work

QoS-based service matching can usually be divided into two relatively independent processes, service selection and service ranking [5]. Service selection ensures the most basic functional requirements and QoS requirements of users or systems. Service ranking is a further optimization on this basis. The model and algorithm of service selection can be divided into service-function-based selection and service-quality-based selection according to diﬀerent selection criteria. In service-function-based model, the concepts such as semantics or ontology are used to build service models [6,7]. The service-quality-based selection can be divided into single QoS performance selection model and comprehensive QoS performance selection model [5]. The service-quality-based selection can also be divided into single value model and probability-based selection model [8–10]. Service function is one of the requirements that should be satisﬁed in the process of service matching. The fundamental purpose of service matching is to select the most appropriate service for the user based on the service request from the user. More and more models describe and deﬁne services based on semantic web and ontology to understand the functional requirements of users more intelligently. A new resource model describing IoT resources in multi-dimensional was proposed in [6]. Based on this model, a resource matching algorithm, that

QoS-Driven Service Matching Algorithm Based on User Requirements

19

select suitable resource according the similarity between semantic web matching resources, was also proposed. In [7] authors proposed a QoS-based dynamic service composition method in semantic IoT. According to the context-added QoS ontology, after the dynamic semantic annotation of the services in semantic internet of things, the candidate service sets are dynamically selected and combined to provide more accurate services. Service quality is another requirement that should be satisﬁed in the process of service matching. The QoS attributes of services will signiﬁcantly impact on the comprehensive evaluation of services. Therefore, QoS-based service selection is an available scheme of IoT service selection. In most studies, such as [8,9], single-valued model or probabilistic model are usually used to model each dimension of QoS, and the optimal services are selected according to the comparison of service performance. In the process of QoS-aware service matching, not only the overall performance of the service but also each user requirement of QoS should be considered. In [10] authors proposed a service discovery algorithm based on a multi- stage service matching algorithm. In this algorithm, each QoS attribute is assigned a diﬀerent weight and the QoS constraints are determined according to user requests. Finally, the most suitable service is selected. The QoS of web service dynamically changes with factors such as network condition, user load and time. Static model constructed solely from historical data cannot accurately represent the dynamic changes. Therefore, the time factor must be considered when modeling.

3

Service Model

In a complete process of service matching, the function and quality of service should be taken into consideration. Assume that the virtual service set S is known and all services in the virtual service set S can satisfy the functional requirements requested by the user. Next, the QoS modeling and service matching will be discussed further. 3.1

Time-Segmented QoS Model Definition

The TSQM model is a time-segmented QoS-based model. According to changes of QoS attributes over time, the QoS change period can be divided into some time periods with diﬀerent intervals and the QoS model can be constructed separately in each time period. Definition. The TSQM model for a service can be represented as a triple (ET, P, QM ), where • ET = [T0 , T0 + T ) is the eﬀective period of QoS, T0 is the start time of eﬀective period, T is the time period of QoS attribute updated. • P = {P1 , P2 , · · · , PN } is the time period of ET , Pi = [ti , ti+1 ) and i Pi = ET .

20

M. Guo and X. Yang

• QM = Q1 , Q2 , · · · , Qn is a sequence of QoS models, Qi = (fDELAYi , fRESTi , fRELi , fU SAi , fCOSTi ) is the QoS vector of the time period Pi , and fDELAYi , fRESTi , fRELi , fU SAi , fCOSTi represent the probability distribution function of delay, response time, reliability, availability, and cost. Given a service, the QoS model of the service can be represented as Q(t) = (fDELAYt , fRESTt , fRELt , fU SAt , fCOSTt ), where t ∈ [ti + kT, ti+1 + kT ) , k = 0, 1, · · · The TSQM model shows that the QoS of the service changes with time. The model can be ﬂexibly extended according to diﬀerent user requirements, and the number of QoS attributes in each time period can be one or more. In this paper, delay, response time, reliability, availability and cost are selected as the QoS attributes. 3.2

Detailed Description of the Model

QoS Model. A QoS model of a service contains k QoS attributes. These attributes can be 5 non-functional attributes deﬁned in the TSQM model, and they can also be extended according to user requirements. The QoS of service Si corresponds to a set of QoS vectors consisting of a probability distribution function at each time period. In order to compare the QoS performance more easily, the probability distribution function in each time period should be converted into a determined value using the 999 criterion (choose a value that 99.9% of the data satisﬁes as the QoS value of the current time period), i.e., fQoSi → qi . For clear expression, the below-mentioned QoS attributes default to QoS attributes within a certain time period. The QoS attributes of service Si can be represented as a vector, i.e., Qi = (qi1 , qi2 , · · · , qik ), where qik is a value converted from the probability distribution function of the k-th QoS attribute. We assume that the virtual service set consists of n candidate services, S = {S1 , S2 , · · · , Sn }, and their corresponding QoS attributes can be represented as an n × k matrix. ⎡ ⎤ q11 q12 · · · q1k ⎢ q21 q22 · · · q2k ⎥ ⎢ ⎥ (1) M =⎢ . . . . ⎥ ⎣ .. .. . . .. ⎦ qn1 qn2 · · · qnk Because of the diﬀerences in the range of QoS values and the eﬀect on the comprehensive service performance, the QoS values should be normalized by the min-max normalization [11]. According to the impact on the comprehensive performance of the service, QoS attributes can be classiﬁed into positive eﬀect attributes and negative eﬀect attributes. The larger value of positive eﬀect attributes (such as reliability, availability, reputation and other attributes) or the smaller value of negative attributes (such as cost, response time, and other attributes), the better overall performance of the service. Assuming that the

QoS-Driven Service Matching Algorithm Based on User Requirements

21

range of qi is [min (qi ) , max (qi )], positive and negative eﬀect attributes should be normalized by formula (2) and (3) respectively. qi −min(qi ) , max (qi ) − min (qi ) = 0 (2) qi = max(qi )−min(qi ) 1, max (qi ) − min (qi ) = 0

qi =

max(qi )−qi max(qi )−min(qi ) ,

1,

max (qi ) − min (qi ) = 0 max (qi ) − min (qi ) = 0

(3)

All QoS values are distributed between [0, 1] after normalization. The comprehensive performance of the service is enhanced with the increase of each QoS value, that is, the larger the QoS value, the better the service performance. Service Request. A service request sent from the user to the service platform when the service discovery is performed can be represented as Req = {Qreq , Mreq }, where Qreq = (α1 , α2 , · · · , αk ) is a QoS request vector and α1 , α2 , · · · , αk represent the user’s expected values for k attributes qi1 , qi2 , · · · , qik . The QoS values in the request vector, α1 , α2 , · · · , αk , should be normalized by for mula (2) or (3), so we can get α1 , α2 , · · · , αk . Then Qreq is converted to Qreq . The priority vector is Mreq = (m1 , m2 , · · · , mj ), j ∈ {1, 2, · · · , k}, and j means the j-th attribute in Qreq as the priority attribute of the request Req. Mreq including one or more priority attributes is deﬁned by the user requirements, which fully reﬂects the user’s preference for the QoS attributes of the target service. The user requirement emphasizes the importance of the j-th attribute qj in the target service. And qj is expected to satisfy the requirement of αj in Qreq as much as possible, i.e., qj ≥ αj .

Priority. The priority of the service request depends on αj in the QoS request vector Qreq . Suppose h is the user’s expected value of a certain QoS attribute, i.e., h = αj . The priority of the request can be calculated by formula (4). ⎧ ⎨ 1, h ∈ [0, T1 ) P rior(h) = 2, h ∈ [T1 , T2 ] (4) ⎩ 3, h ∈ (T2 , 1] T1 and T2 are single performance thresholds that is used to determine the priority of the service request. The values of T1 and T2 are in the range of [0, 1], and T1 ≤ T2 . The priority of the service request Req can be divided into three levels of 1, 2, and 3, which respectively represent the low, medium, and high of the priority. According to the request priority, diﬀerent matching strategies are selected. The matching strategy set can be represented as M S = {M SH , M SM , M SL }, where M SH ,M SM and M SL respectively indicate the matching strategies of diﬀerent priority.

22

M. Guo and X. Yang

QoS Performance Evaluation Value. QoS performance evaluation value is classiﬁed to request performance evaluation value QoSreq and service performance evaluation value QoSser . QoSreq is selected by the expected QoS value

2

from user and it can be represented as QoSreq = Qreq = α12 +α22 +· · ·+αk2 = k 2 is the QoS request vector after nori=1 αi , where Qreq = α1 , α2 , · · · , αk

2

malization. The QoSser of service Si can be represented as QoSser (i) = Qi = k qi12 +qi22 +· · ·+qik2 = j=1 qij2 , where Qi = qi1 , qi2 , · · · , qik is the QoS attribute vector after normalization. The Utility of Service Matching. U (i) is the utility of the service matching algorithm when the service Si is selected as the target service satisfying the request Req. It is classiﬁed to single performance utility value US (i) and comprehensive services utility value UC (i). US (i) is the ratio of a certain QoS attribute of Req to that of Si , and can be represented as formula (5). UC (i) is the ratio of the overall performance evaluation value of Req to that of Si , and can be represented as formula (6). U (i) is the weighted sum of US (i) and UC (i), and it can be represented as formula (7). h/qij , h < qij US (i) = (5) qij /h, h ≥ qij UC (i) =

QoSreq /QoSser (i), QoSser (i)/QoSreq ,

QoSreq < QoSser (i) QoSreq ≥ QoSser (i)

U (i) = μ × US (i) + (1 − μ) × UC (i)

(6) (7)

The μ is weighted factors in the range of [0, 1]. The impact of US (i) and UC (i) on U (i) can be adjusted through μ. In the matching process, the greater utility, the more matched with the user requirements the service is.

4

Service Matching Algorithm

The QoS-based service matching algorithm can be roughly classiﬁed to two methods: single-QoS performance matching and overall-QoS performance matching. In the QPSM algorithm, service selection and matching are performed according to user-deﬁned priority attributes and QoS. So the most suitable service to user requirements can be matched. QPSM algorithm is proposed as Algorithm 1. The main idea of the algorithm is selecting the corresponding matching strategy according to the priority of user request, and selecting the service that is most suitable to the user. The priority of user request is determined by the speciﬁed priority attributes, and the diﬀerent matching strategies are adopted according to the priority. When the request priority is determined as a high priority, the target service must satisfy

QoS-Driven Service Matching Algorithm Based on User Requirements

23

Algorithm 1. QoS-based service matching algorithm (QPSM) Input: (1)S // Service Set (2)Req // User Requirements Output: Ser match // All services that suit for user 1 Initialize Req, S and its corresponding QoS attribute matrix M ; 2 Determine the priority of the request; 3 Compose priority service set Ser prior : qij ≥ h; 4 Compose the candidate service set Ser wait : QoSser (i) ≥ QoSreq ; 5 while Req is not empty do 6 if P rior(h)=3 then 7 if Ser prior = ∅ then 8 Ser match ← null 9 else 10 Ser match ← the largest QoSser (i) from Ser prior 11 end 12 end 13 if P rior(h)=1 then 14 if Ser wait = ∅ then 15 Ser match ← the largest QoSser (i) from S 16 else 17 Ser match ← the minimum QoSser (i) from Ser wait 18 end 19 end 20 if P rior(h)=2 then 21 if Ser prior = ∅ and Ser wait = ∅ then 22 Ser match ← the largest QoSser (i) from Ser prior 23 end 24 if Ser prior = ∅ and Ser wait = ∅ then 25 Ser match ← the largest qij from Ser wait 26 end 27 if Ser prior = ∅ and Ser wait = ∅ then 28 Ser match ← the largest U (i) from S 29 end 30 if Ser prior = ∅ and Ser wait = ∅ then 31 if Ser inter = Ser prior ∩ Ser wait = ∅ then 32 Ser match ← the largest U (i) from Ser inter 33 else 34 if Ser union = Ser prior ∪ Ser wait = ∅ then 35 Ser match ← the largest U (i) from Ser union 36 end 37 end 38 end 39 end 40 end 41 return Ser match;

the priority attributes completely with the user requirements. When the request priority is judged as a low priority, a service with the smallest service performance evaluation value which satisﬁes the user request performance evaluation value is

24

M. Guo and X. Yang

selected. So the load of the entire service system is balanced and the optimized matching of resources is achieved. When the request priority is judged as a medium priority, the user request and service performance are weighed, and the service selection is determined by the utility of service matching. Ser match, a matching service set, is composed of services selected by priority attributes. When the number of priority attributes is more than one, a conﬂict of matching policy selection may occur. The merging of matching services is to merge the services in Ser match and ﬁnally the most suitable service is selected for the user. Algorithm 2 shows the whole procedure of matching service merging.

Algorithm 2. Merge matching service Input: Ser match // Matching Service Set Output: Ser result // The most suitable service for users 1 Initialize α ∈ {α1 , · · · , αk }, i ∈ {1, · · · , n}, j ∈ {1, · · · , k}; 2 for Ser match = ∅ do 3 if num(P rior(α ) = 3) ≥ 1 then 4 if num(Ser match(qij ≥ αj )) ≥ 2 then 5 Ser result ← the largest U (i) from Ser match(qij ≥ αj ) 6 end 7 if num(Ser match(qij ≥ αj )) = 1 then 8 Ser result ← Ser match(qij ≥ αj ) 9 end 10 if num(Ser match(qij ≥ αj )) = 0 then 11 Ser result ← null 12 end 13 end 14 if num(P rior(α ) = 3) = 0 then 15 if num(Ser match) ≥ 2 then 16 Ser result ← the largest U (i) from Ser match 17 else 18 Ser result ← Ser match 19 end 20 end 21 end 22 return Ser result;

5

Experiment Analysis

The main purpose of the QPSM algorithm is to select the most suitable service for the user according to user-deﬁned QoS request. In order to verify the feasibility and eﬀectiveness of this algorithm, it is compared with the other two QoS-based matching algorithms, Single-QoS and Overall-QoS, in four aspects that is response rate, load, average single performance value and overall performance value. All the experiments were conducted on a computer with a 3.2

QoS-Driven Service Matching Algorithm Based on User Requirements

25

GHz Intel Core 2 Duo CPU and 12 GB RAM. The data used for the experiment derived from two sources: a data set containing 1000 actual services and 5 QoS values, and a randomly generated user request data set. The purpose of the ﬁrst experiment is to evaluate the response rate of the algorithm, that is the ratio of successfully matched and returned requests to the total requests. In this experiment, 100 services are selected for matching and 1000 service requests are randomly generated. The response rates of this three algorithms are shown in Fig. 1. As the number of user requests increase, the response rate of each algorithm tends to be stable. The QPSM algorithm outperforms other algorithms with the highest response rate at about 96%. However, the response rate of the Single-QoS algorithm [8] is the lowest at about 88%. The reason for this result is that the Single-QoS algorithm will fail to respond when all candidate services do not satisfy the QoS constraints. The Overall-QoS algorithm [10] will fail to respond when the overall performance is lower than user request performance. In QPSM algorithm, the matching results will be found through a comprehensive consideration of user requirement and service performance. 1

Response rate

0.95 0.9 0.85

Overall-QoS QPSM Single-QoS

0.8 0.75

0

100

200

300

400

500

600

700

800

900

1000

Number of Service Requests

Fig. 1. The response rate of the algorithm with the number of user requests

The second experiment is to evaluate the eﬀect of load balancing, that is indicated by the number of times that services with diﬀerent QoS performance respond to requests. In this experiment, 5 candidate services with the same function and the diﬀerent QoS are selected and 1000 service requests are randomly generated. The distributions of service load by using traditional UDDI [5] algorithm and QPSM algorithm are compared. And the load distributions of QPSM algorithm with diﬀerent single performance thresholds T1 and T2 are tested. Figure 2 shows that the QPSM algorithm outperforms the UDDI algorithm in term of load balancing when the number of service requests is the same. The greater diﬀerence between T1 and T2 , the better performance of load balancing. Because the greater diﬀerence between T1 and T2 , the more service requests are judged to be medium priority, and the eﬀect of load balancing is better. The third experiment is to evaluate the average service single-performance value and the overall-performance value. In this experiment, 1000 services used for matching are selected and 1000 user requests with high demand for response

26

M. Guo and X. Yang 35 UDDI T1=0.5 T2=0.8 T1=0.2 T2=0.8

Load rate (%)

30 25 20 15 10 5 0

S1

S2

S3

S4

S5

Candidate Services

1

1

0.9

0.9

Average response time

Average reliability

Fig. 2. Distribution of service matching load rate

0.8 0.7 0.6 0.5 0.4 0.3

0

200

400

600

800

1000

0.8 0.7 0.6 0.5 0.4

0

200

Number of Service Requests

400

600

800

1000

Number of Service Requests

(a) Average reliability

(b) Average response time

Overall-performance

3.2 3 2.8 2.6 2.4 2.2 2

0

200

400

600

800

1000

Number of Service Requests

(c) Overall service performance

Fig. 3. Service single-performance and overall-performance with the number of user requests

time and reliability are randomly generated. The μ in the service matching utility U (i) is taken as μ = 0.2 and μ = 0.8 respectively. Figure 3 shows that the larger μ, the higher average reliability of the matching service, the shorter response time, and the lower overall service performance value. Because the value of μ determines the proportion of single performance utility value US (i) and comprehensive services utility value UC (i) in the utility of service matching U (i), and aﬀects the ﬁnal service selection further. The users can select the appropriate μ according to their requirements.

QoS-Driven Service Matching Algorithm Based on User Requirements

6

27

Conclusion

Due to the uncertainty caused by the dynamic change of service QoS and the ambiguity of user requirements, there are some limitations in the current service matching algorithms. In order to describe the QoS attributes more accurately, we propose a time-segmented QoS model on the consideration of time. Based on this model, a service matching algorithm based on user QoS request and priority is also proposed. In this algorithm, user requirements and QoS performance preferences is fully considered. And the most suitable service is selected according to user-deﬁned service requests and priorities, which is more suitable for users with speciﬁc requirements. Finally, experimental results indicate that the proposed algorithm can achieve a higher response rate and a better eﬀect of load balancing.

References 1. Benslimane, D., Dustdar, S., Sheth, A.: Services mashups: the new generation of web applications. IEEE Internet Comput. 12(5), 13–15 (2008) 2. He, Q., Yan, J., Jin, H., Yang, Y.: Quality-aware service selection for service-based systems based on iterative multi-attribute combinatorial auction. IEEE Trans. Softw. Eng. 40, 192–215 (2014) 3. Zhao, S., Wu, G., Zhang, S.: Review of QoS research in SOA. Comput. Sci. 36(4), 16–20 (2009) 4. Klein, A., Ishikawa, F., Honiden, S.: SanGA: a self-adaptive network-aware approach to service composition. IEEE Trans. Serv. Comput. 7(3), 452–464 (2014) 5. Guo, D., Ren, Y., Chen, H.: A QoS constrained web service selection and ordering model. J. Shanghai Jiaotong Univ. 41(6), 870–875 (2007) 6. Zhao, S., Zhang, Y., Yu, L., Cheng, B., Ji, Y., Chen, J.: A multidimensional resource model for dynamic resource matching in internet of things. Concurr. Comput. Pract. Exp. 27(8), 1819–1843 (2015) 7. Li, L., Liu, N., Li, G.: A QoS-based dynamic service composition method in semantic internet of things. Appl. Res. Comput. 33(3), 802–805 (2016) 8. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: Qos-aware middleware for web services composition. IEEE Trans. Softw. Eng. 30(5), 311–327 (2004) 9. Cardoso, J., Sheth, A., Miller, J., Arnold, J., Kochut, K.: Quality of service for workﬂows and web service processes. Web Semant. Sci. Serv. Agents World Wide Web 1(3), 281–308 (2004) 10. Jia, B., Li, W., Zhou, T.: A centralized service discovery algorithm via multi-stage semantic service matching in internet of things. In: 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp. 422–427 (2017). https://doi.org/10.1109/CSE-EUC.2017.82 11. Chen, L., Yang, J., Zhang, L.: Time based QoS modeling and prediction for web services. In: Kappel, G., Maamar, Z., Motahari-Nezhad, H.R. (eds.) ICSOC 2011. LNCS, vol. 7084, pp. 532–540. Springer, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-25535-9 38

Research on Overload Classiﬁcation Method for Bus Images Based on Image Processing and SVM Tingting Li1 , Yongxiong Sun2 ✉ (

)

, Yanhua Liang1 , Yujia Zhai2 , and Xuan Ji2

1

2

College of Software, Jilin University, Changchun 130012, China College of Computer Science and Technology, Jilin University, Changchun 130012, China [email protected]

Abstract. The speed and efficiency of overloaded artificial screening bus images are relatively low, which results in a large number of human resources waste prob‐ lems. Therefore, an overload classification method for bus images based on image processing and support vector machine was proposed to intelligently identify the image overload or not. Based on the consideration we have done the following work. Firstly, the bus images were preprocessed, including image enhancement using histogram equalization method and image segmentation using improved Otsu algo‐ rithm; Secondly, the features of the segmented images was extracted by Kirsch edge detection operator to establish the image feature sample library; Finally, the appro‐ priate kernel function and parameters were chosen to establish a classifier model based on support vector machine, which can train the sample library to classify the bus images. Theoretical analysis and experimental results show that the average classification accuracy of the polynomial kernel function is better than those of the Gaussian kernel function and the Sigmoid kernel function in the finite range of parameters selection. When the parameter d of the polynomial kernel function is 4, the classification accuracy is 93.68%, and its classification performance is stable and there is no significant increase or fall. And the conclusion was verified in the actual application. Keywords: Bus overload · Image segmentation · Image feature extraction Support vector machine · Image classiﬁcation

1

Introduction

The bus overload refers to the number of passengers in vehicles exceeding the authorized number of passengers. The bus overload is a direct threat to the safety of the passengers. Once a traﬃc accident occur, it will lead to casualties and have a signiﬁcant inﬂuence on society [1]. In order to prevent vehicles from overloading as much as possible, the public security, transportation, highway and other departments take active measures. On the one hand, they actively propagandize the danger of overload to enhance the safety awareness of passengers. On the other hand, they use diﬀerent kinds of advanced tech‐ nology to supervise overload, such as installed the driving recorders, cameras, and other monitoring equipment in the bus [2]. These measures not only reduce the waste of

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 28–43, 2018. https://doi.org/10.1007/978-3-030-05057-3_3

Research on Overload Classiﬁcation Method for Bus Images

29

manpower and material resources, but also investigate by evidence and punish the exis‐ tence of overloaded illegal vehicles. At present, most provinces and cities in China still use manual recognition method to classify images which are photographed by a camera that is installed in the bus to determine whether the bus is overloaded. Although the accuracy of the manual identi‐ ﬁcation method is high, the eﬃciency is low. Therefore, the manual identiﬁcation method cannot meet the current regulatory needs [3]. In order to solve the problem of artiﬁcial identiﬁcation, an overloaded classiﬁcation method for bus images based on image processing and support vector machine (SVM) is proposed. Compared with existing artiﬁcial recognition methods, this method can automatically recognize the overloaded bus images, which saves a lot of human resources and improves the speed and quantity of illegal images recognition [4]. Simultaneously, it has greatly improved the speed and quantity of illegal images identiﬁcation.

2

Pretreatment of Bus Images

The purpose of this paper is to classify bus images to detect overloaded buses by using image processing and support vector machine. Image preprocessing is the precondition of image classiﬁcation. It plays an important role in classifying overloaded bus images. The experimental data are derived from the historical data of the transportation depart‐ ment, Jilin City, Jilin Province.

Fig. 1. First group of bus image enhancement eﬀect graph and contrast histogram

30

T. Li et al.

2.1 Histogram Equalization In this paper, the images were taken by the cameras installed in the bus and the quality is poor. Thus, the image enhancement is necessary before image segmentation. We used histogram equalization to enhance original images, which can make the distribution of whole image gray tend to be uniform. The process is as follows:

Fig. 2. Second group of bus image enhancement eﬀect graph

Calculate the number of pixels for the each grayscale of original bus images ni , i = 0, 1, … , L − 1, where L is the total gray level for the image. 1. Calculate the original image histogram, the formula is: ( ) n Pi ri = i n

(1)

Where: n is the total number of pixels for the original images. 2. Calculate the cumulative distribution function, the formula is: k ( ) ∑ ( ) sk rk ≈ Pi ri (k = 0, 1 … , L − 1)

(2)

i=0

3. Calculate the output gray level, which can be written in the form:

]/ [( ) ( ) gk = INT gmax − gmin sk rk + gmin + 0.5 (L − 1) Where k = 0, 1 … , L − 1, INT[] is rounding operator.

(3)

Research on Overload Classiﬁcation Method for Bus Images

31

In the formula (3), when gmin = 0, gmax = L − 1, the formula (4) can be written in the form: ( ) ]/ [ gk = INT (L − 1)sk rk + 0.5 (L − 1)

(4)

4. We can get output images by modifying original images, which is based on the mapping relation between the gray level function (rk ) of original images and output gray level function (gk ). 5. Implementing (Implement) image enhancement to two groups of original bus images, the corresponding results are shown in Figs. 1 and 2. 2.2 Image Segmentation In order to classify the overloaded bus images by using support vector machine, we need to extract target area from background area to obtain training data. Thus, it is important to segment target area from original image. Threshold Segmentation is one of the ﬁrst image segmentation methods, which is simple and eﬀective. It includes the maximum between-class variance method, the minimum cross-entropy threshold method, the maximum entropy threshold method and the maximum correlation threshold method [5].

Fig. 3. Comparison of image segmentation eﬀects of four segmentation methods

Through analyzed bus images, we regard the aisle in the image as the target area, and surrounding passengers as background area. Then we processing a same image used

32

T. Li et al.

four traditional segmentation methods mentioned above. The corresponding results are shown in Fig. 3. As shown in Fig. 3, the four segmentation methods mentioned above all lead to noises and holes, which has greatly eﬀect features extraction. Therefore, in this paper, we ﬁrst process the bus images used threshold segmentation, and then closed operation is repeatedly used to remove noises and ﬁll holes. Closed operation is a more advanced morphological transformation that combines expansion and corrosion algorithms [6], which expansion operation is ﬁrstly used to process segmented images, and then imple‐ ment corrosion operation to above results. We processed images in Fig. 3 by using three times of closed operations, and the results are shown in Fig. 4.

Fig. 4. Eﬀect graph using threshold segmentation and closed operation

As shown in Fig. 4, the traditional maximum relevant threshold segmentation method has worst results, and the traditional Otsu has the best eﬀects which can eﬀectively separate target areas from background areas, apart from several connecting pixels.

Fig. 5. Gray histogram of graph a in Fig. 3

In this paper, we select the middle aisle of the bus images as the targets of training samples. Figure 3(a) is a normal bus image in which aisle region accounts for one ﬁfth

Research on Overload Classiﬁcation Method for Bus Images

33

of original image and non-aisle region accounts for four ﬁfth of original image. Thus, compared with background area, the target area is much small. The gray histogram of graph (a) is shown in Fig. 5. As shown in Fig. 5, there are less pixels on the left, and the gray distribution of the middle pixels is uniform, and the gray of the rightmost almost reaches peak. That means the pixels of target area focus on left, and the pixels of background focus on middle and right. The gray scale of background area is bigger than the gray scale of target area. Owing to small variance in target area and big variance in background area, the tradi‐ tional Otsu method makes threshold prefer big variance area, which leads to calculated threshold bigger than ideal threshold and has poor segmentation results. In order to improve the quality of images segmentation and the accuracy of identiﬁcation over‐ loaded bus images, in this paper, we try to modify traditional Otsu method. The original formula of Otsu algorithm can be written in the form: 𝜎(t) = 𝜔0(𝜇0 − 𝜇)2 + 𝜔1(𝜇1 − 𝜇)2 = 𝜔0𝜔1(𝜇1 − 𝜇0)2

(5)

Where: 𝜔0 is the probability of target class, and 𝜔1 is the probability of background class. It means that the target area and background area are weighted [7]. In this paper, we adjust weighting by descending and ascending the power of 𝜔0 and 𝜔1. The improved Otsu formula is: 𝜎(t) = 𝜔0𝛼 (𝜇0 − 𝜇)2 + 𝜔1𝛽 (𝜇1 − 𝜇)2 = 𝜔0𝛼 𝜔1𝛽 (𝜇1 − 𝜇0)2

(6)

Where 𝛼 represents the proportion of background area in the whole image, and 𝛽 is the reciprocal of 𝛼, which makes the algorithm have no biases to target class. By modi‐ fying original formula, we can ensure that the threshold will not be so high when the

Fig. 6. Comparison between traditional Otsu algorithm and improved Otsu algorithm

34

T. Li et al.

variance of one class is bigger than the other, at the same time, the gray level between two classes is more balanced. The results from traditional Otsu algorithm and improved Otsu algorithm are shown in Fig. 6. As shown in Fig. 6, the passengers in the background area are not classiﬁed into target area, while the improved Otsu algorithm can eﬀectively separate target area from background area. Therefore, in this paper, we use improved Otsu algorithm and close operation to segment bus images, which resolves the eﬀects of noise and holes, and provides a good base for features extraction.

3

Bus Image Feature Extraction

After image enhancement and segmentation, we select Kirsch operator to extract segmented image features, and build an image features database which is used to classify bus images using support vector machine. Kirsch operator calculates convolution and derivative for each pixel using eight templates. The eight templates represent eight directions, making the maximal response to the eight speciﬁc edge directions of the images. The output of Kirsch operator is the maximum of eight directions. Kirsch is an eﬀective edges detection operator, which can signiﬁcantly suppress the noise from edge detection [8]. Assuming original image is shown in Fig. 7. a3

a2

a1

a4

(i,j)

a0

a5

a6

a7

Fig. 7. A 3 × 3 sub-picture of the original image

The gradient of the edge is:

[ ( )] G(i, j) = max 1, max ||5Sk − 4Tk ||:k = 1, 2, … 8

(7)

Where Sk = xk+1 + xk+2 + xk+3,Tk = xk+4 + xk+5 + … + xk+8, k equals 1 to 8 repre‐ senting the 8-direction template, as shown in Fig. 8. The Kirsch operator is based on a fact that the gray scale for the non-edges of image is smaller than threshold and the gray scale for the edges of image is bigger than threshold. When detecting image edges, we ﬁrst use a lower threshold to binarize the original images, then detect target area and background area. The target area and the background area can be eﬀectively divided by the boundary regions whose gray scale is bigger than the threshold [9]. By using he method mentioned above, we preprocess two groups of original bus images and extract corresponding features. The results are shown in Fig. 9 .

Research on Overload Classiﬁcation Method for Bus Images

35

Fig. 8. Eight directions template

Fig. 9. Two sets of bus original images and features extraction

Fig. 10. Unloaded and overloaded image extraction aisle shape eﬀect

For the classiﬁcation of bus images, we only concern the target area information. In order to reduce calculation and improve the accuracy of classiﬁcation, we need to avoid

36

T. Li et al.

the inﬂuence of non-target area after extracting image outlines. In this paper, we process simply process the extracted outline images, and then extract the shape of aisle position as sample data for the image features database. Figure 10 shows extracted aisle shapes. In this paper, we process 551 bus images. Some results are shown in Fig. 11.

Fig. 11. Part of the bus image feature samples

4

Image Classiﬁcation Based on Support Vector Machine

Analyzed image features of target area, we can ﬁnd that the image features of target area that there are passengers in the aisle are signiﬁcant from the image features of target area that there are no passengers in the aisle. Therefore, we can recognize overloaded bus images by using the shapes of feature images for target area. We can divide training data into two parts, positive training set and negative training set, where positive training set stores outline feature samples from non-overloaded bus images, and negative training set stores outline feature samples from overloaded bus images. We can use support vector machine to classify bus images after constructing two training sets. Support vector machine is very eﬀective for linear classiﬁcation problems [10]. For a nonlinear classiﬁcation problem, we can transform it into a linear problem by nonlinear transformation function, which makes it linearly separable in a high-dimensional space [11]. For a nonlinear classiﬁcation problem, the solution to the optimal classiﬁcation surface is equal to the following question:

Minimize

Subject to

ϕ(w, 𝜉) =

n ∑ 1 ‖w‖2 + C( 𝜉i ) 2 i=1

[ ] yi (wT xi ) + b − 1 + 𝜉i ≥ 0, 𝜉i ≥ 0, i = 1, 2, … , n

(8)

(9)

Where: C > 0 is a penalty coeﬃcient. This is a quadratic programming problem that can be solved by the Lagrange method and translated into the following questions:

Research on Overload Classiﬁcation Method for Bus Images

Q(𝛼) =

Maximize

n ∑

1∑ 𝛼 𝛼 y y (x ⋅ x ) 2 i,j=1 i j i j i j

(10)

yi 𝛼i = 0

(11)

n

𝛼i −

i=1 n ∑

Subject to

37

i=1

0 ≤ 𝛼i ≤ C, i = 1, 2, … , n

(12)

The weight coeﬃcient for the optimal classiﬁcation surface is:

w=

n ∑

𝛼i yi xi

(13)

i=1

It can be seen that the weight coeﬃcient of the optimal classiﬁcation surface is the linear combination of training samples. From the formula (12), the smaller the penalty coeﬃcient C is, the smaller the Lagrange multiplier is. Likewise, from formula (13), the smaller 𝛼i is, the smaller ‖w‖ is, which means that the bigger interval between two classes can improve the generalization performance of SVM. The smaller C is, the bigger the interval between two classes is and the better generalization performance the SVM has, which leads to reduce the accuracy of SVM. On the contrary, the bigger C is, the smaller the interval between two classes is and the poorer generalization performance the SVM has, which leads to improve the accuracy of SVM. Therefore, the penalty coeﬃcient aﬀects the generalization performance and accuracy of SVM. The value of C should select appropriate. In this paper, we classify bus images based on SVM by choosing an appropriate kernel function. The type of kernel function signiﬁcantly aﬀects the performance of SVM. Three common kernel functions are used in this paper, including Polynomial kernel function, Gaussian kernel function and Sigmoid kernel function [12]. They can be written in the forms as following: Polynomial kernel function:

[ ]d K(xi , xj ) = (xi ⋅ xj ) + 1

(14)

Where: d is the degree of the polynomial. Gaussian kernel function: ) ( | |2 K(xi , xj ) = exp −𝜎 |xi − xj | | |

(15)

Sigmoid kernel function: K(xi , xj ) = tanh(𝜎(xi , xj ) + c)

(16)

Among them, the polynomial kernel function is related to d, and the gaussian kernel function is related to 𝜎. In the paper, we compare the accuracy of the SVM model with

38

T. Li et al.

diﬀerent kernel functions through a large number of testing images. Finally, we choose the optimal classiﬁer. In this paper, we classify overloaded bus images based on image processing and support vector machine. Firstly, we select training samples from standard samples library, and preprocess the selected images, including histogram equalization, images segmentation and closed operation. Secondly, we extract the edge features of the prepro‐ cessed images and build a feature samples training set. Then, we select appropriate kernel function and parameters, and train a model used support vector machine on training set. Finally, we use the trained model to predict the class label of testing set and calculate the accuracy of the model. The whole ﬂow chart is shown in Fig. 12.

Fig. 12. Image overload and classiﬁcation based on image processing and SVM

5

Experiments and Results

The purpose of this paper is to divide bus images into non-overloaded images and over‐ loaded images based on images processing and support vector machine. It is diﬃcult for us to determine which type of kernel function is best when features mapping is unknown. Therefore, the performance of model is signiﬁcantly related to the choice of kernel functions. At present, many researchers make a choice based on the generalization error of the classiﬁer through a great many of experiments [13]. In this paper, 897 bus integral images are used as a sample database which includes 36 obstructed images and 861 normal images. In order to analyze the experimental results, 861 normal images are selected as the standard dataset. The dataset consists of two types of images, of which 669 are non-overloaded and 192 are overloaded. The resolution of each image is 352 × 288. We divide dataset into training and testing dataset by using “Set aside method” [14]. The so-called “Set aside method” is a popular sampling method, which means that dataset D is divided into two mutually exclusive sets that one of the sets is the training set S, the other is the testing set T. After training a model used the training set S, the testing set T is used to calculate testing error to

Research on Overload Classiﬁcation Method for Bus Images

39

estimate the generalization error of the model. In this paper, 426 non-overloaded images and 125 overloaded images are selected randomly from the standard 861 images datasets as the training set, and the remaining 310 images (243 non-overloaded images and 67 overloaded images) as the testing set. In order to ensure these testing samples are not used in the training process, in this paper, we select precision as an evaluation indicator, which is the proportion of correct classiﬁed samples in the testing set. Each experiment is carried out repeatedly through 5 randomly dividing, and the evaluation result is based on the mean of ﬁve times. The result has two digits after the decimal point. The purpose of this experiment is to observe the classiﬁcation accuracy of the clas‐ siﬁer under diﬀerent parameters of diﬀerent kernel functions, and to select the kernel function and parameters that are most suitable for this project. For polynomial kernel function, d value is 1, 2, 3, 4 and 5, respectively. For Gaussian kernel function, 𝜎 value is 0.1, 0.5, 1, 2 and 5, respectively. For Sigmoid kernel function, make 𝜎 = 1, c value is 0.5, 0.3, 0.1, 0, −0.1, −0.3 and −0.5. Meanwhile, according to the literature [15], the penalty factor C value is 100. The following is the classiﬁcation accuracy and the graph of the three kernel functions with diﬀerent parameters (Table 1). Table 1. Classiﬁcation accuracy of polynomial kernel function with diﬀerent parameters Group 1 2 3 4 5 Mean(%)

Parameter d 1 80.65 80.00 81.61 80.97 82.26 81.10

2 84.84 85.48 87.10 85.16 86.45 85.81

3 89.68 91.61 90.32 90.97 88.71 90.26

4 93.23 93.55 94.19 93.87 93.55 93.68

5 89.68 89.68 90.00 89.68 89.03 89.61

Figure 13 shows the average classiﬁcation accuracy of polynomial kernel function with diﬀerent parameters.

Fig. 13. Mean accuracy curve of polynomial kernel function with diﬀerent parameters

As can be seen from the trend of the curve in Fig. 13, the classiﬁcation accuracy of diﬀerent parameters of the polynomial kernel function is diﬀerent. With d increasing,

40

T. Li et al.

the classiﬁcation accuracy of the model ﬁrst increases and then decreases. When d is 4, the classiﬁcation eﬀect is the best, reaching 93.68%. For the experimentally selected parameter d, the average classiﬁcation accuracy ﬂuctuates within a limited range of 81.10%–93.68%. The performance of the model is relatively stable (Table 2). Table 2. Classiﬁcation accuracy of RBF kernel function with diﬀerent parameters Group 1 2 3 4 5 Mean(%)

Parameter d 1 67.74 68.39 70.32 69.35 67.42 68.64

2 90.32 89.68 88.71 90.32 90.97 90.00

3 85.16 83.87 84.52 84.52 83.23 84.26

4 74.19 75.81 74.84 75.81 75.16 75.16

5 70.97 70.00 69.35 71.61 70.00 70.39

Gaussian kernel function with diﬀerent parameters of the average classiﬁcation accuracy curve is shown in Fig. 14 .

Fig. 14. Mean accuracy curve of RBF kernel function under diﬀerent parameters

Fig. 15. Mean accuracy curve of Sigmoid kernel function under diﬀerent parameters

Research on Overload Classiﬁcation Method for Bus Images

41

It can be seen from Fig. 14 that the Gaussian kernel function has diﬀerent classiﬁ‐ cation accuracy with diﬀerent parameters. For the Gaussian kernel function, its param‐ eters are within a limited range selected, when the value is 0.1, the classiﬁcation eﬀect is poor. When the value is 0.5, the classiﬁcation eﬀect is the best; then with the increase of 𝜎, the classiﬁcation accuracy drops and is not very stable. The average accuracy curve of Sigmoid kernel under diﬀerent parameters is shown in Fig. 15. By analyzing the experimental data of Table 3 and the average precision curve of Fig. 15, the classiﬁcation accuracy of Sigmoid kernel function ﬂuctuates in the range of 72.32%–88.77%. When c is −0.3, the classiﬁcation accuracy is the best. Simultaneously, when c takes a negative value, its classiﬁcation accuracy is better than that of a positive value, which accords with the analysis of Sigmoid kernel in ref. [16]. Table 3. Classiﬁcation accuracy of Sigmoid kernel function under diﬀerent parameters Parameter c Group 1 −0.5 80.65 −0.3 88.06 −0.1 84.52 0 78.06 0.1 76.13 0.3 80.65 0.5 71.94

Mean(%) 2 82.26 87.10 85.16 79.03 75.81 80.65 70.97

3 81.61 89.68 85.48 78.71 75.81 81.29 72.58

4 83.87 90.32 85.16 80.00 77.42 80.00 72.58

5 80.65 88.71 83.87 78.71 76.77 80.97 73.55

81.81 88.77 84.84 78.90 76.39 80.71 72.32

By comprehensively analyzing the three kernel functions selected in this paper, the classiﬁcation of multiple kernel functions is obviously better than the other two kernel functions. For Gaussian kernel function, only when 𝜎 is 0.5, the classiﬁcation accuracy reaches 90.00%. When the 𝜎 takes other values, classiﬁcation eﬀect is not stable. For the Sigmoid kernel function, it’s classiﬁcation performance is also unstable, and appears the oscillating phenomena. The average classiﬁcation accuracy among the three of the highest is the polynomial kernel function. The precision is also the highest, up to 93.68%, and the classiﬁcation performance is relatively stable. In general, it is the best choice to use polynomial kernel function parameter d as 4 in the bus images overload classiﬁcation in this paper. But it should be noted that it is the best when select kernel function and parameters are only within a limited range. It can be seen from the above experiments that the average successful rate of bus overload classiﬁcation using the image classiﬁcation method based on support vector machines reaches up to 93.68%. And when applying it to the traﬃc visualization system in Jilin Province, the accuracy rate can still reach about 93%. So the use of image processing and support vector machine technology can achieve bus overload detection.

42

6

T. Li et al.

Conclusion

In this paper, based on image enhancement, improved threshold segmentation, and closed operation processing of images of interior passengers photographed inside the bus, feature extraction is performed on these preprocessed image samples to establish a training set, and an appropriate kernel function is then selected. The SVM model is established with the parameters and completes the sample training of the training set. The automatic classiﬁcation of the imported image is ﬁnally completed, and the over‐ loaded image are intelligently identiﬁed. Finally, for the images obtained in this paper, through the comparative analysis of multiple sets of experiments, we notice that when the polynomial kernel function parameter d value is 4, the classiﬁcation accuracy is the highest. Increasing the recog‐ nition speed and eﬃciency of the overloaded images on buses can save a lot of human resources and increase penalty rates for violations. So, the method of bus image overload classiﬁcation based on image processing and support vector machines has great values. However, compared with the ideal classiﬁcation accuracy of 100%, there is a certain distance. How to further improve the classiﬁcation accuracy is the future work.

References 1. Ding, C.: The eﬀect of overloaded cars and the tire pressure on the stress distribution of the road. Int. J. Intell. Inf. Manag. Sci. 5(3), 264–267 (2016) 2. Wang, W.L., Lu, C.Z., Li, Y.R.: Basic economic measures in long-term eﬀective mechanism for administering overload and oversize of motor vehicles. Int. J. Intell. Inf. Manag. Sci. 24(6), 148–152 (2007) 3. Zhang, Z., Cheng, W., Wu, L., et al.: Study on circular traﬃc signs recognition method based on invariant moments and SVM. J. Electron. Meas. Instrum. 31(5), 773–779 (2017) 4. Zhao, G.Q., Wang, F.J.: Car train overload signal monitoring system optimization modeling research. Comput. Simul. 33(11), 162–163 (2016) 5. Wu, Y.Q., Meng, T.L., Wu, S.H.: Research progress of image thresholding methods in recent 20 years (1994–2014). J. Data Acquis. Process. 30(1), 1–23 (2015) 6. Yan, J.Z., Lin, S., Sing, B.K.: Change-based image cropping with exclusion and compositional features. Int. J. Comput. Vis. 114(1), 74–87 (2015) 7. A R Correspondng’s scientiﬁc contributions, Venmathi, Venmathi, A.R., et al.: Kirsch compass kernel edge detection algorithm for micro calciﬁcation clusters in mammogram. Middle East J. Sci. Res. 24(4), 1530–1535 (2016) 8. Liu, D.H., Zhang, Y.D., Li, X., et al.: Adaptive thresholding method under the dynamic environment. J. Comput. Appl. 36(S2), 152–156 (2016) 9. A R Correspondng’s scientiﬁc contributions, Venmathi, A.R., Venmathi, E.N., Ganesh, N.K.: Kirsch Compass kernel edge detection algorithm for micro calciﬁcation clusters in mammograms. Middle East J. Sci. Res. 24(4), 1530–1535 (2016) 10. Thang, P.Q., Thuy, N.T., Lam, H.T.: A modification of solution optimization in support vector machine simplification for classification. In: Bhateja, V., Nguyen, B.L., Nguyen, N.G., Satapathy, S.C., Le, D.-N. (eds.) Information Systems Design and Intelligent Applications. AISC, vol. 672, pp. 149–158. Springer, Singapore (2018). https://doi.org/ 10.1007/978-981-10-7512-4_15

Research on Overload Classiﬁcation Method for Bus Images

43

11. Zhi, J., Sun, J., Wang, Z., Ding, W.: Support vector machine classiﬁer for prediction of the metastasis of colorectal cancer. Int. J. Mol. Med. 41(3), 1419–1426 (2018) 12. Mcdonald, G., Macdonald, C., Ounis, I.: A study of SVM kernel functions for sensitivity classiﬁcation ensembles with POS sequences. In: SIGIR 2017, pp. 1097–1100 (2017) 13. Yang, L., Wang, Y.: Survey for various cross-validation estimators of generalization error. Appl. Res. Comput. 32(5), 1287–1290 (2011) 14. Zhou, Z.H.: Machine Learning. 2nd edn. Tsinghua University Press, Beijing (2016) 15. Yu, Z., Wong, H.S., Wen, G.: A modiﬁed support vector machine and its application to image segmentation. Image Vis. 29(1), 29–40 (2016) 16. Hsuan, T.L., Chih, J.L.: A study on Sigmoid Kernels for SVM and the training non-PSD kernels by SMO-type methods. Submitt. Neural Comput. 27(1), 15–23 (2003)

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost Haojun Ai1,2,3 , Liangliang Han4 , Yifeng Wang1(B) , and Liang Liao5,6 1

School of Cyber Science and Engineering, Wuhan University, Wuhan, Hubei, China {aihj,whuyifeng}@whu.edu.cn 2 Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Beijing, China 3 Collaborative Innovation Center of Geospatial Technology, Wuhan, China 4 Aerospace System Engineering Shanghai, Shanghai, People’s Republic of China 5 ChangZhou Municipal Public Security Bureau, Changzhou, China 6 Key Laboratory of Police Geographic Information Technology, Ministry of Public Security, Beijing, China

Abstract. Acoustic gesture recognition based on the Doppler eﬀect has garnered much research attention. The accuracy of gesture recognition and potential false positives are the main factors that limit the widespread use of gestures. To this end, we propose a novel gesture classiﬁcation method based on the acoustic Doppler eﬀect that does not require any custom hardware, simply a speaker and one microphone on a laptop. An eﬀective sound ﬁeld is built by a high frequency sound wave from the speaker, and the wave reﬂected by hand motion is captured by the microphone. We design a set with ﬁve features, three of them are stable and invariant to diﬀerent people, so even new users can operate our system with zero start-up cost and no training. The remaining two features are highly correlated with the velocity and the range to computer of the gestures, which can reduce the potential false positives in detection. Besides, a classiﬁer is designed depending on multistage decision rules to identify the 11 kinds of deﬁned gestures. The experiment result about user experience feedback of HCI shows that our system has good usability performance. And the numerical experiments with 10 users show that our system can not only keep less potential false positives, but also achieve a classiﬁcation accuracy of up to 99.09%. Keywords: Doppler eﬀect

1

· Gesture classiﬁcation · Acoustic · HCI

Introduction

For years, gesture recognition [2,9,10,17] with a device-free manner has developed rapidly, especially the widely used mobile phones and PCs have audio input and output components composed of speakers and microphones, so a Dopplerbased gesture recognition as a new human-machine interface application has attracted the attention of researchers [1,7,8,14,15]. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 44–58, 2018. https://doi.org/10.1007/978-3-030-05057-3_4

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

45

Many studies have tried to use machine learning methods [11,13,19–21] to improve the accuracy of gesture recognition. For example, Ai et al. [1] obtained the HMM model of each gesture by training the feature vectors of the samples, ﬁnally achieved recognition accuracy of 95% for 18 gestures. Dolphin [16] extracted the eﬀective frequency bins around the peak and normalized them to form a vector feature, and the classiﬁer they chosen was the Liblinear (Large Linear classiﬁer) with 93% accuracy of recognition. In addition, Neural Net [11], Bayes [19] etc. classiﬁers were also used in some researches. Although the classiﬁcation accuracy of gestures is signiﬁcantly improved by adopting machine learning, other problems such as increased computational complexity and timeconsuming are caused. Besides, potential false positives of gesture detection are also a key issue that restricts the widespread use of gestures in HCI. Most acoustic-based hand gesture classiﬁcation methods show good robustness in an unmanned environment [1, 14,16], but if people walk around, they are prone to false positive performance in detection [1,7,16]. In the paper, we extract three general stable invariant features to characterize one gesture and two other features that reduce the false positives in detecting gestures. Furthermore, we also design a classiﬁer depending on multistage decision rules to categorize the 11 predeﬁned gestures with high accuracy and less false positives. We have summarized the main contributions in the paper as follows: – We extract ﬁve features from the Doppler shift to characterize a gesture, and design a classiﬁer depending on multistage decision rules to identify all gestures, which keeps a high precision during gesture recognition. – Two of the features are bandwidth and amplitude of shift, which can signiﬁcantly reﬂect the velocity and the range to computer of the gestures. By threshold setting can eﬀectively identify some of the far-range people’s walking and slow motions, thereby reducing the potential false positives in detection eﬀectively. – Remaining three features are direction, count of direction change and distance. They are all stable and invariant property in a gesture so that they generally does not change when a same gesture is performed by diﬀerent people. Hence the users can operate our system with zero start-up cost and no training.

2

Feature Extraction

The theoretical basis of the gesture identiﬁcation is a well-known phenomenon: Doppler eﬀect [18]. When a moving object approaches the signal source (the speaker), the frequency of signal perceived by the receiver (the microphone) becomes larger [3], whereas the perceived frequency decreases when the object does an operation far from the wave source.

46

H. Ai et al.

The Doppler shift fR caused by a movement can be calculated by the equation: Δv fR = 1 + ×f (1) c Δf = fR − fS

(2)

Where Δv and c respectively represent the velocity of the object and of the sound in air, and fS is the pilot tone transmitted from the speaker. Since the speaker and microphone keep stationary and are located on a same laptop, the velocity of receiver and source is out of our consideration. 2.1

Signal Analyze

In this paper, a range of eﬀective sound ﬁeld is formed by a high-frequency signal of 18 kHz from the speaker. When the operator moves hands in it, the reﬂected frequency shift is captured by the microphone. According to the characteristics of Doppler frequency shift [6], the whole processing of signal is carried out in the frequency domain. We set the sampling frequency of the microphone to 44.1 kHz, and then the 2048-point FFT is performed to obtain the frequency-domain characteristic of the sound. In the informal test of SoundWave [8], the fastest gesture can reach 3.9 m/s. Herein, we conservatively estimate the fastest speed as 6 m/s, that is, the maximum frequency shift Δfmax = 318 Hz is calculated according to the Eq. 1, so take the left eﬀective frequency range of the emitted peak is [17682, 18000], the right eﬀective range is [18000, 18318].

Fig. 1. (a) Positive shift in frequency spectrum generated by a towards-gesture. (b) Time-frequency map caused by a moving hand. The hand moves towards and away from the device alternately from the 4th to 8th s, and no motion in the remaining time.

We set the length of the analysis windows to 50 ms, so the frequency-domain is refreshed every 50 ms. The frequency spectrum is like a micro-image, reﬂecting

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

47

the changes in the frequency of gestures in the instantaneous, and contains many tiny details (Fig. 1(a)). A time-frequency graph is generated by adding the time information to the spectrum, as seen in Fig. 1(b), it expresses the direction and distance of gestures at the macro level. 2.2

Feature Extraction

After getting the spectrum of the signal collected by the microphone, we extracted ﬁve features, including bandwidth, amplitude of frequency shift, furthermore, the direction and count of direction change, and the moving distance in a gesture, so as to form a feature vector x: T x = x(1) , x(2) ...x(i) ...x(n)

(3)

Where x(i) represents the ith feature, and n = 5. The overall ﬂow is shown in Fig. 2. Next, we explain each feature of frequency shift in detail. Bandwidth ( x(1) ). x(1) is the bandwidth of emitted peak by scanning the frequency bins at 30% of the tone amplitude, which is extracted with the same method as SoundWave ([8]). x(1) is a measure of the absolute velocity of gesture movement and divide the hand velocity into diﬀerent levels (Fig. 3). By setting an appropriate threshold θv , false positives caused by unintended slow motions of users can be eﬀectively detected.

Capture ultrasound

Hamming window

FFT

Calculate distance

Determine direction

Extracting bandwidth

Fig. 2. The processing ﬂow of sound signal.

Fig. 3. Bandwidth in frequency spectrum caused by diﬀerent velocity gestures.

48

H. Ai et al.

Amplitude of Frequency Shift ( x(2) ). x(2) is the highest amplitude that the frequency shift can reach, which is a percentage-based value relative to the amplitude of tone peak Apeak . Shift caused by performing a same gesture at far and near are signiﬁcantly diﬀerent, mainly manifested in x(2) , as illustrated in Fig. 4. The farther a gesture is performed, the lower x(2) is. Therefore, setting a higher amplitude hupper , gestures can basically divided into two

Fig. 4. Where L is the distance from the location of the gesture to the computer, V represents the velocity level of a gesture. The noise in the surrounding environment is about 45 dB. (a) No gesture was performed. (b–c) High amplitude shift caused by a gesture performed in near-range, but x(1) bandwidth in (c) is much larger than that in (b); (d–e) Lower amplitude shift caused by a fast gesture in far-range from computer.

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

49

categories: near-range gesture Gnear and far-range gesture Gf ar . In this paper, we set hupper = 70% × Apeak , it’s obvious that x(2) > hupper in the frequency spectrum of Gnear , but x(2) < hupper of Gf ar . To summarize, x(1) is the bandwidth covered by the frequency shift on the horizontal axis at a speciﬁc amplitude, which reﬂects a gesture velocity. x(2) is the amplitude that the frequency shift can reach on the vertical axis, so gestures can be simply divided into two categories based on the location of gesture from the computer. Identifying the slow velocity or far-range motion as a false alarm can improve system robustness. Direction ( x(3) ). x(3) represents the direction of the gesture, which is depending on the energy diﬀerence between the right and left side of the peak. When the shift of the frequency shift is positive, the energy on the right of the peak increases, whereas the negative shift causes the energy on the left side to increases.

Fig. 5. (a) The red line area shows a positive shift occurs on the right of pilot peak, x(3) > 0, meaning a towards-gesture. (b) No movement and no frequency shift, x(3) is near zero. (Color ﬁgure online)

Deﬁne the energy on the left Elef t as the integral of the frequency within the eﬀective range fS f (x)dx (4) Elef t = fS −Δfmax

Similarly, deﬁne the right energy Eright : fS +Δfmax f (x)dx Elef t =

(5)

fS

Therefore, the diﬀerence between the right and left energy x(3) : x(3) = Eright − Elef t

(6)

50

H. Ai et al.

Where Δfmax = 318 Hz, f (x) is the amplitude of the shift at each eﬀective frequency bin. As illustrated in Fig. 5(a), if x(3) is positive, then the hand moves towards the devices, the negative value means away. No movement occurred if x(3) is near zero (Fig. 5(b)).

Fig. 6. When the frequency shift prop1 to , 2 one time erty goes from change of hand direction is detected. 2 to 3 are also one Similarly, from time change.

2 of long distance Fig. 7. The area 1 of short gesgesture is larger than ture obviously.

Count of Direction Change ( x(4) ). When detecting positive and negative value of x(3) exchanges in a gesture, recorded as one time change of the gesture direction. In Fig. 6, quantity of changes of motion direction x(4) is 5, that is, the frequency shift across the peak intersection marked a change. Distance ( x(5) ). x(5) is calculated by the integration of frequency shift over time, which indicates the moving distance of a gesture in one time direction change to distinguish the long and short distance gesture (Fig. 7). Distance = time × velocity, time information can be quickly obtained from the time-frequency map, the key is velocity. There is a proportional relationship between velocity and frequency shift based on Eq. 2. We use the following equation to make a rough calculation of x(5) : t2 c x(5) = × Δf dt (7) fS t1 Where t1 and t2 respectively represent the start point and the next direction change point of the gesture within once change of the gesture direction. In Fig. 8, an informal test shows the x(5) distribution of diﬀerent gestures, where the short and long distance gestures were respectively performed 100 times by 10 participants. The result veriﬁed our thoughts that long and short distance gestures have a clear boundary value. So we initially set the threshold DL/S of long and short distance is 500 to make gestures more clearly distinct and make sure a high sensitivity of distinguishing two types gesture.

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

Quantity of gestures

100

51

Short distance gesture Long distance gesture

80 60 40 20 0 0

200

400

600

Distance x

(5)

800

1000

...

(Hz)

Fig. 8. Histogram of x(5) distribution.

Long distance

Short distance

Towards

Tap Towards

Both hand

Away

Tap Away Both Handed Seesaw

Fig. 9. G1 ∼ G5 graphic representation.

3

Gesture Definition and Classification

The designed gestures are not only the more accessible body language of HCI, but also easily discriminated from each other. 3.1

Gesture Definition

Based on the proposed ﬁve features, we can deﬁne a simple set that contains 11 gesture actions: G = {G1 , G2 ...Gj ...GN }, where N = 11, Gj represents the jth gesture in the set. All gesture descriptions are listed in Table 1. And Fig. 9 shows G1 ∼ G5 motion graphic, where G1 and G2 are long distance gestures, while the tap gestures like click mouse, so they are all short distance motions. The remaining gestures G6 ∼ G11 are compound gestures. The users need to perform gestures at a certain velocity, without requiring a constant velocity, only need the instantaneous velocity reach the threshold of certain velocity. Users can adjust the velocity threshold according to their own habits. 3.2

Hand Gesture Classification

In this section, we classify gestures step by step based on diﬀerent features until we categorize each of the gestures. The system ﬁrst detects G5 (BHS) because

52

H. Ai et al. Table 1. Deﬁnition of gestures Number Gesture Description G1

T

Towards: Move hand towards the microphone for long distance

G2

A

Away: Move hand away from the devices for long distance

G3

TT

Tap-Towards: Swipe hand towards then away from, just like clicking a mouse one time, short and quickly

G4

TA

Tap-away: Same action as G3 , in the opposite direction

G5

BHS

Both-Handed-Seesaw: Move both hands from two sides to the middle simultaneously, and then separate

G6

TtA

Towards-then-away: Swipe hand towards for long distance, then away to origin

G7

AtT

Away-then-towards: Same gesture like G6 , only in the opposite direction

G8

DTT

Double-Tap-Towards: Do G3 twice

G9

DTA

Double-Tap-Away: Perform G4 twice

G10

TTT

Triple-Tap-Towards: Perform G3 three times

G11

TTA

Triple-Tap-Away: Do G4 three times

it causes signiﬁcant shifts on both sides of the tone peak simultaneously, a clear distinction from the remaining 10 gestures. Then, we classify the remaining 10 gestures by using a classiﬁer designed depending on multistage decision rules (Fig. 10). Table 2 lists the feature values of the 10 gestures.

4

Evaluation and Results

We evaluated the system performance experimentally. And the system was developed on a laptop PC with Windows 10 and a pair of microphone and speaker without any customized hardware (Fig. 11), so the direction of any gesture performed by the user is same for the microphone and speaker. Note that any gestures within 0.8 m and people walking within 2 m near the computer all can cause signiﬁcant frequency shifts, the experimental scene has a noise level of 45 dB. 4.1

Numerical Experiment

We conducted a numerical experiment to the robustness of the system through the following three ﬁelds: false positive, false negative and classiﬁcation accuracy.

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

53

Fig. 10. Identifying gestures using a classiﬁer, when a gesture is detected, the classiﬁer adopts the features x(4) , x(3) and x(5) in turn as decision rule for each stage.

Fig. 11. Devices deployment in experiment environment.

Potential False Positive. A false alarm refers to a gesture is erroneously detected without gestures execution. Experiments were conducted in the following two common living environments, the ﬁrst is that the user only sat in front of the computer for normal typing and thinking motion, while no one walks around. In half an hour, the number of potential false positives is 6, all of them were single tap actions since these gestures is short and simple. In the second case, the user had no any actions, only three participants were located about 1.5 m from the computer and walked around for half an hour. The system detected 4 false positives ﬁnally, and all of them were the result of the participants walking quickly.

54

H. Ai et al. Table 2. The list of features for all gestures Gestures x(3)

x(4) x(5)

T

Towards 0

Long distance(L)

A

Away

L

TT

Towards 1

TA

Away

1

S

TtA

Towards 1

L

0

Short distance(S)

AtT

Away

1

L

DTT

Towards 3

S

DTA

Away

3

S

TTT

Towards 5

S

TTA

Away

S

5

False Negative. A false negative means no gesture is detected while a conscious gestures is performed actually. In our experiment, 10 users (marked U 1 ∼ U 10) actively participated and performed each gestures 100 times, resulting 11000 (100 × 11 × 10) gesture samples. Several false negative errors occurred in the process, as shown in Fig. 12. The false negative rate of nine users are all less than 1%, but the U2 rate as high as 1.1% (Fig. 12). Why? We found a set of interesting data, U2 tends to move in parallel with four ﬁngers instead of sliding the palm of hand, resulting in smaller frequency shift. This may be the reason of a high false negative rate.

TA

TtA

AtT

DTT

DTA

TTT

TTA

0

BHS

A

T

T 100 0

0

0

0

0

0

0

0

0

100 0 0 0 0 0 TT 0 0 99 0 0 1 0 TA 0 0 0 98 0 0 2 BHS 0 0 0 0 100 0 0 TtA 0 0 3 0 0 97 0 AtT 0 0 0 4 0 0 96 A 0

0.8

Actual Gestures

False negative rate(%)

1

TT

Classified Gestures

1.2

0.6 0.4

DTT 0

0

0

0

0

0

0.2

DTA 0

0

0

0

0

0

TTT 0

0

0

0

0

0

0

TTA 0

0

0

0

0

0

1

2

3

4

5

6

7

Gestures

8

9 10 11

Fig. 12. The rate of false negative during the gesture sample test process.

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

100 0 0 0 0 100 0 0 0 0 0 100 0 0 0 0 0 100 0

0

Overall recognition rate = 99.09% Fig. 13. The confusion matrix of the gesture classiﬁcation.

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

55

Classification Accuracy. Then the rest of eﬀective gesture successfully detected from above experiment samples were used to measure classiﬁcation precision. Since the samples were all labeled, so we can easily calculated the ﬁnal classiﬁcation accuracy (Fig. 13) up to 99.09%. There are several samples that have been misidentiﬁed, mostly because of the occasional confusion in decision of long or short distance of gestures, as diﬀerent people have their own preferences to perform hand gestures, so it is very diﬃcult to correctly classify gestures with 100% accuracy by choosing a proper threshold DL/S of long and short distance. However it doesn’t mean the evaluation contradicts with the claim, because the experimental result has shown that our method can already identify the diﬀerent distance gestures with a much high accuracy. 4.2

Gesture Usability Test

Research in gesture usability focus on ﬁve main principles [4,5,12]: learnability, eﬃciency, memorability, errors, and coverage. Among them, the low error rate (99.09% accuracy) and coverage (zero start-up cost and no training) have been basically veriﬁed in Sect. 4.1. Next, we mapped the gesture set to a common remote controller in our life, take MI smart TV remote controller as an example (Fig. 14). Each gesture operates a button, there are 11 buttons on the controller, corresponding to our 11 kinds of gestures. 10 users (U 1 ∼ U 10) respectively performed gestures to simulate remote controller to direct MI TV freely. We collected a total of 151 gesture samples from 10 users, where 2 missed detection and 1 misidentiﬁed. We further recorded the user experience to evaluate the usability of the system for gesture classiﬁcation. Each participant indicated that the system is particularly eﬃcient, as they can smoothly operate the TV with high precision. Six participants remarked

Fig. 14. MI TV remote controller.

56

H. Ai et al.

speciﬁcally on the’learnability’, since they were asked to observe the demo and learn gestures for 2–3 min and then operate the TV. Besides, eight participants described the gestures as “memorability” and “learnability”, since the meaning of the gestures are easy to understand, so they can remember them (and perform them) easily. However, two participants acknowledged that the gesture action and the function of the menu are not very relevant, increasing the memory burden. Finally, our method shows better performance in many items (Table 3) by comparison with the state of the art. A computer with one speaker and a microphone can meet our all hardware requirements. In addition, all experiments do not require users to perform gesture samples in advance and no training. Meanwhile, the results of digital experiments have veriﬁed that our system is robust. It not only has less potential false positives, but also can keep the false negative rate within 1%, and ﬁnally achieve about 99% classiﬁcation accuracy with the deﬁned 11 gestures. Table 3. Comparison to the existing sound-based methods Methods

SoundWave [8] Dolphin [16] Multiwave [15] Our method

Number of speakers

1

1

≥2

1

Needing training?

NO

YES

YES

NO

Improve false positives? YES

NO

NO

YES (>SoundWave)

Test false negatives?

NO

NO

NO

YES

Accuracy

94.5%

93%

93.9%

99%

5

Conclusion

In this paper, we proposed a gesture set for HCI based on Doppler eﬀect. The sound ﬁeld consists of a pair of speaker and microphone. The reﬂected signal by moving gesture is captured by a microphone. We extract ﬁve most robust features from the Doppler shift, and classify a gesture set containing 11 gestures by a classiﬁer based-on multistage decision rules. Compared with the state-of-the-art, the features we propose can be better improve the bad eﬀects of potential false positives, especially our method can achieve a high accuracy during classifying all gestures with no training. Finally, the results of experiments illustrate that our gesture set performs very well on usability, including high accuracy, less false positives, learnability, memorability and zero start-up cost. Acknowledgment. We thank the participants for participating the user study. This work is partially supported by The National Key Research and Development Program of China (2016YFB0502201).

Accurate Acoustic Based Gesture Classiﬁcation with Zero Start-Up Cost

57

References 1. Ai, H., Men, Y., Han, L., Li, Z., Liu, M.: High precision gesture sensing via quantitative characterization of the doppler eﬀect. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 973–978. IEEE (2016) 2. Asadzadeh, P., Kulik, L., Tanin, E.: Gesture recognition using RFID technology. Pers. Ubiquit. Comput. 16(3), 225–234 (2012) 3. Aumi, M.T.I., Gupta, S., Goel, M., Larson, E., Patel, S.: Doplink: using the doppler eﬀect for multi-device interaction. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 583–586. ACM (2013) 4. Bevan, N., Curson, I.: Methods for measuring usability. In: Howard, S., Hammond, J., Lindgaard, G. (eds.) Human-Computer Interaction INTERACT 1997. ITIFIP, pp. 672–673. Springer, Boston, MA (1997). https://doi.org/10.1007/9780-387-35175-9 126 5. Cabral, M.C., Morimoto, C.H., Zuﬀo, M.K.: On the usability of gesture interfaces in virtual reality environments. In: Proceedings of the 2005 Latin American Conference on Human-Computer Interaction, pp. 100–108. ACM (2005) 6. Chen, K.Y., Ashbrook, D., Goel, M., Lee, S.H., Patel, S.: Airlink: sharing ﬁles between multiple devices using in-air gestures. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 565– 569. ACM (2014) 7. Fu, B., Karolus, J., Grosse-Puppendahl, T., Hermann, J., Kuijper, A.: Opportunities for activity recognition using ultrasound doppler sensing on unmodiﬁed mobile phones. In: Proceedings of the 2nd international Workshop on Sensor-based Activity Recognition and Interaction, p. 8. ACM (2015) 8. Gupta, S., Morris, D., Patel, S., Tan, D.: Soundwave: using the doppler eﬀect to sense gestures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1911–1914. ACM (2012) 9. Jeong, J., Jang, Y.: Max-min hand cropping method for robust hand region extraction in the image-based hand gesture recognition. Soft Comput. 19(4), 815–818 (2015) 10. Kellogg, B., Talla, V., Gollakota, S.: Bringing gesture recognition to all devices. NSDI 14, 303–316 (2014) 11. Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2015) 12. Nielsen, M., St¨ orring, M., Moeslund, T.B., Granum, E.: A procedure for developing intuitive and ergonomic gesture interfaces for HCI. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 409–420. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24598-8 38 13. Paramonov, P., Sutula, N.: Simpliﬁed scoring methods for HMM-based speech recognition. Soft Comput. 20(9), 3455–3460 (2016) 14. Pittman, C., Wisniewski, P., Brooks, C., LaViola Jr, J.J.: Multiwave: doppler eﬀect based gesture recognition in multiple dimensions. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1729–1736. ACM (2016) 15. Pittman, C.R., LaViola Jr, J.J.: Multiwave: complex hand gesture recognition using the doppler eﬀect. In: Proceedings of the 43rd Graphics Interface Conference, pp. 97–106. Canadian Human-Computer Communications Society (2017)

58

H. Ai et al.

16. Qifan, Y., Hao, T., Xuebing, Z., Yin, L., Sanfeng, Z.: Dolphin: ultrasonic-based gesture recognition on smartphone platform. In: 2014 IEEE 17th International Conference on Computational Science and Engineering (CSE), pp. 1461–1468. IEEE (2014) 17. Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015) 18. Seddon, N., Bearpark, T.: Observation of the inverse doppler eﬀect. Science 302(5650), 1537–1540 (2003) 19. Suk, H.I., Sin, B.K., Lee, S.W.: Hand gesture recognition based on dynamic bayesian network framework. Pattern Recogn. 43(9), 3059–3072 (2010) 20. Xiao, Q., Siqi, L.: Motion retrieval based on dynamic Bayesian network and canonical time warping. Soft Comput. 21(1), 267–280 (2017) 21. Xiao, Q., Song, R.: Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput. 21(1), 255–265 (2017)

An Approach of Collecting Performance Anomaly Dataset for NFV Infrastructure Qingfeng Du1,2 , Yu He1,2(B) , Tiandi Xie1,2 , Kanglin Yin1,2 , and Juan Qiu1,2 1

School of Software Engineering, Tongji University, Shanghai, China {du cloud,rainlf,xietiandi,14 ykl,Juan qiu}@tongji.edu.cn 2 Software Engineering R&D Centre, Tongji University, Jishi Building, Shanghai, China https://github.com/XLab-Tongji

Abstract. Network Function Virtualization (NFV) technology is widely used in industry and academia. Meanwhile, it brings a lot of challenges to the NFV applications’ reliability, such as anomaly detection, anomaly location, anomaly prediction and so on. All of these studies need a large number of anomaly data information. This paper designs a method for collecting anomaly data from Infrastructure as a Service (IaaS), and constructs an anomaly database for NFV applications. Three types of anomaly datasets are created for anomaly study, including datasets of workload with performance data, fault-load with performance data and violation of Service Level Agreement (SLA) with performance. In order to simulate an anomaly in a production environment better, we use Kubernetes to build a distributed environment, and to accelerate the occurrence of anomalies, a fault injection system is utilized. Our aim is to provide more valuable anomaly data for reliability research in NFV environments. Keywords: Anomaly database · NFV · Kubernetes · IaaS Clearwater · Performance monitoring · Fault injection

1

Introduction

Network Function Virtualization (NFV) is becoming more and more popular. Many Communication Service Providers (CSP) have begun to migrate applications to Network Functions Virtualization (NFV) environment [1]. Detection of anomaly and anomaly location is very important for providing better network services. It is necessary to predict anomalies in some special circumstances. It needs to analyze the rules and connections in a large number of anomaly data. But in production environment, the cost of collecting these data is expensive. So it is meaningful to collect these anomaly data for research in the experimental environment.

c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 59–71, 2018. https://doi.org/10.1007/978-3-030-05057-3_5

60

Q. Du et al.

At present, there are many databases for anomaly data, such as KDD CUP 99 dataset1 , NAB dataset2 , Yahoo Webscope S5 dataset3 , and so on. All of these could be a benchmark for evaluating algorithms for anomaly detection. But these datasets also exist some restrictions, like single label, data redundancy and so on. On this basis, we collect anomaly data from three diﬀerent perspectives. In NFV environment, the cause of the failure is not single. In order to describe diﬀerent exceptions more accurately, the multiple types of fault tags are necessary. Our method uses fault injection system to specify fault types of anomaly data, making datasets more suitable to deal with the problem of multiple classiﬁcation in machine learning [2]. In addition, the malfunction of system resources can also lead to system anomaly happen, the pressure of users on system workload will also lead to system anomaly behavior [3]. In production environment, increase of users may be an important factor leading to anomaly service compared to the occurrence of hardware anomaly events. Our method also collects anomaly data under diﬀerent workload. In NFV applications, the typical quality of service index is Service Level Agreement (SLA)4 . When a violation of SLA occurs, it represents an anomaly service. Our method also collects performance data under diﬀerent SLA level. It helps researcher to analyze the relationship between a occurrence of SLA violation and performance data of IaaS in a system. At last, we propose several machine learning models based on supervised learning to detect SLAs of VNFs and anomaly in IaaS. And compare the experimental results of each model. The result of the comparison between the models show that our anomaly database has a certain reference value in the anomaly detection with VNFs Environment. The paper is organized as follows: Sect. 2 introduces the technical background and our related work in the construction of the anomaly database. Section 3 introduces the architecture of the data collection. Section 4 shows the implementation of our experiment. Section 5 provides a classical case study of Clearwater project5 , gives a detailed description of the building of the anomaly database. And at last, we summarizes the contribution and discuss the future work in Sect. 6.

2

Background and Related Work

With the development of Internet applications and the maturity of hardware virtualization, The emergence of Infrastructure as a Service (IaaS) [4] provides the underlying hardware support for this architecture. It makes network providers do not need care about the details of the underlying hardware devices, and 1 2 3 4 5

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. https://github.com/numenta/NAB. https://webscope.sandbox.yahoo.com/catalog.php?datatype=s. https://en.wikipedia.org/wiki/Service-level agreement. http://www.projectclearwater.org/.

An Approach of Collecting Performance Anomaly Dataset

61

concentrate on providing upper level services. In this context, Virtual Network Functions (VNFs) represent any virtual execution environment conﬁgured to provide a given network service. VNFs are often structured in several components each one hosted on single VMs. The existing anomaly databases collect a lot of anomaly data in diﬀerent ﬁelds. KDD CUP 99 dataset is used for network attack diagnosis. Each of its data records whether or not it has been attacked at the moment. It means that there are only one label in dataset, normal or anomaly. Even Mahbod Tavallaee and his collaborator further optimized KDD CUP 99 dataset called NSL-KDD, it still has the same limitations [5]. This paper provides a disturbance system to specify the type of fault load to analyze the inﬂuence of diﬀerent fault types on the performance of the tested system. Markus Thill present a comparative study where several online anomaly detection algorithms are compared on the large Yahoo Webscope S5 anomaly benchmark [6]. But the yahoo Webscope S5 dataset is more suitable for time series analysis. It continues to have some limitations for the classiﬁcation of different faults. We present a new approach to collecting performance data that with fault label. It has more advantages in the classiﬁcation problem of anomaly detection. In this paper, we integrate common single fault time series analysis problems and multiple fault classiﬁcation problems in complex systems, propose corresponding performance data collection system and disturbance system. Then establish varied dataset in our anomaly database, Provide reference for fault analysis in diﬀerent scenes. The details is shown in our site6 .

3

Architecture of Data Collection

This section outlines the framework of our performance data collection. In order to accurately collect data that with a fault type label, the framework consists of three systems, target application system (target system), disturbance system and performance monitoring system (monitoring system), as shown in Fig. 1. 3.1

Target System

Target system is a NFV application system, which is software implementations of network functions that can be deployed on a network functions virtualization infrastructure (NFVI). NFVI is the totality of all hardware and software components that build the environment where VNFs are deployed. 3.2

Disturbance System

The core function of the disturbance system is fault injection [7,8], it is used to accelerate the occurrence of anomaly events in the target system, such as 6

https://github.com/XLab-Tongji.

62

Q. Du et al.

Fig. 1. Architecture of the performance data collection

hardware performance bottlenecks, SLA violation and so on. In this paper, we use linux system stress tool called stress-ng [9] to simulation system pressure to achieve fault injection function. In order to produce diﬀerent types of disturbance to the system, we use diﬀerent types of fault injection in the target system: – CPU stress fault – MEMORY stress fault – IO stress fault Every type of fault injection will consume the system resources as much as possible to ensure the occurrence of anomaly events. In most situations, anomaly diagnosis of platforms or systems is often directed against single point failure [10]. So we use a strategy to ensure that only one type of disturbance occurs on only one virtual machine at the same time. When fault injection occurs, the disturbance system will record the log of the fault injection at the same time, including the start time, the duration, the type of fault and the target virtual machine. After the monitoring system collects performance data, the logs can be used to tag the performance data. 3.3

Monitoring System

There are many kinds of mature IaaS layer monitoring schemes at present, like Zabbix7 , Nagios8 , Cacti9 . Considering our experimental environment and 7 8 9

https://www.zabbix.com/. https://www.nagios.org/. https://www.cacti.net/.

An Approach of Collecting Performance Anomaly Dataset

63

monitoring project items, we use Zabbix to monitor the system and collect performance data online. Zabbix is an enterprise open source monitoring software for networks and applications with C/S model, the zabbix agent is installed in the VMs. The situation shows that agent monitoring is more accurate than agent-less monitoring, and can more accurately describe the performance model of a system [11]. The Table 1 shows the performance model in our approach. Zabbix agents will collect these metrics from VMs, and store them in it’s MySQL database. We also oﬀer a JAVA application to download these performance data throw RESTful API from Zabbix server. Table 1. Zabbix monitoring metrics

4

Implementation

This section presents the implementation of our test bed environment. It includes infrastructure, kubernetes platform, monitoring system, attacker system and the clearwater-docker NFV application running in kubernetes platform, as shown in Fig. 2. 4.1

Infrastructure

The virtualized platform is a VMWare ESXI machine with 64 CPUs, 128 GB memory and 2 TB disk. It can provide multiple virtual machines on a physical machine. In this paper, we create 10 VMs on it. Every VM has 2 CPUs, 8 GB memory and 20 GB disk. VMs are connected through a 1000 Mbps virtualized network. The VMs has the docker environment with version 17.03.2-ce that can deploy most docker container in it.

64

Q. Du et al.

4.2

Kubernetes

Kubernetes is a powerful container management platform. We use it to deploy the Clearwater project as described below. Here we use the Rancher scheme10 to deploy kubernetes platform on the VMs. The reason is it can easily deploy the kubernetes platform. The installation steps are described as following: 1. Conﬁrm that the network between the virtual machines just created is working; 2. Select a host as the rancher server host and deploy the latest version of rancher docker image on it; 3. Waiting for the rancher server is running Correctly, access the rancher server page from the 80 port of the host; 4. Create a new environment for test bed based on kubernetes template; 5. Add all other VMs in this environment and wait rancher server add them to kubernetes platform automatically.

Fig. 2. Deployment of the test bed

4.3

Monitoring and Attack System

The monitoring system consists of zabbix server host and zabbix agents. Zabbix agents were installed on each VM when they were created and connect to zabbix server through the web page conﬁgurations. When the connection is set up, the agent will began to collect performance data and report them to the server at a set time interval. Attacker host is also an independent host. It will execute the attack scripts which we provided to perform fault injection into VMs. 10

https://rancher.com/.

An Approach of Collecting Performance Anomaly Dataset

4.4

65

NFV Application

The NFV application is a distributed computing system running NFV application. Here we utilise the Clearwater project. It is an open source implementation of an IMS for cloud platforms. It provides SIP-based (Session Initiation Protocol) voice and video calling, and messaging applications. It implements key standardized interfaces and functions of an IMS (except a core network) which enable industries to easily deploy, integrate and scale an IMS [3]. Clearwater project is consequently well suited for NFV related studies, it consists of about 10 components, every component plays its own unique functions in the system, and the relationship between components is shown as Fig. 3. Due to the docker deployment scheme, every Clearwater docker container is conﬁgured to allow unlimited use of host resources.

Fig. 3. Architecture of the clearwater project

Bono (Edge Proxy): The Bono nodes form a horizontally scalable SIP edge proxy providing both a SIP IMS Gm compliant interface and a WebRTC interface to clients. Client connections are load balanced across the nodes. The Bono node provides the anchor point for the client’s connection to the Clearwater system, including support for various NAT traversal mechanisms. A client is therefore anchored to a particular Bono node for the duration of its registration, but can move to another Bono node if the connection or client fails. Sprout (SIP Router): The Sprout nodes act as a horizontally scalable, combined SIP registrar and authoritative routing proxy, and handle client

66

Q. Du et al.

authentication and the ISC interface to application servers. The Sprout nodes also contain the in-built MMTEL application server. Dime (Diameter Gateway): Dime nodes run Clearwater’s Homestead and Ralf components. Homestead (HSS Cache) provides a web services interface to Sprout for retrieving authentication credentials and user proﬁle information. It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface; Ralf provides an HTTP API that both Bono and Sprout can use to report billable events that should be passed to the CDF (Charging Data Function) over the Rf billing interface. Vellum (State Store): Vellum is used to maintain all long-lived state in the deployment. It does this by running a number of cloud optimized, distributed storage clusters including Cassandra, etcd, Chronos and Memcached. Homer (XDMS): Homer is a standard XDMS used to store MMTEL service settings documents for each user of the system. Ellis: Ellis is a sample provisioning portal providing self sign-up, password management, line management and control of MMTEL service settings. As introduced before, the Bono, Sprout, and Homestead are the Core modules in the Clearwater project, they are working together to control sessions initiated by users. So our data collection work is mainly focused on these three modules. When experiment begins, Clearwater is running normally to generate normal data, or running overloaded to generate anomaly data. When system is running normally, the attacker host can execute attack to disturb system to produce anomaly data and record the log. While the monitoring system is monitoring the VMs performance metrics and collect all normal and anomaly data on it to establish the database.

5

Case Study

This section introduces a classic Clearwater case study. On the basis of the normal operation of system, disturbed the system by overload work stress and fault injection respectively to produce the anomaly dataset. And select the machine learning algorithm with better performance in anomaly detection [12–15] to verify the availability of datasets. In order to produce a normal workload, use the oﬃcial recommended tools clearwater-sip-stress-coreonly11 . It can control the working stress of the system by specifying three parameters as: – subscriber count: the number of subscribers to emulate; – duration: the number of minutes to run stress for; – multiplier: Optional parameters, multiplier for the VoLTE load proﬁle (e.g. the default is 1 means 1.3 calls and 24 re-registers per sub per hour; passing 2 here will mean 2.6 calls and 4 re-registers per sub per hour). 11

https://clearwater.readthedocs.io/en/stable/Clearwater stres testing.html.

An Approach of Collecting Performance Anomaly Dataset

67

We chose 500 subscribers, 60 min and 450 multiplier for experiment, At this point, the system can reach a 100% successful call rate. When the work stress continues to increase, the successful call rate began to decline. So we mark this point as a engineering level point x, it means the system has running in full workload under the current conﬁguration. 5.1

Workload Module

As described above, we use engineering level point x as a standard to produce workload. Test the performance data of the system under 0.8x, 1x, 1.5x, 2x and 2.5x pressure respectively. The structure of collected dataset is shown in the Table 2. 5.2

Faultload Module

In this paper, we forces on the single point fault, it means at the same time, there is only one type of fault be injected into one VM. 0.8x engineering level is chosen to be the normal system running workload to easily observe the anomaly representation generated by fault injection. The process of fault injection is shown in Fig. 4.

Fig. 4. Fault injection process

Within a speciﬁed time period, the fault injecting program will select a Select random fault type, a random target virtual machine, and a random injection period to start a disturbance process. This process will continue until the total of time which fault injection consumed reaches the stipulated time period. As described in Algorithm 1. The disturbance system also records the injected log while injecting the fault. The key information includes timestamp, fault type, target host and injection duration. As Algorithm 2 described, We use the fault injection log to indicate which fault injection stage each performance data record belongs to, like normal, cpu fault, memory fault or io fault. The result of data process is shown in Table 3.

68

Q. Du et al.

Algorithm 1. Fault Inject Controller Input: vm list, inject type list, duration list, duration 1: timer = 0 2: while timer < duration do 3: inject vm = random(vm list) 4: inject type = random(inject type list) 5: inject duration = random(duration list) 6: timer+ = inject duration 7: inject(vm, inject type, inject duration) 8: sleep(pause) 9: end while

In order to collect the anomaly SLA data, the workload module and faultload module work together to disturbance the system. We calculate the SLA level of the system from the percentage of successful requests (PSR). When P SR ≥ 90%, means the system is in good condition, marked as level 2. When 50% ≤ P SR ≤ 90%, means the system is in unhealthy condition, marked as level 1. When P SR ≥ 50%, means the system is in bad condition, mark as level 0. The structure of dataset is shown in Table 4. Table 2. Dataset A Timestamp Vm1metric2

Vm1metric1

... Vm2metric1

1521448560 70%

73%

... 69%

1521448565 73%

73%

... 68%

99%

... 97%

Vm2metric2

... Vm3metric1

Vm3metric2

... Workload level

77%

... 66%

69%

... 1

75%

... 70%

74%

... 1

100%

... 95%

97%

... 2

...

...

1521458230 98%

5.3

Dataset Verification

This part introduces four widely used machine learning algorithms, namely, support vector machine, nearest neighbor, naive Bayes and random forests. And use them to locate outliers in the system performance data.

Algorithm 2. Data Labeled Controller Input: perf ormance data, injection log 1: labeled data = [] 2: while perf ormance data.has next()! = null do 3: data = perf ormance data.next() 4: data label = label(data, injection log) 5: labeled data.append(data label) 6: end while

An Approach of Collecting Performance Anomaly Dataset

69

Table 3. Dataset B Timestamp Vm1Vm1... Vm2Vm2... Vm3Vm3... Normal CPU MEMORY IO metric2 metric1 metric1 metric2 metric1 metric2 152263940

70%

73%

... 69%

77%

... 66%

69%

... 1

0

0

152263945

73%

73%

... 68%

75%

... 70%

74%

... 1

0

0

0

152263950

73%

100%

... 69%

79%

... 72%

73%

... 0

1

0

0

71%

74%

... 70%

75%

... 99%

72%

... 0

0

0

1

...

0

...

152267680

Table 4. Dataset C Timestamp Vm1Vm1... Vm2Vm2... Vm3Vm3... SAL metric2 metric1 metric1 metric2 metric1 metric2 level 1521448560 90%

72%

... 92%

74%

... 85%

91%

... 2

1521448565 85%

77%

... 83%

75%

... 73%

88%

... 1

68%

... 92%

89%

... 87%

79%

... 0

...

...

1521458230 66%

Table 5. Validation results of anomaly dataset Service

Measure

Nearest neighbors SVM Naive bayes Random forset

Dataset A Precision 0.98 Recall 0.97 F1-score 0.97

0.89 0.88 0.87

0.95 0.93 0.93

0.97 0.96 0.98

Dataset B Precision 0.93 Recall 0.92 F1-score 0.93

0.90 0.91 0.89

0.96 0.95 0.97

0.99 0.98 0.99

Dataset C Precision 0.94 Recall 0.97 F1-score 0.96

0.87 0.93 0.92

0.89 0.91 0.94

0.98 0.96 0.97

There are 737 records in dataset A and dataset B, we employed the ﬁrst 80% of them as the train set, having trained the learning methods, the rest 20% are used as test set to validate the algorithm model. The validation result are shown in Table 5. The results show that the accuracy, recall rate and F1-score of each model reach a higher value. And because of the multi classiﬁcation problem of the dataset, the random forest model achieves the best results.

6

Conclusion and Future Work

In this paper, we describe an approach to deploy NFV application Clearwater projects through the Kubernetes platform. On this basis, we use disturbance application system and monitoring system to collect performance data of IaaS

70

Q. Du et al.

layer devices under NFV application scenario to build anomaly database. Three categories of anomaly datasets with speciﬁed label are collected, includes workload with performance data, faultload with performance data and SLA level with performance data. The details of the anomaly database can be accessed on our website12 . Through some widely used machine learning algorithm, we verify these datasets and get high accuracy. This means these datasets have some reference value for anomaly detection. In the future, we will try more anomaly scenes and cause anomaly reasons, and build corresponding anomaly datasets to analyze them. We hope to be of certain guiding signiﬁcance for the detection of anomaly in diﬀerent scenes.

References 1. Liu, J., Jiang, Z., Kato, N., Akashi, O., Takahara, A.: Reliability evaluation for NFV deployment of future mobile broadband networks. IEEE Wirel. Commun. 23(3), 90–96 (2016) 2. Pieters, M., Wiering, M.: Comparison of machine learning techniques for multilabel genre classiﬁcation. In: Verheij, B., Wiering, M. (eds.) BNAIC 2017. CCIS, vol. 823, pp. 131–144. Springer, Cham (2018). https://doi.org/10.1007/978-3-31976892-2 10 3. Sauvanaud, C., Lazri, K., Kaˆ aniche, M., Kanoun, K.: Anomaly detection and root cause localization in virtual network functions. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 196–206. IEEE (2016) 4. Bhardwaj, S., Jain, L., Jain, S.: Cloud computing: a study of infrastructure as a service (IAAS). Int. J. Eng. Inf. Technol. 2(1), 60–63 (2010) 5. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, pp. 1–6. IEEE (2009) 6. Thill, M., Konen, W., B¨ ack, T.: Online anomaly detection on the webscope S5 dataset: a comparative study. In: 2017 Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–8, May 2017 7. Natella, R., Cotroneo, D., Madeira, H.S.: Assessing dependability with software fault injection: a survey. ACM Comput. Surv. (CSUR) 48(3), 44 (2016) 8. Delvaux, J., Verbauwhede, I.: Fault injection modeling attacks on 65 nm arbiter and RO sum PUFs via environmental changes. IEEE Trans. Circuits Syst. I: Regular Papers 61(6), 1701–1713 (2014) 9. King, C.: Stress-ng (2018) 10. Wang, Y., Li, X.: Achieve high availability about point-single failures in openstack. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 01, pp. 45–48, December 2015 11. Aversa, R., Panza, N., Tasquier, L.: An agent-based platform for cloud applications performance monitoring. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 535–540, July 2015

12

https://github.com/XLab-Tongji/ADNFVI.

An Approach of Collecting Performance Anomaly Dataset

71

12. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18(2), 1153– 1176 (2016). Secondquarter 13. Iglesias, F., Zseby, T.: Analysis of network traﬃc features for anomaly detection. Mach. Learn. 101(1–3), 59–84 (2015) 14. Kulkarni, A., Pino, Y., French, M., Mohsenin, T.: Real-time anomaly detection framework for many-core router through machine-learning techniques. ACM J. Emerg. Technol. Comput. Syst. (JETC) 13(1), 10 (2016) 15. Erfani, S.M., Rajasegarar, S., Karunasekera, S., Leckie, C.: High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recogn. 58, 121–134 (2016)

An Axiomatization for BSP Algorithms Yoann Marquer and Fr´ed´eric Gava(B) Laboratory of Algorithms, Complexity and Logic (LACL), University of Paris-East, Cr´eteil, France [email protected], [email protected]

Abstract. The gurevich’s thesis stipulates that sequential abstract state machines (asms) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (bsp) bridging model is a well known model for hpc algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. The assumptions of the bsp model are thus provide portable and scalable performance predictions on most hpc systems. We follow gurevich’s thesis and extend the sequential postulates in order to intuitively and realistically capture bsp algorithms. Keywords: bsp Cost model

1 1.1

· asm · Parallel algorithm · hpc · Postulates

Introduction Context of the Work

Nowadays, hpc (high performance computing) is the norm in many areas but it remains more diﬃcult to have well deﬁned paradigms and a common vocabulary as it is the case in the traditional sequential world. The problem arises from the diﬃculty to get a taxonomy of computer architectures and frameworks: there is a zoo of deﬁnitions of systems, languages, paradigms and programming models. Indeed, in the hpc community, several terms could be used to designate the same thing, so that misunderstandings are easy. We can cite parallel patterns [5] versus algorithmic skeletons [8]; shared memory (pram) versus thread concurrency and direct remote access (drma); asynchronous send/receive routines (mpi, http:// mpi-forum.org/) versus communicating processes (π-calculus). In the sequential world, it is easier to classify programming languages within their paradigm (functional, object oriented, etc.) or by using some properties of the compilers (statically or dynamically typed, abstract machine or native code execution). This is mainly due to the fact that there is an overall consensus on what sequential computing is. For them, formal semantics have been often studied and there are now many tools for testing, debugging, cost analyzing, software engineering, etc. In this way, programmers can implement sequential algorithms using these languages, which characterize properly the sequential algorithms. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 72–88, 2018. https://doi.org/10.1007/978-3-030-05057-3_6

An Axiomatization for BSP Algorithms

73

This consensus is only fair because everyone informally agrees to what constitutes a sequential algorithm. And now, half a century later, there is a growing interest in deﬁning formally the notion of algorithms [10]. Gurevich introduced an axiomatic presentation (largely machine independent) of the sequential algorithms in [10]. The main idea is that there is no language that truly represents all sequential algorithms. In fact, every algorithmic book presents algorithms in its own way and programming languages give too much detail. An axiomatic deﬁnition [10] of the algorithms has been mapped to the notion of abstract state machine (asm, a kind of Turing machine with the appropriate level of abstraction): Every sequential algorithm can be captured by an asm. This allows a common vocabulary about sequential algorithms. This has been studied by the asm community for several years. A parallel computer, or a multi-processor system, is a computer composed of more than one processor (or unit of computation). It is common to classify parallel computers (flynn’s taxonomy) by distinguishing them by the way they access the system memory (shared or distributed). Indeed, the memory access scheme inﬂuences heavily the programming method of a given system. Distributed memory systems are needed for computations using a large amount of data which does not ﬁt in the memory of a single machine. The three postulates for sequential algorithms are mainly consensual. Nevertheless, to our knowledge, there is not such a work for hpc frameworks. First, due to the zoo of (informal) deﬁnitions and second, due to a lack of realistic cost models of common hpc architectures. In hpc, the cost measurement is not based on the complexity of an algorithm but is rather on the execution time, measured using empirical benchmarks. Programmers are benchmarking load balancing, communication (size of data), etc. Using such techniques, it is very diﬃcult to explain why one code is faster than another and which one is more suitable for one architecture or another. This is regrettable because the community is failing to obtain some rigorous characterization of sub-classes of hpc algorithms. There is also a lack of studying algorithmic completeness of hpc languages. This is the basis from which to specify what can or cannot be eﬀectively programmed. Finally, taking into account all the features of all hpc paradigms is a daunting task that is unlikely to be achieved [9]. Instead, a bottom up strategy (from the simplest models to the most complex) may be a solution that could serve as a basis for more general hpc models. 1.2

Content of the Work

Using a bridging model [20] is a ﬁrst step to this solution because it simpliﬁes the task of algorithm design, programming and simpliﬁes the reasoning of cost and ensures a better portability from one system to another. A bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. We conscientiously limit our work to the bulksynchronous parallel (bsp) bridging model [1,18] because it has the advantage of being endowed with a simple model of execution. We leave more complex models

74

Y. Marquer and F. Gava

to future work. Moreover, there are many diﬀerent libraries and languages for programming bsp algorithms, for example, the bsplib for c [11] or java [17], bsml [?], pregel [12] for big-data, etc. Concurrent asms [3] try to capture the more general deﬁnition of asynchronous and distributed computations. We promote a rather diﬀerent “bottomup” approach consisting of restricting the model under consideration, so as to better highlight the algorithm execution time (which is often too diﬃcult to assess for general models) and more generally to formalize our algorithms of a bridging model at their natural level of abstraction, instead of using a more general model then restrict it with an arbitrary hypothesis. As a basis to this work, we ﬁrst give an axiomatic deﬁnition of bsp algorithms (algoBSP ) with only 4 postulates. Then we extend the asm model [10] of computation (asmBSP ) for bsp. Our goal is to deﬁne a convincing set of parallel algorithms running in a predictable time and construct a model that computes these algorithms only. This can be summarized by algoBSP=asmBSP . An interesting and novel point of this work is that the bsp cost model is preserved. 1.3

Outline

Many deﬁnitions used here are well known to the asm community. Recalling all of them would be too long but they are available in the online technical report [22]. The remainder of this paper is structured as follows: In Sect. 2 we ﬁrst recall the bsp model and deﬁne its postulates; Secondly, in Sect. 3, we give the operational semantics of asmBSP and ﬁnally, we give the main result. Section 4 concludes, gives some related work and a brief outlook on future work.

2 2.1

Characterizing BSP Algorithms The BSP Bridging Model of Computation

As the ram model provides a unifying approach that can bridge the worlds of sequential hardware and software, so valiant sought [20] for a unifying model that could provide an eﬀective (and universal) bridge between parallel hardware and software. A bridging model [20] allows to reduce the gap between an abstract execution (programming an algorithm) and concrete parallel systems (using a compiler and designing/optimizing a physical architecture). The direct mode bsp model [1,18] is a bridging model that simpliﬁes the programming of various parallel architectures using a certain level of abstraction. The assumptions of the bsp model are to provide portable and scalable performance predictions on hpc systems. Without dealing with low-level details of hpc architectures, the programmer can thus focus on algorithm design only. The bsp bridging model describes a parallel architecture, an execution model for the algorithms, and a cost model which allows to predict their performances on a given bsp architecture.

An Axiomatization for BSP Algorithms

75

A bsp computer can be speciﬁed by p uniform computing units (processors), each capable of performing one elementary operation or accessing a local memory in one time unit. Processors communicate by sending a data to every other processor in g time units (gap which reﬂects network bandwidth ineﬃciency), and a barrier mechanism is able to synchronise all the processors in L time units (“latency” and the ability of the network to deliver messages under a continuous load). Such values, along with the processor’s speed (e.g. Mﬂops) can be empirically determined by executing benchmarks. The time g is thus for collectively delivering a 1-relation which is a collective exchange where p 0 p1 p2 p3 every processor receives/sends at most one word. local computations The network can deliver an h-relation in time g×h. A bsp computation is organized as a sequence of communication barrier supersteps (see Fig. 1). During a superstep, the .. .. .. .. next super-step processors may perform computations on local data . . . . or send messages to other processors. Messages are Fig. 1. A bsp super-step. available for processing at their destinations by the next superstep, and each superstep is ended with the barrier synchronisation of the processors. The execution time (cost) of a super-step s is the sum of the maximal of the local processing, the data delivery and the global synchronisation times. It is expressed by the following formula: Cost(s) = ws + hs × g + L where ws = max0≤i

Axiomatic Characterization of BSP Algorithms

Postulate 1 (Sequential Time). A bsp algorithm A is given by: 1. A set of states S(A); 2. A set of initial states I(A) ⊆ S(A); 3. A transition function τA : S(A) → S(A). We follow [10] in which states, as ﬁrst-order structures, are full instantaneous descriptions of an algorithm. Deﬁnition 1 (Structure). A (ﬁrst-order) structure X is given by: 1. A (potentially inﬁnite) set U(X) called the universe (or domain) of X 2. A ﬁnite set of function symbols L(X) called the signature (language) of X 3. For every symbol s ∈ L(X) an interpretation sX such that: (a) If c has arity 0 then cX is an element of U(X) X (b) If f has an arity α 0 then f is an application: U(X)α → U(X)

76

Y. Marquer and F. Gava

In order to have a uniform presentation [10], we considered constant symbols in L(X) as 0-ary function symbols, and relation symbols R as their indicator function χR . Therefore, every symbol in L(X) is a function. Moreover, partial functions can be implemented with a special symbol undef , and we assume in this paper that every L(X) contains the boolean type (¬, ∧) and the equality. We also distinguish dynamic symbols whose interpretation may change from one state to another, and static symbols which are the elementary operations. Deﬁnition 2 (Term). A term of L(X) is deﬁned by induction: 1. If c has arity 0, then c is a term 2. If f has an arity α > 0 and θ1 , . . . , θα are terms, then f (θ1 , . . . , θα ) is a term The interpretation θ

X

of a term θ in a structure X is deﬁned by induction on θ: X

def

= cX 1. If θ = c is a constant symbol, then θ 2. If θ = f (θ1 , . . . , θα ) where f is a symbol of the language L(X) with arity α > 0 and θ1 , . . . , θα are terms, then θ

X

def

X

X

X

= f (θ1 , . . . , θα )

A formula F is a term with the particular form true|false|R (θ1 , . . . , θα ) |¬F X |(F1 ∧ F2 ) where R is a relation symbol (ie a function with output true or X false ) and θ1 , . . . , θα are terms. We say that a formula is true (resp. false) in X X X X if F = true (resp. false ). A bsp algorithm works on independent and uniform computing units. Therefore, a state St of the algorithm A must be a tuple Xt1 , . . . , Xtp . To simplify, we annotate tuples from 1 to p and not from 0 to p − 1. Notice that p is not ﬁxed for the algorithm, so A can have states using diﬀerent size of “p-tuples” (informally p, the number of processors). In this paper, we will simply consider that this number is preserved during a particular execution. In other words: the size of the p-tuples is ﬁxed for an execution by the initial state of A for such an execution. If X 1 , . . . , X p is a state of the algorithm A, then the structures X 1 , . . . , X p will be called processors or local memories. The set of the independent local memories of A will be denoted by M (A). We now deﬁne the bsp algorithms as the objects verifying the four presented postulates. The computation for every processor is done in parallel and step by step. An execution of A is a sequence of states S0 , S1 , S2 , . . . such that S0 is an initial state and for every t ∈ N, St+1 = τA (St ). Instead of deﬁning a set of ﬁnal states for the algorithms, we will say that a state St of an execution is ﬁnal if τA (St ) = St , that is the execution is: S0 , S1 , . . . , St−1 , St , St , . . . We say that an execution is terminal if it contains a ﬁnal state. We are interested in the algorithm and not a particular implementation (eg, the variables’ names), therefore in the postulate we will consider the states up to multi-isomorphism. → − Deﬁnition 3 (Multi-isomorphism). ζ is a multi-isomorphism between two → − states X 1 , . . . , X p and Y 1 , . . . , Y q if p = q and ζ is a p-tuple of applications

An Axiomatization for BSP Algorithms

77

ζ1 , . . . , ζp such that for every 1 ≤ i ≤ p, ζi is an isomorphism between X i and Y i . Postulate 2 (Abstract States). For every bsp algorithm A: 1. The states of A are p-tuples of structures with the same ﬁnite signature L(A); 2. S(A) and I(A) are closed by multi-isomorphism; 3. The transition function τA preserves p, the universes and commutes with multi-isomorphisms. For a bsp algorithm A, let X be a local memory of A, f ∈ L(A) be a dynamic α-ary function symbol, and a1 , . . . , aα , b be elements of the universe U(X). We say that (f, a1 , . . . , aα ) is a location of X, and that (f, a1 , . . . , aα , b) is an update on X at the location (f, a1 , . . . , aα ). For example, if x is a variable then (x, 42) is an update at the location x. But symbols with arity α > 0 can be updated too. For example, if f is a one-dimensional array, then (f, 0, 42) is an update at the location (f, 0). If u is an update then X ⊕ u is a new structure of signature L(A) and universe U(X) such that the interpretation of a function symbol f ∈ L(A) is: → b if u = (f, − a , b) X⊕u − def → f (a) = X − → f ( a ) otherwise → − where we noted a = a1 , . . . , aα . For example, in X ⊕ (f, 0, 42), every symbol has the same interpretation than in X, except maybe for f because f X⊕(f,0,42)

X⊕(f,0,42)

(0) =

X

42 and f (a) = f (a) otherwise. We precised “maybe” because it may X be possible that f (0) is already 42. X → → If f (− a ) = b then the update (f, − a , b) is said trivial in X, because nothing − → → has changed. Indeed, if (f, a , b) is trivial in X then X ⊕ (f, − a , b) = X. If Δ is a set of updates then Δ is consistent if it does not contain two distinct updates with the same location. Notice that if Δ is inconsistent, then → → there exists (f, − a , b), (f, − a , b ) ∈ Δ with b = b and, in that case, the entire set of updates clashes: → b if (f, − a , b) ∈ Δ and Δ is consistent X⊕Δ − def → f (a) = X − → f ( a ) otherwise If X and Y are two local memories of the same algorithm A then there exists Y → X → → a unique consistent set Δ = {(f, − a , b) | f (− a ) = b and f (− a ) = b} of non trivial updates such that Y = X ⊕ Δ. This Δ is called the diﬀerence between the two local memories, and is denoted by Y X. → − Let X = X 1 , . . . , X p be a state of A. According to the transition function → − → − → − τA , the next state is τA ( X ), which will be denoted by (τA ( X )1 , . . . , τA ( X )p ). − → → − def We denote by Δi (A, X ) = τA ( X )i X i the set of updates done by the i-th → − → − − → def − → − → processor of A on the state X , and by Δ(A, X ) = (Δ1 (A, X ), . . . , Δp (A, X ))

78

Y. Marquer and F. Gava

→ − → − the “multiset” of updates done by A on the state X . In particular, if a state X → − → − → − → − − → is ﬁnal, then τA ( X ) = X , so Δ(A, X ) = ∅ . LetA be a bsp algorithm and T be a set of terms of L(A). We say that two states X 1 , . . . , X p and Y 1 , . . . , Y q of A coincide over T if p = q and for every 1 ≤ i ≤ p and for every t ∈ T we have t

Xi

=t

Yi

.

Postulate 3 (Bounded Exploration for Processors). For every bsp algo− → → − rithm A there exists a ﬁnite set T (A) of terms such that for every state X and Y , → − − → → − − → if they coincide over T (A) then Δ(A, X ) = Δ(A, Y ), i.e. for every 1 ≤ i ≤ p, − → − → we have Δi (A, X ) = Δi (A, Y ). T (A) is called the exploration witness [10] of A. If a set of terms T is ﬁnite then its closure by subterms is ﬁnite too. We assume that T (A) is closed by subterms and the symbol “true” should always be in the exploration witness [10]. The interpretations of the terms in T (A) are called the critical elements and we prove in [22] that every value in an update is a critical element: Lemma 1 (Critical Elements). For every state X 1 , . . . , X p of A, ∀i 1 ≤ − → → → i ≤ p, if (f, − a , b) ∈ Δi (A, X ) then − a , b are interpretations in X i of terms in T (A).

That implies that for every step of the computation, for a given processor, only a bounded number of terms are read or written (amount of work). Lemma 2 (Bounded Set of Updates). For every state X 1 , . . . , X p of the − → algorithm A, for every 1 ≤ i ≤ p, |Δi (A, X )| is bounded. Notice that for the moment we make no assumption on the communication between processors. Moreover, these three postulates are a “natural” extension of the ones of [10]. And by “natural”, we mean that if we assume that p = 1 then our postulates are exactly the same: Lemma 3 (A Single Processor is Sequential). A bsp algorithm with a unique processor (p = 1) is a sequential algorithm. Therefore algoseq ⊆ algoBSP . We now organize the sequence of states into supersteps. The communication between local memories occurs only during a communication phase. In order to do so, a bsp algorithm A will use two functions compA and commA indicating if A runs computations or if it runs communications. Postulate 4 (Supersteps phases). For every bsp algorithm A there exists two applications compA : M (A) → M (A) commuting with isomorphisms, and commA : S(A) → S(A), such that for every state X 1 , . . . , X p : ⎧ ⎨ compA (X 1 ), . . . , compA (X p ) if ∃1 ≤ i ≤ p 1 such that compA (X i ) = X i τA X , . . . , X p = ⎩ otherwise commA X 1 , . . . , X p

An Axiomatization for BSP Algorithms

79

A BSP algorithm is an object verifying these four and we denote postulates, by algoBSP the set of the bsp algorithms. A state X 1 , . . . , X p will be said in a computation phase if there exists 1 ≤ i ≤ p such that compA (X i ) = X i . Otherwise, the state will be said in a communication phase. This requires some remarks. First, at every computation step, every processor which has not terminated performs its local computations. Second, we do not speciﬁed the function commA in order to be generic about which bsp library is used. We discuss in Sect. 3.3 the diﬀerence between commA and the usual communication routines in the bsp community. → − → − → − Remember that a state X is said to be ﬁnal if τA ( X ) = X . Therefore, → − according to the fourth postulate, X must be in a communication phase which is like a ﬁnal phase that would terminate the whole execution as found in mpi. We prove that the bsp algorithms satisfy, during a computation phase, that every processor computes independently of the state of the other processors: Lemma 4 (No Communication during Computation Phases). For every states X 1 , . . . , X p and Y 1 , . . . , Y q in a computing phase, if X i and − → − → Y j have the same critical elements then Δi (A, X ) = Δj (A, Y ). 2.3

Questions and Answers

Why not using a bsp-Turing machine to deﬁne an algorithm? It is known that standard Turing machines could simulate every algorithm. But we are here interested in the step-by-step behavior of the algorithms, and not the input-output relation of the functions. In this way, there is not a literal identity between the axiomatic point of view (postulates) of algorithms and the operational point of view of Turing machines. Moreover, simulating algorithms by using a Turing-machine is a low-level approach which does not describe the algorithm at its natural level of abstraction. Every algorithm assumes elementary operations which are not reﬁned down to the assembly language by the algorithm itself. These operations are seen as oracular, which means that they produce the desired output in one step of computation. But I think there is too much abstractions: When using bsplib, messages received at the past superstep are dropped. Your function commA does not show this fact. We want to be as general as possible. Perhaps a future library would allow reading data received n supersteps ago as the BSP+ model of [19]. Moreover, the communication function may realize some computations and is thus not a pure transmission of data. But the exploration witness forbids doing whatever: only a ﬁnite set of symbols can be updated. And we provide a realistic example of such a function which mainly correspond to the bsplib’s primitives [22]. And why is it not just a permutation of values to be exchanged? The communications can be used to model synchronous interactions with the environment (input/output or error messages, etc.) and therefore make appear or disappear values.

80

Y. Marquer and F. Gava

And when using bsplib and other bsp libraries, I can switch between sequential computations and bsp ones. Why not model this kind of feature? The sequential parts can be modeled as purely asynchronous computations replicated and performed by all the processors. Or, one processor (typically the ﬁrst one) is performing these computations while other processors are “waiting” with an empty computation phase. In [2,3,15,16], the authors give more general postulates about concurrent and/or distributed algorithms? Why not using their works by adding some restrictions to take into account the bsp model of execution? It is another solution. But we think that the restrictions on “more complex” postulates is not a natural characterization of the bsp algorithms. It is better for a model to be expressed at its natural level of abstraction in order to highlight its own properties. For example, there is the problematic of the cost model which is inherent to a bridging model like bsp: It is not clear how such restrictions could highlight the cost model. Fine. But are you sure about your postulates? I mean, are they completely (and not more) deﬁned bsp algorithms? It is impossible to be sure because we are formalizing a concept that is currently only intuitive. But as they are general and simple, we believe that they correctly capture this intuitive idea. We prove in the next section that a natural operational model for bsp characterizes exactly those postulates. Would not that be too abstract? The bsp model is supposed to be a bridging model. We treat algorithms at their natural level of abstraction, and not as something to reﬁne to machines: We explicitly assume that our primitives may not be elementary for a typical modern architecture (but could be so in the future) and that they can achieve a potentially complex operation in one step. This makes it possible to get away from a considered hardware model and makes it possible to calculate the costs in time (and in space) in a given framework which can be variable according to what is considered elementary. For example, in an Euclidean algorithm, it is either the Euclidean division that is elementary or the subtraction. If your bsp algorithm uses elementary operations which can not be realized on the bsp machine considered, then you are just not at the right level abstraction. Our work is still valid for any level of abstraction.

3

BSP-ASM Captures the BSP Algorithms

The four previous postulates deﬁne the bsp algorithms from an axiomatic viewpoint but that does not mean that they have a model, or in, other words, that they are deﬁned from an operational point of view. In the same way that the model of computation asm captures the set of the sequential algorithms [10], we prove in this section that the asmBSP model captures the bsp algorithms.

An Axiomatization for BSP Algorithms

3.1

81

Deﬁnition and Operational Semantics of ASM-BSP

Deﬁnition 4 (ASM Program [10]) def

Π = f (t1 , . . . , tα ) := t0 | if F then Π1 else Π2 endif | par Π1 . . . Πn endpar where f has arity α; F is a formula; θ1 , . . . , θα , θ0 are terms of L(X). Notice that if n = 0 then par Π1 . . . Πn endpar is the empty program. If in if F then Π1 else Π2 endif the program Π2 is empty we will write simply if F then Π1 endif. An asm machine [10] is thus a kind of Turing machine using not a tape but an abstract structure X. Deﬁnition 5 (ASM Operational Semantics) X X X def Δ(f (θ1 , . . . , θα ) := θ0 , X) = (f, θ1 , . . . , θα , θ0 ) def

Δ(if F then Π1 else Π2 endif, X) = Δ(Π

i , X) i = 1 if F is true on X where i = 2 otherwise def

Δ(par Π1 . . . Πn endpar, X) = Δ(Π1 , X) ∪ · · · ∪ Δ(Πn , X) Notice that the semantics of the par is a set of updates done simultaneously, which diﬀers from an usual imperative framework. A state of a asmBSP machine is a p-tuple of memories (X 1 , . . . , X p ). We assume that the asmBSP programs are spmd (single program multiple data) which means that at each step of computation, the asmBSP program Π is executed individually on each processor. → − Therefore Π induces a multiset of updates Δ and a transition function τΠ : def − → Δ(Π, X 1 , . . . , X p ) = Δ(Π, X 1 ), . . . , Δ(Π, X p ) def τΠ X 1 , . . . , X p = X 1 ⊕ Δ(Π, X 1 ), . . . , X p ⊕ Δ(Π, X p ) → − → − If τΠ ( X ) = X , then every processor has ﬁnished its computation steps. In that case we assume that there exists a communication function to ensure the communication between processors. Deﬁnition 6. An asmBSP machine M is a triplet (S(M ), I(M ), τM ) such that: 1. S(M ) is a set of tuples of structures with the same ﬁnite signature L(M ); S(M ) and I(M ) ⊆ S(M ) are closed by multi-isomorphism; 2. τM : S(M ) → S(M ) veriﬁes that there exists a program Π and an application commM : S(M ) → S(M ) such that: → − → − → − → − τΠ ( X ) if τΠ ( X ) = X τM ( X ) = → − commM ( X ) otherwise

82

Y. Marquer and F. Gava

3. commM veriﬁes that: → − → − → − (1) For every state X such that τΠ ( X ) = X , commM preserves the universes and the number of processors, and commutes with multi-isomorphisms (2) There exists a ﬁnite set of terms T (commM ) such that for every state → − → − → − → − → − → − X and Y with τΠ ( X ) = X and τΠ ( Y ) = Y , if they coincide over → − − → → − − → T (commM ) then Δ(M, X ) = Δ(M, Y ). → − We denote by asmBSP the set of such machines. As before, a state X is said → − → − → − → − → − → − → − ﬁnal if τM ( X ) = X . So if X is ﬁnal then τΠ ( X ) = X and commM ( X ) = X . The last conditions about the communication function may seem arbitrary, but they are required to ensure that the communication function is not a kind of magic device. For example, without these conditions, we could imagine that commM may compute the output of the algorithm in one step, or solve the halting problem. Moreover, we construct an example of commM in [22] (Section D). 3.2

The BSP-ASM Thesis

We prove that asmBSP captures the computation phases of the bsp algorithms in three steps. First, we prove that during an execution, each set of updates is the interpretation of an asm program (Lemma 8 p.16 [22]). Then, we prove an equivalence between these potentially inﬁnite number of programs (Lemma 9 p.17). Finally, by using the third postulate, we prove in Lemma 10 p.18 that there is only a bounded number of relevant programs, which can be merged into a single one. Proposition 1 (BSP-ASMs capture Computations of BSP Algorithms). For every bsp algorithm A, there exists an asm program ΠA such → − → − − → → − − → that for every state X in a computation phase: Δ(ΠA , X ) = Δ(A, X ). Theorem 1. algoBSP=asmBSP (The proof is available in [22], Section C p.20). 3.3

Cost Model Property and the Function of Communication

There is two more steps in order to claim that asmBSP objects are the bsp bridging model algorithms: (1) To ensure that the duration corresponds to the standard cost model and; (2) To solve issues about the communication function. Cost Model. If the execution begins with a communication, we assume that − → no computation is done for the ﬁrst superstep. We remind that a state Xt is in a computation phase if there exists 1 ≤ i ≤ p such that compA (Xti ) = Xti . The computation for every processor is done in parallel, step by step. So, the cost in def time of the computation phase is w = max1≤i≤p (wi ), where wi is the number of steps done by the processor i (on processor X i ) during the superstep. Then the state is in a communication phase, when the messages between the processors are sent and received. Notice that commA may require several

An Axiomatization for BSP Algorithms

83

steps in order to communicate the messages, which contrasts with the usual approach in bsp where the communication actions of a superstep are considered as one unit. But this approach would violate the third postulate, so we had to consider a step-by-step communication approach, then consider these actions as one communication phase. asmBSP exchanges terms and we show in [22] how formally deﬁne the size of terms. But we can imagine a machine that must further decompose the terms in order to transmit them (in bits for example). We just assume that the data are communicable in time g for a 1-relation. So, during the superstep, the communication phase requires h × g steps. It remains to add the cost of the synchronization of the processors, which is assumed in the usual bsp model to be a parameter L. Therefore, we obtained a cost property which is sound with the standard bsp cost model. A Realization of the Communication. An example of a communication function for the standard bsplib’s primitives bsp_get, bsp_put, bsp_send bsp_move is presented in [22] (Section D). Proposition 2 (Communication). A function of communication, with routines for distant readings/writings and point-to-point sendings, performing an h-relation and requiring at most h exchanges can be designed using asm. One may argue that the last postulate allows the communication function to do computations. To avoid it, we assume that the terms in the exploration witness T (M ) can be separated between T (Π) and T (commM ) such that T (Π) → is for the states in a computation phase, and that for every update (f, − a , b) i of a processor X in a communication phase, either there exists a term t ∈ Xi T (commM ) such that b = t , or there exists a variable v ∈ T (Π) and a processor Xi

X j such that b = tvX j (representation presented in Section D p.24). To do a computation, a term like x+1 is required, so the restriction to a variable prevents the computations of the terms in T (Π). Or course, the last communication step should be able to write in T (Π), and the ﬁnal result should be read in T (Π).

4 4.1

Conclusion and Future Work Summary of the Contribution

A bridging model provides a common level of understanding between hardware and software engineers. It provides software developers with an attractive escape route from the world of architecture-dependent parallel software [20]. The bsp bridging model allows the design of “immortal” (eﬃcient and portable) parallel algorithms using a realistic cost model (and without any overspeciﬁcation requiring the use of a large number of parameters) that can ﬁt most distributed architectures. It has been used with success in many domains [1]. We have given an axiomatic deﬁnition of bsp algorithms by adding only one postulate to the sequential ones for sequential algorithms [10] which has been

84

Y. Marquer and F. Gava

widely accepted by the scientiﬁc community. Mainly this postulate is the call of a function of communication. We abstract how communication is performed, not be restricting to a speciﬁc bsp library. We ﬁnally answer previous criticisms by deﬁning a convincing set of parallel algorithms running in a predictable time. Our work is relevant because it allows universality (immortal stands for bsp computing): all future bsp algorithms, whatever their speciﬁcities, will be captured by our deﬁnitions. So, our asmBSP is not just another model, it is a class model, which contains all bsp algorithms. This small addition allows a greater conﬁdence in this formal deﬁnition compared to previous work: Postulates of concurrent asms do not provide the same level of intuitive clarity as the postulates for sequential algorithms. But our work is limited to bsp algorithms even if it is still suﬃcient for many hpc and big-data applications. We have thus revisited the problem of the “parallel ASM thesis” i.e., to provide a machine-independent deﬁnition of bsp algorithms and a proof that these algorithms are faithfully captured by asmBSP . We also prove that the cost model is preserved which is the main novelty and speciﬁcity of this work compared to the traditional work about distributed or concurrent asms. 4.2

Questions and Answers About this Work

Why do you use a new model of computation asmBSP instead of asmsonly? Indeed, each processor can be seen as a sequential asm. So, in order to simulate one step of a bspalgorithm using several processors, we could use pids to compute sequentially the next step for each processor by using an asm. Even if such a simulation exists between these two models, what you mean, a “sequentialization” (each processor, one after the other) of the bsp model of execution, cannot be exactly the function of transition of the postulates. Moreover, in order to stay bounded, having p exploration witness (one for each sequential asm) induces p to be a constant for the algorithm. In our work, p is only ﬁxed of each execution, making the approach more general when modeling algorithms. Is another model possible to characterize the bsp algorithms? Sure. This can be more useful for proving some properties. But that would be the same set, just another way to describe it. So, reading the work of [3], a distributed machine is deﬁned as a set of pairs (a, Πa ) where a is the name of the machine and Πa a sequential asm. Reading your deﬁnition, I see only one Π and not “p” processors as in the bsp model. I thus not imagine a bsp computer as it is. You are absolutely right but we do not model a bsp computer, our work is about bsp algorithms. The asmBSP program contains the algorithm which is used on each “processor” (a ﬁrst-order structure as explain before). These are the postulates (axiomatic point of view) that characterize the class of bsp algorithms rather than a set of abstract machines (operational point of view). That is closer to the original approach [10]. We also want to point out that, unlike [3], we are not

An Axiomatization for BSP Algorithms

85

limited to a ﬁnite (ﬁxed) set of machines: In our model, an algorithm is deﬁned for p = 1, 2, 1000, etc. And we are not limited to point-to-point communications. Ok, but with only a single code, you cannot have all the parallel algorithms... We follow [4] about the diﬀerence between a PARallel composition of SEQuential actions (PAR of SEQ) and a SEQuential composition of PARallel actions (SEQ of PAR). Our asmBSP is SEQ(PAR). This leads to a macroscopic point of view1 which is close to a speciﬁcation. Being a SEQ(PAR) model allows a high level description of the bsp algorithms. So, why are you limited to spmd computations? Diﬀerent codes can be run by the processors using conditionals on the “id” of the processors. For example “if pid=0 then code1 else code2” for running “code1” (e.g. master part) only on processor 0. Again, we are not limited to spmd computations. The asm program Π fully contains the bsp algorithm, that is all the“actions” that can be performed by any processors, not necessarily the same instructions: Each processor picks the needed instruction to execute but there could be completely diﬀerent. Only the size of Π is ﬁnite due to the exploration witness. For example, it is impossible to have a number of conditionals in Π that depends of p. Indeed, according to Lemma 4, during a computation phase, if two processors coincide over the exploration witness, then they will use the same code. And according to Postulate 3, the exploration witness is bounded. So, there exists only a bounded number c of possible subroutines during the computation phase, even if pc. Notice that processors may not know their own ids and there is no order in p-tuples; We never use such a property: Processors are organized like a set and we use tuples only for convenience of notation. We are using p-tuples just to add the bsp execution model in the original postulates of [10]. Ok, but I cannot get the interleavings of the computations as in [3]? Your model seems very synchronous! The bsp model makes the hypothesis that the processors are uniform. So if one processor can perform one step of the algorithm, there is no reason to lock it just to highlight an interleaving. And if there is nothing to do, it does nothing until the phase of communication. Our execution model is thus largely “asynchronous” during the computation phases. Speaking about communication, why apply several times the function of communication? When designing a bsp algorithm, I use once a collective operation! An asm is like a Turing machine. It is not possible to perform all the communications in a single step: The exploration witness forbids doing this. Our function of communication performs some exchanges until there are no more.

1

Take for example a bsp sorting algorithm: First all the processors locally sort there own data, and then, they perform some exchanges in order to have the elements sorted between them. One deﬁnes it as a sequence of parallel actions and being also independent to the number of processors.

86

Y. Marquer and F. Gava

What happens in case of runtime errors during communications? Typically, when one processor has a bigger number of super-steps than other processors, or when there is an out-of-bound sending or reading, it leads to a runtime error. The bsp function of communication can return a ⊥ value. That causes a stop of the operational semantics of the asmBSP . 4.3

Related Work

As far as we know, some work exists to model distributed programs using asms [15] but none to convincingly characterize bsp algorithms. In [6], authors model the p3l set of skeletons. That allows the analyze of p3l programs using standard asm tools but not a formal characterization of what p3l is and is not. The ﬁrst work to extend asms for concurrent, distributed, agent-mobile algorithms is [2]. Too many postulates are used making the comprehension hard to follow or worse (loss of conﬁdence). A ﬁrst attempt to simplify this work has been done in [16] and again simpliﬁed in [7] by the use of multiset comprehension terms to maintain a kind of bounded exploration. Then, the authors prove that asms captures these postulates. Moreover, we are interested in distributed (hpc) computations more than parallel (threading) asms. We want to clarify one thing. The asm thesis comes from the fact that sequential algorithms work in small steps, that is steps of bounded complexity. But the number of processors (or computing units) is unbounded for parallel algorithms, which motivated the work of [2] to deﬁne parallel algorithms with wide steps, that is steps of unbounded complexity. Hence the technicality of the presentation, and the unconvincing attempts to capture parallel algorithms [3]. Extending the asms for distributed computing is not new [3]. We believe that these postulates are more general than ours but we think that our extension still remains simple and natural for bsp algorithms. The authors are also not concerned about the problem of axiomatizing classes of algorithms using a cost model which is the heart of our work and the main advantage of the bsp model. 4.4

Future Work

This work leads to many possible work. First, how to adapt our work to a hierarchical extension of bsp [21] which is closer to modern hpc architectures? Second, bsp is a bridging model between hardwares and softwares. It could be interesting to study such a link more formally. For example, can we prove that the primitives of a bsp language can truly “be bsp” on a typical cluster architecture? Thirdly, we are currently working on extending the work of [13] in order to give the bsp algorithmic completeness of a bsp imperative programming language. There are some concrete applications: There are many languages having a bsp-like model of execution, for example pregel [12] for writing large-graph algorithms. An interesting application is proving which are bsp algorithmically complete and are not. bsplib programs are intuitively bsp. mapreduce is a

An Axiomatization for BSP Algorithms

87

good candidate to be not [14]. Similarly, one can imagine proving which languages are too expressive for bsp. mpi is intuitively one of them. Last, the ﬁrst author is working on postulates for more general distributed algorithm ` a la mpi. In any case, studying the bsp-ram (such as the communication-oblivious of [19]) or mapreduce, would led to deﬁne subclasses of bsp algorithms.

References 1. Bisseling, R.H.: Parallel Scientiﬁc Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 2. Blass, A., Gurevich, Y.: Abstract state machines capture parallel algorithms. ACM Trans. Comput. Log. 4(4), 578–651 (2003) 3. B¨ orger, E., Schewe, K.-D.: Concurrent abstract state machines. Acta Inf. 53(5), 469–492 (2016) 4. Boug´e, L.: The data parallel programming model: a semantic perspective. In: Perrin, G.-R., Darte, A. (eds.) The Data Parallel Programming Model. LNCS, vol. 1132, pp. 4–26. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-617361 40 5. Cappello, F., Snir, M.: On communication determinism in HPC applications. In: Computer Communications and Networks (ICCCN), pp. 1–8. IEEE (2010) 6. Cavarra, A., Zavanella, A.: A formal model for the parallel semantics of p3l. In: ACM Symposium on Applied Computing (SAC), pp. 804–812 (2000) 7. Ferrarotti, F., Schewe, K.-D., Tec, L., Wang, Q.: A new thesis concerning synchronised parallel computing –simpliﬁed parallel ASM thesis. Theor. Comput. Sci. 649, 25–53 (2016) 8. Gonz´ alez-V´elez, H., Leyton, M.: A survey of algorithmic skeleton frameworks. Softw. Pract. Exp. 40(12), 1135–1160 (2010) 9. Gorlatch, S.: Send-receive considered harmful: myths and realities of message passing. ACM TOPLAS 26(1), 47–56 (2004) 10. Gurevich, Y.: Sequential abstract-state machines capture sequential algorithms. ACM Trans. Comput. Log. 1(1), 77–111 (2000) 11. Hill, J.M.D., McColl, B., et al.: BSPLIB: the BSP programming library. Parallel Comput. 24, 1947–1980 (1998) 12. Malewicz, G., et al.: pregel: a system for large-scale graph processing. In: Management of data, pp. 135–146. ACM (2010) 13. Marquer, Y.: Algorithmic completeness of imperative programming languages. Fundamenta Informaticae, pp. 1–27 (2017, accepted) 14. Pace, M.F.: BSP vs MAPREDUCE. Procedia Comput. Sci. 9, 246–255 (2012) 15. Prinz, A., Sherratt, E.: Distributed ASM- pitfalls and solutions. In: Ait Ameur, Y., Schewe, K.D. (eds.) ABZ 2014. Lecture Notes in Computer Science, vol. 8477, pp. 210–215. Springer, Heidelberg (2014) 16. Schewe, K.-D., Wang, Q.: A simpliﬁed parallel ASM thesis. In: Derrick, J., et al. (eds.) ABZ 2012. LNCS, vol. 7316, pp. 341–344. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30885-7 27 17. Seo, S., et al.: HAMA: an eﬃcient matrix computation with the MAPREDUCE framework. In: Cloud Computing (CloudCom), pp. 721–726. IEEE (2010) 18. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and answers about BSP. Sci. Program. 6(3), 249–274 (1997)

88

Y. Marquer and F. Gava

19. Tiskin, A.: The design and analysis of bulk-synchronous parallel algorithms. PhD thesis. Oxford University Computing Laboratory (1998) 20. Valiant, L.G.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990) 21. Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci. 77(1), 154–166 (2011) 22. Marquer, Y., Gava, F.: An ASM thesis for BSP. Technical report (2018). https:// hal.archives-ouvertes.fr/hal-01717647

Eﬃcient and Secure Outsourced Linear Regression Haomiao Yang(B) , Weichao He(B) , Qixian Zhou(B) , and Hongwei Li School of Computer Science and Engineering and Center for Cyber Security, University of Electronic Science and Technology of China, Chengdu, China {haomyang,hongweili}@uestc.edu.cn, [email protected], [email protected]

Abstract. The linear regression, as a classical machine learning algorithm, is often used to be a predictor. In the era of big data, the data owner can outsource their linear regression task and data to the cloud server, which has powerful calculation and storage resources. However, outsourcing data may break the privacy of the data. It is a well-known method to encrypt them prior to uploading to the cloud by using the homomorphic encryption (HE). Nevertheless, it is a diﬃcult problem to apply the linear regression protocol in the encrypted domain. With this observation, we propose an eﬃcient and secure linear regression protocol over outsourced encrypted data by using the vector HE, named ESLR, and in our protocol, we further present a privacy-preserving gradient descent method. Security analysis shows that our protocol can guarantee the conﬁdentiality of data. And compared to the linear regression over plaintexts, our proposal can achieve almost the same accuracy and eﬃciency over ciphertexts. Keywords: Machine learning · Homomorphic encryption Linear regression · Gradient descent

1

Introduction

Predictive modeling is an essential tool in decision making processes in domains such as policy making, medicine, law enforcement, and ﬁnance. Considering a hospital would like to use a cloud service which provide predictive service to analyze the patient’s condition so as to improve the quality of care and reduce costs. Due to ethical and legal requirements, the hospital might be restricted to use such service [3,4,12]. Like the hospital, many organizations are collecting ever-increasing data for mining to improve decision-making and productivity. However, they may have no powerful resources to deal with such large-scale data. To solve this problem, an attractive business model is that a service provider, which has powerful platforms and advanced analytic skills, provides such services. Organizations who need the calculation resource can outsource their computational tasks to such powerful service providers. However, because the data c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 89–102, 2018. https://doi.org/10.1007/978-3-030-05057-3_7

90

H. Yang et al.

may contain sensitive information, outsourcing data to public clouds directly raises privacy concerns. In current implementations, the learning algorithm must see all user data in the clear in order to build the predictive model. In this paper, we consider whether the learning algorithm can operate in encrypted domains, thereby allowing users to retain control of their data. For medical data, this allows for a model to be built without aﬀecting user privacy. For the book and movie preferences, letting users keep control of their data, can reduce the risk of future unexpected embarrassment in case of a data breach at the service provider. Roughly speaking, there are three existing approaches to essure the privacy when the server mines the user data. The ﬁrst lets users split their data among multiple servers by using secure multi-party computation [2,5,9]. These servers, then, run the learning algorithm using a distributed protocol. Privacy is assured as long as a majority of servers do not collude. The second is based on diﬀerential privacy protection, where the learning algorithm is executed over data containing noise [6,7,13]. And the third is based on homomorphic encryption, where the learning algorithm is executed over encrypted data [15]. Distributed linear regression is not suitable for outsourced model. In distributed linear regression, every party must take part in computation. Consequently, the secure multi-party computation may be ineﬃcient. In addition, there may be a great loss of accuracy and can not fully guarantee the security of data by using the diﬀerential privacy protection. In this work, we choose homomorphic encryption for our privacy-preserving machine learning algorithm. As we know, homomorphic encryption (HE) allows operations on encrypted data, which provides a possible solution for linear regression over ciphertexts. In our work, we propose an eﬃcient and secure linear regression protocol over encrypted data for outsourced environments, namely ESLR, where the cloud performs linear regression processing over encrypted data. The challenge is how to apply the linear regression algorithm over ciphertexts, while maintaining high accuracy and performance. To address these challenges, we exploit the vector HE (VHE) recently presented by Zhou and Wornell [17]. Unlike the fully HE (FHE), VHE only needs to support somewhat homomorphic encryption. As a result, it is much more eﬃcient than many existing FHE schemes. For example, it is orders of magnitude faster than HELib [10], which is a very famous FHE implementation. Especially by designing ingeniously, VHE can be used in the privacy-preserving gradient descent. Diﬀerent from existed works, our contributions are twofold as follows. (1) Firstly, ESLR reconstructs linear regression clustering process in the domain of ciphertext by taking advantage of the vector encryption, which allows low computation and communication cost. What’s more, we proposed a scheme that can apply privacy-preserving gradient descent method over ciphertext domain eﬃciently. To our best knowledges, it’s very eﬃcient for the optimization algorithm over encrypted data. Experiments shows that ESLR achieves almost the same accuracy compared to the plaintext algorithm.

Eﬃcient and Secure Outsourced Linear Regression

91

(2) Secondly, security analysis demonstrates that ESLR achieves the conﬁdentiality of data, ensuring the privacy of the data owner. In addition, we give the deﬁnition of loss function, which is needed for optimization over ciphertext domain. This paper is organized as follows: The problem formulation is described in Sect. 2. The constructions of linear regression protocol are proposed in Sect. 3, followed by further discusses in Sect. 4. Then we give the security analysis and performance evaluation in Sects. 5 and 6, respectively. Finally, the conclusion is presented in Sect. 7.

2

Problem Statement

In this section, we give the problem statement, including system model and threat model, design goals, notations and preliminaries. 2.1

System Model and Threat Model

We give our system model concentrating on how to achieve secure liner regression over encrypted data in outsourced environments. As shown in Fig. 1, we proposed a classical outsourced system model, mainly consisting of two parties. The one is the data owner, and the other is the service provider. We primarily consider the service provider as an “honest-but-curious” server in our model. We assume the public matrix H and encrypted data D (D is the encryption of the data D) have been outsourced to the cloud, and the conﬁdentiality of the data

Cloud Server

Public matrix H

Data D

Data owner

Authorization of S

Fig. 1. System model

Data user

92

H. Yang et al.

will be protected by the underlying encryption primitive. After that, the server will implement the regression algorithm based on D . That is, the data owner outsources his encrypted data D , and the service provider runs the proposed protocol over D . Finally the service provider returns the predicted results to the data owner. 2.2

Design Goals

The overarching goal is to enable liner regression algorithm to be performed over encrypted data. What’s more, for an eﬃcient and secure liner regression protocol, we consider the following requirements to be necessary. – Accuracy: Enable secure linear regression over encrypted data in outsourced environments and achieve high accuracy. – Security: Protect privacy of linear regression process. – Eﬃciency: Process large amount of data with practical performance. 2.3

Overview of Standard Linear Regression and Gradient Descent

In this section, we give a brief introduction about standard linear regression algorithm [16]. In statistics, linear regression equation is a regression analysis using least square function to ﬁnd the relationship between one or more independent variables and dependent variables. This function is a linear combination of one or more model parameters called regression coeﬃcients. Linear regression with only one independent variable is called simple regression, and it is called multiple regression with greater than one independent variable. Like all forms of regression analysis, linear regression also focuses on the probability distribution of x and y. Given a random sample (xi1 , xi2 , ..., xip , yi ), we have one hypothetical regression output yi , and hypothetical regression inputs xi1 , xi2 , ..., xip . So a multivariate linear regression model is expressed as yi = w1 x1 + w2 x2 + · · · + wd xd + b. For a data set D = [(x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn )], the goal of linear regression is to get the regression coeﬃcients θ = [w1 , w2 , · · · , wd , b] such that the loss function get the minimum value. We deﬁne the loss function as J(θ) = (

n 1 T ) (θ x i − yi )2 . 2n i=1

Further, we formulate the problem as Algorithm 1. The gradient descent method [8] is one of the iterative methods, which can be used to solve the least squares problem. Gradient descent is one of the most commonly used methods in solving the model parameters of machine learning algorithm (unconstrained optimization problem). The other method commonly used is the least square method. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters through the gradient descent method.

Eﬃcient and Secure Outsourced Linear Regression

93

Algorithm 1. Standard linear regression Input: data set D = {(x 1 , y1 ), (x 2 , y2 ),· · · , (x n , yn )} and threshold t Output: θ = [w1 , w2 , · · · , wd , b] T 2 1 1: Deﬁne the loss function J(θ) = ( 2n ) n i=1 (θ x i − yi ) 0 2: Generating the θ randomly 3: repeat ) , where θ k is the value of kth iteration and α is the iteration 4: θ k = θ k−1 − α ∂J(θ ∂θ step. 5: until J(θ k+1 ) − J(θ k ) < t 6: return θ

2.4

Notations and Preliminaries

In this section, we review the preliminaries that are necessary for our work. First, we give notations used throughout the paper as illustrated in Table 1. Table 1. Notations Notation Meaning a

To round a to the nearest integer, for a ∈ R

a

To round each entry ai to the nearest integer, for a vector a ∈ Rn

a

∗

To be a binary representation for a vector a ∈ Zn

We outline the VHE scheme as suggested by Zhou and Wornell [17] that encrypts integer vectors to allow computation of arbitrary polynomials in the encrypted domain. For our purpose of ESLR, we only consider the fundamental operations below and more details are referred to [17]. – VHE.KG(λ): Input a security parameter λ, choose l, m, n, p, q, w ∈ Z, and the distribution χ where l = log2 (q − 1), w(p − 1) < q, q p, and m < n, construct S = [I , T ] ∈ Zm×n with I ∈ Zm×m as the identity matrix, and output the secret key S and the public parameters Param = (l, m, n, p, q, w, χ). – VHE.E (x,S ): Input a secret key S ∈ Zm×n and a plaintext vector x ∈ Zm , output a ciphertext c ∈ Zn that satisﬁes Sc = wx + e where w is a large integer, | S | w, and e is an error term with |e| < w/2. – VHE.D(c,S ): Input a ciphertext vector c ∈ Zn and a secret key S ∈ Zm×n , output a plaintext x ∈ Zm that satisﬁes x = Sc/w. For the VHE scheme, the key switching is an important operation in the encrypted domain. Given two secret keys S ∈ Zm×n and S ∈ Zm×n , and

94

H. Yang et al.

the ciphertext c ∈ Zn which decrypts to the plaintext x ∈ Zm with S , we calcu late a matrix M ∈ Zn ×nl producing a new ciphertext c ∈ Zn so as to decrypt c to the same x with S . In speciﬁc, this key switching task can be divided two steps: M ← VHE.KSM (S , S ) and c ← VHE.KS (M,c). Furthermore, as inferred by [17], for the plaintext x , the ciphertext c, and the key-switching matrix M , the following equation holds. c = M (wx )∗ In addition, it is obvious that VHE supports the operation of the addition in ciphertexts domain as S (c 1 + c 2 + · · · + c n ) = w(x 1 + x 2 + · · · + x n ) + e. 2.5

Privacy-Preserving Inner Product

In this section, we present a new technique of computing the inner product of two vectors. For simplication, we can assume that there are two vectors x 1 and x 2 which are encrypted to c 1 and c 2 using the vector homomorphic encryption of VHE. The challenge is how to calculate the inner product on ciphertext domain. To tackle the problem, a matrix H is essential to be calculated. By solving equation AM = I ∗ , we have a matrix A. Then we can get the matrix H from H = AT A. We can prove that c T H c = w2 x T x . Hence, we can calculate the inner product in ciphertex domain, and will later discuss the security of this method.

3

Proposed Protocol

In this section, we will propose the protocol for linear regression over encrypted items in outsourced environments using VHE. 3.1

Reformulating the Problem

In this section, we give a brief introduction about our problem again. We supposed that the data owner owns a database D that can be thought to be a big table of n records x 1 , x 2 , · · · , x n . The record x i = [xi1 · · · xim ] includes m attributes. Because the resources of the data owner is limited, so the data owner encrypts his database D record-wise, and then outsources the encrypted database D to the cloud. After that, the service provider will apply the linear regression over encrypted data sets, and return back the results to the data owner. In this protocol, the service provider know nothing to the plaintext.

Eﬃcient and Secure Outsourced Linear Regression

3.2

95

Linear Regression Over VHE

With the preparatory work ahead, we discuss the problem of regression over encrypted data ﬁrstly. In order to make our protocol faster and easier, We only consider the security of data properties. Supposed dataset D = {(x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn )} which is only known by data owner are encrypted to be D = {(c 1 , y1 ), (c 2 , y2 ), · · · , (c n , yn )}. The relation between plaintext and ciphertext satisﬁes S c i = wx i + e i where i = 1, 2, · · · , n. When the service provider get the encrypted data sets D from the data owner, he will apply linear regression protocol over D . The whole process is divided three phases: Preparation, Regression, and BackResults. – Preparation(D, λ). The security parameter λ and the data sets D is taken as the input, and the data owner generates a secret key S and a key-switch Matrix M for every record which satisﬁes the following equation. c = M (wx )∗ , where c is the ciphertext of x . The data owner need to calculate the key-switch matrix M only once and the data owner can use the key-switch M to encrypted data sets x . As we know, the scheme of VHE cost most is key-switch. If we use the same key-switch M to encrypt data, We can save a lot of overhead on encryption. Then, the data owner need to calculate the matrix H, which is used to deﬁne the loss function over encrypted data. As we know, the following equation holds. wx = I ∗ (wx )∗ . The data owner solve a matrix equation which satisﬁes: AM = I ∗ . Then, the data owner obtains the matrix A from the equation. Finally, the data owner can get the matrix H as H = AT A

Finally, the data owner upload the encrypted data set D and the matrix H to the service provider.

– Regression(D , H ). The service provider get the encrypted data set D = {(c 1 , y1 ), (c 2 , y2 ), · · · , (c n , yn )} and the matrix H from the data owner and apply the regression algorithm, which includes the steps as below: (1) Generate a vector θ randomly and choose a threshold t. (2) Deﬁne the loss function over encrypted data as J (θ ) = (

1 n 1 ) ( θ T H c i − yi )2 . i=1 w 2 2n

96

H. Yang et al.

(3) Upload the θ based on gradient descent method as below: θ k = θ k−1 − α

∂J (θ ) , ∂θ

where θ k is the value of the k th iteration. (4) Repeat step (3) until the value of the loss function satisﬁes the condition as below: |J (θ k ) − J (θ k−1 )| < t. – BackResults(θ ). From Regression the cloud will get the encrypted parameters. Then, the cloud return it back to the data owner.

4

Discussion

We have shown how to achieve a basic protocol for linear regression over encrypted data in outsourced environment. In this section, we will give the correctness analysis of our protocol and give a brief introduction about how to use the encrypted results. 4.1

Loss Function over Encrypted Data

In this section, we introduce the correctness of loss function over encrypted data, and verify that the following equation holds.

J (θ ) = ( =( =(

n 1 1 T ) ( θ Hc i − yi )2 2n i=1 w2 n 1 1 2 T ) ( w θ ci − yi )2 2n i=1 w2 n 1 T ) (θ c i − yi )2 2n i=1

= J(θ) As we can see, the loss function on the encrypted data is equal to the loss function on the plaintext. 4.2

Encrypted Parameters

In this section, we will discuss the relationship between encrypted parameters θ and encrypted data. First of all, We analysis loss function of plaintext. The loss function of plaintext is shown as follow: J(θ) = (

1 n ) (θ T x i − yi )2 i=1 2n

Eﬃcient and Secure Outsourced Linear Regression

97

Gradient descent is one of the most commonly used methods in solving the model parameters. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters. θ = [θ1 , θ2 , . . . , θd ] where the iterative equation is given as below: ∂J(θ) θ := θ − α ∂θ ⎡ ⎤ ⎤ ⎡ ⎤ ⎡ n (θ T x i − yi ) ∗ xi1 θ1 θ1 i=1 n T ⎢θ 2 ⎥ α ⎢ ⎥ ⎢θ 2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ i=1 (θ x i − yi ) ∗ xi2 ⎥ ⎥ ⎢ .. ⎥ := ⎢ .. ⎥ − ⎢ .. ⎣. ⎦ n⎣ ⎦ ⎣. ⎦ . n T θd θd (θ x − y ) ∗ x i i id i=1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ θ1 θ1 xi1 n ⎢θ 2 ⎥ α ⎢θ 2 ⎥ ⎢xi2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ (θ T x i − yi ) ⎢ . ⎥ ⎢ .. ⎥ := ⎢ .. ⎥ − ⎣ . ⎦ n i=1 ⎣. ⎦ ⎣ .. ⎦ θd θd xid n

θ := θ −

α T (θ x i − yi )x i , n i=1

where α is the iteration step. Note that θ is a linear combination of x i when the initial value is set to the vector 0. Linear combination is supported by Vector Homomorphic Encryption, and thus we can get the results on the encrypted domain.

5

Security Analysis

In this section, we give the security analysis for ESLR, focusing on the encrypted database D = {c 1 , c 2 , · · · , c n } and the matrix H . The honest-but-curious cloud server could not threat the privacy of the data owner, i.e., the cloud could not recover plaintexts database D = {x 1 , x 2 , · · · , x n }. First of all, c i is the ciphertext of x i by the encryption of VHE, for i = 1, 2, · · · , n. For convenience, we omit the subscripts, denoting as c = VHE .E (x , S ), where S is the secret key. Therefore, we can ensure the conﬁdentiality of x , only if the encryption scheme VHE is secure and the secret key S is not known by the cloud. Of course, we may suppose that the secret key S is stored privately by the data owner, and thus the cloud could not get it. Hence, we would focus on the security of VHE. As shown in [17], the security of VHE could reduce to the problem of the learning with errors (LWE ). It is well known the LWE problem is as hard to solve as several worst-case lattice problems [14]. As a resut, the intracibility of LWE assures the security of VHE. However, in order to evaluate the distance of two ciphertexts vectors, we introduce a special matrix H . It is natural to consider if H may bring certain

98

H. Yang et al.

unknown privacy risk. For example, on one hand, to calculate H , we ﬁrst solve the equation I ∗ = AM to obtain A, then compute H = AT A to get H . On the other hand, according to VHE, for the ciphertext c and the plaintext x , c = M (wx)∗ holds. As known, the cloud has H and c. If the cloud combines the equations as follows, it seems that the cloud could recover the plaintext x . ⎧ T ⎪ ⎨H = A A I ∗ = AM ⎪ ⎩ c = M (wx ) In the following, We would give positive answer about the challenge. The analysis demonstrates that the cloud could not yet recover the plaintext x from the ciphertext c by exploiting H . As is known, for a random orthogonal matrix Q, satisfying the relation Q T Q = I , where I is an identity matrix, we have H = AT A = AT Q T QA = AT I A =H It is clear that the equation H = AT A has inﬁnite solutions for A since Q is randomly chosen. Therefore, the cloud could not extract the matrix A from the Norm-matrix H . Futhermore, without knowing A, the cloud could not yet get M . And the cloud could not recover the plaintext x from the ciphertext c. As a result, we achieve the privacy of the database D.

6

Performance Evaluation

In this section, we evaluate the proposed linear regression protocol. Our data sets come from the UCI repository [1], and the experiment environment includes a data owner and a service provider. Python language is used on a Window 10 machine with i3-4130 CPU @1.40 GHz and 4 GB RAM for a user, and the server is a Linux machine with an Intel Xeon E5-2430 v2 CPU @2.5 GHz and 16 GB RAM running Ubuntu 14.04 LTS. The user acts as a data owner and a data user, and the server acts as a service provider. In the following, we will conduct the simulation experiments in terms of the time cost, accuracy, and communication overhead. 6.1

Time Cost and Accuracy

Firstly, we evaluate the time cost by the comparison of running time between plaintext and ciphertext. As illustrated in Fig. 2, we choose 4 data sets to verify our protocol from the UCI repository, and can see that the linear regression

Eﬃcient and Secure Outsourced Linear Regression

(a) dataset 1

(b) dataset 2

(c) dataset 3

(d) dataset 4

99

Fig. 2. Comparison of running time between plaintext and ciphertext

(a) dataset 1

(b) dataset 2

(c) dataset 3

(d) dataset 4

Fig. 3. Comparison between real results and predicted results in encrypted domain

100

H. Yang et al.

on ciphertext is a little slower than that on plaintext. However, the result is acceptable, and it has almost the same results for the data sets between the plaintext and the ciphertext. Then, we show the comparison of accuracy between the real results and the predicted results of the four diﬀerent data sets in the ciphertext domain. As illustrated in Fig. 3, we can see that the predicted results almost coincide the actual results in the ciphertext domain. Furthermore, we choose the Mean Squared Error, Root Mean Squard Error, Mean Absolute Error (MAE) and R Squared (R-S), as the indexes of linear regression to evaluate our model. As seen in Table 2, compared to results in the plaintext domain, our protocol has almost achieved the same prediction performance. This shows that our model has a good performance on the ciphertext domain. Table 2. Clustering time and iterations Data Plaintext on dataset 1

RMSE MAE R-S

9.964 3.156

2.427 0.838

Ciphertext on dataset 1 15.253 3.905

2.873 0.768

Plaintext on dataset 2

0.007 0.081

0.068 0.972

Ciphertext on dataset 2

0.025 0.157

0.133 0.895

Plaintext on dataset 3

23.999 4.899

3.790 0.497

Ciphertext on dataset 3 24.987 4.999

3.897 0.475

Plaintext on dataset 4

6.2

MSE

19.134 4.374

3.499 0.782

Ciphertext on dataset 4 20.134 4.564

3.619 0.768

Communication Cost

In this section, we will discuss the communication cost of our protocol. In our protocol, the communication cost mainly come from ciphertext and the matrix H which is used to deﬁne the loss function. Firstly, for n records and every record have m dimensions, it will produce O(m(n + 1)) communication traﬃc overhead when the data items are encrypted. Secondly, it will generate O((n + 1)2 ) communication traﬃc overhead for matrix H . That means that it will produce O(m+n+1)(n+1) communication traﬃc overhead totally on encrypted domain. On the other hand, the complexity of plaintext stage is O(mn) for the same data sets. In fact, m is always far greater than n because of dimension disaster problem [11]. So communication traﬃc overhead between plaintext and ciphertext is almost same when m is far greater than n and m is big enough.

Eﬃcient and Secure Outsourced Linear Regression

7

101

Conclusion

In this paper we have proposed an eﬃcient and secure linear regression protocol over encrypted data using the vector homomorphic encryption. Especially, we have given a good solution to the challenging problem of privacy-preserving gradient descent method. Performance evaluation shows that it has high accuracy and low computation and communication cost. As we know, many machine learning algorithm base on gradient descent method. In the future, we will use this method on other machine learning algorithms. Acknowledgement. Our work is supported by of the National Key Research and Development Program of China (2017YFB0802003), the National Natural Science Foundation of China (U1633114) and the Sichuan Science and Technology Program (2018GZ0202).

References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 2. Ben-David, A., Nisan, N., Pinkas, B.: FairplayMP: a system for securemulti-party computation. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 257–266. ACM (2008) 3. Dankar, F.K., El Emam, K.: The application of diﬀerential privacy to healthdata. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 158–166. ACM (2012) 4. Centers for Disease Control and Prevention, et al.: HIPAA privacy rule and public health. guidance from CDC and the us department of health and human services. MMWR Morb. Mortal. Wkly. Rep. 52(Suppl. 1), 1–17 (2003) 5. Du, W., Atallah, M.J.: Secure multi-party computation problems and their applications: a review and open problems. In: Proceedings of the 2001 Workshop on New Security Paradigms, pp. 13–22. ACM (2001) 6. Dwork, C.: Diﬀerential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4 1 7. Dwork, C., Roth, A., et al.: The algorithmic foundations of diﬀerential privacy. R Theor. Comput. Sci. 9(3–4), 211–407 (2014) Found. Trends 8. Fletcher, R., Powell, M.J.: A rapidly convergent descent method for minimization. Comput. J. 6(2), 163–168 (1963) 9. Goldreich, O.: Secure multi-party computation. Manuscript. Preliminary version, pp. 86–97 (1998) 10. Halevi, S., Shoup, V.: Helib (2014). Retrieved from HELib: https://github.com. shaih/HElib 11. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998) 12. Lee, L.M., Gostin, L.O.: Ethical collection, storage, and use of public health data: a proposal for a national privacy protection. Jama 302(1), 82–84 (2009) 13. McSherry, F., Talwar, K.: Mechanism design via diﬀerential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)

102

H. Yang et al.

14. Regev, O.: On lattices, learning with errors, random linear codes, andcryptography. J. ACM 56(6), 1–40 (2009) 15. van Dijk, M., Gentry, C., Halevi, S., Vaikuntanathan, V.: Fully homomorphic encryption over the integers. In: Gilbert, H. (ed.) EUROCRYPT 2010. LNCS, vol. 6110, pp. 24–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3642-13190-5 2 16. Wold, S., Ruhe, A., Wold, H., Dunn III, W.: The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 5(3), 735–743 (1984) 17. Zhou, H., Wornell, G.: Eﬃcient homomorphic encryption on integer vectors and its applications. In: Information Theory and Applications Workshop (ITA), 2014, pp. 1–9. IEEE (2014)

New Multi-objectives Scheduling Strategies in Docker SwarmKit ´ Tarek Menouer, Christophe C´erin(B) , and Etienne Leclercq {tarek.menouer,christophe.cerin, etienne.leclercq}@lipn.univ-paris13.fr University of Paris 13, Sorbonne Paris Cit´e, LIPN/CNRS UMR 7030, 93430 Villetaneuse, France

Abstract. This paper presents new multi-objectives scheduling strategies implemented in Docker SwarmKit. Docker SwarmKit is a container toolkit for orchestrating distributed systems at any scale. Currently, Docker SwarmKit has one scheduling strategy called Spread. Spread is based only on one objective to select from a set of cloud nodes, one node to execute a container. However, the containers submitted by users to be scheduled in Docker SwarmKit are conﬁgured according to multiobjectives criteria, as the number of CPUs and the memory size. To better address the multi-objectives conﬁguration problem of containers, we introduce the concept and the implementation of new multi-objectives scheduling strategies adapted for Cloud Computing environments and implemented in Docker SwarmKit. The principle of our multi-objectives strategies consist to select a node which has a good compromise between multi-objectives criteria to execute a container. The proposed scheduling strategies are based on a combinaison of PROMETHEE and Kung multi-objectives decision algorithms in order to place containers. The implementation in Docker SwarmKit and experiments of our new strategies demonstrate the potential of our approach under diﬀerent scenarios. Keywords: Systems software · Scheduling and resource management Container technology · Cloud computing Application of parallel and distributed algorithms

1

Introduction

Nowadays, cloud computing is the commercial subscription to external services. Its principle is based on pay-for-use model that can aﬀect diﬀerent elements such as the requested application, data storage capacity, memory processing and number of users. Diﬀerent forms of cloud computational resources exist such as virtual machines (VMs), containers, or bare-metal resources, having each their own characteristics. Container technology is relatively new in production systems but it is not a new concept. It has increasingly grown up in cloud environment. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 103–117, 2018. https://doi.org/10.1007/978-3-030-05057-3_8

104

T. Menouer et al.

Docker SwarmKit [25] is a toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, containers scheduling and more. In the containers context, it selects the ﬁrst Docker container that must be executed using the classical FIFO (First In First Out) strategy. Then, it chooses the appropriate cloud node from a set of nodes using the Spread scheduling strategy. The principle of Spread is to execute a container on the node having the least number of containers. Spread is a mono-based objective scheduling strategy. However, the containers scheduled by Docker SwarmKit are conﬁgured regarding multi-objectives criteria, like the number of the used CPUs and the size of the used memory. To take into consideration the multi-objectives conﬁguration approach, we present in this paper the idea for new multi-objectives scheduling strategies implemented in Docker SwarmKit. The goal is to address the problem of companies that manage a private infrastructure of nodes i.e. a cloud platform, and would like to optimize the scheduling of several containers submitted online by users. In this paper, for the sake of simplicity, each container is scheduled by taking into consideration two criteria: (i) the number of CPUs and (ii) the memory size. Indeed, the overall motivation for such multi-objectives scheduling strategies comes from the industrial Fonds Unique Interminist´eriel (FUI-22) Wolphin1 project, a collaborative industrial project oriented towards the themes of orchestration and optimization of the execution of containers. Ultimately, the project, supervised by Alterway2 , aims to provide an eﬃcient solution for hypervision and invoicing of container-oriented infrastructure. In fact, in the Wolphin project the Alterway company would like to improve the Docker SwarmKit scheduler to optimize the scheduling of containers submitted online by users. AlterWay would like to reduce the cost of the infrastructure by choosing the most appropriate nodes to maximize the number of executed containers during a time period. Each container should be executed in a node with a good compromise between its availability on the number of CPUs cores and its free memory. This paper demonstrates that we have room for improvements in the Docker SwarmKit toolkit. We propose to particularize two multi-objectives decision algorithms called PROMETHEE and Kung. These algorithms are used to select, for each submitted container, the node that must execute it, according to a “good” compromise between multiple criteria. This is a ﬁrst step for going on high dimensional decision support for scheduling containers inside the concrete Docker SwarmKit toolkit. The essence of cloud computing is precisely to be able to deal with the challenging problem of multiple objectives in heterogeneous and dynamic environments for the beneﬁt of the user and/or the platform. The organization of the paper is as follows. Section 2 presents some related works. Section 3 describes our multi-objectives scheduling strategies based on PROMETHEE and Kung algorithms. Section 4 shows a comparative example between the proposed multi-objectives scheduling strategies and the Spread 1 2

https://www.alterway.fr/wolphin-2-0-laureat-du-fui-22/. https://www.alterway.fr.

Multi-objectives Scheduling Strategies

105

strategy which is the default SwarmKit scheduling strategy. Section 5 introduces exhaustive experiences that allow the validation of our strategies. Finally, a conclusion and some future works are given in Sect. 6.

2

Related Work

In the literature, many problems of resources allocation, or placement of user’s containers or requests refer to the same class of scheduling problems. They consist generally in associating a user’s container with one or several computing cores to be executed in a particular node. Most of these problems are NP diﬃcult [20]. In this general context, we present in the forthcoming subsection several proposed scheduling systems and computing frameworks. We present also in Subsect. 2.2, some multi-objectives studies proposed in the literature. Subsect. 2.3 discusses quickly about machine learning techniques for large-scale multi-objectives optimization. Then, we conclude this section by a positioning in the Subsect. 2.4. 2.1

Containers Scheduling and Cloud Computing

In the literature, there are some frameworks that have proposed to schedule containers on cloud computing [5,17,23,24]. To give a positioning of our work compared to an industrial point of view, we document, as examples of concrete projects, the schedulers inside Google Kubernetes [24], Docker SwarmKit [25] and Apache Mesos [23]. Google Kubernetes [24] is a scheduler framework which represents an orchestration system for Docker containers based on pods concept. Pods are a group of one or more containers such as Docker containers. They are always co-located, co-scheduled and run in a shared context. Moreover, they will be run on the same physical or virtual machine (node). The principle of the Google Kubernetes scheduling can be summarized in two steps. The ﬁrst step consists to classify all nodes to remove nodes that do not meet certain requirements of the pod. The second step consists to classify the remaining nodes using priorities to ﬁnd the best ﬁt to execute a pod. A priority is a key/value representing the name of the priority from the list of existing ones and its weight. For each remaining node, a priority function gives a score which scales from 0 to 10. Each priority function is weighted by a positive number and the ﬁnal score of each node is calculated by adding up all the weighted scores. When all scores of all nodes are calculated, Google Kubernetes chooses the node with the highest score to run the container. Docker SwarmKit [25] is an important container scheduler framework developed by Docker. It has two steps to ﬁnally choose which node will execute the container. First, it uses ﬁlters to select suitable nodes to execute the container according to the number of waiting CPUs cores and the free memory. Then, it uses, according to a Spread scheduling strategy, the most suitable node to execute the selected container. The principle of Spread strategy is to execute a container on the node having the least number of containers. The goal of

106

T. Menouer et al.

Spread is to give a “good” load balancing of containers between all nodes of the infrastructure. Mesos system [23] for example is delegating control over scheduling to the frameworks because many frameworks already implement sophisticated scheduling [9]. The Apache Mesos [23] framework has a native Docker support which oﬀers many features in terms of scheduling such as constraints, discovery service and load balancing [9]. It is based on four elements to schedule containers on the cluster. Zookeeper for example helps Marathon to ﬁnd the address of Mesos master. Marathon starts, monitors and scales the containers. The Mesos master sends the tasks assigned to a node and informs Marathon if there is a node having some free resources. Mesos slaves represent the set of nodes used to execute containers. There exists also some studies related to resource management as studies presented in [5,11,15]. Choi et al. [5] propose a framework which provides useful resource management functions, and more importantly it is possible to apply customized scheduling in local environment. By using this framework, cloud providers or researchers can optimize resources for their purpose. Jimenez et al. [11] introduce a resource monitoring agent for resource management of containers environment. The advantage of their approach is that it allows the monitor to assign resource of each container through the proposed agent. Medel et al. [15] inovate with a client-side scheduling approach in Kubernetes that aims to reduce the resource contention phenomenon in container technologies. The principle of the authors approach is to make use of application characterization in terms of the usage of resources, and extends the Kubernetes scheduler so that it can take better allocation decisions on containers based on such characterization. The application characterization consists in dividing applications in two categories, namely high and low usage of resources. The classiﬁcation process of applications is delegated to the client or developer which provides the category which ﬁts better to the application. 2.2

Short Overview of Multi-objectives Related Problems

Combinatorial and discrete optimization problems such as routing, task allocation, and scheduling are important optimization applications in the real world. Traditionally, the time required to solve a combinatorial problem may increase exponentially in the worst case, thereby making them computationally too costly. Moreover, if the optimization involves multiple objectives, the process becomes more complex and diﬃcult to solve [22]. Xing et al. [21] present a simulation model to solve a multi-objectives Flexible Job-Shop Scheduling Problem (FJSSP). The FJSSP is very important in the ﬁelds of combinatorial optimization and production management. Throughout the experiments, authors showed that multi-objectives evolutionary algorithms are very eﬀective for solving the FJSSP.

Multi-objectives Scheduling Strategies

107

Knowles et al. [12] propose a Pareto archived evolution strategy to solve multi-objectives optimization problem. The algorithm introduces a Pareto ranking-based selection method and couples it with a partition scheme in objective space. It uses two diﬀerent archives to save non-dominated solutions. Chang et al. [4] proposed a new algorithm, called the sub-population genetic algorithm II, to solve multi-objectives combinatorial problems. The algorithm develops a mechanism to exchange information among sub-populations. Once a sub-population reaches a better non-dominated solution, other sub-populations will apply them directly in their search space. In this way, all individuals in the same population will be guided to search toward the true Pareto front. 2.3

Multi-dimensional Search

Machine learning algorithms for large-scale multi-objectives optimization may also be considered as techniques to accelerate the search of solutions in multidimensional space. We assume that solving large-scale multi-objectives scheduling problems on large-scale systems remains challenging. Such general techniques from the ﬁeld of machine learning are surrogate meta-models, multi-armed bandits [14], landscape analysis [6] and online/oﬄine automatic algorithm selection and conﬁguration [2]. Our work may be considered as a practical work to investigate the limits of known multi-objectives optimization techniques to solve concrete problems inside the popular Docker SwarmKit. Once the limits are isolated and understood we can better choose, in the future, another appropriate technique for multidimensional spaces. 2.4

Positioning

To the best of our knowledge, all of the studies proposed previously in the context of scheduling in cloud use a mono-based objective strategy to select a node which executes a container. However, the novelty of this paper is to improve the Docker SwarmKit scheduling system with a new multi-based objectives strategies to select for each submitted container a node that will executes it. Indeed, this paper is an extension of a preliminary paper presented in [3] where the context was a naive scheduling strategy based on a mono-objective scheduling strategy implemented in Docker Swarm.

3

Multi-objectives Scheduling Strategies

In following we start by presenting the PROMETHEE scheduling strategy. Then, we present the Kung scheduling strategy. After the introduction of each multiobjectives scheduling strategy, we give an illustrated example which explains the operation of each strategy.

108

3.1

T. Menouer et al.

PROMETHEE Scheduling Strategy

The ﬁrst proposed scheduling strategy is based on PROMETHEE II (Preference Ranking Organization METHod for Enrichment Evaluations) algorithm [18]. PROMETHEE II is a multi-objectives decision algorithm that permits the building of an outranking between diﬀerent alternatives [18]. It is used in this step because it allows to provide a node which must execute a container with a “good” compromise between: (i) number of waiting CPUs and (ii) unused memory space. Indeed, the PROMETHEE II has been used with success to solve many problems [1]. In our case, it is based on a comparison, pair by pair, of possible decisions (nodes) along number of waiting CPUs and the size of the free memory criteria. Each criterion can be evaluated according to two functions (minimization or maximization). The use of the PROMETHEE II algorithm requires for each criterion two informations: a weight and a preference function. In our context, the weight in all criteria is the same and equal to 1. The preference function characterizes the diﬀerence for a criterion between the evaluations obtained by two possible nodes into a preference degree ranging from 0 to 1. In [10], six basic preference functions have been proposed. In this work, for the sake of simplicity, we use the usual preference functions. To summarize, the PROMETHEE II algorithm is composed of four steps [19] and it is used as follows: 1. Compute for each pair of possible nodes (nodea and nodeb ) and for each criterion (number of waiting CPUs or free memory size), the value of the preference degree. Let gj (nodea ) be the value of a criterion j for a node nodea . We note dj (nodea , nodeb ) (dj (nodea , nodeb ) = gj (nodea ) − gj (nodeb )), the diﬀerence of value of a criterion j for nodea and nodeb . Pj (nodea , nodeb ) is the value of the preference degree of a criterion j for nodea and nodeb . The preference function used in this paper to compute these preference degrees is deﬁned such as: 0 dj ≤ 0 Pj (dj ) = 1 dj > 0 2. Compute for each pair of possible nodes, a global preference index. Let C be the set of considered objectives criteria (number of waiting CPUs and free memory size) and wj the weight associated to the criterion j. The global preference index for a pair of possible nodea and nodeb is computed as follows: π(nodea , nodeb ) = Wj × Pj (nodea , nodeb ) j∈C

3. Compute for each possible node the positive outranking ﬂow φ+ (nodea ) and the negative outranking ﬂow φ− (nodea ). Let A be the set of nodes with size of n. The positive and negative outranking ﬂow of nodes are computed by the following formula: φ+ (nodea ) =

1 π(nodea , x) n−1 x∈A

Multi-objectives Scheduling Strategies

and φ− (nodea ) =

109

1 π(x, nodea ) n−1 x∈A

4. Compute the outranking ﬂows to establish a complete ranking between nodes. The ranking is based on the net outranking ﬂows φ(nodea ) which is computed as follows: φ(nodea ) = φ+ (nodea ) − φ− (nodea ). In our work, the ﬁrst node returned by PROMETHEE II is the node that has the highest value in case of minimization of multi-objectives criteria of the net outranking. Example of How PROMETHEE Scheduling Strategy Works: Assume that at time t0 , we have a container Cx which need 8 CPUs and 8 GB of memory. We assume also that from all nodes of the infrastructure there are just three nodes (nodea , nodeb and nodec ) which can execute Cx . The availability of each node in term of waiting number of CPUs and the size of free memory are presented in Table 1. Table 1. Nodes conﬁgurations in term of waiting CPUs and free memory size Nodes Number of waiting CPUs Memory size na

10

10

nb

20

40

nc

30

40

nd

40

50

As explained before, to select the ﬁrst node that must execute the container Cx using PROMETHEE scheduling strategy (with a minimization function on all multi-objectives criteria), we start by computing for each pair of nodes a diﬀerence value of multi-objectives criteria dx (nodei , nodej ) and the preference degree Px (nodei , nodej ). Then, the system calculates the global preference index φ(nodei ). For example, in Table 2 with the ﬁrst pair nodes (nodea , nodeb ), the diﬀerence value of waiting CPUs criterion is d(nodea , nodeb ) = 10 − 20 = −10. In this case the diﬀerence value is negative, using our usual preference function, the preference degree equals to 0. As in our work the weight of all criteria is the same and equal to 1, the global preference index of the ﬁrst pair nodes (nodea , nodeb ) = 1 × 0 + 1 × 0 + 1 × 1 = 1. Finally, to get the rank of nodes and select the node which can execute a container, our strategy calculates the positive and negative outranking ﬂow and the net outranking ﬂow parameters. Table 3 shows how our strategy calculates these diﬀerent parameters. For example, for nodea , the positive outranking ﬂow (φ+ ) is 21 (1 + 1) = 1. The negative outranking ﬂow (φ− ) is 12 (2 + 1) = 1.5. The net outranking ﬂow φ (φ = φ+ − φ− ) is −0.5 (1 − 1.5). Using PROMETHEE strategy, the nodea is the ﬁrst selected node with the minimum net outranking ﬂow.

110

T. Menouer et al.

Table 2. Computing the diﬀerence values, preference degree and preference index value for a set of pair nodes Pair of nodes

Diﬀerence values

Preference degree

Weight

Number of waiting CPUs

Memory size

Number of waiting CPUs

Memory size

Number of waiting CPUs

Memory size

Preference index value

d(na , nb )

−10

−30

0

0

1

1

0

d(na , nc )

−20

−30

0

0

1

1

0

d(na , nd )

−30

−40

0

0

1

1

0

d(nb , na )

10

30

1

1

1

1

2

d(nb , nc )

−10

0

0

0

1

1

0

d(nb , nd )

−20

−10

0

0

1

1

0

d(nc , na )

20

30

1

1

1

1

2 1

d(nc , nb )

10

0

1

0

1

1

d(nc , nd )

−10

−10

0

0

1

1

0

d(nd , na )

30

40

1

1

1

1

2

d(nd , nb )

20

10

1

1

1

1

2

d(nd , nc )

10

10

1

1

1

1

2

Table 3. Computing of the net outranking ﬂow for each node Nodes φ+ φ− φ

3.2

−3

Rank

na

0

3

nb

1

1.5 −0.5 2

nc

1.5 1

0.5

3

nc

3

3

4

0

1

Kung Scheduling Strategy

The second multi-objectives scheduling strategy is based on Kung algorithm [13]. It is among the best algorithms used in the multi-objectives criteria context [7]. As presented in [7], Kung algorithm ﬁrstly sorts the population (nodes that can execute a container) in descending order according to the ﬁrst criterion (number of waiting CPUs). Thereafter, the set of nodes are recursively halved as Top half (T) and Bottom half (B) sub set of nodes. As T is better in objectives in comparison to B in ﬁrst objective (number of waiting CPUs), so we check the B for domination with T. The solution of B which are not dominated by solutions of T are merged with members of T to form merged set of nodes M. In our context, we use a minimization function. That mean a solution x1 is better that other solution x2 , if the value of x1 is smaller than the value of x2 . The complete algorithm can be summarized in two steps: – Sort the nodes according the descending order of importance in the number of waiting CPUs criterion and rename the population as P of size N.

Multi-objectives Scheduling Strategies

111

– Front(P): if |P | = 1, return P as the output of Front(P). Otherwise, T = Front(P 1 − P |P/2| ) and B = Front(P |P/2|+1 − P P ). IF ith non-dominated solution B is not dominated by any non-dominated solution of T, create a merged set M = {T U i}. Finally, return M as output of Front(P). We say that a solution x1 dominates an other solution x2 if two conditions are satisﬁed: 1. Solution x1 is no worse than x2 in all multi-objectives criteria; 2. Solution x1 is strictly better than x2 in at least one objective criterion. If a solution x1 dominates an other solution x2 ⇔ the solution x2 is dominated by the solution x1 . In our context, the goal of Kung algorithm is to select a set of nodes with a “good” compromise between the availability of CPUs cores and the free memory. Then, our strategy returns the ﬁrst node that can execute a container from the set of nodes returned by the Kung strategy.

Fig. 1. Example of Kung strategy (Color ﬁgure online)

Example of How Kung Scheduling Strategy Works: Assume that at time t0 , we have a container Cx which need 8 CPUs and 8 GB of memory. We assume also that from all nodes of the infrastructure there are just three nodes (nodea , nodeb and nodec ) which can execute Cx . The availability of each node in term of number of waiting CPUs and memory size are presented in Table 1 (the same table as the table presented previously in Sect. 3.1). As explained before, to select the ﬁrst node that must execute the container Cx using Kung strategy, we start by ordering in descending order all nodes according to the value of the waiting CPUs criterion. Then, the set of nodes are recursively halved as Top (T) and Bottom (B) sub set of nodes as it is shown in the Fig. 1 with red color. After applying the second step of the Kung algorithm as presented in the Fig. 1 with blue color, the selected node is the nodea . We note that the Kung and PROMETHEE scheduling strategies give the same result and choose the nodea to execute the container Cx .

112

4

T. Menouer et al.

Comparative Example Between Scheduling Strategies

Figures 2 and 3 show a comparison between the scheduling of 3 containers with Spread strategy (Fig. 2) and with multi-objectives strategy (PROMETHEE or Kung) (Fig. 3).

Fig. 2. Scheduling with Spread strategy

Fig. 3. Scheduling with multi-objectives strategy

The principle of Spread strategy is to execute a container on the node having the least number of containers. For example, with n nodes, Spread selects nodes with the following order: nodei%n , node(i+1)%n , node(i+2)%n , · · · . In this comparison we suppose that we have 2 nodes with the same conﬁguration (24 waiting CPUs and 90 GB of memory). We suppose also that we have 3 containers with the following conﬁgurations: – Container 1: 16 CPUs and 60 GB of memory; – Container 2: 8 CPUs and 30 GB of memory; – Container 3: 24 CPUs and 90 GB of memory. In Fig. 2, the container 1 is executed in nodea . The container 2 is executed in nodeb . When the container 3 is presented, with Spread strategy container 3 can not be executed because there are no node which has enough of resources to execute it. However, in Fig. 3, the ﬁrst container is executed on nodea . After that, the PROMETHEE and Kung strategies select the nodea to execute the container 2. When the container 3 is presented, it is directly executed in nodeb .

Multi-objectives Scheduling Strategies

5

113

Experimental Evaluation

In this section we introduce experiences with our multi-objectives scheduling strategies implemented in Docker SwarmKit to check if it meets our expectations. For these experimentations, we do experiences inside the Grid5000 platform [8], an experimental large-scale testbed for distributed computing in France. For our experimental evaluation, we reserved an infrastructure composed of a total of 128 computing cores, distributed over 4 nodes (Intel Xeon CPU), each node contains 32 cores and 130 GB of memory. The following experimental evaluation is performed according to the submission of 18 containers with an execution time equal to 3 minutes. Each container is submitted by one of the following three users, each user has a particular container conﬁguration: – User 1: for each container, he needs 30 CPUs and 120 GB of memory, – User 2: for each container, he needs 20 CPUs and 80 GB of memory, – User 3: for each container, he needs 10 CPUs and 40 GB of memory. The performance of our multi-objectives scheduling strategies is based on two submitting containers types: (i) containers submitted at the same time, i.e. each user submits 6 containers at the same time; and (ii) containers submitted online with a ﬁxed frequency equal to 1 minute, i.e. each 1 min, 3 containers are submitted by 3 diﬀerent users. The ﬁrst type of experiments with submission at the same time stresses the scheduling system. The second type of experiments with submission online represents a “normal” operating mode. 5.1

Distribution of Containers in Diﬀerent Nodes

In this subsection we present the distribution of containers in our 4 nodes according to the submission type and the three scheduling strategies: (i) Spread; (ii) PROMETHEE; and (iii) Kung. Containers Submitted at the Same Time: Figures 4, 5 and 6 show the distribution of containers submitted at the same time in 4 nodes using Spread strategy (Fig. 4), PROMETHEE strategy (Fig. 5) and Kung strategy (Fig. 6).

Fig. 4. Distribution of containers submitted at the same time in 4 nodes using Spread strategy

114

T. Menouer et al.

Fig. 5. Distribution of containers submitted at the same time in 4 nodes using PROMETHEE strategy

Fig. 6. Distribution of containers submitted at the same time in 4 nodes using Kung strategy

Using PROMETHEE and Kung strategies, we note that the load of containers in each node is bigger than the load of containers with Spread strategy. This is a good property of our implementation, as expected.

Fig. 7. Distribution of containers submitted online in 4 nodes using Spread strategy

Containers Submitted Online: Figures 7, 8 and 9 show the distribution of containers submitted online in 4 nodes using Spread strategy (Fig. 7), PROMETHEE strategy (Fig. 8) and Kung strategy (Fig. 9). We can emit the same remark as the previous experimentation i.e. the load of containers with PROMETHEE and Kung is bigger than the load of containers with Spread strategy. 5.2

Comparison of Performance

In this subsection we compare the performance of 3 scheduling strategies (Spread, PROMETHEE and Kung) in 4 nodes according to the submission type. Table 4 shows a comparison of running time between Spread, PROMETHEE and Kung scheduling strategies according to the submission type. We note that the running time obtained with Spread strategy is always the longest. However, the running time of PROMETHEE and Kung strategies is almost the same.

Multi-objectives Scheduling Strategies

Fig. 8. Distribution of containers submitted online in 4 nodes using PROMETHEE strategy

115

Fig. 9. Distribution of containers submitted online in 4 nodes using Kung strategy

Table 4. Comparison of performance between 3 scheduling strategies Scheduling strategies Submission type At the same time Online Spread

747.49 (s)

747.29 (s)

PROMETHEE

558.62 (s)

622.26 (s)

Kung

559.11 (s)

621.46 (s)

We note also that sometimes the PROMETHEE running time is better (submission at the same time), and sometimes the Kung running time is better (submission online).

6

Conclusion

We have presented, in this paper, new multi-objectives scheduling strategies for Docker SwarmKit. Our new scheduling strategies are based on PROMETHEE and Kung multi-objectives algorithms. The principle of our strategies is to select from a set of nodes, a node to execute a container by taking into consideration multi-objectives criteria: (i) number of waiting CPUs and (ii) free memory size. The goal is to execute a container in node which has a good compromise between the availability of CPUs cores and the free memory size. Actually Docker SwarmKit uses a simple FIFO (First In First out) strategy to select the ﬁrst container that must be executed from a set of containers saved in a queue. As a perspective, we propose to use the same principle as our multiobjectives strategies to select the ﬁrst container that must be executed from a queue of containers. We have present previously in [16], a new scheduling and resources management system based on an economic model. To choose the node that must execute a request, the system presented in [16], uses the Bin Packing strategy. As an other perspective, we propose to use our new multi-objectives scheduling

116

T. Menouer et al.

strategies in the system proposed in [16] and compare the performance between the Bin Packing and the multi-objectives scheduling strategies. Acknowledgments. This work is funded by the French Fonds Unique Minist´eriel (FUI) Wolphin Project. We thank Grid5000 team for their help to use the testbed.

References 1. Behzadian, M., Kazemzadeh, R., Albadvi, A., Aghdasi, M.: Promethee: a comprehensive literature review on methodologies and applications. Eur. J. Oper. Res. 200(1), 198–215 (2010) 2. C´ aceres, L.P., Pagnozzi, F., Franzin, A., St¨ utzle, T.: Automatic conﬁguration of GCC using irace. In: Lutton, E., Legrand, P., Parrend, P., Monmarch´e, N., Schoenauer, M. (eds.) EA 2017. LNCS, vol. 10764, pp. 202–216. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78133-4 15 3. C´erin, C., Ben-Abdaallah, W., Saad, W., Menouer, T.: A new docker swarm scheduling strategy. In: 7th International Symposium on Cloud and Service Computing, Kanazawa, Japan (2017) 4. Chang, P.-C., Chen, S.-H.: The development of a sub-population genetic algorithm II (SPGA II) for multi-objective combinatorial problems. Appl. Soft Comput. 9(1), 173–181 (2009) 5. Choi, S., Myung, R., Choi, H., Chung, K., Gil, J., Yu, H.: GPSF: general-purpose scheduling framework for container based on cloud environment. In: IEEE iThings and IEEE GreenCom and IEEE CPSCom and IEEE SmartData (2016) 6. Daolio, F., Liefooghe, A., V´erel, S., Aguirre, H.E., Tanaka, K.: Problem features versus algorithm performance on rugged multiobjective combinatorial ﬁtness landscapes. Evol. Comput. 25(4), 555–585 (2017) 7. Ding, L., Zeng, S., Kang, L.: A fast algorithm on ﬁnding the non-dominated set in multi-objective optimization. In: The 2003 Congress on Evolutionary Computation, CEC 2003, vol. 4, pp. 2565–2571, December 2003 8. Grid5000: https://www.grid5000.fr/ 9. Grillet, A.: Comparaison of containers schedulers. Medium (2016) 10. Brans, J.-P., Mareschal, B.: Promethee methods - multiple criteria decision analysis: state of the art surveys. International Series in Operations Research & Management Science, vol. 78 (2005) 11. Jimenez, L.L., Simon, M.G., Schel´en, O., Kristiansson, J., Synnes, K., ˚ Ahlund, C.: CoMA: resource monitoring of docker containers. In: Proceedings of the 5th International Conference on Cloud Computing and Services Science (CLOSER 2015) (2015) 12. Knowles, J.D., Corne, D.W.: M-PAES: a memetic algorithm for multiobjective optimization. In: Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512), vol. 1, pp. 325–332 (2000) 13. Kung, H.T., Luccio, F., Preparata, F.P.: On ﬁnding the maxima of a set of vectors. J. ACM 22(4), 469–476 (1975) ´ Kwong, S., Zhang, Q.: Adaptive operator selection with bandits 14. Li, K., Fialho, A., for a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 18(1), 114–130 (2014)

Multi-objectives Scheduling Strategies

117

´ Rana, 15. Medel, V., Tol´ on, C., Arronategui, U., Tolosana-Calasanz, R., Ba˜ nares, J.A., O.F.: Client-side scheduling based on application characterization on Kubernetes. ´ (eds.) GECON 2017. LNCS, vol. 10537, In: Pham, C., Altmann, J., Ba˜ nares, J.A. pp. 162–176. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-680668 13 16. Menouer, T., Cerin, C.: Scheduling and resource management allocation system combined with an economic model. In: The 15th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2017) (2017) 17. Peinl, R., Holzschuher, F., Pﬁtzer, F.: Docker cluster management for the cloudsurvey results and own solution. J. Grid Comput. 14(2), 265–282 (2016) 18. Deshmukh, S.C.: Preference ranking organization method of enrichment evaluation (PROMETHEE). Int. J. Eng. Sci. Inven. 2, 28–34 (2013) 19. Taillandier, P., Stinckwich, S.: Using the promethee multi-criteria decision making method to deﬁne new exploration strategies for rescue robots. In: International Symposium on Safety, Security, and Rescue Robotics (2011) 20. Ullman, J.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384– 393 (1975) 21. Xing, L.-N., Chen, Y.-W., Yang, K.-W.: Multi-objective ﬂexible job shop schedule: design and evaluation by simulation modeling. Appl. Soft Comput. 9(1), 362–376 (2009) 22. Zhou, A., Qu, B.-Y., Li, H., Zhao, S.-Z., Suganthan, P.N., Zhang, Q.: Multiobjective evolutionary algorithms: a survey of the state of the art. Swarm Evol. Comput. 1(1), 32–49 (2011) 23. The apache software foundation. Mesos, apache. http://mesos.apache.org/ 24. Kubernetes scheduler. https://kubernetes.io/ 25. Swarm kit. https://github.com/docker/swarmkit/

Internet Performance Prediction Framework Based on PingER Dataset Wei Zhang, Xiaofei Xing(&), Saqib Ali, and Guojun Wang School of Computer Science and Technology, Guangzhou University, Guangzhou 510006, People’s Republic of China [email protected]

Abstract. The Internet performance directly affects the scalability, reliability and availability of the online applications. Delay of a few millisecond may cause companies lose millions of dollars. Therefore, Internet measurements are carried out to capture the performance of the Internet links worldwide. Most of the Internet performance monitoring frameworks are active in nature i.e., they can only capture the real-time performance of the Internet links. Thus, these monitoring frameworks are unable to forecast the near future performance of the Internet links in a region. Such estimates are quite critical for the network administrators to carry out bandwidth extensive experiments between different sites, policy makers to suggest future upgrades to the Internet infrastructures or streaming service providers to enhance the quality of service to their customers. Therefore, we analyze different machine learning algorithms including Multiple Linear regression, Random Forest algorithm, Gradient Boosting, and eXtreme Gradient Boosting to predict the performance of the Internet links using PingER (Ping End-to-End Reporting) dataset for the countries like China, India and Japan. Our experimental results show that the Multiple Linear regression has improved Internet performance prediction accuracy compared with the other methods. Our work can be utilized by the Internet service providers, streaming service providers or policymakers for the design, deployment, and evaluation of next-generation Internet infrastructure. Keywords: Multiple linear regression PingER

Internet performance Prediction

1 Introduction Internet trafﬁc is increasing every day. The Internet is used in a wide variety of applications including corporate, education, entertainment, news, games, and social networking. It requires a lot of end-to-end link performance in terms of scalability, reliability, and performance. A delay of several hundred milliseconds may cause companies to lose millions of dollars, which may cause the game industry to lose a large number of users. For example, Singla [1] mentioned that a delay of 100 ms will cause Amazon to lose 1% of sales; in the search response, a 500-ms delay will result in a 1.2% decrease in Bing’s revenue and a 250-ms delay in winning competitors, so reducing latency will improve the user experience. On the other hand, the performance of the Internet is also directly related to the country’s key economic development © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 118–131, 2018. https://doi.org/10.1007/978-3-030-05057-3_9

Internet Performance Prediction Framework Based on PingER Dataset

119

indicators. According to the World Bank, a country’s economy has grown by 1.3%, while the speed of the Internet has increased by 10%. Therefore, the performance of the Internet plays an important role in our daily lives [2]. The Internet performance directly affect the reliability and availability of the Internet links. Therefore, Internet measurements are carried out to capture the performance of the links worldwide. The key Internet performance metrics includes, throughput, jitter, delay, packet loss, reachability, directivity etc. Many Internet performance monitoring frameworks are offered in literature, for example, SamKnows [3], BIS- mark [4], Dasu [5], Netradar [6], Portolan [7], RIPE Atlas [8], and perfSONAR [9] originally partially based on the PingER architecture [10]. These frameworks use different tools italic i.e., ping, mtr, cron, ntp, dig, netstat, iperf, and traceroute to mine the performance of the Internet links in real time. The ﬁndings of these frameworks are really critical for the Internet administrators and mangers to ﬁne tune their infrastructures. Most of the above Internet performance monitoring platforms are active. They only capture the real-time performance of Internet software or hardware on congestion, bottleneck links, queue overflows, and errors [11]. However, these frameworks do not provide any information on the performance of future Internet links. Information on Internet performance prediction is necessary for optimization of resources for extensive bandwidth experiments conducted between research centers, laboratories, and universities. In addition, this prediction is also crucial for Internet managers, content service providers, and policy makers to make decisions in the future on upgrading the Internet infrastructure in the region. In this paper, because the Internet performance has more instability and unpredictability, and the traditional Internet performance analysis is only to analyze the current performance parameters, and form a performance log as a basis for analyzing the Internet operating conditions, so we will focus on Internet prediction. We will use historical Internet performance monitoring data in the PingER platform. First, we will preprocess the data. Then, we use machine learning algorithms such as Multiple Linear Regression, Random Forest, Gradient Boost, and XGBoost to build Internet performance prediction model. Finally, we use the Root Mean Square Error (RMSE), Error Rate and other indicators to compare and analyze the prediction accuracy under different models, and ﬁnally ﬁnd a suitable prediction algorithm for PingER Internet performance data. The remaining paper is organized as follows. The related work is discussed in Sect. 2. Sections 3 and 4 mainly introduce the PingER framework and data. The proposed approach for predicting Internet performance is explained in Sect. 5. Section 6 is mainly about the results and discussion. Finally, Sect. 7 concludes the paper.

2 Related Work Internet performance prediction is usually based on observation sequence, so the method of Internet trafﬁc prediction can also be used in Internet performance prediction. Currently, common prediction methods are Least Square, Regression, including Auto-Regressive and Moving Average (ARMA), Autoregressive Integrated Moving

120

W. Zhang et al.

Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), Time Series Seasonal Analysis (TSSA), etc. [12–17]. With the maturity of machine learning and data mining algorithms and their strong performance in various ﬁelds, many researchers have applied data mining methods to the prediction of Internet trafﬁc in recent years. The study by Zheng [18] proposed an Internet trafﬁc prediction model (ET-SVM) integrating inclusion test and support vector machine. Using some single models are used to prediction the Internet trafﬁc, and the merits of the model are deﬁned by the Root Mean Square Error (RMSE) of the predicting results, then the appropriate single model is selected by surrounding test, ﬁnally, the single model prediction results are combined by Support Vector Machine to get the ﬁnal predicting result of network trafﬁc. The study by Chen [19] proposed an Internet trafﬁc prediction model (ELM-LSSVM) that combines limit learning machine and least squares support vector machine to improve the prediction accuracy of Internet trafﬁc. The study by Hammami [20] give the classiﬁcation model based on the flow prediction algorithm. The study by Liu [21] applied Back Propagation (BP) neural network to Internet trafﬁc prediction. The study by Cui [22]. Elman neural network was used to replace BP neural network and achieved better results. Elman neural network is compared with BP difference: Elman Internet is the output of the hidden layer will feedback back to the input layer, as the input of the next Internet, based on the characteristics of Elman neural network can better capture the dynamic characteristics of time sequence, thus can better adapt to the prediction of time series. In this paper, we use four algorithms to building prediction model for the Internet performance data, then use the Root Mean Square Error (RMSE) to judge whether the result is good or bad, and ﬁnally select the optimal model to achieve the Internet performance prediction. 2.1

Random Forest Algorithm

The random forest algorithm was ﬁrst proposed by Breiman and Cutler in 2001 [23]. It use the data sampled from the training set to construct the basic model, and the random forest samples all the attributes and extracts some of the attributes for input. Basic model. In order to reduce the generalization error, random forest algorithm has two layers of sampling, one layer is attribute sampling and the other is training set sampling. The speciﬁc algorithm of random forest is used in this paper as follows. This paper deals with a training set of 364 samples of one attribute, and trains K decision trees. The classiﬁcation result with the most votes will be used as the output of the random forest. 2.2

Gradient Boost Algorithm

Gradient Boosting is a method of implementing Boosting [24, 25]. Its main idea is that each time the model is established, the gradient of the model loss function is established before the gradient is dropped. The loss function describes the degree of failure of the model, and the greater the loss function, the more error-prone the model is. If our model can make the loss function continue to decline, indicating that our model is

Internet Performance Prediction Framework Based on PingER Dataset

121

constantly improving, and the best way is to let the loss function in the direction of gradient. The advantages of this algorithm include no feature normalization, automatic feature selection, model interpretability, and multiple loss functions. 2.3

XGBoost Algorithm

XGBoost Extreme Gradient Rising is a massively parallel Boosted tree, an extension of the Gradient Boosting Algorithm [26]. In the same situation, XGBoost algorithm is more than 10 times faster than similar algorithms [27]. XGBoost can use the CPU multi-threaded parallel tree construction to support yet-another-resource-negotiator (YARN), message-passing-interface (MPI) and other platforms to achieve distributed computing, which can further improve the training speed. It’s advantage that efﬁcient and more accurate.

3 The PingER Framework This paper is based on the PingER framework developed by Stanford University SLAC (Linear Acceleration Center). Originally, it was designed to facilitate the modern High Energy Nuclear and Particle (HENP) physics data-extensive experiments taking place among the SLAC, the Brookhaven National Laboratory (BNL) and the European Center for Particle Physics (CERN). However, for the last ﬁfteen years, the focus of the project is to measure, store and analyze the historical end-to-end performance of the Internet links worldwide [28–30]. PingER consists of more than 50 Monitoring Agents (MAs) active in 20 countries of the world, as shown in Fig. 1. The PingER measurement cycle is activated by the MAs after every half hour. Each MA has a list of remote sites of interest. During each cycle, it sends a set of 100-byte ping requests and 1000-byte ping requests to each target in MA remote site list. The initial 100-byte ping is normally discarded as it is used to prime the routing caches. The cycle for each remote site stops when the MA receives 10 ping responses or it has issued 30 ping requests. The raw data collected for each set of pings consists of an MA name, it’s IP addresses followed by the target remote site name and IP address, the payload, time stamp, packets sent, packets received, minimum Round Trip Time (RTT), maximum RTT, average RTT followed by the sequence number of the received packets and the actual RTTs of the received packets. The data is publicly available through a web server (at each monitoring site) running a Common Gateway Interface (CGI) program. The main host at the SLAC works as a central data storage repository. It fetches all the raw data collected by each MA and stores it in a database on a daily basis. The data is analyzed to extract sixteen different Internet performance metrics, e.g., round trip time (average, maximum, and minimum), packet loss, jitter, unreachability, throughput, directivity, unpredictability, and quiescence for each day, month and year. Further, daily, monthly, and yearly summary reports are compiled for each MA and remote site pair. Currently, PingER’s data warehouse already has about 60 GB of Internet performance data. The storage method is stored in more than 100,000 text ﬁles in a compression ratio of 5:1.

122

W. Zhang et al.

Fig. 1. PingER’s MA and remote sites around the world (the red dot in the ﬁgure represents the monitoring agents and green represents the remote site). (Color ﬁgure online)

4 The PingER Data As mentioned before, PingER has a long history and data in the ﬁeld of network performance monitoring, and there is currently no predicting service in the PingER framework. The main advantages of choosing PingER data are as follows: There are many historical data and it is easy to use. The historical data of Internet performance has a very important influence on the prediction of Internet performance. PingER has been operating since 1995 and has continuously monitored the Internet performance of over 700 sites [31]. We can follow hourly, daily, monthly or yearly view historical Internet data from any monitoring host to monitoring site. The data is compressed into a ﬁle according to the name of the performance index. At the same time, all Internet performance data is displayed on the PingER visual web page. Users can easily download data and conduct experimental analysis [32]. Furthermore, the PingER monitoring framework still has a large number of users since it’s adoption. It helps the development of Internets in various regions. However, the performance of the Internet is not predicted in the PingER platform. Therefore, the performance prediction of the PingER Internet is extremely meaningful. 4.1

Data Sources

As the number of Internet users in Asia is increasing, the total number of Internet users in China, India, and Japan accounted for 66% of the total Asian Internet users, as shown in Table 1 [33]. Therefore, measuring and predicting the Internets of these three countries is extremely important for understanding the Internet conditions in Asia as a whole. Therefore, this paper uses the average round-trip time as the experimental basis. The three selected links are shown in the Table 2.

Internet Performance Prediction Framework Based on PingER Dataset

123

Table 1. Asia Internet user and population data Asia China India Japan

Population 1,415,045,928 1,354,051,854 127,185,332

Internet users 772,000,000 462,124,989 118,626,672

Penetration (% Population) Users (% Asia) 54.6% 38.1% 34.1% 22.8% 93.3% 5.9%

Table 2. Data sources Monitoring-site EDU.SLAC.STANFORD.PINGER EDU.SLAC.STANFORD.PINGER EDU.SLAC.STANFORD.PINGER

4.2

Remote-site CN.EDU.N1 IN.MITPUNE.WWW JP.U-TOKYO.AC

Country China India Japan

Data Pre-processing

The downloaded ﬁle is in the format (.tsv), with a total of 1095 records for the three links. It contains the name of the Internet performance metric, monitoring host, remote site, date, and other related information. The missing values in the data are shown as (.) We ﬁrst convert the source ﬁle to a comma-separated (.csv) ﬁle, and then replaces the missing value with the average of the link. After the replacement is complete, the original data distribution is shown in Fig. 2.

Fig. 2. The raw data distribution of China, India and Japan

5 Proposed Approach In this paper, the average round-trip time in the PingER monitoring framework is selected as the basic metric for Internet performance prediction for the three countries i.e., China, India, and Japan. Through the data collection, data missing value processing, feature selection, selection algorithm, and the establishment of prediction model, prediction, model evaluation and other steps to achieve the prediction of Internet performance, the speciﬁc process shown in Fig. 3.

124

W. Zhang et al.

Start

Collect Historical Data from PingER

Data Pre-processing

Feature Selection

linear Regression

Random Forest

Gradient Boosting

XGBoost

Data Prediction

Comparison of Minimum Root Mean Square Error The Model with The Lowest Root Mean Square Error Predicted Value

End

Fig. 3. Predicting process

5.1

Select the Characteristic Variable

Artiﬁcially constructing features from the original data set, combined with data analysis and data visualization of the Internet performance average round-trip time, ﬁnds that the current day’s Internet performance has a certain correlation with the previous days’ Internet performance. After selecting different eigenvalues for prediction, it was found that the prediction results obtained when the ﬁve eigenvalues were selected during the prediction process were optimal. That is, the average round-trip time of the current day as the dependent variable x1, x2, x3, x4, and x5, and the average round-trip time of the current day as the dependent variable Y. For example, May 19, 2018 was the dependent variable, and May 14, 15, 16, 17, and 18 were independent variables. 5.2

Establish a Multiple-Linear Regression Model

In this paper, we use multiple linear regression, random forest, gradient boost, and XGBoost to build the training model. Then we input the training data into each model and train it. After the training, we build the prediction model, and then input the test data set into the model to predict the results. Finally, the Root Mean Square Error (RMSE) was used to evaluate the performance of the results. The basic task of multivariate linear regression analysis is to establish a multiple linear regression equation of the dependent variable to multiple independent variables based on the actual observations of the dependent variable and multiple independent

Internet Performance Prediction Framework Based on PingER Dataset

125

variables; to test and analyze the integration of each independent variable on the dependent variable [34, 35]. The signiﬁcance of the linear effect, choose the independent variables that have the signiﬁcant linear influence on the dependent variable, establish the optimal multiple linear regression equation, evaluate the relative importance of each independent variable on the dependent variable, and determine the deviation of the optimal multiple linear regression equation. To study the relationship between the variation of two or more independent variables and one dependent variable under the condition of linear correlation, called multiple linear regression analysis, the mathematical equation obtained is a multiple linear regression model. The multiple linear regression model is an extension of the one-dimensional linear regression model. Let the dependent variable y and the independent variables x1 ; x2 ; x3 ; . . .:; xm1 have n groups of actual observation data, y is an observable random variable, which is subject to m−1 non-random factors x1 ; x2 ; x3 ; . . .:; xm1 and e effects of random factors. If y and x1 ; x2 ; x3 ; . . .:; xm1 have the following linear relationship as shown in Eq. (1): y ¼ b0 þ b1 x1 þ b2 x2 þ . . . þ bm1 xm1 þ e

ð1Þ

Where y is the dependent variable x1; x2; x3; . . .:; xm 1 is the independent variable, b 0 ; b 1 ; b 2 ; . . .; b ðm1Þ are m unknown parameters; e is the mean 0 and the variance is r2 An unobserved random variable of [ 0 is called an error term, and it is generally assumed that e N ð0; r2 Þ. For nðn pÞ independent observations, n sets of data samples are obtained as show in Eq. (2): 8 y1 ¼ b0 þ b1 x11 þ b2 x12 þ . . . þ bm1 xm1 þ e1 > > < y ¼ b þ b x21 þ b x22 þ . . . þ b xm1 þ e2 0 1 2 m1 2 .. > > . : y2 ¼ b0 þ b1 xn1 þ b2 xn2 þ . . . þ bm1 xm1 þ en

ð2Þ

Where e1 ; e2 ; . . .; en are independent of each other, obeying the distribution of e N ð0; r2 Þ. In order to facilitate mathematical processing, Eq. (2) is represented in a matrix is as follows:

Then Eq. (1) is represented by a matrix, to as shown in Eq. 3.

Y ¼ Xb þ e e N ð1; r2 In Þ

ð3Þ

126

5.3

W. Zhang et al.

Parameter Calculation

Parameters b0 ; b1 ; b2 ; . . .; bm1 in the regression equation are unknown, When we use b1; b b2; ; b b m1 to estimate the parameters b0 ; b1 ; b2 ; . . .; bm1 in sample statistics b b0; b the regression equation, Estimated multiple regression equation, as show in Eq. (1): by ¼ b b0 þ b b 1 x1 þ b b 2 x2 þ þ b b m1 xm1

ð4Þ

b b b ;b Then use the least square method to obtain the value of b 0 b 1 ; b 2 ; ; b m1 , That is, the sum of squared residuals is minimized so that the parameters of the regression equation are solved. Show in Table 3: Table 3. The parameters of the regression equation b b0 China 31.9188 India 43.7882 Japan 5.2331

b b1 0.6263 0.8644 0.813

b b b2 b3 0.069 −0.0823 0.0965 0.0933 0.0013 0.07288

b b b4 b5 0.08785 0.1135 −0.02727 −0.0023 −0.1283 0.2002

Therefore, the corresponding regression equations for the above three links as shown in Eqs. (5–7): China: Y ¼ 31:9188 þ 0:6263x1 þ 0:069x2 0:0823x3 þ 0:08785x4 þ 0:1135x5 ð5Þ India: Y ¼ 43:7882 þ 0:8644x1 0:0965x2 þ 0:0933x3 0:02727x4 0:0023x5 ð6Þ Japan: Y ¼ 5:2331 þ 0:813x1 þ 0:0013x2 þ 0:07288x3 0:1283x4 þ 0:2002x5 ð7Þ Correlation Coefﬁcient Check Root Mean Square Error (RMSE) is used in this paper to test the pros and cons of the prediction results. RMSE is the sum of the squared error of the predicted value and the real value [36]. It is a quantitative tradeoff method. As shown in Eq. (8): sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ PN xm1 Þ2 m¼1 ðxm1 RMSE ¼ N

ð8Þ

Where xm1 is the real value and xm1 is the predicted value. Obviously, the smaller the value of RMSE, the better the prediction effect. As can be seen from the above Table 4, the multiple linear regression model is superior to the other three algorithms in predicting Internet performance data. Therefore, this paper uses a multiple linear regression to predict Internet performance.

Internet Performance Prediction Framework Based on PingER Dataset

127

Table 4. RMSE values for the four algorithms Liner regression JP.U-TOKYO.AC.N1 0.4016 IN.MITPUNE.WWW 1.81101 CN.EDU.N1 2.5147

Random forest Gradient boosting XGBoost 0.627 0.6608 25.4598 3.1409 5.3875 63.0183 6.6665 6.9849 35.2993

6 Results and Discussion According to the above theories and methods, the paper ﬁnally chooses to use multiple linear regression model to predict the Internet performance. In this paper, the average RTT values of three links from China, India, and Japan are selected. There are a total of 365 data points per links. We select the ﬁrst 338 data points to be used as the training set input regression model to train and obtain the prediction regression equation. The last 27 data were used as test data. The prediction results of selecting one of the links are as follow in Table 5.

Table 5. Predicted and real values Date 2018/4/23 2018/4/24 2018/4/25 2018/4/26 2018/4/27 2018/4/28 2018/4/29 2018/4/30 2018/5/1 2018/5/2 2018/5/3 2018/5/4 2018/5/5 2018/5/6

Real value 163.727 163.216 170.055 165.683 168.441 168.379 163.766 163.755 164.781 163.775 166.923 183.442 169.242 164.638

Predicted value 166.200 166.748 170.020 166.935 168.969 168.875 166.437 166.264 167.080 166.238 167.928 178.659 168.890 166.801

Error rate 1.51% 2.16% -0.02% 0.76% 0.31% 0.29% 1.63% 1.53% 1.40% 1.50% 0.60% −2.61% −0.21% 1.31%

Date 2018/5/7 2018/5/8 2018/5/9 2018/5/10 2018/5/11 2018/5/12 2018/5/13 2018/5/14 2018/5/15 2018/5/16 2018/5/17 2018/5/18 2018/5/19

Real value 167.415 164.198 163.873 164.964 166.781 164.964 163.984 164.408 164.017 163.838 168.782 187.569 174.982

Predicted value 169.225 168.036 166.524 166.817 168.129 166.583 166.054 166.480 166.301 166.006 169.167 181.341 172.483

Error rate 1.08% 2.34% 1.62% 1.12% 0.81% 0.98% 1.26% 1.26% 1.39% 1.32% 0.23% −3.32% −1.43%

According to the obtained multiple linear regression equations, the Internet performance prediction and analysis are realized. The real values and predicted values of the three links are shown in Figs. 4, 5, and 6. In addition, using the multiple linear regression to estimate the average error of the predicted and real values of the Internet performance are 0.59%, 0.54%, and 0.12%, respectively, and the prediction accuracy is high. Therefore, the model can be used to predict the Internet performance.

W. Zhang et al.

Average Round-trip Ɵme (ms)

128

190

Real value

Predicted value

185 180 175 170 165 160

Average Round-trip Time (ms)

Fig. 4. Comparison of predicted and real values of CN.EDU.N1

124.8 124.6

Real value

Predicted value

124.4

124.2 124 123.8 123.6

Average Round-trip Time (ms)

Fig. 5. Comparison of predicted and real values of JP.U-TOKYO.AC.N1

280 275

Real value

Predicted value

270 265 260 255

250

Fig. 6. Comparison of predicted and real Values of IN.MITPUNE.WWW

Internet Performance Prediction Framework Based on PingER Dataset

129

7 Conclusion This paper predicts the performance of the Internet links based on the data collected through PingER end-to-end Internet monitoring framework. The main performance indicator is the average round trip time selected from the SLAC monitoring host in USA to the target countries e.g., China, India, and Japan for 365 days. For the ﬁrst step, we do the pre-processing mainly by replacing the missing values with the average values, ﬁle format conversion, and extracting the key features from the data. Afterward, the data set is divided into two parts, with the ﬁrst 338 days of data as the training set and the last 27 days of data as the test set. Then we use Multiple Linear Regression, Random Forest, Gradient Boost, XGBoost to establish the data prediction model, and use RMSE to evaluate the model. In the end, we found that when we used Multiple Linear Regression to predict the Internet performance of the links data, we got the best results. Therefore, it is expressive to predict the performance of the Internet links using Multiple Linear Regression model. It can help network administrators, policy makers, and network service providers to effectively leverage existing Internet infrastructure. In addition, it will help them to design high-performance next-generation Internet infrastructure. Acknowledgments. This work is supported in part by CERNET Innovation Project under Grant No. NGII20170102, Natural Science Foundation of China under Grant No. 61772007, 61632009, Guangdong Natural Science Foundation of China under Grant No. 2016A030313540, Guangzhou Science and Technology Program under Grant No. 201707010284.

References 1. Singla, A., Chandrasekaran, B., Godfrey, P.B., Maggs, B.: The Internet at the speed of light. In: Proceedings of the 13th ACM Workshop on Hot Topics in Networks - HotNets-XIII, pp. 1–7 (2014) 2. Ali, S., Cottrell, R.L., Nveed, A.: Pinger Malaysia-internet performance measuring project: A case study (No. SLAC-PUB-16462). SLAC National Accelerator Lab., Menlo Park, CA, United States (2016) 3. Samknows Homepage. https://www.samknows.com/. Accessed 30 May 2018 4. Sundaresan, S., Burnett, S., Feamster, N., de Donato, W.: BISmark: A testbed for deploying measurements and applications in broadband access networks. In: Proceedings 2014 USENIX Annual Technical Conference (USENIX ATC 2014), pp. 383–394 (2014) 5. Sánchez, M., Otto, J.: Dasu: Pushing Experiments to the internet’s edge. In: Proceedings of USENIX Association, pp. 487–499 (2013) 6. Sonntag, S., Manner, J., Schulte, L.: Netradar – Measuring the wireless world. In: Wireless Network Measurements, pp. 29–34 (2013) 7. Faggiani, A., Gregori, E., Lenzini, L., Luconi, V., Vecchio, A.: Smartphone-based crowdsourcing for network monitoring: opportunities, challenges, and a case study. IEEE Commun. Mag. 52, 106–113 (2014) 8. Bajpai, V., Eravuchira, S.J., Schönwälder, J.: Lessons learned from using the RIPE atlas platform for measurement research. ACM SIGCOMM Comput. Commun. Rev. 45, 35–42 (2015)

130

W. Zhang et al.

9. Hanemann, A., et al.: PerfSONAR: A service oriented architecture for multi-domain network monitoring. In: Benatallah, B., Casati, F., Traverso, P. (eds.) ICSOC 2005. LNCS, vol. 3826, pp. 241–254. Springer, Heidelberg (2005). https://doi.org/10.1007/11596141_19 10. Matthews, W., Coffrell, L.: The PingER project: active Internet performance monitoring for the HENP community. IEEE Commun. Mag. 38, 130–136 (2000) 11. Paxson, V.: End-to-end Internet packet dynamics. IEEE/ACM Trans. Netw. 7, 277–292 (1999) 12. Wu, C.L., Chau, K.W., Li, Y.S.: Methods to improve neural network performance in daily flows prediction. J. Hydrol. 372, 80–93 (2009) 13. Zhou, D., Chen, S., Dong, S.: Network Trafﬁc Prediction Based on ARFIMA Model. arXiv Prepr. arXiv1302.6324, vol. 9, pp. 106–111 (2013) 14. Shang, P., Li, X., Kamae, S.: Nonlinear analysis of trafﬁc time series at different temporal scales. Phys. Lett. Sect. A Gen. At. Solid State Phys. 357, 314–318 (2006) 15. Nury, A.H., Hasan, K., Alam, M.J.: Bin: Comparative study of wavelet-ARIMA and wavelet-ANN models for temperature time series data in northeastern Bangladesh. J. King Saud Univ. Sci. 29, 47–61 (2017) 16. Yin, H., Lin, C., Sebastien, B., Li, B., Min, G.: Network trafﬁc prediction based on a new time series model. Int. J. Commun Syst 18, 711–729 (2005) 17. Karunasinghe, D.S.K., Liong, S.Y.: Chaotic time series prediction with a global model: artiﬁcial neural network. J. Hydrol. 323, 92–105 (2006) 18. Weiyong, Z., Guangli, F.: Network trafﬁc combination forecasting based on encompassing tests and support vector machine. Comput. Eng. Appl. 15, 84–87 (2013) 19. Hongxing, C.: Network trafﬁc prediction based on extreme learning machine and least square support vector machine. Comput. Eng. Appl. 51(24), 73–77 (2015) 20. Hammami, C., Jemili, I., Gazdar, A., Belghith, A.: Hybrid live P2P streaming protocol. Procedia Comput. Sci. 32, 158–165 (2014) 21. Hodge, V.J., Austin, J.: A Survey of outlier detection methodoligies. Artif. Intell. Rev. 22, 85–126 (2004) 22. Cui, F.: Study of trafﬁc flow prediction based on BP neural network. In: 2010 2nd International Workshop on Intelligent Systems and Applications, pp. 1–4 (2010) 23. Breiman, L.: Random forest. Mach. Learn. 45, 5–32 (2001) 24. Guelman, L.: Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Syst. Appl. 39, 3659–3667 (2012) 25. Parker, C., Fern, A., Tadepalli, P.: Gradient boosting for sequence alignment. In: Proceedings of the National Conference on Artiﬁcial Intelligence, vol. 21, no. 1, p. 452. (2006). AAAI Press, Menlo Park. MIT Press, Cambridge, London (1999) 26. Chen, T., He, T.: XGBoost: eXtreme Gradient Boosting. R Packag. version 0.4-2, pp. 1–4 (2015) 27. Chen, T., Guestrin, C.: XGboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016) 28. Ali, S., Wang, G., Cottrell, R.L., Masood, S.: Internet performance analysis of south asian countries using end-to-end internet performance measurements. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications, 2017 IEEE International Conference on Ubiquitous Computing and Communications, pp. 1319–1326 (2017)

Internet Performance Prediction Framework Based on PingER Dataset

131

29. Ali, S., Wang, G., Cottrell, R.L., Anwar, T.: Detecting anomalies from end-to-end internet performance measurements (PingER) using cluster based local outlier factor. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications, 2017 IEEE International Conference on Ubiquitous Computing and Communications, pp. 982– 989 (2017) 30. Mal, A., Sabitha, A.S., Bansal, A., White, B., Cottrell, L.: Analysis and clustering of PingER network data. In: Proceedings of 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence), pp. 268–273 (2016) 31. Ali, S., Wang, G., White, B., Cottrell, R.L.: A Blockchain-based decentralized data storage and access framework for PingER. In: 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), pp. 1303–1308. IEEE (2018) 32. Ali, S., Wang, G., Xing, X., Cottrell, R.L.: Substituting missing values in end-to-end Internet performance measurements using k-Nearest neighbors. In: 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 919–926. IEEE (2018) 33. Internet Word Stats Homepage. https://www.Internetworldstats.com/stats3.htm. Accessed 11 June 2018 34. Guo, H., Wang, X., Gao, Z.: Uncertain linear regression model and it’s application. J. Intell. Manuf. 28, 559–564 (2017) 35. Sun, H., Liu, H., Xiao, H., He, R., Ran, B.: Short term trafﬁc forecasting using the local linear regression model. Transp. Res. Rec., 143–150 (2003) 36. Kumar, S., Gangwar, S.S.: Intuitionistic fuzzy time series: an approach for handling nondeterminism in time series forecasting. IEEE Trans. Fuzzy Syst. 24, 1270–1281 (2016)

MS-RAID: An Energy-Saving Data Layout for CDP Jingyu Liu1,3, Ziyao Zhang1,3, Lu Liu2 ✉ , and Xin Chai1,3 (

1

)

School of Artiﬁcial Intelligence, Hebei University of Technology, Tianjin 300130, China 2 School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China [email protected] 3 Hebei Province Key Laboratory of Big Data Calculation, Tianjin 300130, China

Abstract. Continuous data protection (CDP) provides unlimited granular recovery point objective (RPO) and nearly instant recovery time objective (RTO). It requires a great ﬂuctuations in the performance of storage system. The system requires higher storage bandwidth when it is active, and lower when inactive. Raid, that is used in storage system normally and provides ﬁxed performance, may face a performance bottleneck or high power consumption. This paper proposes MS-RAID, that bases on S-Raid and can provide multi-level dynamic mapping storage scheme. MS-RAID has vary levels of grouping strategies. MSRAID can meet the needs of real-time dynamic load by changing the number of parallel disks. When the throughput rises, MS-RAID turns the high-level disk group into running to avoid the bottleneck of system performance. When the throughput falls, MS-RAID turns the low-level disk group into running to save energy consumption. Experiments show that MS-RAID is a more energy-eﬃcient data layout, and can save more energy consumption and improve then perform‐ ance than S-RAID. Keywords: RAID · CDP · Energy-saving · Storage · Data layout

1

Introduction

In the era of Big Data, data volume grows exponentially [1, 2]. IDC’s analysis predicts that global data volume will double every two years. By 2020, the world’s data storage capacity is expected to reach 44 ZB, and China’s total data volume will increase to 8.06 ZB, accounting for 18% of the world’s total [3]. Data aﬀects every aspect of society, such as government decisions, corporate operations, and personal lives. More and more attention has been paid to the data reliability. Continuous Data Protection (CDP) can backup data automatically, and save any version of data. CDP allows to restore data to any point in time and provide ﬁne granularities of restorable objects. CDP needs a large amount of storage space. That makes the scale of the data center grow continuously. However, CDP has not ﬁxed requirements on the bandwidth of the storage system, which has certain volatility and obvious time characteristics. When the system is active (usually in the day), high storage bandwidth is required; when the system is inactive (usually at night), low storage bandwidth is required. The current storage systems do not take into account the storage system’s responsiveness to dynamic loads. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 132–141, 2018. https://doi.org/10.1007/978-3-030-05057-3_10

MS-RAID: An Energy-Saving Data Layout for CDP

133

The load characteristics of the CDP system cannot be well adapted. When the load increases, system performance bottlenecks may occur. When the load decreases, addi‐ tional energy consumption is generated. Data centers deployed RAID [4] to provide large capacity and high eﬃciency of the whole storage system. But the energy consumption enlarges dramatically with the data center’s expansion, and that cannot be ignored. Lot of researches are focus on energy-saving of disk storage systems. DRPM [5] (Dynamic Rotations Per Minute) algorithm uses a multistage rotational speed disk according to the real-time change of the workload to save energy consumption. DPPDL [6] (Dynamic Partial-parallel Data Layout) adjusts disks parallelism based on system load dynamically, which reduces energy consumption to a certain extent. But it wastes part of disk space. Xu [7] proposed the SpringFS algorithm that uses an elastic multilevel load allocation method, and assigns diﬀerent copies among servers according to the changes in the workload to reduce energy consumption. Study [8, 9] shows that 80% of the total cost of large data centers comes from the energy consumption of disk storage systems. Li [10–12] proposed S-RAID, a storage data layout for continuous data storage, to save the energy consumption of data center. Energy-saving in storage system can be eﬀective, but its application has not been optimized. According to CDP’s throughput characteristics of CDP data, this paper proposes Multiple S-RAID (MS-RAID), a storage space multi-stage mapping data layout for energy-saving, to balance the energy consumption and performance. The contribution of this paper mainly includes: 1. A data layout is proposed: MS-RAID, the disk array is divided into multilevel storage space, which can provide diﬀerent access performance and dynamic load features suitable for CDP data access; 2. MS-RAID uses Data Increment Algorithm to optimize the number of I/O operations and reduce the writing penalty caused by small-write; 3. A parameter-based disk state control algorithm is proposed, which sets a state parameter for each disk, and adjusts dynamically according to I/O access to achieve the purpose of controlling disk state. Experiments simulate the 32-way video monitoring system. It is proved that the data layout can not only avoid the overperformance of the system, but also meet the highperformance requirements of the system, and achieve high eﬃciency and energy saving data transmission.

2

MS-RAID: Multilevel Mapping Data Layout

2.1 MS-RAID Data Layout RAID5 is a storage solution that can ensure storage performance, data security and storage cost. S-RAID5 groups the data chunks on each stripe and uses parallel access in the group, which is not only conducive to the dormancy of the non-working disk, but also ensures the performance requirements of the system. However, S-RAID5 cannot adapt well to the storage characteristics, because CDP has dynamic requirements for the

134

J. Liu et al.

performance of the system. Based on S-RAID5, MS-RAID5 optimizes the access char‐ acteristics of CDP data, and uses a multi-level grouping strategy to meet the dynamic demand of the system load and maximize the energy saving eﬀect of CDP. MS-RAID, as shown in Fig. 1, makes a multilevel grouping of data chunks on the same strip in the storage system. The number of data chunks in the group of diﬀerent levels is diﬀerence. It provides multiple-level performance. Low-level groups, which have less disks, have low performance and energy-consumption. High-level groups, which have more disks, have high performance and energy-consumption. When the system is idle, the performance requirements are lower, low-level group runs, and highlevel groups are standby for energy-saving. On the contrary, when the system is busy, the performance requirements are higher, high-level group runs for higher performance, and low-level groups are standby for energy-saving.

Application Server

TCP/IP

CDP Storage System

File System

Virtualization Manager

Disk0

Disk1

Disk2

Group0

Disk3

Disk4

Disk5

Group1

Disk6

Diskn

Group2

Fig. 1. Schematic of MS-RAID

Because parity disk is the bottleneck of system performance in RAID4, RAID5 is adopted in this paper. MS-RAID5 with N disks (N ≥ 3) is divided into N stripes. Stripei denotes the strip in the array, and Parityi denotes the parity chunk. X (i, j) denotes the storage chunk in the array, where i denotes the strip, and j denotes the disk in the array, 0 ≤ i, j ≤ N−1. D(i, j) denotes the data chunk in the array. D(i, j) can be expressed as formula (1): { D(i, j) =

X(i, j), i + j < N − 1 X(i, j + 1), i + j ≥ N − 1

(1)

Parityi of the same strip can be expressed as formula (2): Parityi = X(i, N − 1 − i)

(2)

MS-RAID: An Energy-Saving Data Layout for CDP

135

In order to adapt to the dynamic requirement of CDP, multilevel grouping of disk arrays is set. N−1 data chunks on each stripe are divided into Q groups, and Sq chunks in each group (Q ≥ 2, Sq ≥ 1). The relationship between each group and its chunk allocation satisﬁes the formula (3): Q ∑

Sq = N − 1

(3)

q=0

Figure 2 is MS-RAID5 with 6 disks that set into two-level groups. Grp0 includes Disk0 and Disk1, Grp1 includes Disk2, Disk3 and Disk4. The parity chunk is evenly distributed in all disks shown as Fig. 2. When the system is idle, Grp1 is set into standby state, and Grp0 runs to meet requirements. The data is written into the sub-data chunks D0 ~ D7, D20 ~ D27… When the system is busy, Grp1 runs, and Grp0 is turned into the standby state, and the data is written to the sub-data chunk D8 ~ D19, D28 ~ D39… More disks running can provide a higher storage bandwidth. Grp0 Disk0

Stripe0

Grp1 Disk1

Disk2

Disk3

Disk4

Disk5

D0 D2 D4 D6

D1 D3 D5 D7

D8 D11 D14 D17

D9 D12 D15 D18

D10 D13 D16 D19

P0 P1 P2 P3

D20

D21

D28

D29

P4 P5 P6 P7

D30

Stripe1 D27

D39

Stripe 5

Fig. 2. The 6-disk two-level MS-RAID5 data layout

2.2 Read and Write MS-RAID calculate parity data using DIP [12] algorithm to avoid the write penalty when writing data into the array. The parity data in the same stripe can no longer read the original data in the data chunk, but only the written data and the original parity data needs XOR operation: P = ⊕Diskwrite

where P denotes the parity data, and Diskwrite denotes the data to be written.

(4)

136

J. Liu et al.

When each strip begins to write, the parity data is initialized as XOR value with the ﬁrst chunk and the second chunk, shown in Fig. 3, the initialization of the Stripe0 subparity chunk P0 is:

P0 = D0 ⊕ D1

(5)

Data

XOR

Stripe0

D0 D2 D4

D1 D3 D5

D8 D11 D14

D9 D12 D15

D10 D13 D16

P0 P1 P2

D6

D7

D17

D18

D19

P3

Fig. 3. Initialize parity chunk

When the data is written into the diﬀerent chunk of the same stipe, the old data need not to be read as in RAID or S-RAID, and only the old parity data need to be read. The new parity P′0 (shown as in Fig. 4) can be calculated as the formula (6): P′0 = D8 ⊕ D9 ⊕ P0

(6)

Data

Data

XOR

Strip0

D0

D1

D8

D9

D10

P0

D2 D4

D3 D5

D11 D14

D12 D15

D13 D16

P1 P2

D6

D7

D17

D18

D19

P3

Fig. 4. Calculate new parity data

The DIP algorithm with pre-read strategy can not only avoid waking up inactive disks, but also reduce the writing punishment brought by Read-Modify-Write effectively. 2.3 Disk Scheduling The purpose disk scheduling is to enable the RAID with multilevel packets to adapt to dynamical requirement of CDP systems for energy-saving. Because the mapping between the logical block address (LBA) and the physical block address (PBA) is unknown before the data chunk is written, it is necessary to create a mapping table between them and update it in time during the data chunk is written.

MS-RAID: An Energy-Saving Data Layout for CDP

137

Given the logical address blkno, its group number: Grpp is calculated as formula (7): ⎧0 ,p = 0 ⎪ j=p−1 ∑ LBA(Grpp ) = ⎨ N⋅ Sj , p > 0 ⎪ j=0 ⎩

(7)

The logical address of the data chunk blkno is calculated as formula (8): f (blkno) = D(fstripei,v (blkno), fDj (blkno))

(8)

The physical location of the stripei is calculated as formula (9): ⌊ fstripei (blkno) =

blkno − LBA(Grpp )

⌋

(9)

m ⋅ Sp

The physical location of the sub-stripe V of the strip group stripei is calculated as formula (10): ⌊ ⌋ fstripei,v (blkno) = (blkno − Sp ⋅ G ⋅ m)mod Sp

(10)

The disk number of the data chunk is calculated as formula (11): ∑

k=p−1

fDj (blkno) = [blkno − LAB(Grpp )mod Sp +

Sk ]

(11)

k=0

MS-RAID schedule disks according to diﬀerent system requirements. In order to locate to the data address quickly when disks are started, a write address pointer PLBA is set to record the last chunk which to be written in each group, and it can reduce system addressing delay. The new data is written to the next location of pointer when a new group is started up.

3

Experiment and Analysis

In this section, the performance and energy saving test of MS-RAID is carried out based on CDP of the video monitoring data storage system. 3.1 Performance Testing The MS-RAID chunk can be adjusted according to the speciﬁc requirements of the storage system. Based on the monitoring environment, at least 4.6T data storage space is needed, and the additional 10% storage space for the ﬁle system. To meet this envi‐ ronment, a two-level MS-RAID5 consisting of 6 disks(1T) is conﬁgured with the Linux 3.1 kernel. G0 group consists of two disks, and the G1 group consists of three disks.

138

J. Liu et al.

IOMeter is a very powerful I/O test software. The IOMeter writes 2 KB–4096 KB, 40%, 60%, 80% and 100% continuous requests to MS-RAID and S-RAID through the Dynamo load generator on the Linux side, respectively. The performance test results under diﬀerent load requirements shown as Fig. 5. When the data block is small, the performance of MS-RAID is not much diﬀerent from that S-RAID. When the size of data block larger than 128 KB, the write performance of MS-RAID is improved signiﬁcantly. It because that: (1) when the size of the data block is less than 128 KB, the MS-RAID enables the low-level group G0, the group G1 is in the standby state, the low-level disk group is as same as the S-RAID5 parallelism, and the writing performance has no obvious gap; (2) when the size of the data block is larger than 128 KB, group G0 in MS-RAID cannot meet the performance requirements and to adapt to the higher load. In the high-level group G1, the parallel degree in the group is increased, the stripe size is higher than that S-RAID5. The whole amount of data is increased, the frequency of the parity data writing are less, and the write performance is improved signiﬁcantly. 120 40% 2G0+3G1 MS-RAID

60% 2G0+3G1 MS-RAID

40% 2 disk group S-RAID

60% 2 disk group S-RAID

Transfer Size (MBps)

Transfer Size (MBps)

80

60

40

20

100

80

60

40 0 16

32

64

128 256 512 1024 Request Length (KB)

2048

4096

16

80% 2G0+3G1 MS-RAID 80% 2 disk group S-RAID

200

Transfer Size (MBps)

Transfer Size (MBps)

150

100

50

16

32

64

128 256 512 1024 Request Length (KB)

2048

4096

2048

4096

100% 2G0+3G1 MS-RAID 100% 2 disk group S-RAID

150

100

50

32

64

128

256

512

Request Length (KB)

1024

2048

4096

16

32

64

128

256

512

1024

Request Length (KB)

Fig. 5. Performance comparison of 40%, 60%, 80% and 100% sequential write

When the request is smaller than 128 KB, the response time gap between the four schemes is not obvious, shown as Fig. 6, they are in the disk opening process, and the disk group pre-reading is performed. When writing requests is 256 KB, MS-RAID has turned on the G1 disk group, which has improved parallelism and low response time compared to S-RAID.

90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 -5 16

55

32

64

128

256

512

1024

2048

50

40 35 30 25 20 15 10 5 0 -5 16

4096

60% 2G0+3G1 MS-RAID 60% 2 disk group S-RAID

45

32

64

Request Length (KB) 50

40 35 30 25 20 15 10 5 0 -5 16

256

512

1024

2048

4096

1024

2048

4096

55

80% 2G0+3G1 MS-RAID 80% 2 disk group S-RAID

Average Write Response Time (ms)

Average Write Response Time (ms)

45

128

Request Length (KB)

55 50

139

60

40% 2G0+3G1 MS-RAID 40% 2 disks group S-RAID

Average Write Response Time (ms)

Average Write Response Time (ms)

MS-RAID: An Energy-Saving Data Layout for CDP

32

64

128

256

512

1024

2048

4096

45

100% 2G0+3G1 MS-RAID 100% 2 disk group S-RAID

40 35 30 25 20 15 10 5 0 -5 16

Request Length (KB)

32

64

128

256

512

Request Length (KB)

Fig. 6. Comparison of write response time

The performance experiment shows that the multilevel group of MS-RAID can allo‐ cate the storage space for writing requests dynamically, adjust the number of disk parallel and open the compatible disk group according to the characteristics of CDP load change. Compared with the more advanced S-RAID, under the demand of low system perform‐ ance, MS-RAID opens the low-level disk group to reduce disk parallelism and reduce energy consumption. Under the demand of higher system performance, MS-RAID opens a high-level disk group, increases disk parallelism, and ensures the system’s demand for performance. MS-RAID can guarantee the performance requirements of the CDP system while opening the appropriate disk group, which eﬀectively balances the contradiction between energy consumption and system performance. 3.2 Energy Consumption Test In the energy consumption test, the total energy consumption of the disk array is calcu‐ lated as formula (12):

W=

n ∑

Vi × Ii

(12)

i=0

where W denotes the total energy consumption of disk array, Vi denotes the real time voltage, and Ii denotes the real time current of disk i. In order to strengthen the comparison test of diﬀerent groups’ energy consumptions, S-RAID5 with 7 disks is set in the same environment, including two groups of data disks (3 disks per group) and a parity disk.

140

J. Liu et al.

In order to avoid the impact of the system cache on the experimental data, the MSRAID and S-RAID are monitored continuously for 24 h of energy consumption after the system runs 1 day. The results of the energy consumption test are shown in Fig. 7:

Power Consumption (W)

25

20

2 disks group S-RAID 5 3 disks group S-RAID 5 2G0+3G1 MS-RAID 5

15

10

5 19:00

01:00

07:00 Time

13:00

19:00

Fig. 7. Energy consumption comparison of three schemes

At the beginning of the test, the diﬀerence in energy consumption of MS-RAID5 and S-RAID5 with 2 disks a group is not signiﬁcant. It because that only 20 cameras are running at ﬁrst and the load is small. MS-RAID5 only opens the low-level group G0 on the premise of guaranteeing the performance requirement of the system. The S-RAID5 of the three disks per group opened a disk more than MS-RAID5, and that leads to excessive system performance and greater energy consumption. At this stage, the average energy consumption of MS-RAID is 9.3 W, and the average energy consump‐ tion of S-RAID with 3 disks per group is 12.3 W, saving 24.4% energy consumption. As the experiment goes on to the second stage, 32 cameras work at the same time. The load increases, and the performance requirements of the system increase too. MSRAID5 has opened a higher performance group G1, which has increased energy consumption while providing higher performance. Although S-RAID5 with 2 disks per group keeps low energy consumption, it cannot meet the high performance of the system. The performance test and energy consumption test show that MS-RAID not only reduces the energy consumption when the CDP system is idle, but also ensures the system performance when it is busy. It balances the contradiction between energy consumption and performance.

4

Conclusion

In the paper, a multilevel grouping strategy is proposed for CDP systems, and the data layout of multilevel grouping is designed and implemented: MS-RAID. DIP algorithm is used to optimize the writing performance in MS-RAID. The higher energy eﬃciency of the storage system is realized by the disk energy saving scheduling. The multiple

MS-RAID: An Energy-Saving Data Layout for CDP

141

level division strategy is adopted in the array, and diﬀerent groups have diﬀerent amounts of disks. The amounts of disks correspond to the diﬀerent performance require‐ ments of the CDP system. The experiments show that 6-disk two-level MS-RAID5 performance improves 15.9%, 29.0%, 31.4% and 33.6% than 2-disk group S-RAID5 when the workload is 40%, 60%, 80% and 100% sequential, and saves 34.6% energy consumption than 3-disk S-RAID5. It proves that MS-RAID is a more energy-eﬃcient data layout. Acknowledgements. The work was supported by the Natural Science Foundation of China (No. 61876019), the Natural Science Foundation of Hebei Province (Grant No. F2016202145), the Youth Foundation of Education Commission of Hebei Province (Grant No. QN2014192), and the Science and Technology Planning Project of Hebei Province of China (grant No. 15210325).

References 1. Yu, X., Tan, Y., Zhang, C., Liang, C., Khaled, A., Zheng, J., Zhang, Q.: A high-performance hierarchical snapshot scheme for hybrid storage systems. Chin. J. Electron. 27(1), 76–85 (2018) 2. Yan, F., Tan, Y., Zhang, Q., Wu, F., Cheng, Z., Zheng, J.: An eﬀective RAID data layout for object-based de-duplication backup system. Chin. J. Electron. 25(5), 832–840 (2016) 3. Dong, Y., Liu, J., Yang, J., et al.: HS-RAID 2: optimizing small write performance in HSRAID. J. Electr. Comput. Eng. 2016, Article no. 7341735, 8 pages (2016) 4. Patterson, D.: A case for redundant arrays of inexpensive disks. In: Proceedings of ACM SIGMOD Conference (1988) 5. Gurumurthi, S., Sivasubramaniam, A., Kandemir, M., et al.: DRPM: dynamic speed control for power management in server class disks. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, San Diego, pp. 169–179. IEEE (2003) 6. Sun, Z., Zhang, Q., Li, Y., Tan, Y.A., et al.: DPPDL: a dynamic partial-parallel data layout for green video surveillance storage. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1 (2016) 7. Xu, L., Cipar, J., Krevat, E., et al.: SpringFS: bridging agility and performance in elastic distributed storage. In: Proceedings of Usenix Conference on File and Storage Technologies, pp. 243–255. USENIX Association (2014) 8. Basmadjian, M., Hermann, M.D., Lent, R., Giovanni, G.: Cloud computing and its interest in saving energy: the use case of aprivate cloud. J. Cloud Comput. Adv. Syst. Appl. 1(5), 1– 11 (2012) 9. Eric, S., Michael, L., Jon, S., et al.: Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11, 647–657 (2010) 10. Li, X., Tan, Y., Sun, Z.: Semi-RAID: a reliable energy-aware RAID data layout for sequential data access. In: Proceedings of IEEE, Symposium on MASS Storage Systems and Technologies, pp. 1–11. IEEE Computer Society (2011) 11. Liu, J., Zhang, J., Li, Y., et al.: Hybrid S-RAID: an energy-eﬃcient data layout for sequential data storage. J. Comput. Res. Dev. 50(1), 37–48 (2013). (in Chinese) 12. Liu, J., Tan, Y., Xue, J., et al.: Writing optimization strategy in S-RAID based on sequential data characteristics. Chin. J. Comput. 37(3), 721–734 (2014). (in Chinese)

Incentivizing Multimedia Data Acquisition for Machine Learning System Yiren Gu1, Hang Shen1,2(&), Guangwei Bai1, Tianjing Wang1, Hai Tong1, and Yujia Hu1 1

2

College of Computer Science and Technology, Nanjing Tech University, Nanjing 211816, China [email protected] Department of Electrical and Computer Engineering, University of Waterloo, Waterloo N2L3G1, Canada

Abstract. To address restrictions on data collection, incentivizing multimedia data acquisition for machine learning system is proposed. This paper presents an effective QoI (Quality-of-Information)-aware incentive mechanism in multimedia crowdsensing, with the objective of promoting the growth of an initial training model. Firstly, an incentive model is constructed in the form of reverse auction to maximize the social welfare while meeting the requirements in quality, timeliness, correlation and coverage. Then, we discuss how to achieve the optimal social welfare in the presence of an NP-hard winner determination problem. Lastly, a practical incentive mechanism to solve the auction problem is designed, which is shown to be truthful, individually rational and computationally efﬁcient. Extensive simulation results demonstrate the proposed incentive mechanism produces close-to-optimal social welfare noticeably and high-QoI dataset is obtained. In particular, a signiﬁcant performance improvement for machine learning model growth is achieved with lower complexity. Keywords: Multimedia crowdsensing Machine learning QoI Auction

Incentive mechanism

1 Introduction There are two common ways to promote the growth of an initial machine learning model in a short time, i.e., the optimization of algorithm or the improvement of dataset quality. The former (e.g. MobileNets [5, 10]) optimizes model framework by improving algorithm, while the latter provides a large amount of data for continuous training and learning. An immature machine learning model with a large amount of training data can often win a well-designed and high-level model based on only a small amount of training data, such as Automatic Speech Recognition [6] and Image Classiﬁcation [7]. However, there are strict QoI requirements (including quality, timeliness, correlation, coverage) for datasets with the development of machine learning technology increasingly mature. The model training with large-scale datasets requires a lot of time. Machine learning system hopes receiving high-quality datasets expected before

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 142–158, 2018. https://doi.org/10.1007/978-3-030-05057-3_11

Incentivizing Multimedia Data Acquisition

143

model training which satisfying a certain coverage requirement within the prescribed time and areas at the same time. Multimedia Crowdsensing (MC) [1, 2], the crowdsourcing of multimedia data, has a huge potential to incentivize many new machine learning assisted multimedia applications expected to capture tremendous beneﬁts in a variety of ﬁelds including Google Views [8, 9]. Anyone with a Google account can login into Google Views to share panoramic street view they photograph, which can service for the training of machine learning system. What makes it special is that the homepage of Google Views will recommend some street views users contribute. When you click on an account, the street view album of the user’s contribution can be seen. Comparing with traditional data collection methods, MC has made large-scale participatory sensing viable in a speedy manner and with little infrastructure cost by leveraging personal mobile devices as its sensor nodes, which can provide massive datasets for machine learning model training. In general, the cost occurring to a mobile user participating in MC involves resource consumption and privacy leakage. Mobile users may be reluctant to participate in MC without sufﬁcient incentives, which causes that a large amount of high-QoI [4] datasets are not achieved for model training. Therefore, it is necessary to design an effective incentive mechanism to encourage people to participate. Recent research has been focused on some game-theoretic incentive mechanisms for MC systems [17–19] and employs user’s bidding price as an important metric to give rewards. However, most of the existing mechanisms fail to incorporate QoI requirements which depends on special applications. [12] refers to timeliness and efforts of datasets users collect; effective gathering coverage and time of tasks are deﬁned as QoI in [11]. Nevertheless, these indicators have not been all considered in the existing incentive mechanism works. Moreover, few jobs have been done on combining mechanism designs with machine learning scenarios. The requirement for training data becomes higher with the increasingly maturity of machine learning ﬁeld. In this paper, an effective incentive mechanism for multimedia crowdsensing enabled machine learning system is proposed, focusing on obtaining massive high-QoI training datasets for machine learning models to enhance the growth of training model. The main contributions include: 1. To guarantee the utilities of both machine learning system and participating users, an incentive model based on a reserve auction is presented, which maximizes social welfare subjects to the quality requirement of tasks, timeliness of joining task, correlation and coverage of collected pictures. 2. How to achieve the optimal social welfare is discussed in the presence of an NPhard winner determination problem. Then, a practical incentive mechanism to solve the auction problem is designed, which is shown to be truthful, individually rational and computationally efﬁcient. 3. Extensive simulation results show the proposed mechanism achieves noticeable superiority and produces close-to-optimal solutions. The datasets provided by our mechanism accelerate the growth of machine learning model. The rest of this paper is organized as follows. Motivation of this work is discussed in Sect. 2. Section 3 describes system model of our mechanism. The optimal solution

144

Y. Gu et al.

and auction algorithm are given in Sect. 4. Finally, we present simulation results and performance analysis in Sect. 5 before concluding in Sect. 6.

2 Motivation To obtain a mature Image Classiﬁcation Model, a 5-5 layer CNN framework is constructed by us. It is trained with the real Belgian trafﬁc datasets [3]. The training results show classiﬁcation accuracy of model training approaches to 100% with the number of iteration increasing and the loss value of model indeed is falling to 0 (as illustrated in Fig. 1), which illustrate the model we constructed is becoming relatively mature in later stage. However, no matter what the model framework is optimized, there is not much room for improvement of the classiﬁcation accuracy at the later period. The results of the accuracy of training classiﬁcation with different batch sizes are shown in Fig. 2.

Fig. 1. Variation diagram of loss function

Fig. 2. Training accuracy curve of different samples

In view of this problem, ﬁnal classiﬁcation is analyzed by visualization of matplotlib tool. Note that, if the color of forecast and actual category is green, which illustrates classiﬁcation results are correct; if red, wrong. Part of classiﬁcation results is shown in Fig. 3. Visual results reveal that misclassiﬁed images are almost low-quality, which indicates the datasets quality is also a key factor slowing down the growth of model.

Fig. 3. Comparison of predictions and actual categories under matplotlib visualization

Incentivizing Multimedia Data Acquisition

145

3 System Model A multimedia crowdsensing system, as shown in Fig. 4, is composed of a machine learning system, a multimedia sensing platform and many smartphone users. For model training needs, the machine learning system announces picture collection tasks f directly, which is expected to be accomplished in a period of time T1 T2 . Considering the diversity of training data, machine learning system needs users to collect M types images, a picture type represents a subtask, denoted as f ¼ fs1 ; ; sM g. ① announcing tasks

Machine learning system

issuing tasks

② submitting sensing plans

returning results

③ determining winners Multimedia sensing platform

④ uploading sensing pictures ⑤ making payments

Mobile Users

Fig. 4. Interaction model of platform and users in MC

The sensing platform processes requests from the machine learning systems and helps him recruit mobile users. The user set is denoted as U ¼ fu1 ; ; uN g, where N is the number of users. The interaction between sensing platform and mobile users can be formulated as a reverse auction mechanism design problem, which is described as follows: 1. The platform issues tasks to recruit quality-guaranteed users to participate. 2. Each user submits his sensing plan, which is denoted as a tuple bidi consisting of the set of tasks he wants to execute gi f and his bidding bi for these tasks. 3. The platform uses an incentive mechanism to select the winners and calculates payments, represented as ~ p ¼ fp1 ; ; pN g. 4. Winners perform sensing tasks and submit results to the platform. 5. The platform checks the results and makes payments for winners. At last, all pictures are sent to machine learning system for model training. 3.1

Auction Framework

Motivated by Sect. 2, low-quality datasets go against enhancing the growth of an initial model. Recruiting a quality crowd to undertake image collection tasks is considered by us. QoI acts an important index which is integrated into our incentive mechanism. It is calculated by sensing platform, which depends on the sensing application and can be deﬁned according to various factors. For example, in [15], QoI refers to the uploaded photos’ quality. Photos with high quality will help the platform better identify the visible problems with medical devices; in [14], QoI refers to the users’ estimation accuracy of air quality. The QoI of our paper is deﬁned as the following.

146

Y. Gu et al.

Deﬁnition 1 (QoI of image datasets). The QoI of sensing denoted by q ¼ fq1 ; ; qN g is the clarity of pictures. It often depends on the joining time ti and task achievements (like correlation ai and coverage of shooting targets bi ) of users, which are perfectly complimentary factors in the process of executing tasks. The joining time ti of a user is earlier, he has more time to prepare for collecting images; pictures are often more relevant with the platform’s requirements and the coverage of shooting is also wider with sufﬁcient time. If these constraints are changed, image quality will also be changed. We assume that platform maintains a historical record of users’ QoI proﬁle q used as inputs for winner and payment determination. Each subtask sj needs to be completed with a minimum quality Qj . Denoted by Q ¼ fQ1 ; ; QM g the proﬁle of quality requirements for each subtask. Deﬁnition 2 (A User’s Utility). The payoff wi of any ui 2 U is deﬁned as the difference between payments pi and cost ci , which satisﬁes: wi ¼

pi c i ; 0;

xi ¼ 1; xi ¼ 0:

ð1Þ

There, xi equals to 1 if ui wins, and xi equals to 0 otherwise. xi ¼

1; 0;

if ui is chosen as a winner otherwise

ð2Þ

If ui is a winner of our auction, he will be paid pi for executing the corresponding set of sensing tasks. In contrast, he will not be allocated any sensing task and receive zero payment. Deﬁnition 3 (Platform’s Proﬁts). The proﬁt of sensing platform is given as follows: / ¼ VðSÞ

X

pi

ð3Þ

i2S

where the value function VðSÞ represents the sum of the value vi contributed by winner set S. VðSÞ¼

X i2S

vi ¼

X i2S

kqi jgi j

ð4Þ

Equation (4) consists of k, a coefﬁcient that transform the image QoI into monetary reward. jgi j is the number of categories collected by ui . The value function VðSÞ is monotonic in q. That is, for any ~ q ¼ fq1 ; ; qN g and ~ q0 ¼ fq01 ; ; q0N g, we have 0 0 V~q ðSÞ V~q ðSÞ, if qi qi holds 8ui 2 U. Since the smartphones are owned by different users, which are selﬁsh but with rational behavior. Therefore, mobile users will not participate in the MC without sufﬁcient incentive. To guarantee the utilities of both sensing platform and participating users, the goal of this paper is similar to the traditional VCG mechanism [20, 21],

Incentivizing Multimedia Data Acquisition

147

aimed at designing an efﬁcient incentive mechanism that maximize the social welfare. It is formally described in Deﬁnition 4. Deﬁnition 4 (Social Welfare). The social welfare of the whole MC is the sum of the users’ payoff and sensing platform proﬁts: X X c ¼ /þ wi ¼ VðSÞ ci ð5Þ i2U

3.2

i2S

Desirable Properties

Speciﬁcally, a user participating in sensing tasks will incur a cost ci and his maximum executable task set is g0i , which are private and only known to user himself. As a result, ci and g0i could be different from bi and gi respectively. This section describes three desirable properties for our auction mechanism. • Truthfulness: An auction mechanism is truthful if and only if for every bidder ui 2 U, they all adopt the dominant strategy to bid his true value ðg0i ; ci Þ. • Individual Rationality: An auction mechanism is individually rational if for any bidder, the payoff is nonnegative when bidder ui bids his true value ðg0i ; ci Þ. • Computational Efﬁciency: An auction mechanism is computationally efﬁcient if the outcome can be computed in polynomial time. Truthfulness is the most difﬁcult to achieve of the three properties. The bid is twodimensional that contains two parts: the declared cost bi and task sets gi of bidder ui . Therefore, Myerson’s theorem [13] about the properties of one-parameter truthful mechanisms cannot be directly applied. To design a truthful auction mechanism with two dimensions, the following deﬁnitions is introduced: Deﬁnition 5 ( b-Monotonicity). if bidder ui wins by bidding ðgi ; bi Þ, then he also wins by bidding ðgi ; b0i Þ with any b0i bi . Deﬁnition 6 ( g-Monotonicity). if bidder ui wins by bidding ðgi ; bi Þ, then he also wins 0 by bidding ðg0i ; bi Þ with all gi gi . Deﬁnition 7 (Critical Payment). the payment pi for winning bidder ui is set to the critical value di such that bidder ui wins if bi di , and loses if bi [ di . Lemma 1. A mechanism is truthful if it satisﬁes b-Monotonicity, g-Monotonicity and critical payment. Proof: A truthful bidder will receive positive utility which can be veriﬁed easily. If ui 0 is losing with untruthful sensing plan ðgi ; bi Þ or gi 6 gi , his utility will be negative. As a result, the case in which ðgi ; bi Þ is winning and gi g0i requires to be only considered.

148

Y. Gu et al.

Firstly, it can be known that users biding with ðg0i ; bi Þ can win from the property of g- Monotonicity. Suppose that the payment for bid ðgi ; bi Þ is p and for bid ðg0i ; bi Þ is p0 . Any bid ðg0i ; b0i Þ with b0i p0 is losing because p0 is the critical payment of task g0i . Similarly, bidding with ðgi ; b0i Þ is also losing from monotonicity. Therefore, the critical payment for ðgi ; bi Þ is at most that for ðg0i ; bi Þ, which means p p0 ; in other words, the user will not increase his utility by bidding ðgi ; bi Þ instead of ðg0i ; bi Þ. Then, the case of true bid with ðg0i ; ci Þ is considered, whose payment is the same as bidding with ðg0i ; bi Þ from the critical payment, i.e.pi . If bidding ðg0i ; ci Þ loses, then we have ci [ p0 bi . ☐ Compared with ðg0i ; ci Þ, bid with ðg0i ; bi Þ will also not increase his utility.

4 QoI-Aware Incentive Mechanism (QoI-RA) In this section, a QoI-aware based on reverse auction (QoI-RA) is presented. First, we discuss how to achieve approximately maximal social welfare. Then, two Algorithms for the discussion are designed by us. Finally, we present a practical QoI-RA that satisﬁes three properties. 4.1

Optimal Solution of QoI-RA Auction

The goal of QoI-RA is to maximize the social welfare given in Deﬁnition 4 while achieving computational efﬁciency, individual rationality and truthful. The winner selection (QRA-WS) and pricing determination (QRA-PD) can be decoupled into two separate problems. Solving the maximization problem itself, referred to as the QRAWS problem, is challenging because QRA-WS is NP-hard (proved by Theorem 1), let alone combining with the other three properties. QRA-WS Problem: Given the information of a task set f and a user set U, the goal of the QRA-WS problem is to ﬁnd a subset SU. It can be formulated as the following integer linear program, such that max U

N X

wi xi

ð6Þ

i¼1

Subject to: X

qi x i Q j ;

8sj 2 f

ð7Þ

i:sj 2gi ;ui 2U

xi 2 f0; 1g; ai (t) ^a;

8ui 2 U

ð8Þ

T1 t T2

ð9Þ

^ bi (t; r) b;

rl

ð10Þ

Incentivizing Multimedia Data Acquisition

149

Using (4) and (5), we get: c ¼ VðSÞ

X

ci ¼

ui 2S

X ui 2U

ðkqi jgi j bi Þxi

ð11Þ

Let ~ w ¼ fw1 ; ; wN g denote the marginal social welfare proﬁle of all users based on user’s bids, where wi ¼ kqi jgi j bi

ð12Þ

Hence, maximizing the social welfare is actually maximizing the total marginal social welfare of users. c¼

X

w i xi

ð13Þ

ui 2U

4.2

Constraints of Tasks

Deﬁnition 8 (Task’s Quality). We use Qj ðSÞ represented by Eq. (14) to denote the total quality that all winners accomplish task sj 2 f. Therefore, the quality of a subtask is equivalent to guaranteeing that every task is executed by users with sufﬁcient amount of quality in total. X Qj ðSÞ ¼ qi ; 8sj 2 gi ð14Þ i2S

The platform stipulates that total quality of images users collected must satisfy the requirement of each subtask. Constraint (7) is each subtask’s quality requirement. Deﬁnition 9 (Correlation). The correlation of image type between users upload x and tasks require y can be donated as the function aðÞ: aðt dÞ ¼ ½2 sgnðt dÞ f ðd tÞ þ sgnðd tÞ mxy

ð15Þ

where the relevance of image type collected by users is evaluated from two aspects: objectivity and subjectivity. The former is determined by a function that can satisfy the following properties: (1) it is a monotonically none-increasing of the different between t and d, where t is the joining time of users and d is the deadline of sensing tasks; (2) it returns a value in [0,1]. If users join in tasks earlier than deadline, it always equals to 1. Otherwise, it monotonously decreases from 1 to 0; f ðtÞ is a function of Sigmoid. mxy is decided by platform grades on the image correlation, which ranges from [0,1]. The platform requires that users can meet certain relevance of collecting images, corresponding to (9).

150

Y. Gu et al.

Deﬁnition 10 (Coverage). The image coverage bðÞ is represented by: 8 dt ; < ð2 1 þ2er Þ ds dt bðr; tÞ ¼ ; ds : 0 ;

r [ 0;s t d r ¼ 0; s t d otherwise

ð16Þ

where bðÞ is decided by two parts. The former is a monotonically decreasing function of r, where r is the distance between the shooting location and the target object. When r is large, the target coverage rate is low. In addition, the joining time of users is different, so they take photos with different numbers. It is assumed that once a user has participated in the task, he always executes tasks and takes the same time on a single picture. As a result, the number of pictures is proportional to joining time t of users, which can be described as the latter. The longer you participate, the more pictures you can collect from different angles under enough time and conditions, so the coverage of target becomes wider. For model training needs, the sensing platform hopes the image coverage can meet the speciﬁed standards, which is corresponding to (10). The optimal QoI-RA problem is achieved as follows: (1) Winner Selection: Find a subset SU by solving QRA-WS problem; (2) Payment Determination: Determined by the basis of QRA-WS problem. If xi is equal to 0; pi is 0, otherwise. Theorem 1. The QRA-WS problem is NP-hard. Proof: We prove the NP-hardness of the QRA-WS problem by a polynomial time reduction from the minimum weighted set cover (MWSC) problem, which is NP-hard [23]. The MWSC problem is deﬁned as follows: A universe set, denoted by E ¼ fs1 ; ; sM g, consists of M elements, whose subsets can be denoted as a set f ¼ fg1 ; ; gN g. Every set gi 2 f has a corresponding non-negative weight wðgi Þ. The MWSC problem is to ﬁnd the minimum weight subset of f whose union is E. Next, we construct an instance of the QRA-WS problem from an instance of the MWSC problem in polynomial time. Firstly, we transform gi into g0i such that for every element in gi there exist hi 2 Z þ copies of the same element in g0i . We require that every element sj 2 E is covered for at least Hi 2 Z þ times. After reduction, we obtain an instance of the QRA-WS problem. In such a problem, users’ quality proﬁle is denoted as ~ q ¼ fh1 ; ; hN g; users’ bidding tasks proﬁle is denoted as ~ g ¼ fg1 ; ; gN g; user’s marginal social welfare proﬁle is ~ w ¼ fw1 ; ; wN g; user’s duration of service proﬁle is ~ t ¼ ft1 ; ; tN g; user’s location of service proﬁle is ! ! r ¼ fr ; ; r g; tasks’ quality requirement proﬁle is Q ¼ fH ; ; H g. It can be 1

N

1

M

seen vividly that the QRA-WS problem represents a richer family of problems in which the quality qi of any user ui and any task j0 s quality requirement Qj could take any value in R þ . Furthermore, the marginal social welfare can take any value in R. So every instance of the MWSC problem is polynomial-time reducible to an instance of the QRA-WS problem. The QRA-WS problem is NP-hard. ☐

Incentivizing Multimedia Data Acquisition

4.3

151

Mechanism Design

Because of the NP-Hard nature of QRA-WS problem, it is difﬁcult to ﬁnd a solution to maximize social welfare in polynomial time. Meanwhile, traditional VCG auction mechanism [20, 21] are not directly tailored because it requires the social welfare is exactly maximized. A natural step is to design a computationally efﬁcient mechanism with close-to-optimal social welfare. On this basis of the proposed optimal solution, user’s bids fðg1 ; b1 Þ; ; ðgN ; bN Þg are utilized by us to calculate the marginal social welfare of users w, which acts as the input of QRA-WS problem. First, platform excludes the users who don’t meet the timeliness, relevance, and coverage of task (lines 2–4). Next, the platform includes the remaining users whose marginal social welfare are non-negative into winner set (lines 5–6). Then the platform will get the user set X whose marginal social welfare is negative by removing the current winner from user set N (line 7). It calculates task’s remaining quality requirements proﬁle Q0 by subtracting from Q0 the quality provided by the currently selected winners (lines 8–9). The main loop is executed until every task’s quality is satisﬁed (lines 10–15). In the main loop, the minimum marginal social welfare effectiveness is used to select the third batch of users, the formula is deﬁned as follows: n¼P

jwi j 0 j:sj 2gi minfQj ; qi g

ð17Þ

where n is deﬁned as the ratio between the absolute value of the marginal social welfare P 0 jwi j of ui and his effective quality contribution j:sj 2gi minfQj ; qi g. The user with the minimum n among the remaining users in X is included into S. After that, the platform updates set S(lines 10–13) and the residual quality of subtask ~ Q (lines 14–15). Then, the platform pays for winner set S in Algorithm 1. If a user is not winner, his payoff is zero. Algorithm 2 describes the pricing mechanism, which takes the winner set S as input and outputs the payment proﬁle ~ p. Firstly, ~ p is initialized as a zero vector (line 1). Then, like Algorithm 1, platform excludes users who don’t meet the timeliness, relevance, coverage of task and gets set N þ (lines 2–4). Next, the platform includes all users with non-negative marginal social welfare into X þ (lines 5–6). The main loop (lines 7–14) calculates the platform’s payment to every winner. For every winner ui 2 S, the winner determination in Algorithm 1 is executed with all users except ui until the quality requirement of every task in gi is satisﬁed (lines 7–8).Then, the platform obtains the current winner set S0 (line 9) and calculates differently in the following two case (lines 10–14): Case 1. Any winner ui has wi 0 in this case (lines 10–11). As a result, the user’s critical payment is his bidding price b0i , which satisﬁes w0 ¼ kqi jgi j b0i ¼ 0. That is: pi ¼ kqi jgi j

ð18Þ

152

Y. Gu et al.

Case 2. Any winner ui belonging to case 2 (lines 13–14), we go through every uk 2 S0 nU þ . Then we calculate the maximum bidding price b0i of user ui to be able to replace uk as the winner, i.e., b0i satisﬁes Eq. (19). P

b0 kqi jgi j jwk j ¼P i 0 0 j:sj 2gk minðQj ; qk Þ j:sj 2gi minðQj ; qi Þ

ð19Þ

Incentivizing Multimedia Data Acquisition

153

This can also be expressed as: P j:sj 2gi

b0i ¼ kqi jgi j wk P

j:sj 2gk

minfQ0j ; qi g minfQ0j ; qk g

ð20Þ

At last, the maximum value among all b0i discussed above is used to pay for ui . 4.4

Proof of Properties

In this section, we show that QoI-RA auction is truthful, individual rational, computational efﬁcient. Theorem 2. The QoI-RA auction is truthful. Proof: We consider any other bid ðg0i ; b0i Þ of ui , if he wins by bidding ðgi ; bi Þ, where b0i \bi or gi g0i . It will be analyzed from two cases. (1) wi 0. When ui makes a new bid ðg0i ; b0i Þ, w0i ¼kqi g0i b0i [ kqi jgi j bi 0. (2) wi 0. The new marginal social welfare of ui is not affected by previous bidding. It is the same as case 1, which makes w0i 0. As a result, ui can also win by new bid ðg0i ; b0i Þ from Algorithm 1. QRA-WS satisﬁes both bidding tasks and price monotonicity. Furthermore, it is easily veriﬁable that QRA-PD algorithm uses the supremum of bidding price b0i such that bidding ðgi ; b0i Þ still wins. Hence, from Lemma 1, we conclude that QoI-RA auction is truthful. ☐ Theorem 3. The QoI-RA auction is individual rational. Proof: We prove from two possible cases. First, the payoff of mobile user ui 2 fUnSg is 0 if ui is not a winner according to Algorithm 2. Second, ui is a winner. We have proved that users bid truthfully in our QoI-RA auction from Theorem 2. As a result, each user bids his true cost ci . Since QoI-RA preserves the critical payment property as shown in Lemma 1, every winner will be paid the supremum of bidding price. Then, we have pi ci for every winner, i.e., wi ¼ pi ci 0. Therefore, the utility for every user ui is always non-negative, i.e. wi 0. This completes the proof. ☐ Theorem 4. The computational complexity of the QoI-RA auction is OðN 3 MÞ. Proof: QoI-RA auction consists of two algorithms QRA-WS and QRA-PD. The former ﬁrstly goes through all users to select someone who meets the requirements of timeliness, correlation, coverage, which needs N iterations. Its computational complexity is embodied in the main loop, which terminates after N iterations in the worst case. In every iteration, it also goes through every task sj 2 f, i.e., the while-loop runs M times. Hence, the computational complexity of Algorithm 1 is OðN 2 MÞ. Similarly, the problem in Algorithm 2 needs N iterations at ﬁrst. Then, it chooses users whose marginal social welfare is greater than 0, which iterates N times in the worst case. The third for-loop executes Algorithm 1 for each user ui 2 S. So the computational

154

Y. Gu et al.

complexity of Algorithm 2 is OðN 3 MÞ. Therefore, the overall computational com☐ plexity of QoI-RA auction is OðN 3 MÞ.

5 Performance Evaluation In this section, we present and discuss simulation results on the real dataset to justify the effectiveness of the proposed mechanism. 5.1

Simulation Settings

All the evaluation results are based on a real datasets of BelgianTS [3]. The dataset consists of two parts: training (4575 images) and testing data (2520 images). Each contains 62 subdirectories. Each subdirectory from 0 to 61 represents a category/label. Each category has different amount of trafﬁc signs images, which are stored as the format of.ppm. In order to compare with previous experiments mentioned in Sect. 2 easily, all the training images of BelgianTS are imported into our simulation environment. Then, each picture is labeled by us. The basic parameter settings are detailed in Table 1.

Table 1. Parameter setting Parameters k Value 0.1 Parameters jgi j Value

Qj N M qi [1, 2] [10, 15] [100,200] 62 l T1 T2 ci

[20,30] 5

0

30

ti [0,40] ^ a

[2, 4] 0.6

ri [0,7] ^ b 0.6

For comparison, we choose two well-designed incentive mechanisms. The ﬁrst baseline is the revised version of greedy auction with our constraints deﬁned in Sect. 4, which is truthful and individual rational. Firstly, the winner determination of it selects user with wi 0 who meets constraint (9) and (10) as the winners. Different from our mechanism, it then selects users who has largest marginal social welfare in remaining users until QoI requirements of tasks are met. The pricing mechanism pays each winner his supremum bidding price. The second baseline method is a modiﬁed version of traditional VCG auction [20, 21] based on QoI-Aware (QoI-VCG), which consists of winner determination (VCG-WD) and pricing. The concept of QoI and our constraints are integrated into the VCG-WD problem, which can be solved optimally. The pricing mechanism in [20, 21] is used to pay for winners. 5.2

Simulation Results

Experiment 1 compares our mechanism with two well-designed mechanisms about the social welfare. The parameters are given in Table 1. To evaluate the impact of the number of users on the social welfare, we set the number of tasks to 62 and vary the

Incentivizing Multimedia Data Acquisition

155

number of users from 100 to 200 with a step of 20. It can be vividly seen in Fig. 5 that the social welfare of three mechanisms keeps going up when the number of users increases. The social welfare of the QoI-VCG auction equals to the optimal solution of the QRA-WS problem. It can be concluded that the social welfare of QoI-RA auction is close to optimal and far better than the baseline QoI-Greedy auction. In Experiment 2, one of constraints is chosen by us. The coverage of image is changed among 0.5, 0.6, 0.7 with other parameters ﬁxed. As shown in Fig. 6, the coverage is lower, the value of social welfare is greater. It is due to that, the restrictions of these three indexes are set wider, more mobile users have chances to participate. This is helpful for platform’s task accomplished, and thus, the social beneﬁts are relatively high. However, a certain QoI limitation should be set for machine learning model training and it is unsuitable to be set too low. Experiment 3 looks at the running time. The QoI-VCG auction is compared with it. The parameters are the same as Experiment 1. Simulation results are presented in Table 2. QoI-RA auction executes in signiﬁcantly less time than the QoI-VCG auction. That is because the QoI-VCG auction calculates actual social welfare maximization. With the increasing of the number of users, its execution time gradually becomes so long, which is infeasible to be used in practice. In contrast, the QoI-RA auction approaches the optimal social welfare, it can keep low execution. In a word, the QoIRA auction is much more computationally efﬁcient than the QoI-VCG auction. In Experiment 4, two datasets with the same amounts and different qualities are input into an initial CNN model for its training. One group is a part of the original datasets. The other group is selected by the QoI-RA mechanism. Then, we observe the effects of data quality on model training by comparing the two groups of model training. In Fig. 7, the pictures of QoI-RA collected are input into CNN, the model

Fig. 5. Impact of number of users

Fig. 6. Impact of coverage

Table 2. Comparison of execution time N 100 150 200 250 300 350 400 450 500 VCG 6.325 8.361 10.574 12.016 65.293 32.786 95.475 60.251 2056 QRA 0.119 0.128 0.132 0.157 0.224 0.195 0.219 0.226 0.230

156

Y. Gu et al.

Fig. 7. Comparison of training accuracy

Fig. 8. Comparison of loss value

accuracy increases faster. This is because, QoI-RA dataset has the quality constraints. It selects high-quality images in all datasets. In Fig. 8, the loss value of the model trained with QoI-RA datasets falls to 0 more quickly. This indicates that the model trained with high-quality datasets actually has better learning ability. The growth speed of an initial model can be accelerated by the improvement of image quality in some degree. At last, the prepared testing datasets of BelgianTS (2520 images) are used to test the accuracy of trained model, the ﬁnal classiﬁcation accuracy can reach 95.7862% and the time spent for classifying is much shorter than before. It shows that our mechanism is helpful for obtaining high-quality data, with which the growth speed of model can be accelerated (Table 3). Table 3. Testing results DataSet

Predictive indexes Test accuracy Testing time (s) Original dataSet 90.4365% 1527.42 QoI-RA dataSet 95.7862% 1238.04

6 Conclusion A QoI-Aware incentive mechanism (QoI-RA) to provide high-quality datasets for model training has been proposed in this work, which maximizes the social welfare with the subject of quality requirement of each subtask, timeliness of joining task, correlation and coverage of targets. Through extensive simulation results, we show that the proposed mechanism produces close-to-optimal social welfare noticeably. Datasets acquisition through our mechanism is helpful for machine learning model growth with lower complexity. We believe that our method could lay a foundation of the design of incentive mechanisms for multimedia crowdsensing with QoI constraints over machine learning system.

Incentivizing Multimedia Data Acquisition

157

Acknowledgements. The authors gratefully acknowledge the support and ﬁnancial assistance provided by the National Natural Science Foundation of China under Grant No. 61502230, 61501224 and 61073197, the Natural Science Foundation of Jiangsu Province under Grant No. BK20150960, the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 15KJB520015, and Nangjing Municipal Science and Technology Plan Project under Grant No. 201608009.

References 1. Guo, B., Han, Q., Chen, H., et al.: The emergence of visual crowdsensing: challenges and opportunities. IEEE Commun. Surv. Tutor. PP(99), 1 (2017) 2. Li, Y., Jeong, Y.S., Shin, B.S., et al.: Crowdsensing multimedia data: security and privacy issues. IEEE Multimed. 24(4), 58–66 (2017) 3. https://btsd.ethz.ch/shareddata/ 4. Restuccia, F., Ghosh, N., Bhattacharjee, S., et al.: Quality of information in mobile crowdsensing: survey and research challenges. ACM Trans. Sens. Netw. 13(4), 34 (2017) 5. Howard, A.G., Zhu, M., Chen, B., et al.: MobileNets: efﬁcient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017) 6. Hsu, W.N., Glass, J.: Extracting domain invariant features by unsupervised learning for robust automatic speech recognition, arXiv preprint arXiv:1803.02551 (2018) 7. Leroux, S., Molchanov, P., Simoens, P., et al.: IamNN: iterative and adaptive mobile neural network for efﬁcient image classiﬁcation, arXiv preprint arXiv:1804.10123 (2018) 8. Hara, K., Sun, J., Moore, R., et al.: Tohme: detecting curb ramps in google street view using crowdsourcing, computer vision, and machine learning. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, pp. 189–204 (2014) 9. Anguelov, D., Dulong, C., Filip, D., et al.: Google street view: capturing the world at street level. Computer 43(6), 32–38 (2010) 10. Sun, F., Huang, G.B., Wu, Q.M.J., et al.: Efﬁcient and rapid machine learning algorithms for big data and dynamic varying systems. IEEE Trans. Syst. Man Cybern. Syst. 47(10), 2625– 2626 (2017) 11. Man, H.C., Hou, F., Huang, J.: Delay-sensitive mobile crowdsensing: algorithm design and economics. IEEE Trans. Mob. Comput. PP(99), 1 (2018) 12. Xu, Y., Zhou, Y., Mao, Y., et al.: Can early joining participants contribute more? - timeliness sensitive incentivization for crowdsensing (2017) 13. Myerson, R.B.: Optimal auction design. Math. Oper. Res. 6(1), 58–73 (1981) 14. Cheng, Y., Li, X., Li, Z., et al.: AirCloud: a cloud-based air-quality monitoring system for everyone (2014) 15. http://www.fda.gov/MedicalDevices/Safety/ReportaProblem/ucm385880.htm 16. Krontiris, I., Albers, A.: Monetary incentives in participatory sensing using multi-attributive auctions. Parallel Algorithms Appl. 27(4), 317–336 (2012) 17. Duan, L., Kubo, T., Sugiyama, K., et al.: Incentive mechanisms for smartphone collaboration in data acquisition and distributed computing. In: Proceedings of IEEE INFOCOM, pp. 1701–1709 (2012) 18. Faltings, B., Li, J.J., Jurca, R.: Incentive mechanisms for community sensing. IEEE Trans. Comput. 63(1), 115–128 (2014) 19. Yang, D., Xue, G., Fang, X., et al.: Incentive mechanisms for crowdsensing: crowdsourcing with smartphones. IEEE/ACM Trans. Netw. 24(3), 1732–1744 (2016) 20. Clarke, E.H.: Multipart pricing of public goods. Public Choice 11(1), 17–33 (1971)

158

Y. Gu et al.

21. Groves Jr., T.F.G., Groves, T.: Incentives in Teams[J]. Econometrica 41(4), 617–631 (1973) 22. Feng, Z., Zhu, Y., Zhang, Q., et al.: TRAC: truthful auction for location-aware collaborative sensing in mobile crowdsourcing. In: Proceedings of IEEE INFOCOM, pp. 1231–1239 (2014) 23. Cormen, T.T., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms. Resonance 1(9), 14– 24 (2009)

Toward Performance Prediction for Multi-BSP Programs in ML Victor Allombert1 , Fr´ed´eric Gava2(B) , and Julien Tesson2 1

2

Universit´e d’Orl´eans, LIFO, Orl´eans, France Universit´e Paris-Est Cr´eteil, LACL, Cr´eteil, France [email protected]

Abstract. bsml and multi-ml are functional parallel programming languages “` a la ml” based of the respectively the bsp and multi-bsp bridging models. multi-bsp extends bsp to take into account hierarchical architectures. For both models, it is possible to predict the performances of algorithms thanks to embedded cost models. To do so, we propose formal operational semantics with cost annotations for the two aforementioned languages. This work has been done in a incremental manner. First we recall the cost semantics of core-ml language. Then, we adapt it to bsml and then to multi-ml. It is then possible to evaluate the cost of a program following the annotated semantics. Finally, we compare the theoretical approach with the current implementation on a code example. Keywords: Semantics Time prediction

1 1.1

· bsp · bsml multi-bsp · Cost

Introduction Context

The bulk synchronous parallelism (bsp) bridging model [16] was designed for ﬂat parallel architectures. A bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. But modern high performance computing (hpc) architectures are now hierarchical and have multiple layers of parallelism, communication between distant nodes cannot be as fast as among the cores of a given processor. We now consider the multi-bsp model [17], an extension of bsp. multi-ml [1,2] is a multi-bsp extension of bsml [8], a functional approach for programming bsp algorithms in ml, bsml being itself an extension of ocaml, a ml language (https://ocaml.org/). To be compliant with a bridging model eases the way of writing codes that ensures eﬃciency and portability from one architecture to another and also avoid deadlocks and non-determinism. The multi-bsp bridging model oﬀers a high level of abstraction and takes into account real communications and synchronisation costs on hierarchical architectures. Thanks to the cost model embedded in c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 159–174, 2018. https://doi.org/10.1007/978-3-030-05057-3_12

160

V. Allombert et al.

the (multi-)bsp model, it is possible obtain the cost of a given algorithm. Using the (multi-)bsp parameters of an architecture allows to predict the execution time of a given code. That can useful for resource bound analysis and ﬁnd performance bugs thus to provide development-time feedback to hpc programmers. We chose ocaml (with our own distributed extensions) as the source language “` a la ml” for several reasons. For one, ocaml is a widely used language for functional programming which is quite eﬃcient in practice (sophisticated compiler and automatic memory management). Moreover, we wanted to demonstrate that it is possible to deﬁne a practical cost semantics for high-level hpc languages; imperative programming is closer to standard assembly codes which already have their cost analysis such as wcet [15]. Even if functional programming is currently not the norm for hpc, it is more and more common that main stream languages (such as java) add functional features. Studying these features in ml, without having to manage others features (such as java’s objects), is a classical manner to get them for other languages. Cost prediction is important for the design of eﬃcient algorithms and is also important in domains where programs are executed with time constraints (such as in physical engines such as aeroplanes etc.). In the future, even such domains would beneﬁt of many-cores architectures (at most). Cost prediction of hpc programs is thus an important issue to ensure the safety of such systems. 1.2

Example of the Methodology: The Sequential Case

An important ﬁrst step to study cost prediction of programs is to deﬁne the cost of the construction of the language itself, that is deﬁne an operational big-step semantics that assign a parametric cost to a well-formed expression. Having a compositionnal cost semantics is also an important issue in order to get modular and incremental programming: from a software engineering point of view, it makes senses that the cost of a subprogram does not depend (too much) on the context, for example, the cost of an array sorting method should depend only on the size of the input and not when it is called. The main hypothesis is that the resource consumption of a program is a linear combination of the number of executions of each construct in the program1 . The semantics models this idea by parameterizing the cost with unknown coeﬃcients that correspond to each ml construct: the number of executions of each of these constructs constitutes the majority of the execution time of most ml codes [10]. Taking the case of the core-ml language. It relies on a minimal set of ml constructions. This set is suﬃcient enough to express all the behaviour that are used in ml programming. Thus, features such as records, modules, pattern matching, sum types are excluded. The grammar is: e ::= cst Constants | let x = e in e Binding | op Operators | fun x → e Function | x Variables | rec f x → e Recursive function | (e e) Application | if etheneelsee Conditional 1

But their combination could be not linear as for algorithms with polynomial or exponential complexities.

Toward Performance Prediction for Multi-BSP Programs in ML

161

In this grammar, x and f range over an inﬁnite set of identiﬁers. We also ﬁnd the typical ml-like constructors such as let for bindings and also fun and rec for, respectively, functions and recursive functions. As expected, the application is denoted (ee). For the sake of readability, we take the liberty to use the familiar inﬁx notation for binary operators, as well as the usual precedence and associativity rules. When the context is clear, we can avoid the usage of parentheses. op stands for the standard operators, such as common computations on integers. cst stands for constants such as integers, booleans, etc. An expression is evaluated into a value v which are deﬁned as: v ::= op | cst | (fun x → e)[E] | (rec f x → e)[E] E ::= {x1 → v1 , ...xn → xn } Values contains constants and closures (a value which stores both a function and its environment). An environment E is interpreted as a partial mapping with ﬁnite domain from identiﬁers to values. The extension of E by v in x is written E {x → v}. An inference rule can be written as following: P E e⇓vC That is with the premise P, the expression e is evaluated to the value v at cost C. The cost (time and memory) consumed by each construct is averaged out to be a constant. Hence, the execution time of a program C is: c∈C nc×Tc where C represents the set of constructs and nc is the count of each construct during the whole program execution, and Tc is the execution time of the respective constructs. Estimating the overall time execution of a program (in “seconds”) from the semantics now consists to estimating each Tc (in µs) using microbenchmarking2 and replacing them into the extracted cost C. The inference rules for core-ml are deﬁned in Fig. 1 and work as follow. The Csts and Ops rules do not generate any additional cost. Indeed, we assume that they are static values which are accessible freely. Vars aims to access a value bound in a memory using the lookup operator (which returns the corresponding bound value). As this operator access a value stored in a memory, its cost should be proportional to the path trough diﬀerent caches-memories. However, we chose to set such a constant Tvar in order to simplify the rules. The Closure rule mainly models the way the values are enclosed inside a function closure. It is done using the select operator which, given an environment E and a function (code) returns the minimal environment to evaluate such a code. We assume that the cost of building such an environment is proportional to the number of free variables (F, deﬁne by trivial induction on expressions) of e. It is an approximation which can be reﬁned by taking into account more ocaml mechanisms. Recursive functions are build in the same way. The app, let and if rules are straightforward: we simply propagate the cost produced by each expressions. Note the modiﬁcation of the environment for the application to evaluate the code of the closure. Also, each operator gets a cost noted c3 in the rule and we note op v the new built value. The “s” on the rules 2

This assumption does not truly holds for most of the relevant platforms (e.g. the garbage collector and caches-misses) but is still suﬃcient for our study; We let more subtle analyses to future works and we will focus on parallelism.

162

V. Allombert et al.

Fig. 1. The cost semantics of the sequential core-ml language.

that are unused here but will be necessary for the bsp’s supersteps. It is also straightforward to show that ⊕ is commutative. 1.3

Outlines

In this article we introduce the formal cost semantics of ﬁrst the bsml language (Sect. 2) and then we extend it to multi-ml (Sect. 3). For both languages, we ﬁrst present the model of execution, then the cost model and we give the semantics annotated with costs for core languages that describes the syntax of the aforementioned languages. Finally, we compare the predicted execution times with the actual one on a small example (Sect. 4).

2 2.1

BSP Programming in ML and Costs Semantics The BSP Bridging Model In the bsp model [16], a computer is a set of p uniform pairs of processor-memory with a communication network. A bsp program is executed as a sequence of supersteps (Fig. 2), each one divided into three successive disjointed phases: (1) each processor only uses its local data to perform sequential computations and to request data transfers to other nodes; (2) the network delivers the requested Fig. 2. A bsp superstep.

Toward Performance Prediction for Multi-BSP Programs in ML

163

data; (3) a global synchronisation barrier occurs, making the transferred data available for the next superstep. As bsp architecture can be easily mapped on any general purpose parallel architecture. Thanks to the bsp cost model it is possible to accurately estimate the execution time of a bsp program with the bsp parameters. The performance of a bsp computer is characterised by four parameters: The local processing speed r; The number of processors p; The time L required for a barrier; The time g for collectively delivering a 1-relation. g and L can be expressed in fLoating-point operations (flops) and r in flops per second. To accurately estimate the execution time of a bsp program, these 4 parameters can be easily benchmarked [3]. A 1-relation is a collective exchange where every processor receives/sends at most one word. The network can deliver an h-relation in time g × h. The execution time (cost) of a superstep s is the sum of the maximal local processing time, the data delivery and the global synchronisation times. It is expressed by the following formula: Cost(s) = max0≤i

The BSML Language

bsml [7] uses a small set of primitives and is currently implemented as a library (http://traclifo.univ-orleans.fr/bsml/) for the ml programming language ocaml. An important feature of bsml is its conﬂuent semantics: whatever the order of execution of the processors is, the ﬁnal value will be the same. Conﬂuence is convenient for debugging since it allows to get an interactive loop (toplevel). That also simpliﬁes programming since the parallelisation can be done incrementally from an ocaml program. A bsml program is built as a ml one but using a speciﬁc data structure called parallel vector. Its ml type is ’a par. A vector expresses that each of the p processors embeds a value of any type ’a. Figure 3 resumes the bsml primitives. Informally, they work as follows: let e be the vector holding e everywhere (on each processor), the indicates that we enter into the scope of a vector. Within a vector, the syntax $x$ can be used to read the vector x and get the local value it contains. The ids can be accessed with the predeﬁned vector pid. When a value is referenced within the scope of a parallel vector, its locality is l (local); otherwise, the locality is b (bsp).

Fig. 3. Summary of the bsml primitives.

164

V. Allombert et al.

The proj primitive is the only way to extract local values from a vector. Given a vector, it returns a function such that applied to the pid of a processor, returns the value of the vector at this processor. proj performs communication to make local results available globally and ends the current superstep. The put primitive is another communication primitive. It allows any local value to be transferred to any other processor. It is also synchronous, and ends the current superstep. The parameter of put is a vector that, at each processor, holds a function returning the data to be sent to processor j when applied to j. The result of put is another vector of functions: at a processor j the function, when applied to i, yields the value received from processor i by processor j. 2.3

Cost Semantics

Extension. To obtain core-bsml, we extends the expressions of core-ml with parallel primitives as follow: e ::= · · · | replicate (fun − → e) | (proj e) | (put e) | (apply e e). The distinction made between the syntactic sugar (the and $ notations), used when programming bsml algorithms, and the core parallel primitives (replicate and apply), available in the semantics only, simpliﬁes the semantics. Indeed, the syntactic sugar eases the way of programming but it is not suitable for the semantics as it introduces implicit assumptions. Thus, we must transform and abstract the syntactic sugar using the core parallel primitives. The transformation applied to switch from the syntactic sugar to the core parallel primitives is straightforward and produce and equivalent expression. The parallel vector scope, denoted e , is transformed using the replicate core primitive. Thus, e is simply transformed into replicate(fun − → e). The $ syntax is transformed using the apply primitive. The transformation is simple and does not require a complicated expression analysis. To do so, we build a vector of functions that takes, as argument, the dollar’s annotated value. Using the apply primitive, we can apply this vector of functions on the vector of values. For example, the expression (e $x$) is transformed into apply(replicate(fun − x → ex))x. Values are also extended with parallel vectors: v ::= · · · | < v1 , . . . , vp . In the following, to simplify the notations, we indices processors from 1 to p (and not from 0 to p − 1 as common in hpc). We make also the hypothesis that there exists a special vector named pid=< 1, · · · , p > (the ids of the processors). The main modiﬁcation is about the costs. During a superstep, the asynchronous costs are counting independently and it is only during the barrier that the maximal of the costs (computation and communication) are to be taken into account. But a same superstep can be in two diﬀerent parts of an expression (for example let x= 1+1 in ((proj $x$+1 ) 2) where the begin of the ﬁrst superstep is in the ﬁrst part of the let , the next just before the call of the proj and the second superstep when apply the result of the proj on the constant 2). For this reason, we extends the costs with vector of costs < c1 , . . . , cp >s where each component i describe the current local cost ci of processor i during the superstep s. This s is modify only by the rules of synchronous primitives. Nevertheless, we add the three following equivalences:

Toward Performance Prediction for Multi-BSP Programs in ML

165

1. < c1 , . . . , cp >s ⊕ < c1 , . . . , cp >s ≡< c1 ⊕ c1 , . . . , cp ⊕ cp >s , if ci and ci does not contains vectors 2. < Top ⊕ c1 , . . . , Top ⊕ cp >s ≡ Top ⊕ < c1 , . . . , cp >s , whatever Top 3. 0 ≡< 0, . . . , 0 >s , whatever s These rules aims to keep using the previous rule of the sequential constructions of the languages (let, fun, etc.). Lemma 1. The costs with parallel vector of costs form a commutative and associative group id where 0 is the neutral element inside or outside cost vectors and where < 0, . . . , 0 >s is the neutral element outside vectors only. Adding Rules. We must now extend our inference rules in order to take into account the bsp primitives. These rules are given in Fig. 4. They work as follow.

Fig. 4. The cost semantics of the core-bsml language.

The Rpl rule is for building asynchronously a new parallel vector. The expression e is evaluated for each component, in parallel, making a new vector of cost for the current superstep s. The valid function is used to forbid nested vectors and is fully deﬁned in [1]. A type system has been designed to not be forced to do this check dynamically. Then a construct is linearly add. The Apply rule works similarly but for two expressions which thus add two diﬀerent costs (not necessary vectors and for possibly diﬀerent supersteps) and we ﬁnally built the vector by computing its components in parallel (on each processor) making the linear add of a new costs vector. The Proj rule adds a barrier (L) and thus ﬁnishes the superstep (updating s). From the exchanged computing values, a h-relation is added: g and L are thus special constructs. The put cost is quite dense because of the number of communications between all the processors which are done during the evaluation of the primitive. But the rule is close the proj one. For sake of conciseness, we do not show it. The way the data sizes are computed by simple induction on the values (Hrelation): it is rather naive but suﬃcient to an upper born. To get the overall execution time E s e ⇓ v s c then it is max(c) ⊕ L where the function max ﬁrst apply the three previous equivalences in order to aggregate (merge) the cost vectors of the same superstep until not merging is

166

V. Allombert et al.

possible. Finally, when the cost (time and memory) consumed by each construct is statically known in µs then max(< c1 , . . . , cp >s ) = ci if ∀j = i, cj ≤ ci . Lemma 2. max is idempotent that is ∀c max(max(c)) = max(c). For example, let x= in ((proj ) 2) beginning with whatever environment E at any superstep s, for a two processors bsp machine, the cost semantics indicates that the adding cost of such expression is: < T+ , T+ >s ⊕Trpl ⊕ Tapp ⊕ < Tvar ⊕ T+ , Tvar ⊕ T+ >s ⊕1 × g ⊕ L ⊕ Tapp (2 vectors constructions both with an addition; a synchronous primitive; and a ﬁnal application). That is to say, in any context, the expression adds T+ during the asynchronous phase of the current superstep s, ﬁnishes it and begins a new superstep. On it own, the cost of such an expression can be simplify into 2×T+ ⊕ g ⊕ L.

3 3.1

Multi-BSP Programming in ML and Costs Semantics The Multi-BSP Bridging Model

multi-bsp is a bridging model [17] which is adapted to hierarchical architectures, mainly clusters of multi-cores. It is an extension of the bsp bridging model. The structure and abstraction brought by multi-bsp allows to have portable programs with scalable performance predictions, without dealing with low-level details of the architectures. This model brings a tree-based view of nested components (sub-machines) of hierarchical architectures where the lowest stages (leaves) are processors and every other stage (nodes) contains memory. Every component can execute code but they have to synchronise in favour of data exchange. Thus, multi-bsp does not allow subgroup synchronisation of any group of processors: at a stage i there is only a synchronisation of the subcomponents, a synchronisation of each of the computational units that manage the stage i − 1. So, a node executes some code on its nested components (aka “children”), then waits for results, does the communication and synchronises the sub-machine. A multi-bsp algorithm is thus composed by several supersteps, each step is synchronised for each sub-machine. An instance of multi-bsp is deﬁned by d, the ﬁxed depth of the (balanced and homogeneous) tree architecture, and by 4 parameters for each stage i of the tree: (pi , gi , Li , mi ): pi is the number of sub-components inside the i − 1 stage; gi is the bandwidth between stages i and i − 1: the ratio of the number of operations to the number of words that can be transmitted in a second; Li is the synchronisation cost of all sub-components of a component of i − 1 stage; mi is the amount of memory available at stage i for each component of this stage. Thanks to those parameters, the cost of a multi-bsp algorithm can be computed as the sum of the costs of the supersteps of the root node, where the cost of each of these supersteps is the maximal cost of the supersteps of the sub-components (plus communication and synchronisation); And so on. Let Cji be the communication cost of a superstep j at stage i: Cji = hj ×gi+Li where hj the maximum size of the exchanged messages at superstep j, gi the

Toward Performance Prediction for Multi-BSP Programs in ML

167

communication bandwidth with stage i and Li the synchronisation cost. We can express the cost T of a multi-bsp algorithms as following: T =

d−1 i −1 N

(

i=0

wji + Cji )

j=0

where d is the depth of the architecture, Ni is the number of supersteps at stage i, wji is the maximum computational cost of the superstep j within stage i. It is to notice that the bsp and multi-bsp cost models both are a linear combination of costs for the asynchronous computations and costs of communications (separated by barriers). 3.2

The Multi-ML Language

multi-ml [1,2] (https://git.lacl.fr/vallombert/Multi-ML) is based on the idea of executing bsml-like codes on every stage of a multi-bsp architecture. This approach facilitates incremental development from bsml codes to multi-ml ones. multi-ml follows the multi-bsp approach where the hierarchical architecture is composed by nodes and leaves. On nodes, it is possible to build parallel vectors, as in bsml. This parallel data structure aims to manage values that are stored on the sub-nodes: at stage i, the code let v= e evaluates the expression e on each i − 1 stages. Inside a vector, we note #x# to copy the value x stored at stage i to the memory i − 1. The (mkpar f) primitive is an alternative way to build a vector using a function f . Typed ( int → α) → α par, it aims to execute the given function to each processor identiﬁers (from 0 to pi − 1) of a node locally on it; and then, distribute the results down to its sub-nodes. The main diﬀerence with the e notation is that (mkpar f) aims to reduce costs when the communication costs of e is high and the execution cost of f and its result is low. As in bsml, we also found the proj , put primitives and the syntax $x$, all of them with the same semantics. We also introduce the concept of multi-function to recursively go through a multi-bsp architecture. A multi-function is a particular recursive function, deﬁned by the keyword let multi, which is composed by two codes: the node and the leaf codes. The recursion is initiated by calling the multi-function (recursively) inside the scope of a parallel vector, that is to say, on the sub-nodes. The evaluation of a multi-function starts (and ends) on the root node. The following code shows how a multi-function is deﬁned. After the deﬁnition of the multi-function mf on line 1 where [args] symbolises a set of arguments, we deﬁne the node code (from line 2 to 6). The recursive call of the multi-function is done on line 5, within the scope of a parallel vector. The node code ends with a value v, which is available as a result of the recursive call from the upper node. The leaf code, from lines 7 to 9 consists of sequential computations.

168

V. Allombert et al.

We also propose another parallel data structure called tree. A tree is a distributed structure where a value is stored in every nodes and leaves memories. A tree can be built using a multi-tree-function, with the let multi tree keyword and can be handled by several primitives of the language. We do not detail this construction here. Similarly to bsml and its b and l localities, in multi-ml we introduce m when a value refers to the multi-bsp locality and s on leaves (sequential). 3.3

Cost Semantics

Extension. To obtain core-multi-ml, we extends core-bsml with multifunctions as follow: e ::= · · · | (down x) | multi f x → e † e. The multi-function deﬁnition is written with the keyword multi. It takes one arguments and two expressions separated by the † symbol; the ﬁrst argument stands for the node code and the second is for leaf code. The down primitive aims to transfer a value to all the sub-nodes. The transformation from the # syntax into the down primitive is obvious and work as other syntactic sugars of bsml. For example, the expression > is transformed into apply(replicate(fun − x → ex))(downx). As the # annotated value is given as argument of the vector of functions, there are no redundant copies. The expression > is transformed into a code that copy x to the sub-nodes, only once. Parallel vectors of values (and costs) now also depend of their deep level n in the multi-bsp architecture. Closures of multi-functions are also added. Thus we have v ::= · · · | < v1 , . . . , vpn > | (multi f x → e † e)[E]. Adding Rules. We must now extend our inference rules in order to take into account the multi-functions and the nested bsml codes. These rules are given in Fig. 5. They work as follow.

Fig. 5. The cost semantics of the core-multi-ml language.

These new rules need some updates of the previous rules. First, the ⇓ is parameterized by the diﬀerent levels of execution of multi-ml and the stage n (beginning from 1). bsml rules has to be trivially updated with this stage in order to build the right size vectors.

Toward Performance Prediction for Multi-BSP Programs in ML

169

As a node is a particular component where it is possible to express bsp parallelism, we must consider the synchronous costs generated by bsp computations. Those rules, at a stage n, are used to recurse trough the multi-bsp architecture using the multi-function. Therefore, the max function now ﬁrst merge the vectors of the same (sub)superstep and ﬁnally we use this following equivalence (for each superstep s): max(n1 × T1 ⊕ · · · ⊕ nt × Tm ⊕ < c1 , . . . , cpn >s ) ≡ max(n1T1 ⊕ · · · ⊕ nt ×Tt , maxi=1..pn (ci )) that is we take the maximum between the computation of the node parent with the max of its own children. The MultiCall rule is for calling the multi-function at the level m. The counter of superstep is initiated to 0 as the stage to 1. The code of the node begins (level b). This rule terminates with a whole and synchronous broadcasting of the ﬁnal value v where g = g1 + g2 ... + gd (as well for L); This is due to the model of execution of multi-ml where the code outside multi-function is run by all the processors in order to manage the whole execution and thus the value must be known by all the processors. The maximum function allow to get the right cost of all child. The rule is possible only if v is valid (as in bsml). Our type system forbids expressions that have not this property [1] and we can assume that all the evaluated expressions are correct. The MultiLeaf goes to leaf level. The number of supersteps still the same when going throw the leaf level (only sequential codes are allow). The MultiNode is for going throw the hierarchical architecture (inside a vector) from one node to another one (the child). Thus the stage is incremented. A ﬁnal synchronisation is used to ﬁnally wait all the child before terminating the node code (the recursive call of the multi-function). This allow to take the maximum of computation of the sub supersteps as wanted in the multi-bsp cost model. In multi-ml, the building of a vector is an asynchronous operation with a emission of a signal of creation from the node processor to the subnodes (or leaves). It is thus no longer possible using the second equivalence of the ⊕ which only becomes commutative between two Ln (barrier) at a stage n. It is to notice that the Lookup function need also to check the variable at the right memory. Indeed, a variable deﬁne in at the stage n is no available on another stages. To do this, one must adding indices in the environment E. More details are available in [1]. Here, only the MultiNode and MultiLeaf rules can be evaluated. The costs of the multi-function recursive call taking place on both the node and the leaf is simple. We just add the evaluation cost of e1 and e2 , plus the multifunction call cost, resulting in the recursive call. The MultiNode rule adds the Ci costs which result from the potential asynchronous computations done on the node. Thus, we collect all the costs engendered by multi-function recursion. As expected, this mechanism is not necessary on the MultiLeaf rule, as there is no parallel computation at this level.

4

Experiments

Thanks to the cost model embedded in the multi-bsp model, it is possible to estimate the evaluation cost of a multi-ml program. According to the multi-bsp

170

V. Allombert et al.

parameters standing for a machine speciﬁcation, it is then possible to predict the execution time of a program. To verify that the cost estimation retrieved from the multi-bsp cost formulae is valid, we are going to compare the computation time of a simple algorithm to the predicted computation cost. To do so, we propose to analyse a matrix vector product algorithm based on the map/reduce skeleton. Using the multi-bsp parameters of the targeted architecture able to predict the computation time of various inputs. Our example has been written in a functional style using tail-recursive functions but thanks to the ocaml compiler, these functions are transformed into an eﬃcient imperative version. 4.1

Algorithm Description

We consider a simple algorithm to compute the product of a matrix and a vector. Given a matrix M of dimension n × m, where n stands for the number of lines and m form the number of columns, and a vector V of dimension n (number of lines) the computation is the following:M × V = x, such as x = (x0 , ..., xn ) where x is n composed by m lines and xi = j=0 Mij × Vj . Now, to propose a parallel version of this matrix vector product, we choose to use the map/reduce skeleton [6]. Using map/reduce algorithms is an easy way to propose parallel algorithms using simple associative and commutative operators. A map/reduce algorithms works as following: (1) the data are distributed among the processing units; (2) the map operator is applied on each piece of data; (3) the reduce operator is used to combine the results; (4) the ﬁnal result is thus obtained. To implement the matrix vector multiplication we deﬁne: a map operator which compute the product of a matrix and a vector; and a reduce operator which takes i sub-matrices of size (n , m) and assemble them into a (i × n , m) matrix. The bsp cost of the bsp algorithm is: Q(i) × Tmap ⊕ Q(i) × g ⊕ Q(i) × Tred ⊕ L where Q(i) stands for the total amount data stored at processor i. The multid bsp cost of the multi-bsp algorithm is: S(0) × Tmap ⊕ i=1 (S(i − 1) × gi−1 ⊕ Li−1 ) ⊕ S(i) × Tred ) where Tmap (resp. Tred ) is the time of the mapping (resp. reducing) and S(i) stands for the total amount data stored at level i; for example, we have N × M/2/2 elements on each leaf of a dual-core with two thread per core. We assume the following size (quantity of memory) of values such as SizeOf (float) = 64Bytes and SizeOf (floatarray) = n × SizeOf (float) if the array contains n elements. We omit small overheads and alternative costs relative to each level for the sake of simpliﬁcation. Furthermore, the cost of serialisation of the data is taken into account in the g parameter. 4.2

Algorithms Implementation

The bsml codes or mapping/reducing and their descriptions are available in [7,8]. In the context of multi-bsp functional programming, we must now write the map/reduce matrix vector product algorithm using the multi-ml language. As the multi-ml language uses a tree based hierarchical way of executing code,

Toward Performance Prediction for Multi-BSP Programs in ML

171

the map/reduces algorithms are almost embedded in the syntax of the language. Indeed, the map phase consists in mapping a function toward the leaves of the multi-bsp architecture, while the reduce phase is basically the combination of the results toward the root node. In the map/reduce implementation, we assume that the values were previously distributed such as each leaves already contains the sub-matrices and nodes are empty. Thus, the distribution is handled by a tree data structure of matrices. As in our implementation a matrix is represented by a one dimension array, the input data is typed α array tree. The map multi-function is written in Fig. 6 (left). As expected, we call recursively the multi-function map toward the leaves. When reached, the leaves are going to apply the map operator f on their data stored in tda (the tree distributed array of sub-matrices). Then, we build a tree which contains the results on leaves.

Fig. 6. Codes of the multi-ml mapping (left) and reducing (right).

After reaching the leaves using the recursive calls, the reduce multi-function simply retrieve the sub-results of its sub-nodes from rc . It transform the parallel data structure into a local array using to array and apply the reduce operator of each sub-matrices. Finally, the resulting matrix is used to propagate the result to the root node (Fig. 6, right). 4.3

Performance Predictions

Benchmarks were performed on the following architecture: mirev2 8 nodes, each with 2 quad-cores (amd 2376 at 2.3 Ghz) with 16 GB of memory per node and a 1 Gbit/s network. Based on the computation and communication cost of each phases it is possible to compute the cost of the proposed algorithm. To do so, we use the multi-bsp parameters which can be estimated using the probe method [3]. We use the following parameters: g0 = ∞, g1 = 6, g2 = 3 and g0 = 1100, g1 = 1800, g2 = 0 and L0 = 149000, L1 = 1100, L2 = 1800, L3 = 0. For bsp we get g = 1500 and L = 21000. Thank to a micro-benchmarking library [13] of ocaml, we have estimated the execution time of the main operators which are used in the map operator: multiplication, get a value from an array etc. The timings for each operators are available

172

V. Allombert et al. Table 1. Operator timings in μs. TDef = 2.921

TLet = 1.312 TGet = 1,324

TBoolAnd = 0.184

TClo = 0.167

TV ar = 0.619 TF loatAdd = 0,881

TIntEq = 0.284

TF unApp = 1.505 TSet = 1,778 TF loatM ult = 1,317

in Table 1 where Tmult , Tadd , Tset and Tget are respectively standing for multiplication, addition, aﬀectation and read in an array. We have neglect the times to build the closures (and apply them) for both multi-functions and the recursive functions since most of the computations come from mapping and reducing. Thus, we have that Tmap = 3×Tget ⊕Tset ⊕2×TF loatM ult ⊕3×TF loatAdd ⊕2× TBoolAnd ⊕2×TIntEq +10×TV ar and Tred = Tget ⊕Tset ⊕5×Tvar ⊕TIntAdd ⊕TIntEq . As the cost of such atomic operations are prone to signiﬁcant variation because of the compilation optimisation, loops structures and cache mechanisms, we assume that those costs is “a good approximation” of the average computation time needed by these operations. A more precise approaches can be found in [10]. The performance prediction compared to the execution time of the matrix vector multiplication can be found in Fig. 7. We perform the tests for both bsml and multi-ml. We do not used all the cores since our current multi-ml implementation needs speciﬁc processes to handle nodes (which is not the case for bsml) and thus we want to be fair for the cost analysis. Note that it is a too small example and bsml is sometime more eﬃcient than multi-ml. A comparison between the two languages on bigger examples is available in [1]. The tests has been done for 2 nodes (left) and then for 8 nodes (right). We can observe that the performance prediction is coherent to the execution time of the algorithm (and its polynomial complexity). The curves slopes are similar even not very accurate. This is mainly due to the fact that the sequential cost of our method is no ﬁne enough. For example, because this is a toy example, we do not use the cache possibilities of the multi-bsp model and thus multi-ml suffers for some miss-caches that are not currently predicted. The garbage collector of ocaml can also disturb the prediction.

5

Related Work

Close to bsp, the logp [5] models are, most of the time, used to study network capabilities and low-level libraries such as mpi. Extensions of bsp such as [14] were proposed to allows sub-synchronisations. Hierarchical approaches were also proposed in [4]. Parallel algorithmic skeleton are often use to proposed a cost prediction based on a structured approach, as in [9]. In [12], a shape analysis techniques developed in the fish programming language is used to propose language with an accurate, portable cost model. Resource Aware ml (raml) [10] allows to automatically and statically computes the resource-use bounds for ocaml programs. A version for parallel (multithreading) and sequential composition was proposed.

Toward Performance Prediction for Multi-BSP Programs in ML

173

Those models seems not adapted to our approach as they do not provide both simplicity and accuracy for hierarchical architectures with a structured execution scheme.

Fig. 7. Performance prediction compared to execution time for bsml and multi-ml; For 2 nodes (left) and 8 nodes (right).

6

Conclusion

Overview of the Work. In this article we propose a formal semantic with cost annotations allowing cost prediction of multi-bsp algorithms. We propose a set of rules adapted to a (core) version of a sequential and purely functional version of ml. Then, we extend this semantics to allows bsp, and then, multi-bsp codes. Thanks to this incremental approach, we propose a restrained set of rules allowing cost prediction of multi-bsp algorithms. To expose the usability of the cost model embedded in the semantics, we compare the performance prediction and actual benchmarks on several parallel architectures. As our approach is simpliﬁed and consider abstract bsp and multi-bsp parameters and also is based on the estimated execution time of atomic operation, it may suﬀers to accuracy issue. We show that our cost estimation is close to the execution time on a simple map/reduce algorithm apply to a matrix-vector multiplication. Future Work. An interesting use of this cost semantic is to propose a analysis able to statically infer a cost of a given algorithm. Such an approach is available for programming imperative bsp algorithm [11] and could be extended to functional multi-bsp programming using an approach similar to the one proposed in [10]: It would be possible to give the cost of a program at compile time.

174

V. Allombert et al.

References 1. Allombert, V.: Functional Abstraction for Programming Multi-Level Architectures: Formalisation and Implementation. Ph.D. thesis, UPEC (2017) 2. Allombert, V., Gava, F., Tesson, J.: Multi-ML: programming multi-BSP algorithms in ML. J. Parallel Prog. 45(2), 20 (2017) 3. Bisseling, R.H.: Parallel Scientic Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 4. Cha, H., Lee, D.: H-BSP: a hierarchical BSP computation model. J. Supercomput. 18(2), 179–200 (2001) 5. Culler, D., et al.: LogP: towards a realistic model of parallel computation. In: Principles and Practice of Parallel Programming, pp. 1–12. ACM (1993) 6. Dean, J., Ghemawat, S.: MapReduce: simpliﬁed data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 7. Gava, F.: BSP functional programming: examples of a cost based methodology. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008. LNCS, vol. 5101, pp. 375–385. Springer, Heidelberg (2008). https://doi.org/10.1007/9783-540-69384-0 43 8. Gesbert, L., Gava, F., Loulergue, F., Dabrowski, F.: Bulk synchronous parallel ML with exceptions. Future Gener. Comput. Syst. 26(3), 486–490 (2010) 9. Hayashi, Y., Cole, M.: Static performance prediction of skeletal parallel programs. Parallel Algorithms Appl. 17(1), 59–84 (2002) 10. Hoﬀmann, J., Das, A., Weng, S.C.: Towards automatic resource bound analysis for OCaml. In: Principles of Programming Languages. POPL 2017. ACM (2017) 11. Jakobsson, A.: Automatic Cost Analysis for Imperative BSP Programs. Int. J. Parallel Prog. (Feb 2018) 12. Jay, C.: Costing parallel programs as a function of shapes. Sci. Comput. Prog. 37(1), 207–224 (2000) 13. Roshan, J., et al.: Core bench: Micro-benchmarking library for OCaml (2014) 14. de la Torre, P., Kruskal, C.P.: Submachine locality in the bulk synchronous setting. In: Boug´e, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1124, pp. 352–358. Springer, Heidelberg (1996). https://doi.org/10. 1007/BFb0024723 15. Abella, J., et al.: wcet analysis methods: pitfalls and challenges on their trustworthiness. In: IEEE Symposium on Industrial Embedded Systems, pp. 39–48 (2015) 16. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990) 17. Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci. 77(1), 154–166 (2011)

Exploiting the Table of Energy and Power Leverages Issam Ra¨ıs1 , Laurent Lef`evre1(B) , Anne-C´ecile Orgerie3 , and Anne Benoit1,2 1

´ Laboratoire LIP, Ecole Normale Sup´erieure de Lyon & Inria, Lyon, France {issam.rais,laurent.lefevre,anne.benoit}@inria.fr 2 Georgia Institute of Technology, Atlanta, GA, USA 3 Univ. Rennes, Inria, CNRS, IRISA, Rennes, France [email protected]

Abstract. Large scale distributed systems and supercomputers consume huge amounts of energy. To address this issue, a large set of hardware and software capabilities and techniques (leverages) exist to modify power and energy consumption in large scale systems. Discovering, benchmarking and eﬃciently exploiting such leverages, remains a real challenge for most of the users. In this paper, we deﬁne leverages and the table of leverages, and we propose algorithms and predicates that ease the reading of the table of leverages and extract knowledge from it.

1

Introduction

Data centers worldwide consumed around 194 terawatt hours (TWh) of electricity in 2014, or about 1% of total demand [2]. This worrying consumption has direct ﬁnancial and environmental consequences on data center managers, like Cloud providers and supercomputer operators. Several techniques have been developed in order to lower the electrical consumption of data centers. These techniques, that we call leverages, can improve the energy eﬃciency of data centers at diﬀerent levels: hardware, middleware, and application. Hardware leverages include Dynamic Voltage and Frequency Scaling (DVFS) [11] and shutdown techniques [10]. At the middleware level, energy-eﬃcient resource allocation policies for job managers are examples of leverages [7]. Finally, leverages at the application level include green programming [1]. While many of these leverages have been independently studied in the literature, few works consider the utilization of several leverages at the same time, and no more than two leverages. Yet, the utilization of a given leverage can impact both the utilization and the eﬃciency of another leverage. The variety of leverages is added to the data center’s complexity, in terms of size and hardware heterogeneity, and makes energy eﬃciency complex to reach for the users who have access to multiple leverages. In this work, we aim at extending the current state of the art, which is studying the inﬂuence of one or two leverages at maximum at the same time, thus ignoring the impacts incurred by the utilization of more leverages. Thus, we c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 175–185, 2018. https://doi.org/10.1007/978-3-030-05057-3_13

176

I. Ra¨ıs et al.

proposed a generic deﬁnition, combination and knowledge extraction of multiple leverages in order to fully explore their combined impacts. We propose a ﬁrst approach toward a completely automated process to characterize the leverages available on a data center node. The key idea of our contribution consists in providing hints to users about the most suitable solution for their application from a deﬁned score table with a value for each leverage combination and each studied metric. Through these tables could be derived knowledge about leverage combination and eﬀects they incur on each other. From the deﬁnition of a table of leverages, a tool to help a user, a developer or an administrator to choose which leverage or leverage combination suits the best his objectives (here with a focus on energy or power metrics), the contribution of this paper consists in the algorithms proposed to extract knowledge about the interaction of leverages and their inﬂuence on a given metric. The remaining of this paper is structured as follows. Section 2 formalizes the concept of leverages, and illustrates this formalism on the leverages under consideration in this paper. Section 3 deﬁnes and explains how to build the table of leverages. Section 4 presents the experimental setup and a ﬁrst full example of table of leverages. Section 5 then shows how to exploit the raw data of the table of leverages and extract useful knowledge. Finally, Sect. 6 concludes this work and gives perspectives.1

2

Leverage Definition

In this section, we ﬁrst propose a formalization of a leverage. Second, we apply this formalism to the leverages that we selected for this paper. Definition 1. A leverage L is composed of S = {s0 , s1 , . . . , sn }, the set of available valid states of L, and sc , the current state of L. Thus, an energy or power leverage is a leverage that has a high impact on the energy or power consumption of a device through its various states or through the modiﬁcation of its current state. Switching from one state to another can have a cost in terms of time and energy. Yet, in the current work, we focus on studying the impacts of leverage combinations over a single intensive application phase [4], and thus we do not study the switching costs between states. In this paper, we consider multiple leverages available on current hardware, namely multi-thread, computation precision and vectorization. These leverages belong to diﬀerent categories of leverages: application level with computation precision and vectorization techniques, and middleware level with multithreading. These leverages are described hereafter. 1

This work is supported by the ELCI project, a French FSN project that associates academic and industrial partners to design and provide software environment for very high performance computing. Experiments were carried out using the Grid’5000 testbed, supported by a scientiﬁc interest group hosted by Inria and including CNRS, RENATER and several Universities (https://www.grid5000.fr).

Exploiting the Table of Energy and Power Leverages

177

Multi-thread Leverage. The ﬁrst studied leverage is a middleware-level leverage that permits the usage of multiple cores during computation. OpenMP [5], a well-known application programming interface abstraction for multi-threading, can be used to exploit this intra-node parallelism of multi-cores. It consists of a set of directives that modiﬁes the behavior of the executed code, where a master thread forks a speciﬁc number of slave threads that run concurrently. This multi-thread leverage increases the CPU utilization of the node. Consequently, because of the non-power proportionality of current hardware architectures [10], this leverage can improve the energy eﬃciency of the node. In the rest of the paper, the multi-thread leverage is denoted by nbThreads with the set of states {1, . . . , nmax }, where 1 means that one OpenMP thread is used, and nmax corresponds to the maximum number of threads that could be launched simultaneously on the node. In this work, only the extreme states, 1 and nmax , are explored. Computation Precision Leverage. The second leverage belongs to the application level and exploits the various computation precision options available on actual hardware (i.e., int, ﬂoat, double). Such a leverage alters the precision of the results computed by the application, but lower precision translates into shorter data representation and so, less computation and less energy consumption. At the application level, the user can specify a desired Quality-of-Service that can be expressed as accessible computation precision states. This precision leverage is denoted by P recision, and the set of states is {int, ﬂoat, double}, corresponding to the data format for the application. For each of these states, a diﬀerent code version is provided. Vectorization Leverage. Finally, the last studied leverage concerns the application level. Current CPUs allow the usage of vectorization capabilities to exploit intra-core parallelism. On Intel architectures, it started with MMX instruction in Pentium P5 architectures in 1997 [9]. It was then extended to SSE [6]. SSE was then extended to SSE2, SSE3, SSSE3 and ﬁnally SSE4. AVX [8] then introduces new instructions, followed by AVX2 and ﬁnally AVX512 available in XeonPhi architecture. In this paper, we focus on SSE3 and AVX2, which are representative of the SSE and AVX families. These instruction sets permit single instruction on multiple data (SIMD) at application level. This vectorization leverage is denoted by V ectorization. The set of states is {none, SSE3, AVX2}, where none means that no vectorization is used. For each of these states, a diﬀerent code version is provided using the speciﬁc intrinsics and adequate compilation ﬂags for each version. The proposed leverage formalism described above is used in the rest of the paper to easily describe the state of each considered leverage and the possible combinations of leverages. The three leverages studied here are chosen to be representative examples of available leverages on modern architectures and frequently used during HPC applications. The methodology proposed in this paper is designed to be applied to any number and any type of leverages.

178

3

I. Ra¨ıs et al.

The Table of Leverages

We describe the table of leverages, which relies on metrics and benchmarks to characterize the performance and energy impact of each leverage combination on a given node. For each metric and each benchmark, a score is attributed to a given leverage combination. The table is then used to extract knowledge about each leverage and evaluate impacts of leverage combinations in order to help the users to utilize their computing infrastructure in a more energy-eﬃcient way. Metrics. Leverages may inﬂuence the quality of service or performance of an application. For instance, shutdown techniques may induce latency in waking up the required nodes. Consequently, for these leverages, users need to determine their acceptable trade-oﬀ between energy-related metrics and performance metrics. The table of leverages relies on three diﬀerent metrics that represent both energy and performance constraints. These metrics are measured for a given period of time corresponding to the time spent during benchmark execution. The two ﬁrst metrics are energy and power related metrics. To deﬁne them, we introduce the following notations: T = {t0 , . . . , tN } is the set of time stamps of energy consumption measurements of a given run; t0 and tN represent the starting and ending timestamps (with a distance of one second), respectively; pj , j ∈ [0, N ], represents the power consumption (in Watt), of the considered node for the timestamp tj . Metric 1: The average power consumption of an executable is denoted avrgW att, and it is deﬁned as avrgW att = j∈[0,N ] pj /(N + 1). Metric 2: The energy consumption of an executable is denoted Joules. It represents the energy consumption of the complete node used between t0 and tN . It is deﬁned as Joules = j∈[0,N −1] (tj+1 − tj ) × pj . Metric 3: The last metric concerns the performance of the run, and is expressed as the execution time, denoted T ime. It includes whole execution time of an executable, including initialization. Benchmarks. A benchmark corresponds to a self-contained application that is representative of typical applications or portions of applications. The benchmark is compiled before the run, and once launched, the metrics previously deﬁned are collected during its execution. Here, for the sake of clarity, we evaluate only one benchmark for a set of embedded leverages. We chose to focus on a well-known CPU intensive code: the line per line matrix multiplication (LpL MM) of dense random large squared matrices (8192 as dimension size). The same algorithm is implemented for the various leverage combinations. The considered leverages are multi-thread, computation precision and vectorization. For the last two leverages, a diﬀerent state means a diﬀerent version of code, here generated by hand using dedicated intrinsics and compilation ﬂags (-O3 -msse3 -mavx2). We deactivated the auto vectorization of the compiler (-fno-tree-vectorize) to have a control over the chosen intrinsics and because auto generation of vectorizable code is not one of the focused leverage in this paper.

Exploiting the Table of Energy and Power Leverages

179

Formalization of the Table of Leverages. Here, we describe how to compute the score associated to each metric for each leverage. Let X, Y, Z be the sets of available states of three leverages χ, ψ, ω (corresponding to S, the set of states for a given leverage L, from Deﬁnition 1): X = {x0 , . . . , xnx }, Y = {y0 , . . . , yny }, and Z = {z0 , . . . , znz }. Let g1 , . . . , gm be the measured metric functions, as for instance avrgW att, Joules, and T ime. For all u (1 ≤ u ≤ m), gu (xi , yj , zk ) is the value of metric gu for the states xi , yj , zk for the leverages χ, ψ, ω. In the table of leverages, each line corresponds to a combination of states for each leverage and the columns correspond to the measured metrics. We normalize each value on the minimum value for each metric. These normalized values constitute the scores indicated in the table of leverages. Let h1 , . . . , hm be the normalized versions of g1 , . . . , gm . So, we have, for 1 ≤ u ≤ m, gu (xi ,yj ,zk ) hu (xi , yj , zk ) = min gu (x ,y ,z ) , with hu (xi , yj , zk ) being the value x ∈X,y ∈Y,z ∈Z i j k

i

j

k

in the table of leverages in column of metric u and corresponding to the line for the states xi , yj , zk respectively for the leverages χ, ψ, ω. For application-level leverages, here Precision and Vectorization, the chosen benchmarks correspond to a diﬀerent combination of application leverage states. Leverage nbThreads changes its state through environment variable. When all states are covered, the table of leverages is complete for the considered benchmark. Reducing the creation time of such a table is not the focus of this paper.

4

Building and Analyzing the Table of Leverages

In this section, we present the table of leverages built on a node from our experimental testbed, Grid’5000 [3]. Grid’5000 deploys clusters linked with dedicated high performance networks in several cities in France. As our focus is on energy and performance related metrics, we used the Lyon site, where the energy consumption of every computing node is monitored through a dedicated wattmeter, exposing one power measurement per second with a 0.125 Watts accuracy. The Nova cluster from Lyon is used in the following. This cluster contains Dell PowerEdge R430 with 2 CPU E5-2620 v4 of 8 cores each, 32 GB of memory, 2 HDD disks of 300 GB each. We applied our previous methodology for the three chosen leverages to the CPU intensive benchmark. This allows us to explore all possible states of chosen leverages, and thus to build a complete table of leverages. The table has the following format: the ﬁrst three columns present the states of the nbT hreads, P recision, and V ectorization leverages respectively, while the last three columns show the normalized results of the three metrics avrgW att, Joules, and T ime, respectively, for every combination of leverage. As can be seen in Table 1 (ﬁrst six columns), a line represents results of all gathered metrics for the execution of a representative load for a chosen combination of leverages. The results are normalized as explained before. The table of leverages gathers the knowledge of a Nova node, for a given workload done for multiple states of leverages combined.

180

I. Ra¨ıs et al.

Table 1. Normalized table of leverage states and ranked impact for line per line matrix multiplication (LpL MM) benchmark on a Nova node. Leverage states

Table of leverages

Ranked impact

nbThreads (T) Prec. (P) Vector. (V) avrgWatt Joules Time avrgWatt Joules Time 1

int

none

1.05

65.09

61.89 P,T,V

P,T,V P,T,V

1

int

SSE3

1.06

28.26

26.56 P,V,T

V,P,T V,P,T

1

int

AVX2

1.06

29.32

27.67 P,V,T

V,P,T V,P,T

1

float

none

1.05

72.97

69.67 P,V,T

P,T,V P,T,V

1

float

SSE3

1.06

33.8

31.89 V,P,T

V,P,T V,P,T

1

float

AVX2

1.05

36.8

34.89 P,V,T

V,P,T V,P,T

1

double

none

1.06

81.59

76.89 P,T,V

P,T,V P,T,V

1

double

SSE3

1.07

58.52

54.89 V,P,T

V,P,T V,P,T

1

double

AVX2

1.06

57.72

54.22 P,V,T

V,P,T V,P,T

32

int

none

1.43

13.48

9.44 P,T,V

T,P,V T,P,V

32

int

SSE3

1.4

4.68

3.33 P,V,T

T,V,P T,V,P

32

int

AVX2

1.0

1.0

1.0

P,V,T

T,V,P T,V,P

32

float

none

1.45

7.4

5.11 P,T,V

T,P,V T,P,V

32

float

SSE3

1.41

3.76

2.67 V,P,T

T,P,V T,P,V

32

float

AVX2

1.56

3.11

2.0

P,V,T

T,V,P T,V,P

32

double

none

1.53

8.34

5.44 P,T,V

T,P,V T,P,V

32

double

SSE3

1.53

8.52

5.56 V,T,P

T,P,V T,P,V

32

double

AVX2

1.54

7.0

4.56 P,T,V

T,V,P T,V,P

Explanation of the Table: A lot of unexpected results, at ﬁrst sight, are detected in Table of leverage 1, like the combination with int being better than float and double when 1 and none are the chosen state for the nbThread and Vectorization leverages, with this trend being reversed with nbThreads=32. From the set of combination with 1 as the chosen state for leverage nbThreads, it is logic to see that int is quicker than float then double from a cache usage perspective. Indeed, more data can be brought into the cache to compute without the need to fetch new data compared to ﬂoat or double representation that need more space for the same amount of elements. As for the SSE and AVX combinations, we have tremendous gain while using it compared to None, as it uses vectorial capabilities of the used core. Using a leverage usually comes with a cost. This statement is also true for the Vectorization leverage. An operation on vectors has costs, even if it is low. For instance, it is known that loading and saving vectors has a non null cost. With only one active thread, the current architecture, Broadwell here, allows turbo boost, a technology that permits to reach a much higher frequency that the available ones (here it can reach 3.0 GHz, when average frequency is 2.1 GHz). Also, when the OS detects too much load on a core, it context switches the running process and runs it on another core. Hence, the kernel saves the states (stack, registers) of the current process and loads it on another core, implying a storing and loading cost of the

Exploiting the Table of Energy and Power Leverages

181

given process. This phenomenon can happened several times during a second. Thus, saving and charging states can create a lot of cache misses, which could be dramatical with usage of vectorization, where loading and saving vectors is not free. As AVX has longer vectors, its operation costs on vectors can be longer than SSE. Thus, it starts to be beneﬁcial only when comparing double combinations for such a Vectorization leverage. When threads are up to 32, data is more likely to be shared between caches of various used cores. Without the previous struggles from caches for one core and because it is also well known that ﬂoating points operations(float and doubles here) are well optimized on current architectures and perform better than integers, {32, ﬂoat, none} and {32, double, none} perform better than {32, int, none}. All threads are sharing data on separated cache, SSE and AVX outperforms the none conﬁguration, with AVX always outperforming SSE for a ﬁxed combination. Due to this data repartition between caches implied by the chosen conﬁguration of the nbThreads leverage, there is enough computation to overcome costs of larger vector operations, here AVX for all combinations. Note that the best combination for all metrics used here is always the {32, int, AVX2} combination. This result is the best combination to choose only if we have no constraints about leverage choices. It is expected to see variation, as leverages highly modulate the usage of nodes, either from intensity of usage for example of caches, core usage, availability of speciﬁc leverages (like seen with turbo boost with one thread). Results of metrics from combination of leverages is thus complicated to fully understand without a detailed knowledge of the architecture, the underlying used leverages and their inﬂuences on a given context. We propose predicates that helps a user underline such interesting points of interest from the table of leverages. For example, this table could help a user to choose a combination taking into account a ﬁxed leverage state. Or to answer the following question: is there a leverage or a state of leverage that is always better for a given metric?

5

Exploiting the Table of Leverages

In this section, we describe the main contribution of this paper: a methodology to exploit the table of leverages and to extract useful knowledge, such as the inﬂuence and impact of one or multiple leverages on a given metric or set of metrics. We propose two focuses for extracting a score for each leverage. The ﬁrst one corresponds to the actual table: it normalizes the results of a given metric for every explored conﬁguration. The second one computes a ratio of contribution for each leverage in order to expose the most relevant leverage (the one with the largest contribution to the considered metric). We deﬁne four exploitation predicates that ease the analysis of the table, and answer questions. We illustrate these predicates and the answers of these questions on the selected table (Table 1). These questions target a single metric, hu . Question 1: Is a selected combination of leverages states the best one for metric hu ? If a given combination is always the best, it means it should

182

I. Ra¨ıs et al.

always be applied, if possible, if one wants to optimize hu . Consider a combination of states xa , yb , zc of leverages χ, ψ, ω for metric hu . We need to check whether for all i ∈ [0, . . . , nx ]\{a}, j ∈ [0, . . . , ny ]\{b}, and k ∈ [0, . . . , nz ]\{c}, we have hu (xa , yb , zc ) ≤ hu (xi , yj , zk ). On Nova nodes and for the three leverages (Table 1), the best combination for all three studied metrics is {32, int, AVX2}. Question 2: When I fix a state, do I always improve metric hu ? Consider state xa of leverage χ. We want to check whether for all i ∈ [0, . . . , nx ]\{a}, for all l, j ∈ [0, . . . , ny ], and for all m, k ∈ [0, . . . , nz ], we have hu (xa , yl , zm ) ≤ hu (xi , yj , zk ). On the example of Table 1, for the Joules and T ime metric, only the nmax (here, 32) state of nbThreads leverage answers this predicate, meaning that using this state will always be beneﬁcial. No speciﬁc results can be obtained with this question for the avrgW att metric, meaning that no leverage state is always better for this metric when used. Question 3: If some states are fixed for a subset of leverages, is a given state for the remaining leverages the best choice to optimize hu ? Consider that the state of leverages ψ, ω is ﬁxed to yb , zc . We are asking whether state xa of leverage χ is the best choice for metric hu . Therefore, we need to check whether for all i ∈ [0, . . . , nx ]\{a}, we have hu (xa , yb , zc ) ≤ hu (xi , yb , zc ), which tells for instance that for the ﬁxed combination {32, SSE3}, the best state for the Precision leverage is float, when considering the Joules or T ime metric (Table 1). Although, when focusing on avrgW att as the studied metric, for the {32, SSE3} ﬁxed combination, the best state for the Precision metric is int. If only state zc for leverage ω is ﬁxed, and we consider states xa and yb of leverages χ and ψ respectively, we check whether for all i ∈ [0, . . . , nx ] and for all j ∈ [0, . . . , ny ], we have hu (xa , yb , zc ) ≤ hu (xi , yj , zc ). Concerning the Joules metric (Table 1) for the ﬁxed state float of the Precision leverage, the best combination for the nbThreads and Vectorization leverages is {32, AVX2}. However, for the avrgW att metric, ﬁxing again the state float of the Precision leverage, the best combination is now {32, SSE3}. Applying this predicate allows us to extract some unexpected results. Concerning the Joules and T ime metrics, for the Precision and Vectorization leverages, no state emerges as the best one. In fact, it highly depends on the chosen state of other leverages. One could for instance expect int to always be the best state, but when comparing the {32, double, none} with {32, int, none}, we see that the double combination is more eﬀective than the int combination. Similar conclusions can be drawn when the Vectorization leverage is used. AVX2 has larger vectors than SSE3, thus we would expect it to be always more eﬃcient. However, when nbThreads state is equal to 1, {1, ﬂoat, SSE3} is more eﬀective than {1, ﬂoat, AVX2}, leading to a diﬀerent best choice when combined to the nmax state (here, 32), where {32, ﬂoat, AVX2} is more eﬀective than {32, ﬂoat, SSE3}. Note that this combination emerges as the best one when SSE3 is ﬁxed. Concerning the avrgW att metric, we also get unexpected knowledge. In opposition to the Joules and T ime metrics, no state emerges as the best one for none of the studied leverages. As AVX2 has larger vectors than SSE3, we would expect it to always stress more the CPU, thus always having higher values for

Exploiting the Table of Energy and Power Leverages

183

this metric. It is the case with the {32, ﬂoat} and {32, double} combinations. However, it is not observed with other combinations. When nbThreads=1, int is always the best choice to minimize this metric, whatever the chosen state for Precision and Vectorization leverages. Moreover, when Vectorization and nbThreads are set to any studied states, int is also always the best choice to minimize the avrgW att metric. Question 4: Given a combination for all the leverages, how can we rank the states in terms of contribution for metric hu ? To answer this question, we consider a set of states xa , yb , zc of leverages χ, ψ, ω. Then, for each state w ∈ {xa , yb , zc }, we compute the contribution score mc(w) for this state on hu (xa ,yb ,zc ) metric hu as follows. For state xa of leverage χ, mc(xa ) = max hu (xi ,yb ,zc ) . i∈[0,...,nx ]

We deﬁne similarly the contribution of states for the other leverages ψ and ω. Then, we rank the contribution scores mc(xa ), mc(yb ), mc(zc ) in ascending order to answer the question. Table 1 (last three columns) presents the scoring related to the table of leverages. For the best combination {32, int, AVX2}, the ranking goes as follows for the Joules metric: “T,V,P” or “nbThreads, Vectorization, Precision”, meaning that the chosen state for T here is the most contributing state in this combination, followed by the V, and then P states. Thus, for this combination, the precision leverage with the int position has the lowest contribution. This ranking points out unexpected results for the Joules metric. We notice a switch between two positions of a given leverage for the ﬁxed combination of other leverage states: {32, double}. In fact, when comparing the scoring of {32, double, SSE3} with {32, double, AVX2}, we get respectively “T,P,V” and “T,V,P”. In the ﬁrst case, double and SSE3 have the same worst possible score, 1.0, meaning that it is the worst state of this leverage for this combination. In the second case, AVX2 scores better than SSE3 and thus, it is above double. When nbThreads=1, we note that combinations including SSE3 and AVX2 states always have the Vectorization leverage state as the most contributing one, which leads to the conclusion that it is always better to use SSE3 and AVX2 states for the Vectorization leverage. For the {32, ﬂoat, SSE3} combination, we get the scoring “T,P,V”. float gets a better score and thus a better position than SSE3 because it is the best leverage state for the {32, SSE3} combination, leading to the conclusion that choosing float instead of other Precision leverage states contributes more than choosing SSE3 instead of other Vectorization leverage states for this combination. For the avrgW att metric, scoring underlines the fact that when choosing int as a state of Precision leverage, and for a ﬁxed state of the Vectorization leverage, the sorting is always the same. In fact, {32, int, none}, {32, int, SSE3} and {32, int, AVX2} get the exact same sorting of contribution that {1, int, none}, {1, int, SSE3} and {1, int, AVX2}, respectively. Moreover, int is always the most contributing leverage state, which shows that int is always a good choice to improve this metric. This scoring also underlines the fact that in order to minimize the avrgW att metric, a user should better focus on P and V leverages, asT is never the most contributing one. This scoring

184

I. Ra¨ıs et al.

highlights results that would have been diﬃcult to notice just by looking at the table. It allows a user to quantify how much a leverage position used in a combination contributes to the overall performance for a given metric.

6

Conclusion

Energy eﬃciency is a growing concern. In the context of HPC and datacenters where the size of infrastructures grows drastically, energy consumption has to be taken into account as a high expense. There is a wide range of techniques, that we formally deﬁne as leverages, that permits to modulate the computing capabilities and/or the energy/power used by a device. We propose a generic solution to extract ﬁne grain knowledge and hints from the table of leverages, thanks to the deﬁned predicates. Our solution underlines new knowledge about leverages alone and about combinations of leverages. Thus, it allows us to extract inﬂuences of leverages on each other and understandable knowledge by the user. Knowledge could be extracted from a table on CPU-intensive workload. For example, our solution underlines the fact that if Precision is set to the double state, it is always better to use it with AVX2 state for the Vectorization leverage to minimize the Joules metric. Also, for Vectorization ﬁxed to the SSE3 state, our solution tells us that float is the best state to minimize the Joules metric. We also underline the fact that some unexpected behavior can be seen when combining leverages. For example, we underline the fact that changing float or int to double for Precision, and keeping the SSE3 state activated for Vectorization state, turns out to be counterproductive for the Joules metric. The ﬁrst short term future work is the parallelization of the creation of the table of leverages in order to improve the time needed to build it. Then, we plan to apply this methodology on other non CPU-intensive phases, such as IO, HDD, and RAM-intensive phases with appropriate leverages for every phase. Finally, a future working direction would be to extend this methodology to costly transition leverage states, as for instance shutdown policies. Also, we would like to investigate how to reduce the completion time for building such a table. In fact, the time to solution here could be greatly reduced, for example by predicting which run is not needed to know values of relevant metrics using learning or prediction techniques.

References 1. Acar, H., Alptekin, G.I., Gelas, J.-P., Ghodous, P.: Towards a green and sustainable software. In: Concurrent Engineering, pp. 471–480 (2015) 2. International Energy Agency. Digitalization & Energy. White paper (2017) 3. Balouek, D., et al.: Adding virtualization capabilities to the Grid’5000 testbed. In: Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T. (eds.) CLOSER 2012. CCIS, vol. 367, pp. 3–20. Springer, Cham (2013). https://doi.org/10.1007/978-3319-04519-1 1

Exploiting the Table of Energy and Power Leverages

185

4. Chetsa, G.L.T.E.A.: A user friendly phase detection methodology for hpc systems’ analysis. In: IEEE International Conference on and IEEE Cyber, Physical and Social Computing (2013) 5. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998) 6. Gallas, B., Verma, V.: Embedded Pentium (R) processor system design for Windows CE, Wescon/98, pp. 114–123. IEEE (1998) 7. Georgiou, Y., Glesser, D., Rzadca, K., Trystram, D.: A scheduler-level incentive mechanism for energy eﬃciency in HPC. In: CCGrid, pp. 617–626 (2015) 8. Lomont, C.: Introduction to intel advanced vector extensions. Intel White Paper, pp. 1–21 (2011) 9. Peleg, A., Weiser, U.: MMX technology extension to the Intel architecture. IEEE Micro 16(4), 42–50 (1996) 10. Ra¨ıs, I., Orgerie, A.-C., Quinson, M.: Impact of shutdown techniques for energyeﬃcient cloud data centers. In: Carretero, J., Garcia-Blas, J., Ko, R.K.L., Mueller, P., Nakano, K. (eds.) ICA3PP 2016. LNCS, vol. 10048, pp. 203–210. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49583-5 15 11. Suleiman, D., Ibrahim, M., Hamarash, I.: Dynamic voltage frequency scaling (DVFS) for microprocessors power and energy reduction. In: International Conference on Electrical and Electronics Engineering (2005)

A Semantic Web Based Intelligent IoT Model Chao Qu , Ming Tao(B) , Jie Zhang , Xiaoyu Hong , and Ruifen Yuan School of Computer Science and Network Security, Dongguan University of Technology, Dongguan 523808, China {quc,zhangjie,hongxy,yuanrf}@dgut.edu.cn, [email protected]

Abstract. Diﬀerent from the sensor network, the devices in the intelligent Internet of Things (IoT) should be able to organize and coordinate spontaneously to accomplish speciﬁc tasks. By taking advantage of various intelligent technologies, we proposed an intelligent IoT model based on the Semantic Web. The framework consists of top ontology, entity link layer, semantic label layer, service register center, transaction construction layer, and transaction execution control layer. For the sake of constructing and executing the transactions automatically in the intelligent IoT, entity functions are represented by Semantic Web Services. Additionally, the framework also acts as a manager during the execution of a transaction and makes eﬀective management and control to the entities. We demonstrated the eﬀectiveness and superiority of the proposed model with a case study of the comprehensive rescue service for transportation accidents. Keywords: Intelligent IoT

1

· Semantic Web · IoT framework

Introduction

The ultimate purpose of IoT is to realize the smart interconnection between objects and many applications have applied [1]. The logic expression ability, knowledge discovery and reasoning capabilities of the Semantic Web ingratiate with the needs of further development of the Internet of Things. The Semantic Web has become an important technology for promoting the Internet of Things. In the past few decades, semantic techniques have provided means for description, information sharing, and integration of heterogeneous objects. Moreover, the artiﬁcial intelligence and knowledge engineering are combined in the ﬁeld of the Semantic Web to represent and process data and knowledge.

2

Related Work

The corresponding research results of the Intelligent IoT include the Smart-M3 system which aims to merge the Semantic Web and IoT domain. The system c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 186–195, 2018. https://doi.org/10.1007/978-3-030-05057-3_14

A Semantic Web based Intelligent IoT Model

187

provides a semantic publishing and subscription software architecture [2]. Tao et al. [3] used semantic ontology to manage devices in smart home. Wu et al. [4] proposed a uniﬁed knowledge framework to improve the Semantic Web of Things (SWoT) for interoperability between IoT applications in speciﬁc areas. Jahan et al. [5] discussed a ubiquitous knowledge base, CoAP-based framework and intelligent gateway for SWoT framework. Gyrard et al. [6] proposed the SEG3.0 as a joint, uniﬁed approach to semantic interoperability and applied it to smart city systems. Poslad et al. [7] proposed a new IoT EWS system framework, a semantic information organization model, which aims to explore the scalability of the IoT [8]. Singh et al. [9] proposed a new IoT architecture model using Semantic Fusion Model (SFM), which uses intelligent semantic framework to encapsulate information gathered from sensor networks. The most representative ones are the Semantic Web of Things model proposed by Floriano [10].

3

Problem Statement

The purpose of the Intelligent IoT is to realize the direct correlation between the information space and the physical world and achieve complete intelligent interaction. Since the speciﬁc process is completed without the involvement of human activities, the problems are more complex and diverse. The intelligent interaction between the human society, information space and physical world “Fig. 1” is diﬀerent from the IoT and also diﬀerent from the Semantic Web. The complexity is increasing geometrically and the system will face more unprecedented problems and challenges. The problems include the following four aspects: The existing frameworks of Intelligent IoT mainly focused on the information processing and do not substantially involve the driving and control of entities. In essence, the existing framework still research on information ﬁeld. There are still many issues about device connection are not mentioned. The existing frameworks of Intelligent IoT do not completely separate from manual control to achieve true intelligence. The ultimate purpose of the Intelligent IoT is to hand over all information processing to the machine or entity. People only act as perceivers of the end result and do not need to participate in the query, composition and processing. The existing frameworks cannot provide solutions for such intelligent

Fig. 1. Interactions in IoT, SW and SWoT.

188

C. Qu et al.

development. The existing frameworks of Intelligent IoT cannot provide support for complex process construction. Although the existing frameworks can settle the problems of service composition, it did not provide support for service composition which coupled with dynamic entity information. The existing frameworks of Intelligent IoT cannot manage and control the execution of complex processes. Physical world entities need an eﬀective scheduling mechanism and a mechanism to resolve errors is also needed. In some conditions system must resolve errors immediately and eliminate the expected impact of errors.

4

Semantic Web Based Intelligent IoT Model

To settle the problems in Sect. 3, we proposed a Semantic Web based intelligent IoT model (ISWM). And in this section we will explain it in detail. 4.1

Framework

Based on previous research, We proposed a more reasonable Intelligent IoT model, as shown in “Fig. 2”.

Fig. 2. Framework of Semantic Web based intelligent IoT model.

A Semantic Web based Intelligent IoT Model

189

Top Ontology. The top ontology used to represent the concepts and relationships involved in the IoT. It is the basis of the framework and provides logical reasoning for the whole architecture [11]. Entity Link Layer. The entity link layer implements functions such as information transfer and drive control for various entities, and is an intermediate layer between the logic part and the entity. This layer should not only discover the added physical devices, convert various types of device communication protocols and drive devices, but also acquire the devices status information in real time to provide the basis for the establishment and execution of transactions. Semantic Annotation Layer. The semantic annotation layer implements the semantic annotation of the original data in order to provide semantic support for the upper application. This layer also packages the semantic information of entities into Semantic Web Services. This layer is mainly divided into three functional modules: The semantic tag database is a set of semantic representation tags for existing entities, similar to the DTD in XML. The semantic annotation module labels the functions and features of the underlying entities with the normalization tags, providing a basis for selection during transaction construction. The function of the service package module is to encapsulate semantically labeled entity functions into Web services. Service Registry Center. In the semantic annotation layer, the entity information is converted into machine-readable semantic format. The content of this information includes two parts: static attributes, such as state and external environment description, and the functional attributes, which is packaged as a Web Service. Using a similar approach to the Web service, this information is stored on a registry or cloud platform and can be updated using periodic queries or transaction triggers. The service register can use the existing technologies in SOA to store and manage the services provided by the IoT entity. The service registry center can adopt a centralized management mode and a distributed management mode. Its working principle and implementation technology can also directly draw on UDDI. Transaction Construction Layer. This layer regards the Web Service as dynamically conﬁgurable resources for management and scheduling. The main function of this layer is to build the service chain to meet user requirements. There are two steps for transaction construction: semantic decomposition of user requires and service discovery and composition. The function module includes the following four: The context analysis module analyzes the requirements of semantic information delivered by the user interface through ontology reasoning and a priori rules, and then makes a deﬁnitive judgment on the requirement according to the corresponding context. The requirement decomposition module uses the ontology and its inference rules to decompose the request into sequential calls of several entities, which is translated into Web services. The function of the service query module is to ﬁnd the Web service that satisﬁes the conditions in the registration center according to the result of requirement decomposition.

190

C. Qu et al.

The function of the service composition module is to organize the queried services and build a transaction that meets the requirements of users. Transaction Execution Control Layer. The main function of this layer is to manage and control the execution of IoT transactions. Including: state information, the deﬁnition of entities states in the transaction; State awareness of subsequent entities during execution; Dynamic replace entity or terminate the transaction when error occurs; Synergistic scheduling or process consolidation between synergistic IoT transactions. The function module includes the following four: The transaction status set is a predeﬁned rule set, which deﬁnes a set of states, environment requirements et., which should be fulﬁlled during the execution of the IoT transaction. The transaction status set must be determined in advance by using the ontology reasoning rules based on the semantic information while constructing the IoT transaction. The entity status query module is directly associated with the device state module in the entity link layer, and obtains the state information of the entities in real time. When the error control module encounters a service failure that represents an entity’s function during the execution of the IoT transaction, it determines whether or not the pre-driver result is retained and how the successor sequence is handled according to the semantic environment. The scheduling module controls the concurrently, cooperatively or mutually exclusive IoT transactions, and scheduling non-concurrent entities according to the actual environment. 4.2

Working Mechanism

The workﬂow of ISWM includes the following three aspects: Entity Functions are Registered as Web Service. First, the entities connected to the IoT are captured by the device discovery module at the entity link layer. Second, the entities are matched with its driver, which is conﬁgured by the device driver module. And then the entities information is passed to the semantic annotation layer for semantic encapsulation. After that, the semantic annotation module represents the entities according to the metadata in the semantic tag database. The service package module expresses the entity functions as Web Services. Finally, the services are registered in service registry center and published by the service publisher module. The process is shown in “Fig. 3”. Construct IoT Transaction to Meet the Users Requirement. User requirement is provided to the system in natural language and passed to the transaction construction layer. The requirement is analyzed by the context analysis module and translated into semantic information which is machine-readable. The semantic information decomposed into a combination of simple requirements by the demand decomposition module. The format of these simple requirements is the same with Web Services. The service query module ﬁnds and matches the entity services in the service registry center. The service combination module organizes them to construct the transaction and establishes the transaction status set. The process is shown in “Fig. 4”.

A Semantic Web based Intelligent IoT Model

191

EnƟty connect to the IoT EnƟty discovery

EnƟty configuraƟon

EnƟty informaƟon delivery

Device driver

SemanƟc annotaƟon

SemanƟc presentaƟon of enƟty acƟvity

Protocol pool

SemanƟc tag database

EnƟty acƟvity service package

Service package

Service registry

Service register

Service publish

Service database

Device discovery

Service publisher

Fig. 3. The process of entities registered as web services.

User requirement SemanƟc analysis Service database

Requirement decomposiƟon

Service query

Service query Service composiƟon

Context analysis Requirement decomposiƟon

Service composiƟon

Build transacƟon and status set Fig. 4. IoT transaction construction process.

Control the Execution of IoT Transaction. After the construction of the IoT transaction, services are called in turn according to its logical sequence under the control of the transaction execution control layer. In the service invocation process, the entity status query module detects the state of the entity, which provides the service, in real time through the device state module in the entity link layer, and updates the transaction state set. At the same time, the semantic parameters in the requirements are transferred as driver parameters to the corresponding entities by the information convert module. During the entire transaction execution process, the error control module and the scheduling module in the transaction execution control layer are responsible for the processing and management of errors. The process is shown in “Fig. 5”.

192

C. Qu et al. Start Detect transacƟon status set

Error control

TransacƟon status set

Has service?

Error handling

Y Select the next service

EnƟty status query

Detect the corresponding enƟty

Device state detect

Error control Y EnƟty error?

Scheduling

N Scheduling enƟty

N

End

EnƟty N

Meet the condiƟons?

Device driver

Y InformaƟon convert

Execute service

Fig. 5. IoT transaction execution process.

5

Comparison and Use Cases

5.1

Comparison of Models

The ISWM proposed in this paper compares with the SWoT models [9] and the active service model for IoT (IASM) [10]. The common attributes are as follows: They all need knowledge base (KB) as a support. In IASM and ISWM, Web Service is used as objects of discovery and composition and service composition is used for managing information. The diﬀerences between our model and the previous architectures are as follows: ISWM framework adds an entity link layer to the aforementioned structure, which is used to communicate with and control the underlying entities. In ISWM, information not only transmit upward but also downward. A service registry center is added for the uniﬁed and standardized management of entity services. In particular, entities can be scheduled to resolve unexpected problems when abnormal conditions occur. The diﬀerences are listed in “Table 1”. Table 1. Comparison of SWoT, IASM and ISWM KB support

Service discovery

Entity composition

Entity status feedback

Entity control and schedule

SWoT

Yes

No

No

No

No

IASM

Yes

Yes

Yes

No

No

A Semantic Web based Intelligent IoT Model

193

Table 2. The implementation of transport rescue process in diﬀerent environments Manual operation

SWoT

ISWM

Information collection

Rescuers

Sensor network

Sensor network

Accident determination

Traﬃc police

KB

KB and ontology

Decision making

Emergency rescue department

Decision Support Systems

Decision Support Systems

Organization

Emergency Emergency rescue rescue department and department and rescuers rescuers

Transaction execution and control system and rescuers

Information publication

Emergency rescue department

Emergency rescue department or KB

Emergency rescue department

Table 3. Executioner and method in rescue process in diﬀerent environments SWoT

ISWM

Executioner

Method

Executioner

Method

Information report

Emergency rescue department

By phone

Transaction construction system

Message trigger

Wrecker dispatch

Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Ambulance dispatch Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Fire truck dispatch

Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Traﬃc control

Traﬃc police

Command by traﬃc police

Transaction execution control system

IoT entities such as traﬃc light and indicator,traﬃc police if needed

Material dispatch

Emergency rescue department

Preparation and transportation manually

Transaction construction system and transaction execution control system

Intelligent storage and intelligent logistics system scheduling

194

5.2

C. Qu et al.

Case Study of Traﬃc Rescue

The Intelligent Transportation System (ITS) utilizes the IoT technology to equip the road network to real-time monitoring and exact management. Its most important function is to detect and deal with traﬃc accidents in time. The model proposed in this paper is compared with the manual operation and SWoT model in the implementation of the comprehensive rescue process for traﬃc accidents as shown in “Table 2”. We can see that in the comprehensive rescue process of traﬃc accidents, the proposed model in this paper is as eﬀective as the manual operation and the SWoT structure. The biggest diﬀerence in the model compare with the other two structures is the executioners in the integrated rescue procedure as shown in“Table 3”. It can be seen from “Tables 2 and 3”, that in the comprehensive rescue process, the SWoT model can use the sensor network and knowledge system to discover, determine, and formulate rescue strategies for accidents, but it is ineﬀective for the organization of subsequent rescue work.

6

Conclusion

In order to achieve the intelligent objectives, this paper proposed an intelligent IoT model based on the Semantic Web. We described the framework and working mechanism of the model. The framework uses the ontology as the logical reasoning basis and is divided into several parts: the entity link layer, the semantic annotation layer, the service registry center, the transaction construction layer, and the transaction execution control layer. Semantic technology is used to describe the IoT entity as a dynamic Web Service. In the model, the technologies of service discovery, service composition are used to build IoT transactions that meet users requirements and control the transaction processes. Due to the addition of physical feedback, entity control and scheduling, the advances of our model are shown in the use case of traﬃc accident rescue. In another work we study the security of the model [12]. Acknowledgment. This work was supported in part by the Natural Science Foundation of Guangdong Province, China (Grant No. 2018A030313014); Guangdong University Scientiﬁc Innovation Project (Grant No. 2017KTSCX178); the outstanding young teacher training program of the Education Department of Guangdong Province (Grant No. YQ2015158); Guangdong Provincial Science & Technology Plan Projects (Grant Nos. 2016A010101035 & 2016A010101034); and National Natural Science Fund, China (Grant Nos. 61300198 & 61772233).

References 1. Tao, M., Zuo, J., Liu, Z., Castiglione, A., Palmieri, F.: Multi-layer cloud architectural model and ontology-based security service framework for IoT-based smart homes. Futur. Gener. Comput. Syst. 78, 1040–1051 (2016) 2. D’elia, A., Viola, F., Roﬃa, L., Azzoni, P., Cinotti, T.S.: Enabling interoperability in the Internet of Things: a OSGi semantic information broker implementation. Int. J. Semant. Web Inf. Syst. 13(1), 147–167 (2017)

A Semantic Web based Intelligent IoT Model

195

3. Tao, M., Ota, K., Dong, M.: Ontology-based data semantic management and application in IoT- and cloud-enabled smart homes. Futur. Gener. Comput. Syst. 76, 528–539 (2016) 4. Wu, Z., Xu, Y., Zhang, C., Yang, Y., Ji, Y.: Towards semantic web of things: from manual to semi-automatic semantic annotation on web of things. In: Wang, Y., Yu, G., Zhang, Y., Han, Z., Wang, G. (eds.) BigCom 2016. LNCS, vol. 9784, pp. 295–308. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42553-5 25 5. Jahan, F., Fruitwala, P., Vyas, T.: Towards the next generation of web of things: a survey on semantic web of things’ framework. In: Satapathy, S.C.C., Das, S. (eds.) Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 1. SIST, vol. 50, pp. 31–39. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30933-0 4 6. Gyrard, A., Serrano, M.: Connected smart cities: interoperability with SEG 3.0 for the Internet of Things. In: IEEE 30th International Conference on Advanced Information Networking and Applications Workshops, pp. 796–802 (2016) 7. Poslad, S., Middleton, S.E., Chaves, F., et al.: A semantic loT early warning system for natural environment crisis management. IEEE Trans. Emerg. Top. Comput. 3(2), 246–257 (2015) 8. Sun, Y., Jara, A.J.: An extensible and active semantic model of information organizing for the Internet of Things. Pers. Ubiquitous Comput. 18(8), 1821–1833 (2014) 9. Singh, D., Tripathi, G., Jara, A.J., et al.: A survey of Internet-of-Things: future vision, architecture, challenges and services. In: 2014 IEEE World Forum on Internet of Things, pp. 287–292 (2014) 10. Scioscia, F., Ruta, M.: Building a semantic Web of things: issues and perspectives in information compression. Proceedings of the 2009 IEEE International Conference on Semantic Computing (ICSC), pp. 589–594 (2009) 11. Qu, C., Liu, F., Tao, M.: Ontologies for the transactions on IoT. Int. J. Distrib. Sens. Netw. 11, 1–12 (2015) 12. Qu, C., Tao, M., Zhang, J., Hong, X.Y., Yuan, R.F.: Blockchain based credibility veriﬁcation method for IoT entities. 2018, Secur. Commun. Netw. 2018, 1–11 (2018)

Accelerating CNNs Using Optimized Scheduling Strategy Rui Xu1, Sheng Ma2(&), Wenwu Li1, and Yang Guo1 1

College of Computer, National University of Defense Technology, Changsha 410073, Hunan, China 2 The State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, Hunan, China [email protected]

Abstract. Convolutional neural networks (CNNs) have a wide range of applications in image and video recognition, recommender systems and natural language processing. But CNNs are computationally intensive, and its computational cost is hard to accept. In order to speed up the calculations, people focus on optimizing convolution that account for most of the proportion of CNNs’ operation. So, many algorithms have been proposed to accelerate the operation of convolution layers. However, each algorithm has its advantages and disadvantages, and there is no one algorithm that can handle all situations. In this paper, we examine the performance of various algorithms in GPU environment. By building a customized CNN model, we have fully explored the impact of the neural structure on the performance of algorithms, including inference/training speed, and memory consumption. In addition to the algorithms, we also focus on how their implementations in GPU environment affect their performance. Finally, we summarize the characteristics of each algorithm., and design a strategy to assigns the appropriate implementation for different convolutional layers in CNNs. With our strategy, we can make AlexNet run 1.2x to 2.8x faster than other strategies in GPU environment. This work has very important meaning for understanding these algorithms and may provide insights for further optimizations of the architecture of GPUs and accelerators. Keywords: Artiﬁcial intelligence Convolutional neural networks Scheduling strategy GPU framework

1 Introduction Since deep learning [1] was proposed, it has rapidly become a hot topic. Especially, deep neural networks (DNNs) have made signiﬁcant progress in image classiﬁcation, target recognition, speech recognition, language translation, etc. [2]. In some cases, the accuracy of neural network even exceeds the accuracy of human identiﬁcation [3].

This work is supported by the National Natural Science Foundation of China (No. 61672526) and Research Project of NUDT (ZK17-03-06). © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 196–208, 2018. https://doi.org/10.1007/978-3-030-05057-3_15

Accelerating CNNs Using Optimized Scheduling Strategy

197

A series of successful and mature network models have also been proposed, including Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs) [2], etc. In this paper, we focus on CNNs, which play a key role in image and video recognition, recommender systems and natural language processing [3]. However, the training/inference time of CNNs is long and sometimes unbearable. Due to the complexity of convolution operations, CNNs will bring a huge workload to the device. Meanwhile, CNNs is becoming more and more complicated, since the number of convolutional layers is continually increasing. These changes can bring improvement in the accuracy, but will result in a huge increase in training or inference time. There are many ways to solve this problem, one of which is using acceleration algorithm. At present, there are three popular convolutional acceleration algorithms, Matrix multiplication (GEMM) [5], Fast Fourier Transform (FFT) [6], and Winograd’s minimal ﬁltering algorithm (Winograd algorithm) [7]. GEMM converts convolution operations into more efﬁcient matrix operations, while FFT and Winograd algorithms reduce computational complexity of CNNs. People usually implement CNNs in GPU environment, because GPU uses manycore architectures and have massively parallel processing power [2]. Moreover, NVIDIA has developed a deep learning library called cuDNN, which is a GPU-accelerated library of primitives for deep neural networks [8]. It can provide highly tuned implementations for convolution to accelerate the execution of CNNs in GPU environment. Therefore, more and more users choose GPUs to speed up the execution of CNN. And, cuDNN is also being used more and more widely. Currently, users can choose an appropriate framework to build CNNs model. But they rarely understand the implementations or algorithms of convolution used by these frameworks. Few studies have really shown the differences between these convolution algorithms. In this paper, we show the detailed comparisons on the characteristics between these algorithms. We choose GPUs as the main hardware platform for convolution operation and compare seven most popular implementations of convolution. We choose a customized CNN model as the workload to obtain performance characteristics of these implementations. The customized CNN model was built and trained using the same framework, Caffe [9]. Our work shows that each implementation has pros and cons, and no algorithm can perform best in all situations. The actual performance of these algorithms or implementations will be heavily dependent on the conﬁguration of convolutional layer. Moreover, in the same conﬁguration, the GPU implementation will further affect the performance of these algorithms. This work has very important meaning for understanding these algorithms and may provide insights for further optimizations of the architecture of GPUs and accelerators. Based on the characteristics of algorithm, we provide optimization techniques to implement efﬁcient CNN models using the algorithm libraries. We design an Optimized Algorithm Scheduling Strategy, which assign the appropriate convolution algorithm for each convolutional layer of CNNs. We also designed an experiment to verify the superiority of our strategy. We compare our design with several existing

198

R. Xu et al.

solutions, such as Caffe+CUDA and Caffe+cuDNN. Our experimental show that our strategy can increase the execution speed up to 2.8x compared to Caffe+CUDA and 1.2x compared to Caffe+cuDNN.

2 Background and Related Work 2.1

Convolutional Neural Networks

A Convolutional Neural Networks (CNN) is the feedforward neural network [2]. It has an excellent performance for large-scale image processing and identiﬁcation. In a CNN, multiple convolutional layers are connected. Such structure allows CNN to abstract the features of the image as much as possible. The main purpose of the convolution layers is to extract the features in the image. They use well-trained ﬁlters that are highly responsive to speciﬁc good patterns. However, the type of features extracted by different convolutional layers are not the same. In AlexNet, the ﬁrst layer of convolution is used to detect low-order features such as edges, corners, curves, etc. As the number of convolutional layer increase, the features detected by ﬁlters are more complex [10].

Layer 3

Layer 2

Convolution

Fig. 1. The description of the part of AlexNet, which is from Layer 2 to Layer 3, by using the 4D-tensor.

In a CNN, feature data is stored as tensors. In the traditional method, feature images or maps is processed in two dimensions, , where H represents the height of images and W represents the width of images. But there are a lot of images that need to be processed in the same layer, we can treat the feature map data as four dimensions tensor, , where N and C means the number of images in a batch, and the number of channels, respectively. In this way, we can easily describe CNNs’ network structure (see Fig. 1) Similarly, we can also use 4D-tensor to describe the kernels, , where K represents the number of kernels, R represents the height of the kernel, and S represents the width of the kernel. 2.2

Convolution Algorithms

Convolution is the key operation of CNNs. How to carry out these operations efﬁciently has become a hot research topic. Many algorithms have been proposed and most of them have different implementations in GPU environment.

Accelerating CNNs Using Optimized Scheduling Strategy

199

The formulation of convolution operation is Eq. (1) [5]. Where N means the minibatch size, C means the number of channels, K means the number of ﬁlters, R and S mean the ﬁlter size, P and Q mean the output size, U means the stride size, F means ﬁlters of CNNs, and D means input maps. The traditional method of calculating convolution is based on Eq. (1). We called it the direct convolution algorithm. It completes the multiplication between elements and accumulate their results according to Eq. (1) [11]. It is the most straightforward way to perform convolution. Cuda-convnet2 [11] is a widely used direct convolution library. O½n; k; p; q ¼

C1 X R1 X S1 X

F½k; c; r; s D½n; c; p U þ r; q U þ s;

c¼0 r¼0 s¼0

ð1Þ

n 2 ½0; NÞ; k 2 ½0; KÞ; p 2 ½0; PÞ; q 2 ½0; QÞ: Another algorithm is matrix multiplication (GEMM). It transforms the matrix convolution operation into a matrix multiplication operation as shown in Fig. 2 [5]. Because matrix multiplication has efﬁcient computational libraries in GPU environment, this simpler approach has gained considerable efﬁciency.

Fig. 2. Transforming the matrix convolution into a matrix multiplication [2]. This process produces redundant data, which marked in red in this ﬁgure.

One of the implementations of GEMM in GPU environment is called explicit GEMM. This implementation directly calculates the convolution according to GEMM algorithm flow. But it has the disadvantage that there is redundant data in the input matrix, and they will take up extra memory space. Therefore, implicit GEMM [12] was proposed. It divides these matrixes into small pieces and uses the index to guide the calculation. Small amount of data can be loaded into the on-chip memory directly without taking up extra GPU memory. But this method requires additional calculation of the index and sufﬁcient bandwidth.

Fig. 3. Use FFT to calculate convolution. In the frequency domain space, it can be seen that the small-size ﬁlter becomes the same size as the input image, which takes up extra space.

200

R. Xu et al.

Another implementation of GEMM is implicit-precomp GEMM. It is based on implicit GEMM. But unlike implicit GEMM, it does not require index calculation during the operation of convolution. It obtains the index in advance by calculating the parameters of CNNs structure and block size. It can further speed up the calculation, However, it takes up some memory space to store the index. In order to further speed up the operation, Fast Fourier Transform (FFT) is also implemented. It transforms the input and ﬁlter data into the frequency domain space and completes these matrices product [6]. Then, the result is transformed back into the time domain space to get the ﬁnal convolution result (see Fig. 3). FFT speeds up computation by reducing computational complexity of convolution. The number of multiplication of convolution is O(P2 R2) in direct algorithm, whereas the FFT algorithm can reduce the number to O(P2 logP) [2]. The disadvantage of FFT algorithm is that it needs to store a large amount of intermediate data. The transformation of FFT also expands the ﬁlter to the size of the input maps [5]. Due to the above reasons, the algorithm needs to take up signiﬁcant memory space, especially when facing small-size kernels and large-size inputs. To solve these problems, FFT-tiling was proposed, which is another implementation of the FFT algorithm in the GPU. Similar to implicit GEMM, it divides input maps into small tiles. It uses block transmission and calculation to reduce memory usage and hide the latency of the transmission [12]. Another acceleration algorithm is Winograd algorithm. It transforms the multiplication operations into addition operations to reduce the computational cost [7]. By using this algorithm, we can reduce the number of multiplication from O(P2 R2) to O((P + R − 1)2) in the operation of convolution. However, the disadvantage of this algorithm is the lack of good flexibility. When the size of ﬁlters changes in CNNs, the parameter matrices used for transformation has to be changed. In addition, the process also generates intermediate data that needs to be stored [12]. 2.3

Related Work

Since convolutional neural networks were introduced to public, few studies focus on the comparison between the convolutional algorithms. At present, the best way to evaluate the convolution algorithms is to refer to experimental data provided by several algorithm developers. Mathieu, et al. (2013) show the performance of the FFT algorithm [13, 19]. Chetlur, et al. (2014) compare implicit GEMM, explicit GEMM and Direct algorithms in their work [5]. Lavin et al. (2015) show the advantages of the Winograd algorithm compared to the GEMM and FFT algorithms [14]. However, through these years of development, the implementations of algorithms in GPU environment have become diversiﬁed. For example, the GEMM algorithm has three implementations. Although these implementations execute the same algorithm, their performance is completely different. So it is necessary to conduct a comprehensive evaluation of these implementations. It also should be noted that there are many studies comparing the performance of different DNN frameworks, like [20, 21]. But our work wants to show the characteristics of different convolution algorithms in the same framework. When the user selects

Accelerating CNNs Using Optimized Scheduling Strategy

201

the appropriate framework, we will give our optimization suggestions for reference. Meanwhile, based our experiments’ result, we design an Optimized Algorithm Scheduling Strategy. Through this strategy we can improve the computational efﬁciency of CNNs.

3 Experimental Methodology We conduct two experiments in our work. In the ﬁrst experiment, we compare the characteristics between different implementations. we measure the execution time and memory usage of these implementations to compare their characteristics. In order to identify the performance limit factors for each algorithm, we select the customized convolutional neural network as the workload, because it is representative and flexible enough to simulate many conditions. The default structure parameters of the custom network structure are as follows, N = 64 (mini-batch), C = 64 (channel), H = 56 (input-size), R = 5 (kernel-size), K = 128 (ﬁlter-number), U = 1 (stride-size). The choice of these parameters is reference to GoogLeNet [4, 8]. After that, we adjust the network parameters (N, H, R, K, U) and use variable-controlling approach to change one of them and keep the others constant. In this way, we can observe the performance changes with this parameter.

Table 1. System conﬁguration CPU GPU Main memory GPU memory Operating system Framework Libraries

Intel Core i7-6700k (4.00 Ghz) NVIDIA GeForce GTX1080 8 GB 8 GB Ubuntu 16.04 LTS Caffe V1.0.0 CUDA 8.0; cuDNN 6.0; cuda-convnet2;

In the second experiment, we measure the execution speed of AlexNet in GPU environment using different algorithm scheduling strategy. Based on our previous experiments, we suggest possible optimization techniques to improve the speed of CNNs. We also design an optimized algorithm scheduling strategy, which assign the best suited implementations for AlexNet’s convolutional layers. We verify our strategy by comparing with Caffe+CUDA or cuDNN’s strategy. Our experiments are performed with a system described in Table 1, including the versions of the deep learning frameworks and libraries used. We use Caffe to build our CNNs model. Since cuDNN does not support direct convolution, we implement the direct convolution algorithm with cuda-convnet2.

202

R. Xu et al.

4 Comparison of Algorithms The characteristics of the algorithm are reflected by the execution efﬁciency under different conditions. In this section, we characterize the seven implementations (implicit GEMM, implicit-precomp GEMM, explicit GEMM, FFT, FFT-tiling, Winograd) of convolution algorithms in GPU environment. We measure the runtime and memory usage to compare the performance of seven implementations with respect to different size of input image, kernel size and stride-size. In this way, we show the influence of the network structure on the performance of the implementations. For convenience, we use GEMM1 as implicit GEMM, GEMM2 as implicitprecomp GEMM, GEMM3 as explicit GEMM, FFT1 as traditional FFT, and FFT2 as FFT-tiling.

7000

Memory (MB)

6000 5000 4000 3000

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

300 250

Run-time (ms)

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

8000

200 150 100

2000

50 1000

0

0 0

50

100

150

input-size

200

250

0

50

100

150

200

250

input-size

Fig. 4. The impact of the input size on performance

4.1

Input Size

Figure 4 shows the performance of all algorithms with different input sizes. For a small input size (20), the performance advantage of FFT2 is becoming more obvious. Winograd algorithm has similar runtime as FFT2, but it experiences an out of memory error when the input size is equal to 160. The runtime of FFT1 is fluctuant when the input size is around 64. The reason is that, for different input sizes, FFT1 will call different functions or libraries to calculate the Fourier transform, and one of the thresholds is 64. So FFT1 results in the worst performance in our experiment when the input size is 80. GEMM1 and GEMM2 still consume the least memory. GEMM3, FFT1 and Winograd algorithm experience the out of memory error when input sizes are equal to 100, 140 and 160 respectively. Interestingly, the memory usage of FFT2 is less than the direct algorithm when input size is greater than 80. The main reason is that these two algorithms use different acceleration libraries.

Accelerating CNNs Using Optimized Scheduling Strategy GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

7000 600

Runtime (ms)

500

400

300

6000 5000

Memory (MB)

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

203

4000 3000

200

2000

100

1000 0

0 0

5

10

15

20

25

30

35

40

45

50

0

55

5

10

15

20

25

30

35

40

45

50

55

Kernel-size

Kernel-size

Fig. 5. The impact of the kernel size on performance

4.2

Kernel Size

Figure 5 shows the performance of all algorithms with different kernel sizes. It is noted that Winograd algorithm only supports 3 3 and 5 5 kernels, so its runtime is reported with two dots in Fig. 5. In addition, the direct algorithm cannot support all given ﬁlter numbers in our experiment. For a small kernel size (kernel size < 5), the speed of GEMM2 is faster than FFT2, and Winograd algorithm has the similar runtime to GEMM2. But when the kernel size is greater than 5, FFT2 results in the best performance, and FFT1 is a bit slower than FFT2. Moreover, their runtime tends to be a constant value when the kernel size is smaller than 32. The reason is that the FFT algorithm need to do Fourier transform on the kernel, and the size of kernel is adjusted to the same as the input size. So, the kernel size basically has no effect on the calculation of FFT algorithm. Since FFT2 divides the input into 32 32 tiles [12], it experiences an error when kernel size is equal to 32. Interestingly, for GEMM algorithm, the trend of GEMM1 and GEMM2 runtime is arched. By calculating the number of multiplication (O(P2 R2), R = (H − P)/S + 1) of GEMM, a quadratic function is obtained, which is the same as the trend of GEMM runtime in Fig. 5. In memory usage, GEMM3 has the highest consumption, and it even experiences an out of memory error when kernel size is equal to 9. However, the memory usage of other algorithms is not affected by the kernel size, so their memory consumption is basically unchanged.

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

20

3500 3000

Memory (MB)

Run-time (ms)

15

10

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

4000

2500 2000 1500

5 1000 500

0

0

0

2

4

6

Stride

8

10

12

0

2

4

6

8

Stride

Fig. 6. The impact of Stride on performance

10

12

204

4.3

R. Xu et al.

Stride Size

Figure 6 shows the performance of all algorithms with different stride sizes. Only the GEMM algorithm passes all tests, because FFT and Winograd algorithms only support stride size of 1, and the direct algorithm has upper bound on the stride size. When stride size is greater than 1, GEMM2 results in the best performance. It can be seen that the runtime and memory consumption curves of GEMM are hyperbolic. The stride size has impact on the number of data that needs to be processed, and with stride size larger, the amount of data less. In conclusion, FFT2 is the fastest implementation to train a CNN model with large kernel sizes (large than 20 in our experiment) and large input sizes (large than 5 in our experiment), due to its low arithmetic complexity and block operation. Winograd algorithm also has a similar performance, but considering the memory usage, we prefer to the FFT2. FFT1 is a bit slower than FFT2 when computing convolution with a large input size. But for a small input size (smaller than 20), FFT2 is slower than FFT1. For small kernel sizes and input sizes, Winograd algorithm and GEMM2 would be a good choice. But GEMM2 is more flexible than Winograd algorithm or FFT2, because Winograd algorithm only supports 3 3 and 5 5 kernels, and FFT2 or Winograd algorithm can only support the stride size as 1. Moreover, GEMM2 always occupy the minimum memory space, it is well suitable for cases when the memory is limited.

5 Optimized Scheduling Strategy As we have mentioned earlier, with same neural network structure, different implementations of convolution algorithms often have different performance. So, the diversity of DCNN’s layer sizes and the different performance of implementations demand an efﬁcient scheduling strategy to assign the appropriate implementation for each convolutional layer. In this way, we can optimize both power efﬁciency and performance. By analyzing our experimental data, we propose an optimized algorithm scheduling strategy. The strategy completes the algorithm selection according to the structure parameters of the current convolutional layer of neural networks. For each layer, the strategy read the model parameter ﬁle to obtain the input mapping data structure , the weight data structure , and the stride-size U. After that, the strategy will examine the parameters U, H, and R, respectively. So that’s the basic flow (see Fig. 7): When U is greater than 1, our strategy arranges implicit-precomp GEMM as the convolutional implementation for the current convolutional layer. According to our experimental results, this implementation works best in this case. But if U equals 1, we examine the value of H. If H is greater than 16, our strategy assigns FFT-tiling for the current layer as the implementation. According to our characteristic analysis of FFT-tiling, this implementation can gain better performance than others in the case of H > 16. If H is less than or equal to 16, Our experiments prove that FFT-tiling is not the best choice at this time, and our strategy will re-select an implementation according to the value of R.

Accelerating CNNs Using Optimized Scheduling Strategy

205

Fig. 7. The workflow and pseudocode of Optimized Algorithm Scheduling Strategy

When R is equal to 3 or 5, our strategy will arrange the Winograd as the implementation for the current layer. But if R is not equal to 3 or 5, according to our experimental results, the FFT implementation is the best choice. With this strategy, we can obtain the best implementation for each convolutional layer with the optimal performance in GPU environment. By reducing the execution time, we also reduce energy consumption. The workflow also shows that our strategy involves only the structural parameters of the network and does not care about the operation data or the process during execution of CNNs. This feature allows us that we can execute our strategy in advance, so that the operation process of CNNs will not be affected. In order to verify our strategy, we compare it with two scheduling of Caffe+CUDA and Caffe+cuDNN. The Caffe+CUDA solution is rely on the GEMM algorithm and uses CUDA library to accelerate the convolution operation in the GPU environment. There is no algorithm scheduling in this solution. The Caffe+cuDNN solution uses multiple algorithms to accelerate CNNs. It uses cudnnGetConvolutionAlgorithm(), which serves as a heuristic for seeking the suitable algorithm for cuDNN-convolution for the given layer speciﬁcations. In our experiments, we use these three strategies to accelerate the AlexNet network and measure their execution time respectively. The experimental results are shown in Fig. 8. From the experimental data, it can be seen that the speed of Caffe+CUDA is the slowest. Because Caffe+CUDA only uses GEMM algorithm, which is inefﬁcient to execute AlexNet with different layer structure. In contrast to Caffe+CUDA, Caffe +cuDNN has a variety of convolutional algorithms. It chooses the appropriate algorithm or implementation to accelerate each convolutional layer of AlexNet. In this way, it further improves the computational efﬁciency of CNNs. In experiments, Caffe +cuDNN increases the speed by 2.3 than Caffe+CUDA. According to the data structure parameters of the neural networks, our solution arranges the most suitable

206

R. Xu et al.

Table 2. Convolutional algorithms arranged by different strategies for each convolutional layer of AlexNet in GPU environment. AlexNet

conv1

conv2

conv3

conv4

conv5

conﬁg

Stride = 4 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

GEMMa

GEMMa

Winograd

Winograd

Winograd

GEMMa

FFT-tiling

Winogradb

Winograd

Winograd

Caffe +CUDA Caffe +cuDNN Our strategy a

GEMM : implicit-precomp GEMM Winogradb: another implementation of Winograd

convolution algorithm for each convolutional layer of AlexNet, so as to achieve the maximum acceleration effect. Our strategy increases the speed by 2.3 than Caffe +CUDA. Meanwhile, it is 1.2x faster than the Caffe+cuDNN. In order to further explore the differences between these strategies, we record the algorithms, respectively, that they choose for each layer of AlexNet (Table 2). In conv1, conv4 and conv5, both our strategy and the Caffe+cuDNN chooses the same implementation to speed up convolutional operations. But in conv2, Caffe +cuDNN chooses precomp-implicit GEMM, and our strategy chose FFT-tiling. In comparison, our strategy, which increases the speed of convolutional operations by 40% than Caffe+cuDNN, is more efﬁcient. Similarly, in conv3, our strategy chooses another implementation of the Winograd algorithm, making it 10% faster than Caffe +cuDNN.

300

1.0x

Runtime/ms

250 200 150

2.3x 100

2.8x

50 0

Caffe+CUDA

Caffe+cuDNN

Our strategy

Fig. 8. The execution time of AlexNet using different strategies in GPU environment

Our experiments show that our strategy is better than Caffe+CUDA or Caffe +cuDNN. It should be noted that our strategy only needs to read the structural parameters of the current layer of the CNNs network, and it has nothing to do with the data actually participating in the calculation. In this way, we can use our strategy in advance to rationalize the calculations for each convolutional layer in the CNNs network. Our strategy does not affect the actual execution time of CNNs.

Accelerating CNNs Using Optimized Scheduling Strategy

207

6 Conclusion The convolutional neural network has become a hot topic in current research. Our work is aimed at comparing the performance of popular convolution algorithms in the GPU environment with the same framework. Based on our experiment, we ﬁnd that choosing convolution algorithms carefully can make a CNN model faster in executing convolution layers. For this reason, we propose an optimized algorithm scheduling strategy, which can assign the best implementation for each convolutional layer. This strategy is simple and does not affect the implementation of CNNs. Experiments show that using our strategy can speed up the execution of the CNNs model from 1.2x to 2.8x.

References 1. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 2. Sze, V., et al.: Efﬁcient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) 3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. Commun. ACM 60(2), 84 (2012) 4. Simard, P., Lecun, Y., Denker, J.S.: Efﬁcient pattern recognition using a new transformation distance. In: Advances in Neural Information Processing Systems (NIPS 1992), pp. 50–58 (1992) 5. Chetlur, S., et al.: cuDNN: efﬁcient primitives for deep learning. Computer Science (2014) 6. Mathieu, M., Henaff, M., Lecun, Y.: Fast training of convolutional networks through FFTs. Eprint Arxiv (2013) 7. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks, pp. 4013–4021. Computer Science (2015) 8. Cheng, J., Grossman, M., Mckercher, T.: Professional CUDA C Programming. Wiley, New York (2014) 9. Jia, Y., et al.: Caffe: Convolutional Architecture for Fast Feature Embedding, pp. 675–678 (2014) 10. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53 11. Krizhevsky, A.: cuda-convnet2 (2014). https://github.com/akrizhevsky/cuda-convnet2/ 12. NVIDIA: CUDNN User Guide (2017). https://developer.nvidia.com 13. Chen, T., et al.: MXNet: a flexible and efﬁcient machine learning library for heterogeneous distributed systems. Statistics (2015) 14. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011) 15. Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition, pp. 1–9. IEEE (2015) 16. Lecun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (2014) 17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014) 18. He, K., et al.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)

208

R. Xu et al.

19. Vasilache, N., Johnson, J., Mathieu, M., et al.: Fast convolutional nets with FBFFT: a GPU performance evaluation (2014) 20. Li, X., et al.: Performance analysis of GPU-based convolutional neural networks. In: International Conference on Parallel Processing, pp. 67–76. IEEE (2016) 21. Kim, H., et al.: Performance analysis of CNN frameworks for GPUs. In: IEEE International Symposium on PERFORMANCE Analysis of Systems and Software, pp. 55–64. IEEE (2017)

Data Analysis of Blended Learning in Python Programming Qian Chu1,2 , Xiaomei Yu1,2(B) , Yuli Jiang1,2 , and Hong Wang1,2 1

Institute of Information and Engineer, Shandong Normal University, Jinan, China [email protected] 2 Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan 250014, Shandong, China

Abstract. The rapid emergence of blended learning has sparked a great deal of research interest in the field of educational data mining. We apply the novel educational form of blended learning in the undergraduate curriculum of python programming. With the questionnaire before curriculum is obtained to capture the basic information of undergraduate students, we design educational resources and activities for online studying and face-to-face teaching. Since the learning process of each student is captured continuously, we make teaching and learning evaluations weekly to improve current teaching methods hence arouse students’ interest of continuous learning. With analyzing data and mining knowledge received in the process of blended learning, some beneficial results are gained to promote the quality of blended learning in the undergraduate curriculum of python programming, and benefit the undergraduate students as well as higher education in the long run.

Keywords: Blended learning

1

· Education · Python · Data analysis

Introduction

Blended learning is an eﬀective teaching approach where learning occurs both online and face-to-face, with the purpose to capture educational strengths both on the internet and in the classrooms. This novel teaching form has opened up a new era of education by pushing the so-called “forced-feeding method of teaching” to a blended learning form with both online video learning and ﬂipped classroom. On one hand, oﬄoading lecture time to video makes it possible for the teachers to spend more time interacting with students in class. On the other hand, the ﬂipped classroom actually enhances the oversight, and promotes the students to taking part in class activities. Therefore, we introduce blended learning into the undergraduate curriculum of python programming. Supported by the National Nature Science Foundation of China (No. 61672329, No. 61773246), Shandong Normal University’s Educational Project for Blended Learning (No. 2016KG79, No. 2016JG54). c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 209–217, 2018. https://doi.org/10.1007/978-3-030-05057-3_16

210

Q. Chu et al.

In fact, most ﬁrst year students in the university are required to take a semester course on computer science, a large portion of which is based on their interests and background with computer science. The purpose of the course is twofold: (1) to introduce the programming language of python to freshmen and show them common data analysis and machine learning libraries in python as well as their uses; (2) to teach them the basics of programming and introduce the engineering problem-solving methodology. Considering that the students enrolled are from diverse majors and show sharp diﬀerence in their motive intensity and their study attitude, we carry out a simple questionnaire before curriculum to capture the basic information of undergraduate students. Then we design educational resources and activities for online studying and face-toface teaching as well as for selective and personalized study. Since the learning process of each student is captured continuously, we make teaching and learning evaluations weekly to improve current teaching methods hence arouse students’ interest of continuous learning. Finally, data preprocess such as pretreatment and standardization is done on data collected in the process of blended learning, with the purpose to promote quality of blended learning in the undergraduate curriculum of python programming. The main contributions of this paper are outlined as follows: – The questionnaire before and after curriculum is designed to capture the basic information of the undergraduate students and further improve the blended learning in the follow weeks. – Creative work including personalized teaching video and educational resources is produced to beneﬁt the whole blended learning process. – With analyzing data and mining knowledge received in the process of blended learning, some beneﬁcial results are gained to promote quality of blended learning in the next semester. The remainder of this paper is organized as follows. In Sect. 2, we review related work on blended learning. After address some relevant problem, we present our blended learning process in Sect. 3. With analyzing data and mining knowledge in blended learning, Sect. 4 describes the methods of teaching and learning evaluations. Finally, some conclusions are outlined and future work is presented in Sect. 5.

2

Related Work

Blended learning is a learning approach that retains the values of traditional education while incorporates advanced educational technologies. In 2008, Garrison and Vaughan introduced three cases of mixed learning. In a small class of politics, blended learning enables students to gain more meaningful experience. In a large class of chemistry, blended learning is used to increase teacher-student interaction and improve problem-solving skills [1]. Graham et al. put forward a framework for adoption and implementation of blended learning based on the investigation in six American universities. They are divided into three categories:

Data Analysis of Blended Learning in Python Programming

211

awareness or exploration, adoption early implementation and mature implementation growth [2]. In general, blended teaching can be understood in such a way: the students acquire knowledge not only in the classroom, but also have online courses in extracurricular time, so they have the freedom to control the progress of learning themselves. With the two study phases connected as a whole, the students’ individual learning needs are met. In 2003, professor He proposed the concept of blended learning in China ﬁrstly. Combining the advantages of traditional learning and e-learning, blended learning further emphasizes the role of teachers’ leading guidance as well as the dominant position of student’s studying abilities. The theory of blended learning causes a revolution in the education sectors [3]. In 2006, Huang et al. addressed the way of using appropriate technology and means to impart knowledge to students, so as to optimize the eﬀect of learning [4]. Professor Li Kedong believe that blended learning can be applied on the appropriate teaching platform and media, so as to beneﬁt the eﬀect of learning [5]. The theory of blended learning has been continuously improving and developing.

3

Practice of Blended Learning in Python Programming

In order to follow the idea of “Internet + education” and improve the quality of teaching, we choose the “python programming” as a practical object in the elective course at colleges, then introduce the blended learning to explore the eﬀect of teaching in practical courses. 3.1

Analysis Before Class

Before implementing the form of blended learning, we design the questionnaire, with the aim to fully know student’s basic information, such as learning ability, cognition, and so on. There are 98 copies are issued, and all of the copies are actually received, so the eﬀective rate is 100%. Basic Information About Students. The objects are four-year undergraduate students from diﬀerent grades and schools, and the course of python programming is organized for a mixed class. The student’s cognitive style, computer knowledge and learning habits are diﬀerent, that is to say, the basic information of students should not only conducive to their aptitude, but also conducive to a reasonable teaching plan in content. The students in the class come from 12 different colleges in this university, including the school of Information Science and Engineering, the school of Music, the school of Physics and Electronic Sciences, etc. The analysis of the basic information of students is shown in Table 1. As the proportion of boys and girls is close to 1:1, the teachers have to take the logical diﬀerences between boys and girls into account. In fact, the boys have some advantages in practical abilities, while the girls have more advantages on careful thinking, which make it possible for the teachers to strengthen

212

Q. Chu et al.

unity and cooperation among students, so as to promote the advantages of complementarity. Based on the actual situation, the curriculum designed for python programming is as follows: the content of the course emphasizes basic theoretical knowledge and important technical practice; the face to face teaching adopts the form of case-driven teaching and group collaboration methods; the online teaching content adopts the task learning list which is driven by the combination of a variety of learning forms, including online video, knowledge point testing, projects and so on. Table 1. Basic information about students. Grade

Gender School Computer Management Others

Freshman

Male 13 Female 2

3 15

12 27

Sophomore Male 13 Female 0

0 0

3 9

Students’ Level of Basic Knowledge. There are diﬀerences in the basis of students’ computer knowledge. For example, Question-Whether you have ever studied a programming course? It can be seen that the proportion of those who studied programming and those who have never studied programming is close to 1:1, which means that nearly half of the students have blank knowledge or have a very weak foundation for programming. Continuing with Question-What do you think of your ability to program now? The results are shown in Fig. 1, from level 1 to level 5, the ability is decreasing. Accounting to about 50% of the total, the proportion of the students with poor programming ability is about 33%. Nearly 90% of the students have poor programming skills. Maybe they consider that this course is too hard to understand. Therefore, teachers may explain to them that the purpose of this course is to teach basic application, so as to enhance the self-conﬁdence learning in studying such a course. Students’ Attitude to Blended Learning. Blended learning diﬀers from traditional teaching. Prior to class, teachers should be aware of students’ attitudes about blended learning. For Question-Which teaching method do you prefer? The result is shown that more than half of the students tend to blended learning model, what’ more, for Question-Do you support learning the course in a blended learning fashion? 82% of the students support the method of blended learning, only a few students hesitant or hold opposition attitude, indicating that most students have a positive attitude towards mixed-type courses, which is important in the process of mixed teaching.

Data Analysis of Blended Learning in Python Programming

213

Fig. 1. The feedback students’ ability for blended learning

From the above it can be seen that there is much diﬀerences in the basis in students’ abilities in programming, but students have a strong motivation to learn and most students are willing to try this new approach in the curriculum. 3.2

Resources and Activities Design

Selecting the appropriate learning platform helps teachers control the entire teaching process better. From the studies of Xu [6], the online learning platform for blended learning is increasingly being transformed from the formal platform to the informal platform where the informal platform is more personalized according to the truly customized teaching situation. In the course of python programming, we select Superstar online as our blended learning platform. Teaching resources design is essential to rich the learning resources in online platform. Teachers add modules and upload learning resources to Superstar learning platform, which meets the individual teaching needs of teachers, while the Micro lesson videos are recorded. Compared with the existing open class or excellent courses on the internet, 8 to 10 min of high eﬃciency micro-class is beneﬁt to make a deep impression on students. Moreover, homemade microclasses can satisfy the needs of students more accurately, and ﬂexibly adjust the learning content by students themselves. The python programming resources online includes general introduction to python, python basic operation, python data structure, python data reading and analysis data, project practice, and etc. Weekly course is equipped with a sheet on autonomy learning task to give guidance in learning. Other modules such as the homework module and QA module are set to meet our needs. In the course of blended learning, information technology is regards as a means of teaching, and the teachers would not be replaced by technical means [7]. In teaching activities, learning activities for students mainly include collaborative learning, self-learning, physical learning, practical learning and so on. In addition, interaction among students and teachers are also included. Such a learning style is an indispensable part, though it is more informal, for online forums play an important role in necessary communication. Moreover, the teachers obtain a large amount of information from the forums. For example, students’

214

Q. Chu et al.

understanding and opinion about mixed course. In this study, the forum provides online communication for teachers and students, as well as students and students. Teachers improve the management and resources on platform, monitoring and regulating the learning activities in class. After each class, the teachers design a micro-questionnaire to track students’ learning and provide suggest for the next class. The micro-questionnaire takes two or three questions as the standard, in order to obtain the learning eﬀect and the possible problems in teaching. 3.3

Blended Learning Process

With the development of Internet and educational technologies, blended learning is better to meet the study needs of students. Before the class, the teachers place recorded video which is necessary on the teaching platform; students can control the learning process by themselves. When diﬃcult points are met, students can watch the video repeatedly. In the face-to-face teaching, teachers in the classroom focus on the key point in learning. On problems that arise in practice, the students communicate with their classmates or ask teachers in their class time. In this way, the students’ abilities of hands-on in practice are strengthened. Compared with traditional teaching, the students’ passive acceptance of the rigid knowledge in the book is changed so as to achieve the skills and abilities of being practical. For each session in python programming, students autonomously study on the teaching platform, refer to the autonomous learning quiz, watching instructional videos, etc. As to the questionable points of knowledge, they discuss with classmates or teachers in the discussion form. Generally speaking, a group consists of ﬁve or six students, and the members of the group work together. Moreover, they would communicate with each other among diﬀerent groups. The form of group learning enables everyone to participate in the learning activities, which greatly stimulates students’ enthusiasm in learning. The teachers guide the students to focus on the memory of knowledge points and solve the students’ doubts, using a task-driven approach to enable students to “do” the theory and practice linked. As a result, A project or some small exercises are selective to encourage students to demonstrate their learning results in the classroom, and the classroom is left to demonstrate their experiments and experience in using python if necessary.

4

The Analysis of Evaluation Results

In modern education, the public pay more attention to the performance of students in practice and prefer multiple evaluations. Therefore, the goal of teaching is changed to “promote the all-round development of students”. 4.1

Subjective Teaching Eﬀectiveness Survey

After the course, questionnaires and interviews are applied to have an insight on students’ thinking about blended learning. From the students’ evaluation for

Data Analysis of Blended Learning in Python Programming

215

the course, it can be seen that the improvement of students’ learning ability, as well as the outstanding scores of students’ participating in competitions are achieved. Moveover, the blended learning bring advantages in learning habits for both teachers and students. Using the novel form of blended learning, the cooperative ability of students is improved. About 89% of the students think that their abilities are improved. Then the teachers communicate with the rest part of students to ﬁnd out possible reasons that aﬀect their ability to enhance. These factors are the key points for improving blended learning. As to the question of whether or not to support blended learning, 49% of the students chose strong supporting and 47% of them just support blended learning, while only 4% of them oppose it. Compared with the premixed questionnaire survey, 35% of the students strongly support it, and 45% of them just support it, while 20% of them oppose it. It can be seen that the overall number of students who support it rising markedly and the number of students who oppose it is remarkably decreased. From Question-with blended learning, what are you most beneﬁts from the course? 40% of the students gain knowledge about python and improve their ability to programming, 22% of them increase their ability to collaborate, and 24% increase their ability to discover problem and the ability to solve problems, 14% of the students get more interest in learning. In all, these factors promote the improvement of the overall abilities of the students. In Question-whether the blended learning activities boost your learning in the python class? 53% of the students think it very helpful, 38% consider it helpful, which shows that blended teaching helps student learn better. The results are shown in Fig. 2.

Fig. 2. The analysis of evaluation results

Can you ﬁnish the learning tasks easily and in time? As to this question, about 30% of the students delay the completion of the task, or can’t complete the task of learning on time, which indicates that the learning eﬃciency of students is not high. On the main factors that aﬀect students’ completing the task of learning, 31% of the students think it is inconvenient to complete the task, while 17% of students think it inconvenient to use computer network, and 30% of the students think that they lack of good learning habits. In order to enhance the responsibility of the students in the course, the teachers should adjust the

216

Q. Chu et al.

learning tasks to a reasonable level and make a more humanized arrangement to ensure that the students have enough time to complete the task. In fact, about 54% of the students believe that the blended learning put forward higher requirements to teaching and learning, which requires the cooperation in teachers and students to coordinate learning activities and improve their adaptability to blended learning. 4.2

Objective Teaching Eﬀect Survey

After a semester of blended learning activities, the teaching platform recorded the students’ performance of learning activities in detail, and obtained the data such as the number of visiting, students’ scores, the duration of the videos, the number of discussions, the chapter tests and so on. With the data obtained, data process such as pretreatment and standardization is applied to obtain valuable information on blended learning. As for the interview period, 73.44% of the students visited the learning platform from 16.00 pm to 24.00 pm. Thus, most of the students took advantage of the one-day course after completing their daytime course, and used free time in evening to study. They visit the learning platform in dormitory or study room to review or preview the course, which means the students’ online learning are achieved mainly by computers. Therefore, the next improvement is to extend a more ﬂexible fashion of mobile terminals for blended learning.

5

Summary

In this study, a semester’ teaching with the novel form of blended learning is introduced and improvements on blended learning is conducted, which enrich the teaching methods and achieve better-than-expected teaching eﬀects in the undergraduate curriculum of python programming. However, the research and implementation of blended learning is not mature and there is still rooms for improvement in our following study.

References 1. Garrison, D.R., Vaughan, N.D.: Blended Learning in Higher Education: Framework, Principles, and Guidelines. Wiley, New York (2008) 2. Graham, C.R., Woodfield, W., Harrison, J.B.: A framework for institutional adoption and implementation of blended learning in higher education. Internet High. Educ. 18, 4–14 (2013) 3. Kukang, H.: From blending learning to see the new development of educational technology theory (Part Two). China Electrochem. Educ. 3, 5–10 (2004) 4. Ronghuai, H.: The Theory and Practice of Blended Learning, pp. 33–35. Higher Education Press, Beijing (2006) 5. Li, K., Zhao, J.: The principle and application of hybrid learning. Electr. Educ. Res. 2(7), 1–6 (2004)

Data Analysis of Blended Learning in Python Programming

217

6. Meidan, X.: Study and Design of Mixed Learning Based on WeChat Public Platform. Normal University, Nanjing (2016) 7. Cheng, G., Chau, J.: Exploring the relationships between learning styles, online participation, learning achievement and course satisfaction: an empirical study of a blended learning course. Br. J. Educ. Technol. 47(2), 257–278 (2016)

APs Deployment Optimization for Indoor Fingerprint Positioning with Adaptive Particle Swarm Algorithm Jianhui Zhao1, Jun Li1, Haojun Ai1,2, and Bo Cai1(&) 1

2

School of Computer Science, Wuhan University, Wuhan 430072, Hubei, China [email protected] Collaborative Innovation Center of Geospatial Technology, Wuhan 430079, Hubei, China

Abstract. Indoor positioning service gives people much better convenience, but its efﬁciency is affected by the spatial deployment of access points, APs. We propose an algorithm from adaptive particle swarm, APS, and then apply it in APs deployment optimization for ﬁngerprint based indoor positioning. In our method, solutions of APs placement are taken as individuals of one population. Particle swarm method is improved with adaptive technology to ensure the population diversity and also avoid large number of inferior particles. After evolutions, the optimal result is obtained, corresponding to the best solution of APs deployment. The algorithm works well for both single-objective and multiobjective optimizations. Experiments with deployments of 107 iBeacons have been tested in an underground parking lot. Compared with the existing APs placement methods, our APS algorithm can obtain the least indoor positioning error with ﬁxed APs number, while receive the best integrated evaluation considering both positioning error and APs cost with unﬁxed APs number. The proposed algorithm is easily popularized to the other kinds of indoor spaces and different types of signal sources. Keywords: Indoor positioning Adaptive particle swarm

APs deployment Optimization algorithm

1 Introduction Positioning technology can be used in outdoor and indoor environments. Outdoor positioning technology includes GPS, Galileo, Beidou navigation satellite system, etc. However, satellite signal attenuates seriously when it penetrates building, while the complex indoor environment causes further signal attenuation, thus it is impossible to achieve indoor positioning with satellite signal. Currently, there is an increasing demand for indoor positioning, e.g., locating one person in a building, floor, or even room; ﬁnding certain good in the warehouse; positioning your wallet, key in an ofﬁce; and so on. Indoor positioning can greatly facilitate people’s work and life, and it has received more attentions from users and researchers [1, 2]. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 218–228, 2018. https://doi.org/10.1007/978-3-030-05057-3_17

APs Deployment Optimization for Indoor Fingerprint Positioning

219

The indoor positioning system may use different types of signal sources [3], such as WiFi, Bluetooth, UWB, LED, Ultrasonic, RFID, Infrared, ZigBee, etc. According to the positioning algorithm, indoor positioning mainly includes cell-ID positioning, triangle positioning, multilateral positioning and ﬁngerprint positioning methods [4]. Cell-ID positioning and triangle positioning methods cannot guarantee the accuracy of positioning. Multilateral positioning method has theoretically high positioning accuracy, but it is difﬁcult to obtain time and angle parameters from the ordinary equipment. Fingerprint based positioning collects signal characteristic parameters of different positions in space, and establishes the ﬁngerprint database, then the target position is determined by comparing the received signals with signal characteristic parameters in database. Therefore, ﬁngerprinting method is usually used for indoor positioning. Ma et al. [5] presented an indoor positioning system based on intelligent ﬁngerprint assisted method to calculate the reliable reference point data and develop a learning mechanism through wireless network connection. Xia et al. [6] developed a method of processing the off-line data and an improved KNN positioning method to improve the positioning precision based on ﬁngerprint. Raspopoulos [7] studied the use of deterministic channel modeling through 3D Ray Tracing for constructing the device independent radio maps for WiFi RSSI-based ﬁngerprinting indoor positioning system, which is applicable to different devices. Deploying the signal sources (access points, APs) properly in advance is very important for indoor positioning technology. There are two types of APs deployment approaches: non-optimal and optimization methods, while the non-optimal method mainly refers to uniform deployment. Traditional APs deployment usually adopts the uniform method, i.e., signal sources are evenly distributed in one space. But there may be less accurate positioning result with too few APs, or unnecessary waste with too many APs. In order to achieve both precise indoor positioning and low cost, APs deployment should be optimized. Maximum and Minimum Coverage (MMC) method proposed by Dhillon and Chakrabarty [8] uses the polynomial-time algorithms to determine the number of sensors and their placement to help address the coverage optimization under constraints of imprecise detections and terrain properties. Based on Cramer-Rao Lower Bound (CRLB) and Simulated Annealing (SA), Zhou et al. [9] presented an optimization method for APs placement, which focuses on the error bound analysis of indoor WiFi ﬁngerprint based positioning for intelligent APs placement using Fisher Information Matrix (FIM) to characterize the relationship between positioning errors and signal distributions. Besides, there are other kinds of complex optimization methods [10, 11], e.g., genetic algorithm, artiﬁcial immune algorithm, particle swarm optimization, etc., and they can also be used for APs deployment optimization of indoor positioning. Particle Swarm Optimization (PSO) has been widely utilized in the ﬁelds of neural network training, function optimization and fuzzy system control. For optimization, PSO has good search depth, but its search breadth is insufﬁcient [12]. Therefore, an adaptive particle swarm algorithm, APS, is proposed to optimize APs deployment for indoor ﬁngerprint positioning. The APS can improve the breadth searching ability of traditional PSO, and can generate better global optimization. Compared with existing

220

J. Zhao et al.

optimization algorithms, our proposed method can obtain more optimal indoor APs placement, including both single-objective (e.g., indoor positioning error) and multiobjective (e.g., positioning error and APs cost) evolutions.

2 Our Algorithm for Indoor APs Deployment In our work, ﬁngerprint based positioning method is used for indoor location, while positioning error and APs cost are mainly considered in deﬁning the objective function. The adaptive PSO, APS, is implemented for spatial optimization of APs. 2.1

Objective Function and Fingerprint Positioning

To test the efﬁciency of different optimization algorithms for APs deployment, APs are initially placed on the reference points in indoor space. Then some of the APs are chosen as one possible solution of deployment optimization, and their locations are estimated with ﬁngerprint positioning method. The differences between reference points and estimated points are calculated, and taken as positioning error for the selected APs. Considering the positioning error, or considering both the positioning error and the cost of APs, the objective function value can be computed, and then used to evaluate the spatial deployment of APs. With the help of optimization algorithm, the optimal APs deployment is obtained after iterations of searching. Given indoor space and APs parameters, the optimization algorithm can provide installing suggestion for signal sources. Suppose there are n APs in indoor place, and m of them are selected as one deployment. The chosen APs are evaluated as follows. 1. Place all the n APs evenly in the building for indoor positioning, i.e., their row spacing and column spacing are both k meters, and coordinates of all the APs are recorded as reference points in a database. 2. For the selected m APs, their coordinates or the corresponding m reference points are retrieved from the database. 3. Signal values from APs are received at each reference point, and are recorded by mobile phone as ﬁngerprints in a 2D array, i.e., data in the jth column and the ith row is the signal value of the jth AP collected at the ith reference point. 4. Coordinates are estimated for each of the selected m APs, and then are recorded as their estimated points. 5. As one solution of APs spatial deployment, the selected m APs are evaluated by the uniﬁed objective function: OF ¼ a APE þ b COA m P

APE ¼ i¼1

ð1Þ

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðxei xri Þ2 þ ðyei yri Þ2 m

ð2Þ

APs Deployment Optimization for Indoor Fingerprint Positioning

221

where OF is the objective function, while APE is the averaged positioning error and COA is the cost of APs, a and b are their weighting parameters, ðxri ; yri Þ is coordinates of the ith reference point, ðxei ; yei Þ is coordinates of the ith estimated point. Obviously, the smaller value of objective function means the better APs placement, i.e., the deployment with less positioning error and APs cost. The average positioning error is obtained through ﬁngerprint positioning, and the method for obtaining the estimated coordinates of the ith reference point is described as follows. 1. For each of the m APs, compare its ﬁngerprint with every ﬁngerprint already recorded in the database, i.e., calculate the Euclidean distance between two ﬁngerprints. 2. Find the 3 ﬁngerprints with the smaller Euclidean distances w1, w2 and w3 from database, then obtain coordinates of the 3 corresponding reference points (x1, y1), (x2, y2), (x3, y3). 3. The estimated coordinates of the ith reference point is computed as: xei ¼ ððx1=w1Þ þ ðx2=w2Þ þ ðx3=w3ÞÞ=ðð1=w1Þ þ ð1=w2Þ þ ð1=w3ÞÞ

ð3Þ

yei ¼ ððy1=w1Þ þ ðy2=w2Þ þ ðy3=w3ÞÞ=ðð1=w1Þ þ ð1=w2Þ þ ð1=w3ÞÞ

ð4Þ

where coordinates of the ith estimated point are calculated from 3 reference points with similar ﬁngerprints, and the more similar reference point has the bigger effect by using the reciprocal of Euclidean distance as weighting parameter. For the application of indoor positioning, position of mobile phone is estimated in the same way. In our work, the mobile phone is supposed to be placed at each location of APs one by one. Thus the reference points and their estimated points can be used to deﬁne the positioning error of the related deployment. 2.2

Implementation of Adaptive PSO

PSO algorithm is easy to fall into the local optimum, so we improve the optimization algorithm to increase its breadth searching ability. The improved PSO is named as Adaptive Particle Swarm (APS) algorithm, which can ensure the population diversity and avoid the introduction of large number of inferior particles. The basic idea of APS is: (1) set up a threshold; (2) take one particle as an excellent particle if its objective function value is no more than the threshold, otherwise take it as an inferior particle; (3) increase the threshold adaptively if the number of excellent particles is too small, or reduce the threshold adaptively if the number of them is too big. The proposed APS is used for the problem of indoor positioning APs placement optimization, and the procedure is as follows.

222

J. Zhao et al.

1. Initialize the population of particles, and calculate the objective function value of each particle, the speciﬁc process is described in Pseudocode 1. 2. Initialize the history optimal value “pbest” for each particle, and initialize the population optimal value “gbest” for all particles, i.e., “gbest” is the least value of all “pbest” in the same generation. 3. Adaptively adjust the particles to obtain better optimization result, and the speciﬁc process is described in Pseudocode 2. 4. Update the individual history optimal value “pbest” with the least objective function value for each particle, and the population optimal value “gbest” with the least objective function value for all particles. 5. Take “gbest” as result if the maximum number of iterations is reached or if “gbest” satisﬁes the requirement, and take the corresponding particle with “gbest” as the optimal solution of APs deployment; otherwise, go to Step (3).

For each particle, sizepop is the population size, num[i] is the number of APs in the ith particle, and ﬁtness[i] is the objective function value of the ith particle. [Xmin, Xmax] is the x coordinate range of indoor space, [Ymin, Ymax] is the y coordinate range of indoor space, and [Vmin, Vmax] is the velocity range of particles. Based on the coordinates, the corresponding AP is determined, thus our method can deal with randomly labeled APs in indoor space.

APs Deployment Optimization for Indoor Fingerprint Positioning

223

The threshold Tgood is predeﬁned to decide one particle as an excellent or inferior particle. The number of excellent particles is goodnum, while Nmax is the maximum number of excellent particles and Nmin is the minimum number of excellent particles. Values v1/v2 are deﬁned to decrease/increase the threshold Tgood, and we set v1 > v2 to avoid the large number of inferior particles.

3 Experimental Results and Analysis Based on experiments of APs deployment for ﬁngerprint based indoor positioning, efﬁciency of our APS algorithm is analyzed. Our approach is also compared with the existing optimization methods, including Maximum and Minimum Coverage (MMC) based method, Cramer-Rao Lower Bound (CRLB) based method, Genetic Algorithm (GA) and Artiﬁcial Immune Algorithm (AIA). 3.1

Testing Environment

In our work, all the algorithms are tested in the positioning space of one underground parking lot, as shown in Fig. 1. Taking iBeasons as APs, there are 107 signal sources.

224

J. Zhao et al.

The APs are evenly deployed in the indoor parking lot, with row spacing and column spacing about 4.5 m.

p

g

Fig. 1. Positioning space of underground parking lot with 107 iBeacons

Fingerprint database is acquired as follows: take the 107 APs locations as reference points; at each reference point use an Android mobile phone orientated to the same direction to collect the received signal strength from each AP; obtain 600 sets of data from each reference point within 1 min with once collection every 100 ms; store the collected data of each reference point as one XML ﬁle; calculate the average value of 600 sets of data, and take it as ﬁngerprint of the reference point. Our experiments are executed in a computer with Inter (R) Core i5-6400 processor, 2.70 GHz CPU, 8 GB memory, NVIDIA GeForce GTX 1050 Ti graphics card, Win7 64-bit operating system, and VS2015. 3.2

Parameters of Each Algorithm

Based on the positioning space of underground parking lot, the parameters of our APS algorithm are set as: Xmin is 4832, Xmax is 4863, Ymin is 5353, Ymax is 5449, Vmin is −30 (speed in the negative direction), Vmax is 30 (speed in the positive direction), the initial Tgood is 7.1, Nmin is 5, Nmax is 20, v1 is 0.05, v2 is 0.03, c1 and c2 are 2.0, smin is 0, smax is 10, rxinit is 15.0, rxfinal is 5.0, ryinit is 45.0, ryfinal is 5.0. The parameters of MMC based algorithm are the same as Reference [8]. The parameters of CRLB based algorithm are the same as Reference [9]. The parameters of GA algorithm are set as: the occurrence probability of the cross operations is 0.5, the occurrence probability of the mutation operations is 0.2. The parameters of AIA algorithm are set as: the cross mutation rate is 0.85, the mutation rate of a single gene is 0.65, the parameter of diversity evaluation is 0.95. For all the above optimization algorithms, the population size is 50, the maximum number of iterations is 80, the maximum number of APs is 107. When number of APs is ﬁxed, the objective function only considers the average positioning error, which is:

APs Deployment Optimization for Indoor Fingerprint Positioning

OF ¼ 1:0 APE þ 0:0 COA

225

ð5Þ

When number of signal sources is unﬁxed, the objective function considers both the average positioning error and the cost of APs, i.e., the number of APs since they are all iBeasons in our experiments, which is: OF ¼ 1:0 APE þ 0:075 COA

3.3

ð6Þ

Performances with Fixed APs Number

When the number of APs is ﬁxed, all algorithms are compared only considering the average indoor positioning error for the deployment optimization. There are 88 tests performed for each algorithm, corresponding to the number of APs from 20 to 107. As illustrated in Fig. 2, our combined algorithm obtains the minimum location error with the same number of APs as other methods, and thus gives the best deployment of APs. From the experiment result, performances of all related algorithms are ordered ascendingly as: CRLB, MMC, GA, AIA, APS.

Fig. 2. Positioning errors from MMC, CRLB, GA, AIA, APS

3.4

Performances with Unﬁxed APs Number

When the number of APs is unﬁxed, all algorithms are compared considering both the average positioning error and the cost of APs (i.e., the number of iBeasons) for deployment optimization. There are 10 tests for each algorithm, and the results are shown in Fig. 3. Obviously, APS algorithm obtains the minimum objective function value or the best integrated evaluation of positioning error and APs cost, and thus gives the best deployment of iBeasons.

226

J. Zhao et al.

Fig. 3. Integrated evaluations of positioning error and APs cost from MMC, CRLB, GA, AIA, APS

Then the minimum, average and maximum integrated objective function values of the 10 tests are listed in Table 1 for all the methods. It can be found that the proposed APS algorithm has all best performances in minimum, average, maximum integrated objective function values. Based on the average integrated evaluation, performances of all algorithms are ordered ascendingly as: MMC, GA, CRLB, AIA, APS.

Table 1. Minimum, average and maximum integrated evaluations from MMC, CRLB, GA, AIA, APS Algorithms MMC CRLB GA AIA APS

Minimum integrated evaluation 6.52489 6.15706 6.25092 5.88511 5.73439

Average integrated evaluation 6.52489 6.35707 6.38989 6.09539 5.85493

Maximum integrated evaluation 6.52489 6.63497 6.50733 6.38688 5.95164

According to the above experiments in the underground parking lot, our APS algorithm has been proven to be the best optimization method of APs spatial deployment for ﬁngerprint based indoor positioning. From the tests with ﬁxed or unﬁxed APs number, the proposed combination algorithm can generate the best optimal result, including both single-objective (indoor positioning error) evolution and multi-objective (positioning error and APs cost) evolution.

APs Deployment Optimization for Indoor Fingerprint Positioning

227

4 Conclusion The indoor positioning technology is becoming more and more important, since it can bring much convenience to people. With the development of various signal sources, a lot of indoor positioning approaches have been designed. Among existing algorithms, the ﬁngerprint positioning is usually used with established ﬁngerprint database. The efﬁciency of indoor positioning is affected by the spatial deployment of access points, which should be considered before APs installing. There are already some algorithms for APs deployment, such as uniform placement, linear programming and nonlinear optimization. Some complex optimization methods have been used for deployment of APs, such as GA, AIA, etc. To help overcome the disadvantage of PSO such as insufﬁcient searching breadth, we propose a new algorithm, APS. The breadth searching ability of PSO is improved with an adaptive method, which can maintain the population diversity and avoid more inferior particles. The APS method has better depth and breadth searching abilities, and works well for APs deployment with single or multiple objectives. Based on a series of experiments with 107 iBeacons in an underground parking lot, our algorithm is tested and compared with the other optimization methods. It has been proven that the proposed APS achieves the best APs deployment with the least indoor positioning error, or the least integrated evaluation considering both positioning error and APs cost. All the algorithms are tested with ﬁngerprint indoor positioning in underground parking lot, taking iBeacons as signal sources. In the future, APS will be tested in more positioning spaces, with more types of APs. The optimization algorithm will consider affects and constraints of different indoor environments, and complementary advantages from various kinds of APs. Our ultimate aim is to provide a very popular optimization method for APs spatial deployment, and help implement precise indoor positioning in complex spaces with multiple types of APs. Acknowledgments. This work was supported by the National Key Research and Development Program of China (Project No. 2016YFB0502201).

References 1. Li, C.C., Su, J., Chu, T.H., Liu, J.W.S.: Building/environment data/information enabled location speciﬁcity and indoor positioning. IEEE Internet Things J. 4, 2116–2128 (2017) 2. Zou, H., Wang, H., Xie, L., Jia, Q.S.: An RFID indoor positioning system by using weighted path loss and extreme learning machine. In: IEEE International Conference on Cyberphysical Systems, Taipei, Taiwan, pp. 66–71 (2013) 3. Khalajmehrabadi, A., Gatsis, N., Akopian, D.: Modern WLAN ﬁngerprinting indoor positioning methods and deployment challenges. IEEE Commun. Surv. Tutor. 19, 1974– 2002 (2017) 4. Chen, K., Wang, C., Yin, Z., Jiang, H., Tan, G.: Slide: towards fast and accurate mobile ﬁngerprinting for wi-ﬁ indoor positioning systems. IEEE Sens. J. 18, 1213–1223 (2018) 5. Ma, Y.W., Chen, J.L., Liao, J.J., Tang, C.L.: Intelligent ﬁngerprint-assisted for indoor positioning system. In: IEEE International Workshop on Electromagnetics, vol. 85, pp. 108– 109 (2014)

228

J. Zhao et al.

6. Xia, M., Chen, J., Song, C., Li, N., Chen, K.: The indoor positioning algorithm research based on improved location ﬁngerprinting. In: 27th Chinese Control and Decision Conference, Qingdao, China, pp. 5736–5739 (2015) 7. Raspopoulos, M.: Multidevice map-constrained ﬁngerprint-based indoor positioning using 3-D ray tracing. IEEE Trans. Instrum. Meas. 67, 466–476 (2018) 8. Dhillon, S.S., Chakrabarty, K.: Sensor placement for effective coverage and surveillance in distributed sensor networks. In: Wireless Communications and Networking, WCNC, vol. 3, pp. 1609–1614 (2003) 9. Zhou, M., Qiu, F., Xu, K., Tian, Z., Wu, H.: Error bound analysis of indoor wi-ﬁ location ﬁngerprint based positioning for intelligent access point optimization via ﬁsher information. Comput. Commun. 86, 57–74 (2016) 10. Du, X., Yang, K.: A map-assisted wiﬁ AP placement algorithm enabling mobile device’s indoor positioning. IEEE Syst. J. 11, 1467–1475 (2017) 11. Chen, X., Zou, S.: Improved wi-ﬁ indoor positioning based on particle swarm optimization. IEEE Sens. J. 17, 7143–7148 (2017) 12. Cai, Y., Guan, W., Wu, Y., Xie, C., Chen, Y., Fang, L.: Indoor high precision threedimensional positioning system based on visible light communication using particle swarm optimization. IEEE Photonics J. 9, 1–20 (2017)

Deployment Optimization of Indoor Positioning Signal Sources with Fireworks Algorithm Jianhui Zhao1, Shiqi Wen1, Haojun Ai1,2, and Bo Cai1 ✉ (

)

1

2

School of Computer Science, Wuhan University, Wuhan 430072, Hubei , China [email protected] Collaborative Innovation Center of Geospatial Technology, Wuhan 430079, Hubei , China

Abstract. Spatial deployment of signal sources aﬀects performance of indoor positioning systems, thus has received more attentions in recent years. This paper presents a FWA method from ﬁreworks algorithm, to provide the optimal deploy‐ ment solution. Taking ﬁne chromosomes as ﬁreworks, the explosion factors are set including the number of explosion sparks and the radius of all explosion sparks. The supplemented individuals are produced from explosion and random generation, which helps increase the diversity of population and guarantee the qualities of individuals. After crossover and mutation, population evolves to the next generation. The optimal result from evolutions refers to a deployment solu‐ tion, i.e., certain number of signal sources with their locations. The FWA algo‐ rithm has been tested to have good convergence ability by a series of experiments, with iBeacons based indoor positioning system in an underground parking lot and the ﬁngerprint based indoor location method. Compared with the usually used optimization algorithms, FWA has the best searching ability in single-objective and multi-objective cases, and it obtains the best optimization result considering only positioning error, or both positioning error and the cost of iBeacons. There‐ fore, the proposed FWA provides optimal deployment of signal sources for indoor positioning systems. Keywords: Spatial deploying · Fireworks method · Indoor position Fingerprint

1

Introduction

Positioning technology can be divided into outdoor positioning and indoor positioning. GPS is the most famous outdoor positioning system, which implement locating by transmitting signal source, receiving signal intensity and calculating distances. Due to the irregularity of building structure and the complexity of indoor materials in certain complicated interiors such as shopping malls, there are diﬀerent inﬂuences on the attenuation of satellite signal intensity. Therefore, people are trying to install sensors such as Wi-Fi, Bluetooth and LED for indoor positioning. Multiple signal sources, or even diﬀerent types of them have been used for indoor locating. Jung and Han presented a WRMs calibration system that automates the initial construction and maintenance of Wi-Fi maps for crowdsourcing based indoor

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 229–238, 2018. https://doi.org/10.1007/978-3-030-05057-3_18

230

J. Zhao et al.

positioning, it uses crowdsourced ﬁngerprints collected from numerous smartphones and incorporates an unsupervised learning algorithm [1]. Chen et al. used both a commodity ﬂashlight and a smartphone to achieve linear positioning, which allows automatic mapping from received signal strength to the position on a line, serving as a building block for ﬁngerprinting in general environments [2]. Popoola and Sinanović designed a low complex indoor positioning system, and the accuracy is improved using overlap between LED beams, while collision handling algorithms are designed for LED packets in the overlap region [3]. Zheng et al. proposed an optical indoor positioning scheme using a single LED as beacon and a camera as receiver, where the joint measured angle of arrival and received light strength are utilized as ﬁngerprint to determine the position of receiver [4]. There are other kinds of sensors used as signal sources for indoor positioning, such as ultrasonic [5], ZigBee [6], radio maps [7], etc. Based on the prop‐ erties of diﬀerent kinds of sensors, they can also be utilized together to combine their advantages together. Zou et al. implemented an indoor localization and tracking system, using smartphone built-in Inertial Measurement Unit (IMU) sensors, WiFi received signal strength measurements and the opportunistic iBeacon corrections based on particle ﬁlter [8]. In case that the number of signal sources is ﬁxed, optimal spatial deployment can improve the positioning accuracy eﬀectively. Besides, the optimization technology can help reduce the number of signal sources while maintains the level of positioning accu‐ racy. How to balance multiple factors such as precision, cost and so on, is the main problem of deployment optimization. The initially used spatial deployment is non-opti‐ mization method, i.e., the uniform coverage of signal sources whose core technique is to divide a space evenly. There are many ways for space dividing [9], e.g., triangulation, trilateration, hyperbolic localization, etc. Uniform coverage is simple, and works well for indoor environments with regular layout and less obstacles. However, most indoor environments are irregular and complex, thus are not suitable for uniform coverage. Maximum and minimum coverage [10] is an optimization method, which uses polyno‐ mial-time algorithms to determine the number of sensors and their placement to address the coverage optimization under the constraints of imprecise detection and terrain prop‐ erties. Compared with non-optimization methods, this scheme can achieve a relatively reasonable deployment of signal sources in complicated indoor spaces. Based on Cramer-Rao Lower Bound (CRLB) and Simulated Annealing (SA), Zhou et al. designed a method for APs placement, which focuses on the error bound analysis of indoor WiFi ﬁngerprint based positioning for intelligent APs placement optimization by using Fisher Information Matrix (FIM) to characterize the relationship between positioning errors and signal distributions [11]. There are complex optimization methods, e.g., Particle Swarm Optimization (PSO), Artiﬁcial Immune Optimization (AIO), Genetic Algorithm Optimization (GAO), etc., and some of them have been adopted in spatial location optimization. Chen and Zou presented a Wi-Fi indoor positioning method using an improved unscented Kalman ﬁlter, and PSO is proposed to reduce the ranging error and improve the positioning accuracy [12]. Chen et al. predicted the next location of a mobile object based on AIO, taking into account the characteristics of short moving time and an elusive moving tendency [13]. Eldeeb et al. gave a GAO based framework to solve APs placement

Deployment Optimization of Indoor Positioning Signal Sources

231

problem, which ﬁnds APs setup with unique ﬁngerprints at each signal test point while maximizing diversity among these ﬁngerprints [14]. The Fireworks Algorithm (FWA) simulates the explosion process of ﬁreworks, thus it can increase the diversity of ﬁre‐ works, meanwhile maintain the quality of ﬁreworks. Till now, there is only a few FWA based references [15] for spatial optimization, but no report for the indoor positioning applications. The advantage of FWA makes it have possible applications in spatial deployment of indoor signal sources, thus the FWA based algorithm is proposed for indoor positioning in this paper.

2

The iBeacons Based Indoor Positioning System

2.1 The iBeacons Based Testing Environment The testing environment is an underground parking lot, which is an indoor space with 2,800 m2. As shown in Fig. 1(a), each dot means an iBeacon in the space, while red dots represent an example of deployment with certain number of signal sources and their locations. The target of our work is to ﬁnd an optimal deployment with less number of iBeacons and better locating accuracy. As shown in Fig. 1(b), the installed iBeacons are labeled with red circles. There are 107 iBeacon signal sources, and they are uniformly arranged in the space. For each iBeacon, the distances between it and its adjacent signal sources are about 4.5 m. The iBeacons are taken as reference points, and they are used to locate any position in the underground parking lot.

Fig. 1. The iBeacons based testing environment, (a) the layout of indoor positioning space, (b) the installed iBeacons (Color ﬁgure online)

232

J. Zhao et al.

2.2 Fingerprint Based Indoor Location Method In our work, the ﬁngerprint based positioning approach is adopted, which includes ﬁngerprint database establishment and ﬁngerprint matching. For each reference point, Received Signal Strength Indicator (RSSI) from every signal source is collected to set up the ﬁngerprint database. During ﬁngerprint matching, the RSSIs of one observing point (any position to be located) is compared with ﬁngerprints in database, and location of the observing point is computed from the most similar reference points. (1) Fingerprint database establishment The ﬁngerprint database for n reference points is consisted of n records, and each record is consisted of n RSSIs from all signal sources. Thus, there is an n * n matrix in database, while each record is a ﬁngerprint. If no signal strength can be received, the RSSI is set as zero for the related row and column in matrix. In our experiments, a mobile phone with android 5.5 is used to collect RSSIs from all iBeacons. For each reference point, the collecting time is 1 min. The acquisition frequency is once per 100 ms, so a total of 600 sets of 107 RSSIs are obtained. Then, the average values are calculated for the 600 sets, and the averaged 107 RSSIs are taken as the ﬁngerprint for one reference point. Fingerprints for all reference points are stored into a XML ﬁle, which is the ﬁngerprint database. (2) Fingerprint matching To locate an observing point, ﬁngerprint is collected with n RSSIs from all signal sources. Then the ﬁngerprint is compared with all records in established ﬁngerprint database. The diﬀerence of 2 ﬁngerprints is deﬁned as Euclidean distance between the related 2 n-dimensional vectors. So the most similar ﬁngerprint is the one with the least Euclidean distance. In our experiments, 3 similar fingerprints are found for every observing point, corresponding to 3 reference points with the smallest 3 Euclidean distances. Suppose the 3 Euclidean distances are w1, w2, w3, and the coordinates of 3 reference points are (x1, y1), (x2, y2), (x3, y3) respectively. Then coordinates of the observing point is estimated by: x = ((x(1)∕w(1)) + (x(2)∕w(2)) + (x(3)∕w(3)))∕((1∕w(1)) + (1∕w(2)) + (1∕w(3)))

(1)

y = ((y(1)∕w(1)) + (y(2)∕w(2)) + (y(3)∕w(3)))∕((1∕w(1)) + (1∕w(2)) + (1∕w(3)))

(2)

The coordinates (x, y) obtained from the above formula are regarded as the measured location of the observing point. Positioning error of the observing point is evaluated by the Euclidean distance between its measured location and true location. Obviously, the longer distance means the larger error. After measurements of all observing points, the averaged positioning error can be computed for them. Since the coordinates of all refer‐ ence points are known in our experiments, the reference points are directly used as observing points. That is, the mobile phone is placed in the location of each reference point, and then its position is measured.

Deployment Optimization of Indoor Positioning Signal Sources

233

(3) Fitness function To evaluate the located results, ﬁtness function is deﬁned, which may consider only the positioning precision, or consider multiple factors simultaneously. In our system, two factors are mainly considered, i.e., positioning error and cost of signal sources. Because only iBeacons are employed, cost of signal sources is the number of iBeacons being used for indoor positioning. With the increasing of the number of signal sources, the whole positioning error decreases, while the cost of system increases. How to combine these two factors to achieve a relatively optimal result is the problem to be solved in our system. To represent and evaluate the combination, we adopt the following ﬁtness function: FFV = a ∗ PE + b ∗ NS

(3)

where FFV is ﬁtness function value, PE is the whole position error, NS is the number of signal sources, while a and b are weighting parameters.

3

Fireworks Algorithm for Indoor Positioning

Fireworks algorithm (FWA) is used for deployment optimization of signal sources, which is the ﬁrst application of FWA in indoor positioning to our knowledge. The opti‐ mization procedure is shown in Fig. 2, and its main steps are described as follows. (1) Initialization of ﬁreworks In FWA method, there are many ﬁreworks, and each ﬁrework is consisted of some sparks. One ﬁrework means a set of randomly generated spatial deployment of signal sources, while each spark of ﬁrework means a signal source. The initialization procedure of FWA is the same as that of Genetic Algorithm, i.e., ﬁrework refers to chromosome, and spark refers to gene. (2) Selection of ﬁne ﬁreworks For each ﬁrework, its ﬁtness function value is calculated. In evolution procedure of FWA, ﬁne ﬁreworks should increase generation after generation to obtain the more optimal results. Therefore, a constant threshold is set for ﬁtness function value to make sure the convergence of FWA. The values of all ﬁreworks are compared with the ﬁtness function threshold, and the ﬁreworks with less values than threshold are selected as ﬁne ﬁreworks. For the ﬁne ﬁreworks, they are ordered with their ﬁtness function values from small to large. (3) Set of explosion factors For every ﬁne ﬁrework, its explosion factors include the number of explosion sparks and the radius of all explosion sparks. The number of explosion sparks of ﬁne ﬁrework xi is computed by:

234

J. Zhao et al.

Fig. 2. Flowchart of FWA for indoor positioning

y − f (xi ) + a si = m ∗ ∑n max (ymax − f (xi )) + a i=1

(4)

where m represents the total number of sparks of the xi ﬁrework, f (xi ) represents the ﬁtness function value of the xi ﬁrework, ymax represents the maximum ﬁtness function value of all m ﬁreworks of current generation, the constant a is used to avoid the denom‐ inator from becoming zero. The radius of all explosion sparks in ﬁne ﬁrework xi is computed as:

Ai = A ∗ ∑n

f (xi ) − ymin + a

i=1

(f (xi ) − ymin ) + a

(5)

where A represents the maximum explosion radius value set in advance, ymin represents the minimum ﬁtness function value of all m ﬁreworks of the current generation, the other parameters are the same as Eq. (4). When a ﬁrework is exploded, the new sparks with the number from Eq. (4) are randomly selected within the range of radius from Eq. (5), while the new sparks should be diﬀerent from the old ones. (4) Supplement of ﬁreworks Except for the mf ﬁne ﬁreworks, the other mi ones of current generation are discarded since their ﬁtness function values are too large. To make the population size m

Deployment Optimization of Indoor Positioning Signal Sources

235

unchanged, ﬁreworks need to be supplemented. There are two cases: mi > mf and mi mf, the ﬁne ﬁreworks explode using explosion factors to generate mf new ﬁreworks, then the (mi-mf) ﬁreworks are randomly generated the same as initi‐ alization procedure. When mi 4.5), which is obviously different from other stages. Moreover, the SS stage can also be better differentiated from other sleep stages based on the entropy value. However, the difference between the S1 stage and the REM stage is less obvious, and it requires to further study.

246

3.4

X. Shao et al.

Discussion

This study used psychophysics method to measure the threshold range of sleep stages based on single-channel EEG signals by calculating the value of fuzzy entropy. On the one hand, we studied the scale shadow of MFE for sleep stage thresholds. The effect of the experiment show that the fuzzy entropy changes with the change of the scale factor. When the scale factor s = 2, the MFE obtains the maximum value and the threshold resolution of the sleep stages are improved (Fig. 5).

Fig. 5. Influence of different scale factors on fuzzy entropy

On the other hand, we studied the gender differences in sleep stage thresholds by comparing the fuzzy entropy thresholds of 10 data samples (5 males and 5 females). It is more convincing that explains statistically the gender differences in sleep stage thresholds based on T-test as shown in Table 1 (Sig:ð2Þ 0:05,). Table 1. T-test results of different gender T-test of mean equation t f Sig. (2) Mean difference Standard error value Conﬁdence interval (q ¼ 95%) Subthreshold Threshold W 4.619 8 0.002 0.923 0.199 0.462 1.384 S1 2.983 8 0.018 0.669 0.224 0.151 1.186 S2 3.833 8 0.016 0.539 0.140 0.214 0.863 SS 3.645 8 0.007 0.592 0.162 0.217 0.966 REM 4.559 8 0.004 0.761 0.166 0.376 1.146

The experimental results show that there are signiﬁcant differences in the entropy threshold of sleep stages between different gender, and the sleep threshold of female is signiﬁcantly higher than that of male. It may be related to the active areas of the brain in both male and female [9], and we ﬁnd the explanations in psychophysiology. Gur et al. [10] used fMRI to ﬁnd that in the unit volume of the brain, female have higher

A Study of Sleep Stages Threshold Based on Multiscale Fuzzy Entropy

247

gray matter than male, and the male have higher white matter than female. In addition, one of the reasons for the threshold difference is the social differences between male and female. Of course, it needs a further research.

4 Conclusion In this paper, The CEEMDAN and MFE method were used to study the threshold of sleep stages based on single-channel EEG signals. First, the adaptive EMD decomposition of the EEG data is performed. We can get a new high-precision EEG data. Then, the MFE of each new data is calculated and used as the feature of sleep stage threshold which provides a reference for the study of the automatic sleep stages classiﬁcation. Finally, the influence of the fuzzy entropy scale factor and different gender samples on the sleep stage threshold was studied. The experimental results showed that the sleep threshold of female was signiﬁcantly higher than male’s. We will continue our research in the future. On the one hand, our experimental sample size is too small to be universal and representative. On the other hand, the experimental results show that the sleep stage thresholds in S1 and REM stages cannot be accurately measured by using fuzzy entropy. This requires us to further study to better understand the meaning of sleep. Acknowledgments. National Natural Science Foundation of China (61373149) and the Taishan Scholars Program of Shandong Province, China.

References 1. Chen, X.: Automatic sleep staging based on EEG. Nanjing University of Posts and Telecommunications (2014) 2. Loomis, W.E., Shull, C.A., Snedecor, G.W.: Methods in Plant Physiology: A Laboratory Manual and Research Handbook. McGraw-Hill, New York City (1937) 3. Shao, X., Hu, B., Zheng, X.: A study on automatic sleep stage classiﬁcation based on clustering algorithm. In: Zeng, Y., et al. (eds.) BI 2017. LNCS (LNAI), vol. 10654, pp. 139– 148. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70772-3_13 4. Tang, Q.: Automatic sleep staging based on EEG signals. Guangdong University of Technology (2016) 5. Cheng, J.: Sleep stage analysis based on EEG signals. Beijing Institute of Technology (2015) 6. Hassan, A.R., Bhuiyan, M.I.H.: Computer-aided sleep staging using complete ensemble empirical mode decomposition with adaptive noise and bootstrap aggregating. Biomed. Signal Process. Control 24, 1–10 (2016) 7. Tiantian, L., Yong, L.: Measurement of thresholds in facial expressions and their age and gender differences. Psychol. Behav. Res. 13(6), 771–777 (2015) 8. Jinde, Z., Minjun, C., Junsheng, C., et al.: Multi-scale fuzzy entropy and its application in fault diagnosis of rolling bearings. J. Vib. Eng. 27(1), 145–151 (2014)

248

X. Shao et al.

9. Lee, T.M., Liu, H.L., Hoosain, R., et al.: Gender differences in neural correlates of recognition of happy and sad faces in humans assessed by functional magnetic resonance imaging. Neurosci. Lett. 333(1), 13–16 (2002) 10. Gur, R.C., Gunningdixon, F., Bilker, W.B., et al.: Sex differences in temporo-limbic and frontal brain volumes of healthy adults. Cereb. Cortex 12(9), 998–1003 (2002)

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels Jing Liu1, Kun Han1, Wenhua Wu1, Shu Wang2, and Xiao Yu3 ✉ (

1

2

)

School of Information and Communication, National University of Defense Technology, Xian 710106, China Institute of Systems Engineering, Academy of Military Sciences, Beijing 100039, China 3 School of Computer Science and Technology, Shandong University of Technology, Shandong Zibo 255000, China [email protected]

Abstract. The Maximum likelihood (ML) estimation algorithm of timing devi‐ ation and carrier frequency oﬀset in orthogonal frequency division multiplexing (OFDM) system is studied, and the ML algorithm is extended to the fast fading multipath wireless channel environment using the multi-symbol Joint estimation technique. This method is based on the autocorrelation of cyclic preﬁxes (CP) in OFDM blocks without training data, the spectral eﬃciency and throughput of the system are improved. Meanwhile, in the case of the extremum of signal-to-noise ratio, two algorithms are deduced, which are suboptimal but less computational complexity and more adaptable to channel. Simulation results indicate that this scheme can eﬀectively improve the estimation performance of symbol timing deviation and carrier frequency oﬀset in fast fading multipath channel. Keywords: OFDM · ML estimation · Synchronization · Multipath fading

1

Introduction

OFDM have advantages of higher spectral eﬃciency and eliminating interference within cells has been recently received great attention, which is widely applied to the digital audio and video broadcasting system, indoor broadband wireless system, etc. [1, 2]. Because of separating one subcarrier from other subcarriers by utilizing the orthogonal characteristics, symbol timing oﬀset and carrier frequency oﬀset have made great eﬀects on system performance, such as FFT window oﬀset and inter-carrier interference. By inserting pilot symbol and training serial, the existing synchronization algorithm is a simple technology which is always used to the time-varying multipath system. The technology decreases spectral eﬃciency and throughput. To improve system perform‐ ance, blind synchronization technology based on slow-varying channel models is widely studied, which needs a large quantity of OFDM data block and including cycle preﬁx [3, 4] and cyclostationarity [5–7]. The paper proposes a blind synchronization algorithm based on ML which does not need the pilot symbol and is suitable for fast-fading multi‐ path OFDM system.

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 249–256, 2018. https://doi.org/10.1007/978-3-030-05057-3_20

250

2

J. Liu et al.

Signal Model

2.1 OFDM System Model The baseband model of OFDM system is shown in Fig. 1 [8, 9]. Assume that there are N subcarriers and transmitted symbols are deﬁned as X(0), X(1), …, X(N -1). After fast Fourier transforming, spectral signals can be transformed into time signals x(0), x(1), …, x(N − 1). Cycle preﬁx with the length of D which is copied from the last D data of time domain data block is added to the front of each data block, i.e. x(k) = x(k + N), k ∈ [−D; −1] [10, 11]. Time domain discrete signal to transmit can be expressed as [12–14]:

x(k) = x(k + N), k ∈ [−D, −1]

(1)

where 𝜎s2 denotes transmitted energy per symbol, X(n) denotes the signal with mean of 0 and variance of 1, and x(k); k ∈ [0, N − 1] denotes the signal with mean of 0 and variance of 𝜎s2.

Fig. 1. OFDM system model

The signal received from the fast-fading multipath channel can be written as:

y(k) =

L ∑

h(k, l)x(k − l),

(2)

l=0

Where L + 1 denotes channel length and is lower than the length of cycle preﬁx D, and h(k, l), l = 0, … , L is impulse response of the channel. The correction of h(k, l) can be expressed as:

{ } E h(k1 , l1 ), h∗ (k2 , l2 ) = 𝛾J0 (2𝜋fD T ||k1 − k2 |∕N )e−l∕D |l=l1 =l2 ,

(3)

Where 𝛾 is normalized constant, J0 (.) is the first kind Bessel function, fD denotes the max Doppler shift, and T is the valid symbol time. The received signal can be expressed as:

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

r(k) = y(k − 𝜃)ej2𝜋k𝜀∕N + w(k),

251

(4)

Where 𝜃 denotes discrete delay during signal transmitting, 𝜀 is the normalized carrier oﬀset and w(k) is AWGN with a variance of 𝜎w2 . 2.2 Signal Correction Cycle preﬁx in the symbol of OFDM to send is the data block which is the copy of the last D data. We deﬁne the set I as the cycle preﬁx corresponding to received data and I ∗ as the original data which is copied to cycle preﬁx where the data in I correspond to those in I ∗. For k ∈ I we have L ⎧ ∑ −l∕D 2 2 2 2 m = 0, ⎪ 𝛾 J0 (0)e 𝜎s + 𝜎w = 𝛽0 𝜎s + 𝜎w , l=0 ⎪ L ∗ E{r(k)r (k + m)} = ⎨ ∑ −l∕D 2 j2𝜋𝜀 2 j2𝜋𝜀 2 ⎪ 𝛾 J0 (2𝜋fD T)e 𝜎s e 𝜎s = 𝛽1 e 𝜎s , m = N, l=0 ⎪ otherwise, ⎩ 0,

Where 𝛽0 = 𝛾

L ∑

J0 (0)e−l∕D, 𝛽1 = 𝛾

l=0

l=0

E{r(k)r∗ (k + m)} = 0. SNR is deﬁned as

3

L ∑

(5)

J0 (2𝜋fD T)e−l∕D. For k ∉ I and m ∉ 0,

𝛽0 𝜎s2 𝜎w2

.

Optimal Estimated Value Based on ML

In the last section, the symbol to send and channel noise is assumed to be a complex Gaussian signal. For k ∈ I, the joint probability density function (pdf) of the received signal can be expressed as: { }) |r(k)|2 + |r(k + N)|2 − 2𝜌Re ej2𝜋𝜀 r(k)r∗ (k + N) exp − ) ( (1 − 𝜌2 ) 𝛽0 𝜎s2 + 𝜎w2 f (r(k), r(k + N)|𝜃, 𝜀 ) = ( )2 𝜋 2 (1 − 𝜌2 ) 𝛽0 𝜎s2 + 𝜎w2 (

(6)

Where weighting coeﬃcient ρ is deﬁned as: 𝛽1 𝜎 s |E{r(k)r∗ (k + N)}| 𝜌= √ = √ { 2 2 } } { 𝛽 𝜎 0 s + 𝜎w E |r(k)|2 E |r(k + N)|2 2

For all k, pdf of the received signal can be written as:

(7)

252

J. Liu et al.

(

|r(k)|2

exp − ( ) 𝛽0 𝜎s2 + 𝜎w2 f (r(k)|𝜃, 𝜀 ) = ( ) 𝜋 𝛽0 𝜎s2 + 𝜎w2

) (8)

.

By using the vector form of received signal, logarithm likelihood function can be written as: log f (r|𝜃, 𝜀) ) ( ∏ ∏ f (r(k), r(k + N)|𝜃, 𝜀) f (r(k)|𝜃, 𝜀) = log = log

k∉I∪I ∗

k∈I

(

∏ k∈I

)

(9)

f (r(k), r(k + N)|𝜃, 𝜀) ∏ f (r(k)|𝜃, 𝜀) , f (r(k)|𝜃, 𝜀)f (r(k + N)|𝜃, 𝜀) k

Where f (.) is pdf of random variable margin. From (8), we can ﬁnd that f (r(k)|𝜃, 𝜀 ) does not correspond to 𝜃 or 𝜀. Assuming that the number of received data blocks is M, we have I = {𝜃, … , 𝜃 + D − 1, 𝜃 + K, … , 𝜃 + K + D − 1, … , 𝜃 + (M − 1)K, … , 𝜃 + (M − 1)K + D − 1},

(10)

Where K = N + D the total length of one OFDM data block. By substituting (9), (10) and (12) into (11), we have M−1 𝜃+D−1

log f (r|𝜃, 𝜀 ) = C1 + C2

∑ ∑ [ { } Re ej2𝜋𝜀 r(k + iK)r∗ (k + iK + N) i=0

k=𝜃

𝜃+D−1

)] −𝜌 ∑ ( |r(k + iK)|2 + |r(k + iK + N)|2 , 2 k=𝜃

(11)

where 𝜃+D−1

C1 =

∑

log(1 − 𝜌2 ),

k=𝜃

2𝜌 . C2 = 2 (1 − 𝜌 )(𝛽0 𝜎s2 + 𝜎w2 )

(12)

By transforming (11), we have

𝜌 log f (r|𝜃, 𝜀) = ||T1 (𝜃)|| cos(2𝜋 + ∠T1 (𝜃)) − T2 (𝜃), 2 Where ∠ denotes the plural phase.

(13)

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

253

M−1 𝜃+D−1

T1 (𝜃) =

∑ ∑ i=0

∑ ∑ ( i=0

(14)

) |r(k + iK)|2 + |r(k + iK + N)|2 .

(15)

k=𝜃

M−1 𝜃+D−1

T2 (𝜃) =

r(k + iK)r∗ (k + iK + N),

k=𝜃

T1 (𝜃) can be seen as self-correction of signal and T2 (𝜃) is accumulating energy func‐ tion [15]. The ML estimated value of 𝜃 and that of 𝜀 can be computed by maximizing (13). We can compute these values in two steps as following:

max log f (r|𝜃, 𝜀) 𝜃,𝜀

= max max log f (r|𝜃, 𝜀) 𝜃

𝜀

(16)

= max log f (r|𝜃, 𝜀ML (𝜃)). 𝜃

Where 𝜀 is between [0, 2𝜋]. The ML estimated value of 𝜀 is: 𝜀ML (𝜃) =

1 ∠T (𝜃). 2𝜋 1

(17)

By substituting (16) and (18) into (13), the ML estimated value of 𝜃 is

𝜌 𝜃ML = arg max ||T1 (𝜃)|| − T2 (𝜃). 𝜃 2

(18)

From (17) and (18), we can ﬁnd that the value of the variance to estimate depends on the length of OFDM data block M, the length of cycle preﬁx D and weighting coef‐ ficient 𝜌. Meanwhile the variance to estimate T1 (𝜃) determines the performance of esti‐ mate algorithm. When 𝜃ML = 𝜃, we have the largest value of T1 (𝜃)‘s altitude. From (18), we weighting coeﬃcient 𝜌 is computed according to existing station of the channel. When the SNR is very large, 𝛽0 𝜎s2 ≫ 𝜎w2 . After substituting it into (10), we have 𝜌 → 1. We substitute it into (18) and get the value of 𝜃, which is the result from the MMSE-likely algorithm. It can be written as: 1 𝜃MMSE = arg max ||T1 (𝜃)|| − T2 (𝜃). 𝜃 2

(19)

When the value of signal to noise ratio (SNR) is very low, 𝛽0 𝜎s2 ≫ 𝜎w2 .After substi‐ tuting it into (10), we have 𝜌 → 0. We substitute it into (18) and get the value of 𝜃, which is the result from the MC-likely algorithm. It can be written as: 𝜃MC = arg max ||T1 (𝜃)||. 𝜃

(20)

From (19) and (20), we can ﬁnd that the computation complexity of these two algo‐ rithms is very little and they both adapt well to the channel. For diﬀerent value of SNR,

254

J. Liu et al.

we can get estimate algorithm of 𝜀, which is shown in (17). Note that the performance of 𝜀 is related to estimate result of 𝜃.

4

Results and Analysis

The performance of proposed algorithm based on ML algorithm is estimated by Monte Carlo method in this section. The parameters is set as [16]: 20 symbols are transmitted in OFDM system, N = 128, T = 224 us, L = 20, M = 20, D/N = 1/4, and fD= 1 kHz. The gain for each channel follows the same Gaussian distribution and independently. Timing deviation 𝜃 = 50 and frequency oﬀset 𝜀 = 0.1. The variance curve of Timing deviation estimation VS SNR and multipath length of traditional Maximum mean square error, that of MC, and that of the proposed algorithm based on ML algorithm are shown in Figs. 2 and 3. From Fig. 2 we can ﬁnd that the performance of these three algorithms increases as SNR increases. From Fig. 3 we can ﬁnd that when the length of multipath increase, the estimation performance of these three algorithms decreases with the decrease of channel quality. In these three algo‐ rithms, the performance is the worst with the lowest computation complexity.

Fig. 2. Timing deviation estimation VS SNR

Fig. 3. Timing deviation estimation VS the length of multipath increase

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

255

The variance curve of frequency oﬀset estimation VS SNR and multipath length of traditional Maximum mean square error, that of MC, and that of the proposed algorithm based on ML algorithm are shown in Figs. 2 and 3. From Fig. 4 we can ﬁnd that the performance of these three algorithms increase as SNR increases and this performance of these three algorithms is the same adaptability to SNR. From Fig. 5 we ﬁnd that the performance of these three algorithms decrease with the increase of multipath length and the performance of MC is the worst. From Fig. 2 to Fig. 5, we ﬁnd the performance of the presented algorithm can meet with the requirement of real system under some condition. Compared with other blind estimation algorithms, these three algorithms base on ML algorithm get much better performance.

Fig. 4. Frequency oﬀset estimation VS SNR

Fig. 5. Frequency oﬀset estimation VS multipath length

5

Conclusions

Estimation algorithm based on ML algorithm is proposed to resolve the problem of blind synchronization over fast-fading multipath channels. Two suboptimal estimation algo‐ rithms with low computation complexity are conducted under diﬀerent SNR. The

256

J. Liu et al.

presented algorithms without training data can increase spectral eﬃciency throughput. Results of simulation indicate that the presented algorithms can improve the perform‐ ance of estimating symbol timing deviation and carrier oﬀset.

References 1. Lin, T.C., Phoong, S.M., New, A.: Cyclic-preﬁx based algorithm for blind CFO estimation in OFDM systems. IEEE Trans. Wireless Commun. 15(6), 3995–4008 (2016) 2. Fang, C., Gong, X., Huang, M.: On sequential blind channel estimation for time-varying OFDM system. In: IEEE International Conference on Ubiquitous Wireless Broadband, pp. 1–4 (2016) 3. Prakash, D., Pillai, S.S., Jayaprakash, A., Reddy, G.R.: A new blind carrier frequency oﬀset estimation scheme for OFDM systems. In: International Conference on Communication & Signal Processing, pp. 1096–1100 (2016) 4. Lin, T.C., Pan, Y.C., Tai, W.J., Phoong, S.M.: An improved ESPRIT-based blind CFO estimation for OFDM in the presence of I/Q imbalance. Signal Process. Adv. Wireless Commun. 395(6), 639–643 (2013) 5. Sun, Z., Liu, R., Wang, W.: Joint time-frequency domain cyclostationarity-based approach to blind estimation of OFDM transmission parameters. Eurasip J. Wireless Commun. Network. 2013(1), 1–8 (2013) 6. Zhang, W., Gao, F., Yao, B.: Blind CFO estimation for multiuser OFDM uplink with large number of receive antennas. In: IEEE International Conference on Acoustics, vol. 64 (9), pp. 2255–2268 (2016) 7. Lim, J.: Joint estimation of CFO and channel in OFDM systems with blind noise statistics. IETE Tech. Rev., 1–13 (2016) 8. Liu, M., Li, B., Yang, Q., Tang, N.: Blind joint estimation for OFDM time-frequency parameters. Circuits Syst. Signal Process. 32(6), 2999–3012 (2013) 9. Liu, M., Li, B.: Bandwidth blind estimation for OFDM. In: IEEE International Conference on Digital Signal Processing, pp. 181–184 (2017) 10. Li, X., Hu, J., Wei, H., Yu, F., Wang, G.: Blind carrier and sampling frequency oﬀsets estimation in OFDM system. In: Wireless Communications & Networking Conference, pp. 1–6 (2017) 11. Saci, A., Al-Dweik, A., Shami, A., Iraqi, Y.: One-shot blind channel estimation for OFDM systems over frequency-selective fading channels. IEEE Trans. Commun. 65(12), 5445–5458 (2017) 12. Jayaprakash, A., Reddy, G.R.: Robust blind carrier frequency oﬀset estimation algorithm for OFDM systems. Wireless Pers. Commun. 94(3), 1–15 (2017) 13. Tian, J., Zhou, T., Xu, T., Hu, H., Li, M.: Blind estimation of channel order and SNR for OFDM systems. IEEE Access PP(99), 1 (2018) 14. Wang, Y.C., Phoong, S.M.: Blind estimation of symbol timing oﬀset in OFDM systems. In: IEEE International Workshop on Signal Processing Advances in Wireless Communications, pp. 1–5 (2017) 15. Ramadhan, M., Bouzidi, D.A., Iyad, D.: A low complexity joint semi-blind estimation of CFO and channel for OFDM systems. In: International Conference on Electrical Engineeringboumerdes, pp. 1–6 (2017) 16. Lin, T.C., Phoong, S.M.: MSE-optimized CP-based CFO estimation in OFDM systems over multipath channels. In: Asia-paciﬁc Signal & Information Processing Association Summit & Conference, pp. 818–822 (2018)

Facial Shape and Expression Transfer via Non-rigid Image Deformation Huabing Zhou1 , Shiqiang Ren1 , Yong Zhou2(B) , Yuyu Kuang1 , Yanduo Zhang1 , Wei Zhang1 , Tao Lu1 , Hanwen Chen1 , and Deng Chen1 1

2

Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China Yangtze University College of Technology and Engineering, Jingzhou 434100, China [email protected]

Abstract. In this paper, we present a novel approach for transferring shape and expression of a face in image to that of another, regardless of variance between the two faces in illumination, color, texture, resolution and even some mild occlusion. We first use a face alignment algorithm to locate accurate facial landmark points for both original face and target face, then align them with a global similarity transformation to eliminate their inconsistency in pose, size and position. Finally, we use our non-rigid image deformation method to deform the original face by fitting a map function for each of its pixel point according to the two sets of facial landmark points. Our method can be full-automatic or semiautomatic for conveniently tuning a better result by combining a face alignment algorithm and a non-rigid image deformation method. Experiment results show that our method can produce realistic, natural and artifact-less facial shape and expression transfer. We also discuss the limitation and potential of our proposed method. Keywords: Non-rigid image deformation Expression transfer

1

· Face editing

Introduction

Image deformation, which refers to deforming objects into desired shapes or poses, has long been an active research area in image processing. Specially, face deformation aims at deforming faces to obtain new face images with expected shape or expression, It has a number of useful applications ranging from face image beautify, medical imaging and facial animation in the entertainment industries. However, due to the fact that human face has extremely complex geometric form and movement mechanism, as well as subtle variations in color and texture. The authors gratefully acknowledge the financial supports from the National Natural Science Foundation of China under Grant Nos. 41501505, 61502354 and the Scientific Research Project of Education Department of Hubei Province under Grant No. Q20181508. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 257–269, 2018. https://doi.org/10.1007/978-3-030-05057-3_21

258

H. Zhou et al.

Lots of works tried to challenge this problem in diﬀerent ways. In image blending based methods [6,9,18], to transfer the expression from one face to that of another, the target face with expected expression is cut and pasted to the original face following with a seamless blending [5]. These methods can create quite realistic expression transfer when the two faces have similar color and texture, but they may change the identity of original face and are not robust to in-consistencies in illumination, color, texture, resolution or occlusion. Morphbased approaches [3,19] synthesize new expressions which are between two different facial expressions through interpolation, one limitation of these methods is not capable of transferring facial expression between diﬀerent people.

Original f aces

Def ormed f aces

T arget f aces

Fig. 1. Three example of face shape and expression transfer with our face deformation method. The left column shows original faces, namely to be deformed faces; the middle column shows results of facial expression and shape transfer from target faces to original faces with our method; the right column shows target faces which contain expected facial expression and shape.

Image deformation methods are one of the common ways of dealing with these troubles. Shen et al. [22] achieve face smilization by using image deformation method to deform a normal face to smile one. Deformation methods view face deformation as a mapping from original face to deformed face and solving of the mapping function only rely on face counter information, which means deformation methods can achieve facial shape and expression transfer between diﬀerent people regardless of variances in illumination, color, texture, resolution and even occlusion [25,27].

Facial Shape and Expression Transfer via Non-rigid Image Deformation

259

Fig. 2. Framework of our method. Firstly, landmark points are extracted from the original and target face, then landmark points of target face are align to that of original face with a global similarity transformation, finally the original face are deformed according to original face landmark points and aligned target face landmark points with our nonrigid image deformation method.

Many image deformation methods have been proposed to satisfy the requirements such as intuitive user interaction, realistic deformation results. Among them, the methods that avoid unnatural local scaling and shearing are of special interest. To produce such deformations, Schaefer [21] proposed Moving Least Squares (MLS) [8] using linear functions such as rigid transformation. The use of MLS and rigid formations makes the deformation as-rigid-as-possible. [1] However, the deformation methods mentioned above are modeled for deformation of general objects, the special geometrical structure features of face are not taken into account. These geometrical structure features can provide intrinsic structure information of the original face, which is beneﬁcial to the face deformation estimation. Therefore, we need to develop a non-rigid model. To address these issues and produce more realistic and natural face deformation, we propose a new algorithm based on MLS and a non-rigid transformation modeled [10–13,15] by specifying it in a reproducing kernel Hilbert space (RKHS) [2,14]. Furthermore, taking the special geometrical structure features of face into account, we introduce the local neighborhood structure constraint into our model as a regularization term. Beneﬁting from the combination of these factors, our algorithm can avoid superﬂuous global or local deformation, and lead to more natural and realistic face deformation. Specifying the transformation in an RKHS leads to a simple closed-form solution which is computationally eﬃcient. In general, image deformation methods are typically controlled with a set of distinct handles, including points [4], lines [3], and even polygon grids [17], which are usually chosen by users manually. In our face deformation method, we use the facial landmark points as handles. Beneﬁting from the impressive progress in realtime face detection and facial landmark alignment in recent years [7,20,23,24],

260

H. Zhou et al.

we use a face alignment algorithm [7] to locate accurate facial landmarks rather than choose manually, which make it possible to achieve automatic facial expression and shape transfer by using one face image to drive the deformation of another face image. As shown in Fig. 1, there are three examples of facial expression transfer using our face deformation method, in which the expression and shape of the target faces are transferred to the original face. Our contribution in this paper include the following two aspects. First, we propose a novel non-rigid model with local neighborhood structure regularization to deal with face deformation, which can capture the intrinsic geometry of the input face image and hence help to produce realistic deformation. Second, combing the face alignment algorithm, we present a fast and automatic approach of facial expression and shape transfer with our face deformation method.

2

Facial Expression and Shape Transfer

The framework of our method is showed in Fig. 2. To achieve automatic facial expression and shape transfer between two faces, we ﬁrst utilize a cascade regression tree based face alignment algorithm [7] to extract 68 accurate face landmark points from the two face images. This algorithm is quite robust to face images of sculpture, sketch, comic and painting. Then, to eliminate the inconsistencies between the two faces in position, size and rotation, we align the target landmarks to original landmarks with a similar transformation which can be solved with an ordinary Procrustes analysis. After the transformation, the target face landmarks have a similar size, pose and location to the source face, and in the same time, it retains the contour details which are of fatal importance for the face deformation. Finally, we transfer facial expression and shape from target face to original face by deforming the original face to the shape of target face with our non-rigid face deformation method. n n Let X = {xi }i=1 be the original face landmark points and Y = {yi }i=1 be the aligned target face landmark points, where xi and yi are column vectors and present coordinate of the i-th landmark point, n is the numbers of points in the two sets. The deformation can be viewed as a map f from original image to deformed image, each pixel point p in original image have a unique map function fp : fp (p) = p + gp (p)

(1)

where gp is the displacement function which is solved through the interpolation of X and Y , and fp (p) is coordinate in deformed image where p is mapped to. More details of the deformation will be discussed in the following sections. We can see that all the color and texture features of the deformed image are mapped from the original image and only landmark features are from the target face image; this lead to a fact that our method can avoid artifacts caused by inconsistency of the two faces in illumination, color, texture and resolution.

Facial Shape and Expression Transfer via Non-rigid Image Deformation

3

261

Local Feature Guided Non-rigid Image Deformation

In this section, we describe the detail of our non-rigid face deformation algorithm. As mentioned above, the deformation is built according to two sets of distinct points. Let X be a set of control points(original face landmarks), and Y be the corresponding target points(aligned target face landmarks). We view the deformation as a function f that maps the points in the original image to those in the deformed image, and formulate the function estimation as a vector-ﬁeld interpolation that should satisfy the following three properties [21]: (i) Interpolation: the points {xi }ni=1 should map directly to {yi }ni=1 under deformation; (ii) Smoothness: f should produce smooth deformations; (iii) Identity: f should be the identity function if the deformed handles {yi }ni=1 are the same as {xi }ni=1 (i.e., ∀i, xi = yi ⇒ f (x) = x with x being an arbitrary point in the image). These properties are very similar to those used in scattered data interpolation. Thus, we construct a non-rigid deformation function f satisfying these three properties with a closed-form solution. 3.1

Problem Formulation

The mathematical formulation of the deformation problem is based on Moving Least Squares (MLS) [21]. For each point p in the image, MLS is used to solve for a rigid-body transformation fp (x) that minimizes a weighted least squares error functional: n wi (p)fp (xi ) − yi 2 (2) i=1

where wi (p) is a non-negative weight function deﬁned as wi (p) = p − xi −2α

(3)

where α controlling the weight of each control point and · being the Euclidean distance. The global deformation function f is obtained from a set of local functions, and is deﬁned as f (p) = fp (p), which is continuously diﬀerentiable. As traditional MLS method model the deformation with rigid transformation for general objects, to specialize in face deformation, we consider to generalize the formulation to the non-rigid case and take the special geometrical structure features of face into account. To generalize this formulation to the non-rigid case, we ﬁrst replace the deformation function in MLS method with a non-rigid one. As mentioned in Eq. (1), we model the non-rigid displacement function gp (p) by requiring it to lie within a speciﬁc functional space, namely a reproducing kernel Hilbert space (RKHS). We deﬁne an RKHS by a positive deﬁnite matrix-valued kernel Γ : IR2 × IR2 → IR2×2 [16] and here we choose a diagonal decomposable kernel: 2

Γ (xi , xj ) = e−xi −xj

/β 2

I

(4)

with β determining the width of the range of interaction between points and I is an identity matrix.

262

H. Zhou et al.

The optimal displacement function gp (p) then takes the form: gp (x) =

n

Γ (x, xi )ci

(5)

i=1

where the coeﬃcient ci is 2 × 1 vector (to be determined). To take advantage of geometrical structure features of face, we introduce the local neighborhood structure regularization, for the local structures among neighboring feature points are very strong and stable. This is particularly beneﬁcial to the non-rigid facial movement. Therefore, we preserve the local neighborhood structure with a local geometrical constraint during deformation. In our deformation problem, we hope that the local structures in Y could be preserved after the displacement of X, which could be achieved by the following three steps [26]. First, search the k nearest neighbors for each point in X, and enforce the weight Mij = 0 if xj does not belong to the set of neighbors of xi , where M is an n×n neighboring weight matrix with Mij summarizing the contribution of xj to xi for reconstruction. Second, minimize the reconstruction errors measured by the cost function: 2 n n xi − M x (6) E(M ) = ij j i=1 j=1 n under a constraint that the rows of the weight matrix sum to one: j=1 Mij = 1 The optimal neighboring weights Mij can be obtained by solving a least squares problem. Third, the local geometry of each control point after the transformation f is preserved by minimizing the cost function: 2 n n wi (p)xi + gp (xi ) − Mij (xj + gp (xj )) (7) i=1 j=1 Combining the moving last square error term in Eq. (2) and the local regularization term in Eq. (7), the optimal displacement function gp can be solved by minimizing: n

wi (p)xi + gp (xi ) − yi

2

i=1

2 n n +η wi (p)xi + gp (xi ) − Mij (xj + gp (xj )) i=1 j=1

(8)

where the positive real numbers η control the tradeoﬀ between the two terms. With a close form solution of the coeﬃcient set C, we deﬁne our deformation function as the initial position plus the displacement function: f (p) = p + (Γp C)T

(9)

Facial Shape and Expression Transfer via Non-rigid Image Deformation

263

where kernel vector Γp = (Γ (p, x1 ), . . . , Γ (p, xn )) with size 1×n and the coeﬃcient matrix C = (c1 , c2 , . . . , cn )T with size n × 2. Note that this deformation function f is smooth, and as p approaches xi , wi (p) approaches inﬁnity, and then the function interpolates, i.e., f (xi ) = yi . Moreover, if ∀i, xi = yi , then gp (p) ≡ 0, therefore, f is the identity transformation, i.e., f (p) = p. 3.2

Close-Form Solution

By substituting Eq. (5) into Eq. (8), it can be rewrite in the following matrix form: 2 E(C) = W 1/2 (X + Γ C − Y ) F (10) 2 1/2 +η W (X + Γ C − M (X + Γ C)) F

where the kernel matrix Γ ∈ IRn×n is called the Gram matrix with Γij = 2 2 e−xi −xj /β , the weight matrix W is a diagonal matrix with the i-th entry determined by Eq. (3), X are the control points and Y are target points respectively, in which the i-th rows represent xi and yi , C is the coeﬃcient matrix with size n × 2, and · F denotes the Frobenius norm. Equation (10) is quadratic in C. Taking the derivative of it with respect to C and setting it to zero, we obtain a closed-form solution: C = (I + ηQW −1 )−1 Γ −1 Y − Γ −1 X

(11)

where I is the identity matrix, and Q = (I −M )T W (I −M ). With this closed-form solution for C, we can write a simple expression for the deformation function: f (p) = p + [Γp [(I + ηQW −1 )−1 Γ −1 Y − Γ −1 X]]T 2

(12) 2

where Γp is an row vector with the i-th entry Γp,i = e−p−xi /β . To deform a new face image more eﬃciently, we approximate the original face image with a grid and apply the deformation function (12) to each vertex, then use a bilinear interpolation in each quad of the grid. We summarize our approach in Algorithm 1. 3.3

Computational Complexity and Fast Implementation

According to solution Eq. (12), the computation complexity is mainly determined by time complexity of solving the weight matrix M , Gram matrix Γ and the inversion of a matrix of size n × n. To search the k nearest neighbors for each point in X, the time complexity should be close to O((k + n) log n) by using the kd tree [20]. According to Eq. (6), the time complexity of obtaining the weight matrix M is O(k 3 n) because each row of M can be solved separately with O(k 3 ) time complexity. Due to the Gram matrix being of size n×n, the time complexity of solving the Γ is O(n2 ). Since weight matrix M and Gram matrix Γ share for deformation function of each point, namely they only need to be computed once, while inversion

264

H. Zhou et al.

Algorithm 1. The Proposed Algorithm

1 2 3 4 5 6 7 8 9 10 11 12

Input: original and target face, kernel Γ , parameters k, α, β, η Output: Deformed face Extract face landmark points and get the correspondences {xi , yi }n i=1 ; Construct the Gram matrix Γ based on {xi }n i=1 ; Search the k nearest neighbor for each point in X; Compute M by minimizing the cost function (6); Approximate the original face image with a grid; repeat Choose a vertex p on the grid; Compute the weight W by Eq. (3); Compute the vector Γp ; Compute f at vertex p by using Eq. (12); until all the vertexes are computed ; The deformed face is generated by a bilinear interpolation of {f (p)}.

of a matrix of size n × n in solution (13) are diﬀerent for each point in the image, the total complexity of our method is O(k 3 n + n2 + n3 l). Since k n l, and it can be written as O(n3 l), where l is the number of vertex in the grid which is used for approximating the image. Moreover, users in general creates the deformations by manipulating the target point set, and the control points are ﬁxed. Therefore, much of Eq. (12) can be precomputed. In particular, we can rewrite Eq. (12) in the form: (13) f (p) = S + (V Y )T where V = Γp (I + ηQW −1 )−1 Γ −1 and S = p − (Γp Γ −1 X)T can be precomputed leading to fast implementation. In this case, the time complexity of our algorithm is reduced to O(nl). Parameter Setting: There are mainly four parameters in our method: k, α, β and η. Parameter k controls the number of nearest neighbors for local neighborhood structure regularization. Parameter α controls the weight of each control points. Parameter β and η aﬀect the amount of the local structure constraint, β determines how wide the range of interaction between points, η determines the tradeoﬀ between the MLS error term and the local structure regularization term. We ﬁnally set k = 15, α = 2, β = 2 and η = 10 according to the parameter tuning experiments.

4

Experiment

In this section, we test our method on diﬀerent types of face images. We use the dlib library to implement the cascade regression tree based face alignment algorithm [7], and extract the landmark points for both original and target face images. To demonstrate our method, we conduct the experiment based on a self-organized dataset which include various kind and style of face images. More exactly, the

Facial Shape and Expression Transfer via Non-rigid Image Deformation

265

dataset include face image of men, women, children and the style range from image of nature face, sculpture, sketch, comic and painting. The face images vary in factors such as illumination, color, texture, resolution and occlusion. Here, we present some representative types of face deformation. In Fig. 3, we show 4 representative facial expression and shape transfer results obtained with our method. To evaluate the performance of our method, we also report the results of MLS [21] and blending based face swap(face blending) [9] method as comparison. In the ﬁgure, the ﬁrst row presents original faces and the ﬁfth row presents target faces, while the second, third and fourth rows are the corresponding facial expression and shape transfer (from target faces to original faces) results of our method, MLS and face swap method.

Original f aces

Ours

M LS

F ace blinding

T arget f aces

Fig. 3. Face expression and shape transfer results of our method, MLS [21] and face blending method [9]. the first row: original face, the second row: results of our method, the third row: results of MLS, the fourth row: results of face blending method, the fifth row: target faces

266

H. Zhou et al.

First column shows facial expression and shape transfer between two head sculpture images, we can see that both deformation method (our method and MLS) and face swap method have their advantages and produce natural and smooth results. The result of face swap method shows more signiﬁcant facial expression and shape transfer since more counter detail of target face are transferred, but it tend to change the identity of result face in the same time; our method retains most counter detail of original face and achieve transfer with smooth deformation; MLS method performs similar to our method but slightly poor in some details, e.g. unnatural curving in jaw and mouth. In the second column, we consider transfer facial expression and shape from an image of natural face to a face painting. We can see that the result of face swap method occurs obvious blur; the blur is caused by its blending operation which aim at eliminating the inconsistent in color, texture and revolution, while our method are not aﬀected by these inconsistence and produce natural, smooth and clear deformation result, and for MLS method, the unexpected zigzag again appear on the lips. To further explore the performance of the three method, we consider facial expression and shape transfer between two face image with more signiﬁcant diﬀerence in expression and factors such as illumination, color, texture and occlusion. As shown in the third and fourth column, both MLS and face swap method degenerate. From the third column of the ﬁgure, we can see that all the three method can transfer facial expression and shape in a large degree, but face swap method yield relatively poor results due to the obvious artifacts caused by the inconsistence of the two face images in color and collusion of glasses in the target face. For the result of MLS, there are some imperfections such as unnatural curving in the jaw and defects in the right brow, while result of our method are smooth and natural. From the fourth column of the ﬁgure, we can see that there are lots of ﬂaw in the result of face swap method due to the inconsistence of the two face in color and texture; in the result of MLS, there are unnatural curving in jaw, mouth and brow; moreover, fold-over, another unexpected property of MLS, appears between left eye and brow. Our method is not troubled by the above problems which show that our method can achieve more natural and smooth face deformation and it’s quite robust to the inconsistence of original face and target face. In previous examples, we deform the entire face, namely transfer facial shape and expression at same time; however, our proposed method can only deform part of face according to your needs. Figure 4 shows two examples of transfer external face shape and internal facial expression separately by choosing diﬀerent landmark points. The ﬁrst column presents original faces and the fourth column presents target faces, the second column shows the results of internal expression transfer and the third column shows the results of external shape transfer. Figure 5 shows two example of generating facial animation with our method by using diﬀerent face to drive a static face image deformation. Images in the left are original face images, in the top are target face images and in the bottom are deformed face image. The results show that the deformation is smooth, natural and artifact-less despite the signiﬁcant variances between original face image and target face images.

Facial Shape and Expression Transfer via Non-rigid Image Deformation

original f aces

expression def orm

shape def orm

267

target f aces

Fig. 4. Two example of transfer external face shape and internal facial expression separately. The first column: original faces, the second column: the results of internal expression transfer, the third column: the results of external shape transfer, the fourth column: target faces

Fig. 5. Tow example of facial animation generate by our method. left: original faces, top: target faces, bottom facial animation.

268

5

H. Zhou et al.

Discussion and Future Work

As mentioned above, our method completely rely on original and target face landmarks points for both the accuracy and amount. As a result, inaccurate face landmark localization will lead to deadly aﬀecting to our results and too few landmark points will cause the losing of detail counter information of target face, which farther lead to an insigniﬁcant deformation eﬀect. Another limitation of our method is the deformation of mouth and eyes, e, g. if we try to deform a mouth or eye from close statement to open, there will be a hole since there is no information of these region in original image. We consider to use a generative model to solve this problem in the future work. A potential breakthrough of our non-rigid deformation method is that it is available for the deformation of 3D point cloud objects. We will try to generalize our method to 3D case in the future work.

6

Conclusion

Within this paper, we present a novel approach that transfers the appearance and expression of a face in image to that of another regardless of variance between the two faces in factors such as illumination, color, texture, resolution and even some mild occlusion. Our method can be full-automatic or semi-automatic for conveniently tuning a better result by combining a face alignment algorithm and a non-rigid image deformation method. The ﬁnal results are realistic, natural and artifact-less.

References 1. Alexa, M., Cohen-Or, D., Levin, D.: As-rigid-as-possible shape interpolation. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 157–164 (2000) 2. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337– 404 (1950) 3. Beier, T., Neely, S.: Feature-based image metamorphosis. ACM SIGGRAPH Comput. Graph. 26(2), 35–42 (1992) 4. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (2002) 5. Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH, pp. 313–318 (2003) 6. Garrido, P., Valgaerts, L., Rehmsen, O., Thormaehlen, T., Perez, P., Theobalt, C.: Automatic face reenactment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4217–4224 (2014) 7. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Computer Vision and Pattern Recognition. pp. 1867–1874 (2014) 8. Levin, D.: The approximation power of moving least-squares. Math. Comput. 67(224), 1517–1531 (1998) 9. Liu, L., Liu, L., Nie, X., Feng, J., Yan, S., Yan, S.: A live face swapper. In: ACM on Multimedia Conference, pp. 691–692 (2016)

Facial Shape and Expression Transfer via Non-rigid Image Deformation

269

10. Ma, J., Zhao, J., Tian, J., Bai, X., Tu, Z.: Regularized vector field learning with sparse approximation for mismatch removal. Pattern Recognit. 46(12), 3519–3532 (2013) 11. Ma, J., Zhao, J., Tian, J., Yuille, A.L., Tu, Z.: Robust point matching via vector field consensus. IEEE Trans. Image Process. 23(4), 1706–1721 (2014) 12. Ma, J., Zhao, J., Guo, H., Jiang, J., Zhou, H., Gao, Y.: Locality preserving matching. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4492–4498. AAAI Press (2017) 13. Ma, J., Zhao, J., Jiang, J., Zhou, H.: Non-rigid point set registration with robust transformation estimation under manifold regularization. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4218–4224 (2017) 14. Ma, J., Zhao, J., Tian, J.: Nonrigid image deformation using moving regularized least squares. IEEE Signal Process. Lett. 20(10), 988–991 (2013) 15. Ma, J., Zhao, J., Tian, J., Tu, Z., Yuille, A.L.: Robust estimation of nonrigid transformation for point set registration. In: Proceedings IEEE Conference Computer Vision Pattern Recognition, pp. 2147–2154 (2013) 16. Ma, J., Zhao, J., Tian, J., Yuille, A.L., Tu, Z.: Robust point matching via vector field consensus. IEEE Trans. Image Process. 23(4), 1706–1721 (2014) 17. Maccracken, R., Joy, K.I.: Free-form deformations with lattices of arbitrary topology. In: Conference on Computer Graphics and Interactive Techniques, pp. 181–188 (1996) 18. Min, F., Sang, N., Wang, Z.: Automatic face replacement in video based on 2D morphable model. In: International Conference on Pattern Recognition, pp. 2250– 2253 (2010) 19. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.H.: Synthesizing realistic facial expressions from photographs (1998) 20. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014) 21. Schaefer, S., Mcphail, T., Warren, J.: Image deformation using moving least squares. ACM Trans. Graph. 25(3), 533–540 (2006) 22. Shen, S., Yamasaki, T., Aizawa, K., Sugahara, T.: Data-driven geometric face image smilization featuring moving least square based deformation. In: IEEE Third International Conference on Multimedia Big Data, pp. 220–225 (2017) 23. Xiao, S., Yan, S., Kassim, A.A.: Facial landmark detection via progressive initialization. In: IEEE International Conference on Computer Vision Workshop, pp. 986– 993 (2015) 24. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multitask learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Cham (2014). https://doi.org/10.1007/9783-319-10599-4 7 25. Zhou, H., Kuang, Y., Yu, Z., Ren, S., Dai, A., Zhang, Y., Lu, T., Ma, J.: Non-rigid image deformation algorithm based on MRLS-TPS. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 2269–2273. IEEE (2017) 26. Zhou, H., Ma, J., Yang, C., Sun, S., Liu, R., Zhao, J.: Nonrigid feature matching for remote sensing images via probabilistic inference with global and local regularizations. IEEE Geosci. Remote Sens. Lett. 13(3), 374–378 (2016) 27. Zhou, H., Ma, J., Zhang, Y., Yu, Z., Ren, S., Chen, D.: Feature guided non-rigid image/surface deformation via moving least squares with manifold regularization. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1063–1068. IEEE (2017)

P-Schedule: Erasure Coding Schedule Strategy in Big Data Storage System Chao Yin, Haitao Lv(&), Tongfang Li, Yan Liu, Xiaoping Qu, and Sihao Yuan Jiujiang University, Jiujiang 332005, China [email protected] Abstract. Erasure coding technology is one of the key technologies in big data storage system. A well designed erasure coding can not only improve the reliability of the big data storage system, but also greatly improve the performance. Most of the existing big data storage systems use replica strategy, which can provide good availability and real-time, but it has caused a lot of data redundancy and waste of storage space. A large part of the data stored in the storage system exists in the form of cold data. In this paper, we aim at the cold data which doesn’t require highly on data availability and real-time in the big data storage system. We have proposed a scheme to support both replica strategy and coding strategy, and designed the node scheduling and data addressing scheme. We selected Liberation code which is excellent in writing operation, and developed P-Schedule scheme to optimize the decoding speed. Through a series of designs, we can effectively improve the disk utilization and write speed of the cold data in the big data system. The test results show that the sequential write performance of erasure coding is better than that of the replica strategy. The larger the data block is, the better the performance is. Keywords: Big data

Erasure coding Liberation P-Schedule

1 Introduction Since the birth of the Internet, the data grows in an explosive way [1]. Especially in recent years, the development of mobile terminals, networking, cloud computing, which make the speed of data growth faster and faster. “China Mobile Phone Market Research Report in 2016–2017” shows that internet users in China have reached 668 million until June 2016, and mobile phone users are 593 million. According to the results report released by Tencent Inc. in 2017, active users have reached 549 million in WeChat monthly. The more data there is, the more important they are. The popularity of deep learning [2] and big data [3, 4] is a very good description in recent years. While the traditional centralized storage can’t afford such a large amount of data, the big data storage comes into being in such a context [5]. Combining with the advantages of storage systems and network systems, big data storage can provide more reliability, security and scalability. In order to guarantee the reliability of the data, the big data system needs to use some kind of backup scheme to avoid data loss when one node is damaged [6]. The existing big data system always adopts replica scheme [7], which will cause a lot of waste of storage space. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 270–279, 2018. https://doi.org/10.1007/978-3-030-05057-3_22

P-Schedule: Erasure Coding Schedule Strategy

271

The traditional RAID [8] has used erasure coding to store, especially RAID6 uses erasure coding to control the data redundancy in a very low level at the same backup situation. There are many erasure codes similar RAID6, RDP [9, 10], EVENODD [11], STAR [12] and so on. However, because of the complexity and data availability, and a large amount of network bandwidth occupied by erasure codes in the encoding and decoding process, large-scale erasure codes scheme has not been introduced to big data system. We have proposed an erasure coding scheme to solve the storage pressure and data availability in cold data storage. This scheme can greatly reduce the data redundancy. Considering that cold data usually occupies a large proportion in the system, this scheme has focused on optimizing the sequential write operations of cold data and greatly improves the writing performance. The contributions of this paper are described as follows: 1. After comparing and analyzing the advantages and disadvantages of big data storage system in replica strategy and erasure code strategy, we have proposed to solve the problem that cold data causing a lot of waste of storage space by the application of erasure code. 2. We have proposed to improve cold data storage capacity of big data storage system by using Liberation code scheme, and we have successfully designed and implemented erasure codes in a fault tolerance big data storage system. 3. Through detailed comparative experiments, it veriﬁes the feasibility and great practicability of erasure code to store cold data in the big data storage system. It also can greatly reduce the data redundancy without affecting the availability of data. The rest of this paper is organized as follows. Section 2 is related works. Section 3 introduces the theory of P-Schedule algorithm and Sect. 4 introduces the implement. The experimental result and evaluation are described in Sect. 5. Section 6 is the conclusion.

2 Related Works 2.1

Coding Based on Matrix

Matrix is the key component of erasure coding. Both array codes and RS codes need to be encoded and decoded in the form of matrix. Coding based on matrix is essentially a dot product operation of matrices and vectors. Suppose there are k data blocks and m check blocks, and each block contains w bits. The matrix contains k + m rows and K columns. The element of the matrix is an integer in the ﬁnite ﬁeld GF(2W) [13]. The matrix is called distribution matrix (DM in short). Different codes have different coding matrices. The distribution matrix and the vectors containing the data block are multiplied to a vector containing the data block and the check block. When encoding, the erasure matrix and the data vector are dot product, and then the check block can be obtained.

272

C. Yin et al.

We know that each data block corresponds to a row inside the distribution matrix. When decoding, we only need to ﬁnd the corresponding rows of K data blocks which are not damaged in the distribution matrix to form an erasure matrix. The matrix is transposed and multiplied by a vector composed of data blocks without damage, and the damaged data blocks can be computed and ﬁxed up. 2.2

Coding Based on Bit-Matrix

If we launch the big data matrix (k + m) * k in row and column direction by w in the ﬁnite ﬁeld GF(2W), we can get a matrix with w * (k + m) rows and w * k columns. We call this matrix BDM (Binary Distribution Matrix), and the m * (w * w) matrix at the bottom is called CDM (Coding Distribution Matrix). We extend the vector mentioned in the previous section to wk elements. Since each element is a bit in the matrix and the vector, we can use XOR operations instead of dot product operations [15]. We set the data vectors corresponding to the bit of 1 in BDM to XOR operations. By replacing the original dot product operations with XOR operations, the speed of encoding and decoding operations can be greatly improved. Moreover, we can see that the number of XOR operations is directly related to the number of 1 in the BDM, so that we can determine the performance of the encoding by the number of 1 in BDM.

3 P-Schedule Scheme 3.1

The Principle of P-Schedule Scheme

Figure 1 shows a Bit-matrix encoding procedure for k = 3 and w = 5. Let’s analyze the calculation steps in the coding process of the matrix. The most direct way is to do 5 dot product operations, and then we convert these dot product operations into XOR operations.

Fig. 1. The encoding process based on Bit-matrix.

P-Schedule: Erasure Coding Schedule Strategy

273

However, we can see that the bit-matrix is a sparse matrix. Compared with doing encoding operation directly, it is more efﬁcient for us to preprocess the encoding operations. We use ﬁve tuples to represent the encoding process in Eq. 1: \op; sd; sb; dd; db[

ð1Þ

We can use XOR operations instead of dot product operations and set the data vectors corresponding to the bit of 1 in BDM to XOR operations. By replacing the original dot product operations with XOR operations, the speed of encoding and decoding operations can be greatly improved. The performance of the encoding will be determined by the number of 1 in BDM. Op represents operation type, 0 is copy operation, and 1 is XOR operation. Sd is the device number of the source data. Sb is the bit number of source data. Dd and db represent the device numbers and digits of the destination data, respectively. For convenience, we unify the device number from 0 to k + m − 1. When ID i < k, it indicates the data device Di. When ID i > k, it represents the check device Ci-k. The validation process of bit-matrix in Fig. 2, which we can express in schedule, as shown in Table 1. It can be seen that schedule algorithm can effectively reduce the number of XOR operation. When we encode and decode in any bit-matrix encoding system, we should convert them into schedule to improve the efﬁciency of encoding and decoding.

Fig. 2. An example of bit-matrix encoding for k = 3 and w = 5. Table 1. Schedule for bit matrix operation. Schedule ,, ,, ,,, ,, ,,,

3.2

Dot product C0,0 = d0,0 C0,1 = d0,1 C0,2 = d0,2 C0,3 = d0,3 C4,1 = d0,4

⊕ ⊕ ⊕ ⊕ ⊕

d1,1 d1,2 d1,2 d1,4 d1,0

⊕ ⊕ ⊕ ⊕ ⊕

d2,2 d2,3 d1,3 ⊕ d2,4 d2,0 d2,0 ⊕ d2,1

Encoding

In the coding system, there are k data devices and m checkout devices, each of which has w bits word length. Usually, there is a matrix with w * (k + m) * wk in the ﬁnite ﬁeld GF(2W) used as the erasure matrix. We select several representative erasure

274

C. Yin et al.

coding to check the performance of erasure matrix, such as Liberation, Evenodd, RDP and Cauchy Reed-Solomon. For any given w and k, the number of 1 in parity matrix of Liberation code is kw + k − 1 and that in in the unit matrix of the head of the BDM matrix is kw. The total numbers of 1 in the erasure matrix are 2kw + k − 1. We know that if the number of 1 is x in an erasure matrix contains, the XOR operation number in the encoding process is X − 1. In order to obtain a parity bit, the number of XOR operations required in Liberation code is shown as Eq. 2: 2kw þ k 1 2w k1 ¼ k 1þ 2w 2w

ð2Þ

The optimal value of Eq. 2 is k − 1. While to get a check bit, the number of XOR operations in EVENODD code is shown as Eq. 3: kw þ ðw 1Þðk 1Þ þ kw 2w 3 k1 ¼ ðk 1Þ þ 2w 2 2w

ð3Þ

As we can see, the number of XOR operations in EVENODD code is almost one and a half times than that of Liberation code. In addition, we can use the ratio of 1 in the parity matrix to compare the various codes. The proportion of 1 of Liberation code is 16%, while that of RDP code is 28% and CRS code is 21%. From the above comparison, we can see that the number of 1 in erasure matrix of Liberation code is the least, which means that we can achieve the purpose of coding through less XOR operations in the actual coding process. Therefore, the coding scheme used in this paper is based on Liberation code. 3.3

Decoding

Suppose that the data on the data node D0 and D1 is missing while k = 5, w = 5. In order to recover the missing data, we take the ﬁrst ten rows of the parity matrix and transpose them. We can get the data on D0 and D1 from the transposed matrix and the remaining data nodes. The number of 1 is 134 in the ten row of the transposed matrix, so that the number of XOR operations is 134 − 10 = 124. Now, let’s check the number of 1 in the zeroth and ﬁfth lines that are used to calculate d0,0 and d1,0, respectively. The number of 1 in the ﬁrst line is 16 while that in the ﬁfth lines 14. This means that 28 XOR operations are required through the above matrix operations. We have found that there are thirteen columns in which both of the two lines are 1. If we calculate d1,0 ﬁrst, it only needs 13 XOR operations, and then we’ll calculate the d0,0 by d1,0: d0;0 ¼ d1;0 d2;0 d3;0 d4;0 p0

ð4Þ

P-Schedule: Erasure Coding Schedule Strategy

275

Equation 4 requires only 4 XOR operations. The number of XOR operations is reduced from 28 to 17 times in this way.

4 Implement 4.1

Architecture

Our system is based on Linux system, and the original backup strategy is replica. In this paper, we add erasure code to the basis of multiple replicas. The speciﬁc system framework is as Fig. 3.

Fig. 3. The architecture of the system.

Cluster management module manages the system nodes and maintains the membership among the nodes. For example, when the system nodes fail or there are new nodes added to the system, it will inform upper level and manage operations to maintain consistency information among nodes. When the application initiates the read/write request, the native gateway receives the request and positions the server node where the data block is located through the consistent hash algorithm. If it is on another server node, the request is forwarded to another node. If it is on the local node, the requested data is classiﬁed in the form of replication or in the form of erasure code. The node management module receives the request from the gateway and performs the read/write operation according to the request type.

276

C. Yin et al.

4.2

Node Schedule

In order to avoid the bottleneck, our system adopts symmetric and decentralized ring architecture. A consistent hash algorithm is used to locate in the ﬁle of the storage node. There is no longer a super node or a metadata server, and all nodes have equal status. The virtual nodes are evenly distributed on the ring after the consistency of hash algorithm of IP and port number. When we select the storage group, we can carry out the hash algorithm according to the user names of the client. We can select the ﬁrst virtual node and the next N nodes according to the hash value. If a node and a selected node are on the same physical node, the node will be given up and the next will be chosen. Continue the selection to select the N nodes. As Fig. 4 shows, we suppose to choose three nodes to form a storage group. After we can carry out the hash algorithm according to the user names of the client, the ﬁrst node to be obtained is A-1. We turn back to select the next two nodes, B-1 and A-2. We ﬁnd that node A-2 and node A-1 are in the same physical node, so we have to give up A-2, then go backward traversal to ﬁnd C-1. At this point, A-1, B-1, and C-1 are located on three different physical nodes to form a storage group.

1 2 A-1 C-2

B-1 3

A-2

A-3

4 B-2

C-1

Fig. 4. The relationship among data blocks, nodes, and hash space.

This scheme avoids the possibility of multiple faults in the scenario one, and still fails to ﬁnd the appropriate storage group.

5 Evaluation 5.1

Experimental Setup

The test environment consists of three hardware servers on which a big data storage cluster are built up by virtual machines to test. The hardware conﬁguration of each server is shown in Table 2. The most commonly used indicators to measure the reading and writing performance of a storage system are IOPS. All of the tests are based on these two indicators as test standards in this paper. In addition, we have ﬁxed gradient values for the size of

P-Schedule: Erasure Coding Schedule Strategy

277

data blocks, 4 K, 64 K, 512 k and 1028 k respectively. In the reading and writing performance test, we test a big data storage system in different backup mode, triple replication and different erasure codes. In order to ensure the same effect between fault tolerance model and replication, we set the ratio of k and m 4:2 in erasure code model. Table 2. The parameters of test server Name CPU Memory System Disk Data Disk SSD

5.2

Parameter Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz 2 Qimonda 1333 MHz 4 GB 2 WDC WD1003FBYX-0 1 TB 7200 rpm ST1000DM003-9YN1 1 TB 4 7200 rpm Seagate 600 SSD 120G MLC

Read Performance Tests

The read operation can be divided into sequential reading and random reading according to the location of each reading. Sequential reading begins at a certain location and reads backwards until the end of a position. Random reading randomly selects a location to read a small amount of data, and jumps to a random location to continue read. In theory, the speed of sequential reading is much better than that of random reading, especially in the systems using disks as storage medium.

Fig. 5. IOPS in reading.

In Fig. 5, the horizontal axis represents the size of the data block, and the vertical axis represents the sum of the IOPS values of the nodes. As we can see, IOPS is decreasing on the whole with the data block increases. This is because the time reading each block increases as the block size increases, so that the overall IOPS of the system decreases. In addition, for the same block size, the IOPS of sequential reading in the erasure code strategy is lower than that of the replication mode. The smaller the data block, the greater the IOPS gap between the two modes. When the block is larger than 4 K, the value of the erasure pattern is about 25% lower than that of the replica mode. When the block size is 1024 k, It is only about 10.4% lower than the other’s.

278

5.3

C. Yin et al.

Write Performance Tests

Figure 6 shows the sequential write of the IOPS. We can see that in the overall trend, the IOPS value becomes smaller with the data block becomes larger. However, unlike reading operations, the IOPS in erasure code strategy is larger than the IOPS value of replica strategy in writing operations. Moreover, with the data block becomes larger, the IOPS in erasure code strategy almost becomes two times that in the replica mode. This is because the amount of data written in the replica strategy is much higher than that in the erasure code strategy for the same data write request. For each data write request with 1 K, the replica strategy is written 3 K data and consumes 2 K of the network bandwidth, while the erasure code strategy only needs 1.5 k to write data and consume 1.25 k network bandwidth.

Fig. 6. IOPS in writing.

6 Conclusion Erasure code is an effective way to solve data redundancy. It can achieve the same fault tolerance with the data redundancy far below the replica strategy. However, there is a lack of data availability in the big data system with erasure code strategy, which is inconsistent with the real-time data required by the user. Considering the cold data caused by the massive waste of storage space and data availability, this paper presents the special schedule strategy of erasure coding storage for cold data. We have proposed to improve cold data storage capacity of big data storage system by using Liberation code. The experiments show that erasure code can greatly reduce the data redundancy without affecting the availability of data. Decoding in the erasure code system will consume a large amount of network bandwidth, which is also a factor restricting the use of erasure codes in big data storage systems. Although regenerative codes have been proposed to solve the problem of network bandwidth to some extent, this is achieved by sacriﬁcing storage efﬁciency. How to optimize the decoding bandwidth without sacriﬁcing the storage efﬁciency is also a research direction. Acknowledgements. This work was supported by National Natural Science Foundation of China (No. 61662038), Science and technology project of Jiangxi Provincial Department of Education (No. GJJ151081), the Visiting Scholar Funds by China Scholarship Council, the JiangXi Association for Science and Technology.

P-Schedule: Erasure Coding Schedule Strategy

279

References 1. Morris, R.J.T., Truskowski, B.J.: The evolution of storage systems. IBM Syst. J. 42(2), 205– 217 (2003) 2. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1–21 (2015) 3. Schermann, M., Hemsen, H., Buchmüller, C., Bitter, T., Krcmar, H., Markl, V., Hoeren, T.: Big data. Bus. Inf. Syst. Eng. 6(5), 261–266 (2014) 4. Chen, Y., Chen, H., Gorkhali, A., Lu, Y., Ma, Y., Li, L.: Big data analytics and big data science: a survey. J. Manag. Anal. 3(1), 1–42 (2016) 5. Li, S., Cao, Q., Wan, S., Qian, L., Xie, C.: HRSPC: a hybrid redundancy scheme via exploring computational locality to support fast recovery and high reliability in distributed storage systems. J. Netw. Comput. Appl. (2015) 6. Calder, B., Wang, J., Ogus, A., et al.: Windows Azure storage: a highly available cloud storage service with strong consistency. In: Proceeding of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 143–157 (2011) 7. Chun, B.G., Dabek, F., Haeberlen, A., et al.: Efﬁcient replica maintenance for distributed storage systems. In: Proceedings of NSDI, pp. 225–264 (2006) 8. Chen, P.M., Lee, E.K., Gibson, G.A., et al.: RAID: high-performance, reliable secondary storage. ACM Comput. Surv.–CSUR 26(2), 145–185 (1994) 9. Corbett, P., English, B., Goel, A., et al.: Row-diagonal parity for double disk failure correction. In: FAST 2004: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp. 1–14 (2004) 10. Xiang, L., Xu, Y., Lui, J., et al.: Optimal recovery of single disk failure in RDP code storage systems. In: SIGMETRICS 2010 Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 119–130 (2010) 11. Blaum, M., Brady, J., Bruck, J., et al.: EVENODD: an efﬁcient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44(2), 192–202 (1995) 12. Huang, C., Xu, L.: STAR: an efﬁcient coding scheme for correcting triple storage node failures. IEEE Trans. Comput. 57(7), 889–901 (2008) 13. Reed, I.S., Solomon, G.: Polynomial codes over certain ﬁnite ﬁelds. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1996) 14. Rodrigues, R., Liskov, B.: High availability in DHTs: erasure coding vs. replication. In: Castro, M., van Renesse, R. (eds.) IPTPS 2005. LNCS, vol. 3640, pp. 226–239. Springer, Heidelberg (2005). https://doi.org/10.1007/11558989_21 15. Luo, J., Bowers, K.D., Oprea, A., Xu, L.: Efﬁcient software implementations of large ﬁnite ﬁelds GF(2n) for secure storage applications. ACM Trans. Storage 8(2) (2012)

Answer Aggregation of Crowdsourcing Employing an Improved EM-Based Approach Ran Zhang ✉ , Lei Liu, Lizhen Cui, Wei He, and Hui Li (

)

Shandong University, Jinan, Shandong, China [email protected], [email protected]

Abstract. Crowdsourcing platforms are frequently employed to collect answers from numerous participants on the Internet, e.g., Amazon Mechanical Turk. Different participants may have different answers for the same question. This cause unexpected aggregated answers. The accuracy of aggregated answers depends on answer quality. Answer quality varies by skill level of participants. In crowd‐ sourcing, participants are defined as workers. Existing studies always characterize worker quality with their skills. However, the personality features of individual persons may have significant impact on the quality of their answers, e.g. worker emotion and worker intent. To this end, aggregating answers without taking into account the personality characteristics of persons may lead to unexpected results. To fill the gap this paper employs an improved EM-based approach for answer aggre‐ gation based on the answer data of workers and considering personality characteris‐ tics. The approach not only aggregates answers but also simultaneously estimates the skill level of each worker, worker emotion, worker intent and the difficulty of the task. Last but not least, the verification is conducted on real-world datasets Affect Text and simulation datasets. Keywords: Crowdsourcing · Worker skill · Task diﬃculty · Worker quality Personality characteristics · EM-based approach · Answer aggregation

1

Introduction

Crowdsourcing is a distributed problem-solving solution, which aids computers in completing tasks that computers cannot solve on their own [1]. There are many crowd‐ sourcing platforms e.g., Amazon Mechanical Turk, Crowd Flower, www.zbj.com, and www.weichaishi.com. Crowdsourcing platform publishes tasks e.g., sentiment labeling task [17] form requesters and collect answers from workers. Hundreds of workers on such platforms can accept tasks and send back the corresponding answers. Based on the collected answers, aggregated answers can be obtained through some aggregation algo‐ rithm. The accuracy of the aggregated answers depends on answer quality. Answer quality varies due to the difference of skill level, intent and emotion of workers. Existing works always study the influence of skill level and worker intent on the answer quality and ignores the personality characteristics of persons e.g., worker emotion. Therefore, in order to obtain the aggregated answers, this paper takes worker skill, worker emotion and worker intent into consideration. Besides, difficulty of the task is also taken into account. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 280–290, 2018. https://doi.org/10.1007/978-3-030-05057-3_23

Answer Aggregation of Crowdsourcing Employing

281

Personality characteristics of persons are important for model worker quality. There are factors that have impact on answers of workers. For tasks, each task has its own difficulty level, which has impact on the judgment of workers. Due to different characters of workers, they have different worker skill and worker intent. In addition, emotion as the personality characteristics of persons will also have impact on the accuracy of answers of workers [11]. Some researchers [8] find that workers who are in positive emotion are more productive than workers who are in negative emotion. Although there have been many studies that have considered worker quality, it hardly find studies which takes into account worker emotion. In this paper, in order to aggregating answers from workers, worker emotion is taken into consideration. The improved answer aggregation approach is based on the EM algorithm in this paper. In this method, the influence of worker intent, worker emotion, worker skill and task difficulty on answers of workers is considered, and according to this method, aggregated answers are obtained. The EM-based method in this paper is called the four-parameter EM approach. In this model, workers with different emotions and different intent are formu‐ lated based on workers’ behavior. This paper defines three types of workers: workers who are non-malicious and are in positive emotion, workers who are non-malicious but are in negative emotion, and workers who are malicious that answer only for money. A generalized Expectation-Maximization algorithm [9, 10] is used to perform param‐ eter estimation in this paper. The improved EM-based approach is divided into two steps which performs iteratively until convergence of the parameter set: (1) Expectation step: use the existing estimates for worker skill, worker emotion, worker intent and task difficulty of the parametric probabilistic model for aggregated answers to calculate the expectation of aggregated answers; (2) Maximization step: find the worker skill, worker intent, worker emotion and task difficulty that make the expected log likelihood maximize. When the parameters are converged, the probability distribution of the aggregated answers is also constant, so the aggregated answers of the task can be obtained, as well as the worker skill, worker intent, worker emotion and task difficulty at that time. Experiments show that this method is effective. Contributions of this paper are as follows: • This paper aggregates answers considering personality characteristics of persons e.g., worker emotion. It hardly ﬁnds works that takes into account it. And it is challenging to evaluate it. To this end, a method for evaluate personality characteristics of persons can be provide, which considers the inﬂuence of worker emotion; • This paper develops an improved EM algorithm with four parameters, which are worker emotion, worker intent, worker skill and task diﬃculty, respectively. The method combines these four parameters to evaluate the worker quality, and mean‐ while obtains the aggregated answers. This rest of the paper is organized as follows. Section 2 is related work on answering aggregation and quality control. In Sect. 3, the paper will describe in detail the improved EM-based method, which is a method with four parameters. Section 4 is validation and the veriﬁcation result of the improved method on the simulated dataset and real-world dataset. The real-world dataset is a sub-dataset of Aﬀect Text. Section 5 concludes this paper.

282

2

R. Zhang et al.

Related Work

Collecting answers from numerous workers through crowdsourcing platforms has been widely accepted. To obtain aggregated answers, answer aggregation approach is often used. Majority voting [5, 6] is a simple approach to obtain the aggregated answer. However, majority voting has the assumption that all workers have same accuracy. In fact, workers have diﬀerent individual accuracy, since they have diﬀerent characters, e.g., worker skill, worker intent, worker emotion. To ﬁll the gap, methods considering diﬀerent characters of workers arise. In [3], Cao et al. utilize weighted majority voting to aggregate answers of workers, which considers accuracy of workers based on the history information of workers answers. In [7], Demartini et al. propose a probabilistic model which is based on factor graph considering answers of workers and accuracy of answers. Both [3] and [7] use historical answers of workers to analyze individual accuracy of workers. In [13], Koulougli et al. propose accumulative weighted voting method which takes into account both uncertainty and skill levels. In [12], Sun et al. propose a probabilistic model for quantitative crowd‐ sourcing problem by considering the changing of worker ability so as to achieve better quality control. Both [13] and [12] consider worker skills in analyzing accuracy of answers of workers. In addition, worker intention also has impact on answers of workers. For example, spammers in crowdsourcing system may choose answers randomly for the ﬁnancial reward. To distinguish spammers and non-spammers, Kurve et al. [2] add the worker intention as a parameter into the EM-based algorithm for identifying malicious workers. Kurve thinks malicious workers will choose the wrong answers when they know the correct one or they will choose answers randomly. In [14], Moayedikia et al. propose a reliability estimation algorithm which relies on Gaussian process model, based on bee colony algorithm to distinguish spammers and non-spammers. Both [2] and [14] consider worker intention in analyzing answers of workers. In addition to the factors mentioned above that have impact on the accuracy of answers of workers, there are other factors. In [15], Wu et al. propose a novel ground truth inference algorithm which is based on EM algorithm and aggregates answers. Wu considers the reliability of each worker and the diﬃculty of each instance. Algorithm GLAD [16] takes the diﬃculty of instances into consideration and adopts a logistic regression model for inference. [15] and [16] considers the inﬂuence of task diﬃculty on the answer accuracy of workers. In [11] Yu et al. leverage the relationship between worker emotion and their productivity to schedule work time of workers for high answer quality. Yu considers personality characteristics of persons of workers. Some researchers [8] ﬁnd that workers who are in positive emotion are more productive than workers who are in negative emotion. In this paper, factors such as worker skill, task diﬃculty, worker intent and worker emotion are all taken into account to aggregate answers.

Answer Aggregation of Crowdsourcing Employing

3

283

Crowdsourcing EM-Based Approach with Four Parameters

3.1 Four Parameters in the Improved EM-Based Approach Four parameters in the algorithm will be mentioned here. Worker skill, task diﬃculty, worker intent and worker emotion are taken into account here. Continuous parameters are used to describe worker skill and task diﬃculty. ki ∈ (−∞, +∞) represents worker skill of worker j; dj ∈ (−∞, +∞) represents the diﬃculty level of task i. Binary param‐ eters are used to denote workers emotion and worker intent. (mj , wj ) ∈ {(1, 1), (0, 1), (1, 0), (0, 0)}, where mj indicates the emotion of the worker j, and wj indicates the intent of the worker j. As can be seen in Table 1, there are types of workers based on combination of worker emotion and worker intent. (1, 1) denotes that workers are in positive emotion and have non-malicious intent. They are called PN workers. (0, 1) denotes workers are in negative emotion but have malicious intent. They are called NN workers. (1, 0) and (0, 0) indicate that the worker is malicious. Whether the worker emotion is positive or negative, they are all called MM workers. PN workers tend to answer questions correctly at the best level of ability. NN workers tend to answer as accurately as possible, but negative emotion will reduce their accuracy to answer questions. The inﬂuence factor q is used to indicate the extent of the eﬀect of emotion of workers. MM workers tend to answer at random, regardless of their emotion. Table 1. Types of workers Worker emotion Positive: 1 Negative: 0

Worker intent Non-malicious: 1 Malicious: 0 PN MM NN MM

3.2 Expectation Step of the Improved EM-Based Approach Suppose there are a group of workers to answer Tn non-probe tasks. Each worker answers at least one task and each task receives at least one answer from workers. The probe task is a task with known ground truth. Let Mi denote the number of workers answering the task i. There are Tp tasks that are also published to workers, but workers do not know that these are probe tasks. Therefore, workers with low accuracy cannot escape detection of the as the index of probe tasks, Take {1, 2, 3 … , Tp}} {system. Tp + 1, Tp + 2, Tp + 3 … , Tp + Tn {is the index of }non-probe tasks. Answers for task i is chosen from the set of options Oi ≡ 1, 2, 3, … , Ci . Let zi ∈ Oi be ground truth answer to task i. Let rij ∈ Oi be answer for the task i from the worker j. Referring to stochastic generative model for generating answers of workers based ) } { [2], } the } {{(on worker behavior parameter set of four-parameter model is defined as: 𝛺 = wj , mj , kj , qj ∀j , di ∀i . In order to deﬁne the probability model of ground truth, this paper gives the proba‐ bility distribution of answers of workers based on answering behaviors of workers of diﬀerent types. Based on the diﬀerence between the worker skill and the diﬃculty of the task, use the sigmoid function to model the probability that the workers will answer

284

R. Zhang et al.

1

. qj ∈ (0, 1) is indicates the probability that negative 1+e emotion has impact on accuracy of workers.

the tasks correctly:

−(kj −di )

PN Workers Probability mass function φ is deﬁned as (1). It expresses the probability that the answer of the worker is rij when the emotion of the worker is positive and the intent is nonmalicious. In this scenario, the possibility for workers to answer correctly is only decided between the worker skill and task diﬃculty. The greater the value of (by the diﬀerence ) kj − di is, greater the probability to answer correctly, the value of the 𝜑 will tend to 1 . Ci ( ) ) ( 𝜑 rij = l|𝛺ij , mj , wj = (1, 1), zi

( )( − k −d ) ⎧ 1 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) −(kj −di ) C = ⎨ ( )( −(k −d ) i ) 1 + e e j i ⎪ 1 otherwise −(kj −di ) ⎪ Ci 1 + e ⎩

(1)

NN Workers Probability mass function φ is deﬁned as (2).

( ) ) ( 𝜑 rij = l|𝛺ij , mj , wj = (0, 1), zi ( )( − k −d ) ⎧ 1 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) −(kj −di ) C ) ( i )( 1 +−ek −d =⎨ qj e ( j i) 1 ⎪ otherwise + −(kj −di ) Ci ⎪ 1 + e−(kj −di ) 1 + e ⎩

(2)

When a worker is in negative emotion but has non-malicious intent to answer the question, his ability to answer correctly will be reduced partly. Symbol q is used to denote the inﬂuence probability. MM Workers When a worker has malicious intent, no matter he is in positive emotion or in negative emotion, he tends to choose the answer randomly. Probability mass function 𝜑 is deﬁned as:

Answer Aggregation of Crowdsourcing Employing

285

) ( 𝜑 rij = l|𝛺ij , wj = 0, zi ( )( − k −d ) ⎧ qj 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) Ci 1)( + e−(kj −di ) ) ( =⎨ −(kj −di ) 1 1 ⎪ e otherwise + −(kj −di ) Ci − 1 ⎪ 1 + e−(kj −di ) 1 + e ⎩

(3)

Based on the generation model above and Bayes rule, the posterior probability mass function is denoted as (4), which can also be called as probability distribution of

( ) t 𝜑 r |𝛺 , z = c ij ij i j=1 ( ) Pi Zi = c|X, 𝛺t = ∑ ∏ ( ) Ci Mi t 𝜑 r |𝛺 , z = c ij i ij l=1 j=1 } { ∀c, ∀i ∈ Tp + 1, … , Tn + Tp ∏Mi

(4)

the ground truth of non-probe tasks, where 𝛺t is current parameter set, c ∈ {1, 2, 3 … , Ci } and X denotes the observed data containing workers answers to every tasks and the ground truth answer of probe tasks. This paper treats ground truth of non( ) probe tasks Zn as latent variables in our EM model. 3.3 Maximization Step of the Improved EM Algorithm There are observed data and unobserved data here, observed data is denoted as X above, ( ) ( ) the ground truths of non-probe tasks Zn . X and X, Zn is called incomplete data and complete data respectively. To estimate the parameter set 𝛺, the incomplete data loglikelihood should be calculated. The incomplete log-likelihood is obtained by iteratively maximizing the expectation of the complete log-likelihood Q(𝛺|𝛺t ). Based on genera‐ ) ( tion model above and P Zi = c|X, 𝛺t obtained in expectation step, the expected complete data log-likelihood Q(𝛺|𝛺t ) can be written as (5).

286

R. Zhang et al.

( ) Q Ω|𝛺t = E[logLC |X, Zn , 𝛺t ] ) ∑T +T ∑M ( ) ∑Tp ∑Mi ∑Ci ( p n i ∝ log φ rij |𝛺ijt , zi + log φ rij |𝛺ijt , zi = c i=1 j=1 i=Tp +1 j=1 c=1 ∑Tp ∑ ( ) {wj [mj log𝜑(rij |𝛺ijt , mj , wj = (1, 1), zi = rij ) = i=1 j:rij =zi ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi = rij )] ( ) ) ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi = rij )} ∑Tp ∑ ( ) {wj [mj log𝜑(rij |𝛺ijt , mj , wj = (1, 1), zi ≠ rij ) + i=1 j:rij ≠zi ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi ≠ rij )] ) ) ( ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi ≠ rij )} ) ( ∑Tp +Tn ∑Ci ∑ ( ) ( ) P zi = c {wj [mj log𝜑 rij |𝛺ijt , mj , wj = (1, 1), zi = c + i=Tp +1 c=1 j:rij =c ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi = c)] ) ) ( ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi = c)} ) ( ∑Tp +Tn ∑Ci ∑ ( ) ( ) P zi = c {wj [mj log𝜑 rij |𝛺ijt , mj , wj = (1, 1), zi ≠ c + i=Tp +1 c=1 j:rij ≠c ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi ≠ c)] ( ) ) ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi ≠ c)}.

(5)

( ) kj ∀j and di ∀i are continuous parameters, mj , wj ∀j are discrete parameters, and there are 3M (three types of workers) crowd conﬁgurations, it is infeasible to ﬁnd a closed solution for Ωt+1 = arg maxΩ Q(Ω|Ωt ). Therefore, the maximization step of this improved EM algorithm is divided into two steps: continuous parameters calculation and discrete parameters calculation, and maximize these two steps iteratively.

Discrete Parameters Calculation

( ) In this sub-step, a closed solution for mj , wj ∀j will be found with kj ∀j, qj ∀j and di ∀i ( ) ﬁxed. m ̃ j , w̃ j denotes the result of the worker emotion and worker intent in these discrete parameters calculation sub-step. 𝛺̃ denotes the result of previous continuous parameters calculation sub-step. Genetic algorithm is used here.

Continuous Parameters Calculation

)} )} {( {( The expectation of the complete log-likelihood E[logLC ∖ mj , wj |Xj , mj , wj ] is ( ) calculate in this sub-step to ﬁnd kj ∀j, qj ∀j and di ∀i that can maximize it, with mj , wj ∀j ( ) ﬁxed. The value of mj , wj ∀j is from previous concrete parameters calculation sub-step. There is a gradient ascent being used to ﬁnd a local maximum for kj ∀j, qj ∀j and di ∀i, which is performed until the change of likelihood between the two gradient steps falls

Answer Aggregation of Crowdsourcing Employing

287

below a certain threshold. Two sub-steps are performed iteratively until the convergence of the parameters in 𝛺. Then the results of maximization step are stored in 𝛺t+1. Expectation step and maximization step are performed iteratively until the change of the expectation of the likelihood between the two steps falls below a certain threshold. The EM algorithm with four parameters is guaranteed to ﬁnd a local solution.

4

Validation

Validation on Simulated Dataset Simulated data generated by the above generation model is used. Firstly, a group of 10 workers are generated with kj ∼ N(1, 1000); 10% of workers have malicious intent, and 10% of workers with non-malicious intent are in negative emotion. This paper only considers the eﬀect of negative emotion on non-malicious workers. The tasks are gener‐ ated with di ∼ N(20, 500). The emotion inﬂuence factor qj ∈ (0, 1) follows normal random distribution. The ground truth for each task is chosen randomly from {1, 2, 3}. To make the result more clearly to observe, the results of worker skill and task diﬃculty are ﬁtted by the least squares ﬁtting method respectively. In this paper, based on the above probability distribution of worker skill and task diﬃculty, comparison values of the worker skill and task diﬃculty are separately generated. They are actual values which are used to generate simulated data. Four-parameter algorithm and Three-parameter algorithm (EM-based algorithm of Kurve) are utilized respectively to obtain the esti‐ mated values for worker skill and task diﬃculty. Figure 1 shows the comparison of estimated values of worker skill with actual value. Figure 2 shows the comparison of estimated values of task diﬃculty with actual value. The solid line in the ﬁgure denotes values estimated through Four-parameter algorithm. The dotted line with dot denotes values estimated through Three-parameter algorithm. The last line denotes actual value. Figures 1 and 2 show highly consistent trends between the estimated value of worker skill and task diﬃculty and the actual value.

Fig. 1. The comparison of estimated values of worker skill with actual values

288

R. Zhang et al.

Fig. 2. The comparison of estimated values of task diﬃculty with actual values

For better explanation of two ﬁgures, two indicators are used: NSE and RMSE. NSE (Nash-Sutcliﬀe eﬃciency coeﬃcient) measures the ﬁtting degree of values estimated through Four-parameter algorithm with actual values. As long as the value of the NSE is between 0 and 1, the ﬁtting eﬀect of the model is credible. For worker skill, the NSE value of Four-parameter algorithm is 0.49. For task diﬃculty, its NSE is 0.58. It indicates Four-parameter algorithm is credible for estimating task diﬃculty and worker skill. Credible estimated worker skill and task diﬃculty can be obtained through Four-param‐ eter algorithm. As shown in Table 2, RMSE (root-mean-square error) of Four-parameter algorithm for workers skill is 7.04, RMSE of Three-parameter algorithm is 11.33, which indicates the ﬁtting eﬀect between worker skill estimated through Four-parameter and actual values is well. For task diﬃculty, MSE of Four-parameter algorithm is 10.08, and RMSE of Three-parameter algorithm is 11.40. It also indicates the ﬁtting eﬀect between task diﬃculty estimated through Four-parameter and actual values is well. Table 2. RMSE of estimated values and actual values Parameters Worker skill (Fig. 1) Task diﬃculty (Fig. 2)

Four-parameter algorithm 7.04 10.80

Three-parameter algorithm 11.33 11.40

Accuracy measures the degree of the estimate of the algorithm for the aggregated answers. The accuracy of Four-parameter algorithm is 0.86 which is as good as that of Three-parameter algorithm. It indicates Four-parameter algorithm can obtain accurate aggregated answers. Results of the experiment show that the four parameter algorithm is not worse than three-parameter algorithm in estimating worker skill and task diﬃculty as well as aggregating answers in the context of this paper. Validation on Real-World Dataset Aﬀect Text Dataset. The Aﬀective Text dataset is collected as a sentiment labeling task proposed by Strapparava et al. [17]. They employ workers to rate the title of a piece of

Answer Aggregation of Crowdsourcing Employing

289

news for a few types of emotions and a comprehensive score (Valence) to indicate the entire emotion of this news. Snow et al. [18] select a set of 100 samples from the SemEval set and obtain 1000 scores for each emotion and Valence scores. For each emotion, workers provide a score in a range of [0, 100]. For the score Valence, workers provide a score in a range of [−100, 100]. This paper maps the Valence score to two classes, three classes, four classes and ﬁve classes respectively. We obtain four sub-datasets: Two classes dataset, Three classes dataset, Four class dataset and Five classes dataset. Figure 3 shows the comparisons between the developed method and ZenCrowd [7], KOS [4] as well as the base line Majority voting (MV). The dataset used is sub-dataset Valence of Aﬀect Text dataset. As mentioned, the sub-dataset Valence is divided into four sub-datasets based on the number of the class. In Fig. 3, the four- parameter algo‐ rithm performs well on three classes, four classes and ﬁve classes. The experimental comparisons show that the four-parameter algorithm performs well on multiple category sentiment labeling tasks.

Fig. 3. The comparison of accuracy of the aggregated answers between MV, ZenCrowd, KOS and OurAlgo.

5

Conclusions

Answer aggregation considering worker quality is a useful tool for aggregating answers of workers. The aggregated answers depend on quality of answers. Answer quality varies through worker skill, worker intent and worker emotion. In this paper, an improved method of answer aggregation is utilized to obtain the aggregated answers. The approach is based on EM algorithm. The improved EM-based method not only can obtain aggre‐ gated answers considering worker quality but also can simultaneously estimate the worker skill, worker emotion, worker intent and the diﬃculty of each task. Taking into account the three factors that have impact on the answers of workers, more accurate analysis of the possibility of workers answering correctly is achieved, so as to obtain aggregated answers considering worker quality. Veriﬁcation is performed on the simu‐ lated dataset and real-world dataset. Compared to other method, the improved method is eﬃcient on multiple category sentiment labelling tasks. It can obtain more accurate result in that scenario.

290

R. Zhang et al.

Acknowledgment. This work is partially supported by National Key R&D Program No. 2017YFB1400100, SDNFSC No. ZR2018MF014.

References 1. Feng, J.H., Li, G.L., Feng, J.H.: A survey on crowdsourcing. Chin. J. Comput. 38(9), 1713–1726 (2015) 2. Kurve, A., Miller, D., Kesidis, G.: Multicategory crowdsourcing accounting for variable task difficulty, worker skill, and worker intention. IEEE Trans. Knowl. Data Eng. 27(3), 794–809 (2014) 3. Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Proc. VLDB Endow. 5(11), 1495–1506 (2012) 4. Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems (2011) 5. Lee, J., Cho, H., Park, J.W., Cha, Y.R., Hwang, S.W., Nie, Z., Wen, J.R.: Hybrid entity clustering using crowds and data. VLDB J. 22(5), 711–726 (2013) 6. Park, H., Garcia-Molina, H., Pang, R., Polyzotis, N., Parameswaran, A., Widom, J.: Deco: a system for declarative crowdsourcing. Proc. VLDB Endow. 5(12), 1990–1993 (2012) 7. Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: International Conference on World Wide Web, pp. 469–478. ACM (2012) 8. Oswald, A., Proto, E., Sgroi, D.: Happiness and productivity. Soc. Sci. Electron. Publ. 33(4), 789–822 (2008) 9. Dempster, A.P., Laird, L., Rubin, D.B.: Maximum likelihood estimation from incomplete data via the EM algorithm. Elearn 39(1), 1–38 (1977) 10. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., et al.: Learning from crowds. J. Mach. Learn. Res. 11(2), 1297–1322 (2010) 11. Yu, H., Shen, Z.J., Fauvel, S., Cui, L.Z.: Eﬃcient scheduling in crowdsourcing based on workers’ emotion. In: IEEE International Conference on Agents IEEE Computer Society, pp. 121–126 (2017) 12. Sun, H., Hu, K., Fang, Y., Song, Y.: Adaptive result inference for collecting quantitative data with crowdsourcing. IEEE Internet Things J. 4(5), 1389–1398 (2017) 13. Koulougli, D., Hadjali, A., Rassoul, I.: Leveraging human factors to enhance query answering in crowdsourcing systems. In: IEEE Tenth International Conference on Research Challenges in Information Science, pp. 1–6. IEEE (2016) 14. Moayedikia, A., Ong, K.L., Boo, Y.L., Yeoh, W.: Bee colony based worker reliability estimation algorithm in microtask crowdsourcing. In: IEEE International Conference on Machine Learning and Applications, pp. 713–717. IEEE (2017) 15. Wu, M., Li, Q., Zhang, J., Cui, S., Li, D., Qi, Y.: A robust inference algorithm for crowd sourced categorization. In: International Conference on Intelligent Systems and Knowledge Engineering, pp. 1–6 (2017) 16. Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: International Conference on Neural Information Processing Systems, vol. 46, pp. 2035–2043. Curran Associates Inc. (2009) 17. Strapparava, C., Mihalcea, R.: SemEval-2007 task 14: aﬀective text. In: International Workshop on Semantic Evaluations, pp. 70–74. Association for Computational Linguistics 18. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Conference on Empirical Methods in Natural Language Processing 2008 (2008)

Internet of Things and Cloud Computing

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data Using Apache Spark in Cloud Cheng Yang1 , Weidong Bao1 , Xiaomin Zhu1,2(B) , Ji Wang1 , and Wenhua Xiao1,3 1

2

National University of Defense Technology, Changsha, China [email protected] State Key Laboratory of High Performance Computing, Changsha, China 3 Academy of Military Sciences, Beijing, China

Abstract. In the ﬁeld of signal process, Fast Fourier Transform (FFT) is a widely used algorithm to transform signal data from time to frequency. Unfortunately, with the exponential growth of data, traditional methods cannot meet the demand of large-scale computation on these big data because of three main challenges of large-scale FFT, i.e., big data size, real-time data processing and high utilization of compute resources. To satisfy these requirements, an optimized FFT algorithm in Cloud is deadly needed. In this paper, we introduce a new method to conduct FFT in Cloud with the following contributions: ﬁrst, we design a parallel FFT algorithm for large-scaled signal data in Cloud; second, we propose a MapReduce-based mechanism to distribute data to compute nodes using big data processing framework; third, an optimal method of distributing compute resources is implemented to accelerate the algorithm by avoiding redundant data exchange between compute nodes. The algorithm is designed in MapReduce computation framework which contains three steps: data preprocessing, local data transform and parallel data transform to integrate processing results. The parallel FFT is implemented in a 16-node Cloud to process real signal data The experimental results reveal an obvious improvement in the algorithm speed. Our parallel FFT is approximately ﬁve times faster than FFT in Matlab in when the data size reaches 10 GB. Keywords: Fast fourier transform Apache spark · Parallel algorithm

1

· Cloud computing

Introduction

Target detection usually employs some traditional methods such as radar detection to detect aerial targets [14,28]. However these methods are not available when the signal from aerial aircrafts is weak. Fortunately, utilizing spatial electric signal from satellites to detect targets is a feasible developing approach to c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 293–310, 2018. https://doi.org/10.1007/978-3-030-05057-3_24

294

C. Yang et al.

detect aerial targets [10,13]. Since aircrafts would reﬂect the signal from satellites, the ground receiving station gets two diﬀerent signals (the pure signal from satellites directly and the reﬂect signal reﬂected by aircrafts). By making analysis and comparison between pure-signal and reﬂect-signal, the position information about aerial targets can be obtained. It should be noted that the process of comparison, a huge quantity of data (3 TB) need to be processed in one hour in real-time, which requires the data processing systems in the back to have the capability to perform computations for large-scale signal data in time. Speciﬁcally in signal comparison, numerous data are needed to be processed ,which generates tremendous intermediate data at the same time. In this process, Fourier transform plays a signiﬁcant and indispensable role [18]. Fourier transform decompose a function of time into frequency [15,26]. Discrete Fourier Transform, as one algorithm in the series of Fourier transforms, is widely used to detect the features of received signals. From these features, the target’s information can be obtained. However, Discrete Fourier Transform has a great amount of calculation which results in low eﬃciency. Fast Fourier Transform (FFT) algorithm, proposed by Cooley and Turkey, simpliﬁes and accelerates the Discrete Fourier Transform eﬀectively. [6] It successfully reduces the complexity of Discrete Fourier Transform from N *N to N *logN . Although Fast Fourier Transform is more eﬃcient than Discrete Fourier Transform, when the data scale becomes giant, this conventional algorithms cannot solve the signal processing problem eﬀectively. The FFT algorithm is not only used for signal processing but also applied to many other ﬁeld, i.e., image processing [20,22], spectral analysis [23], data compression [12], and so on. Improving the eﬃciency of FFT algorithm on big data can be beneﬁcial to many research ﬁeld. Due to the importance of processing such big-scale data, a wide variety of approaches are designed to optimize the performance of signal processing [31]. Among these methods, parallel Fast Fourier Transform is a unique approach since it enables the algorithm implemented on multi-machines. Furthermore, with the fast improvement of Cloud computing technology [1,2], the thought of parallel FFT can be implemented in big data processing frameworks. To the best of our knowledge, there is little work to conduct parallel FFT with big data processing frameworks in Cloud. We use Apache Spark to optimize the real-time FFT job. Apache Spark is an eﬃcient parallel big data processing framework [21,30]. It derived from the conventional Cloud computing framework, MapReduce that repeatedly read and write data from an external stable storage system [32]. Nonetheless, when an application needs to frequently reuse the intermediate data, MapReduce becomes ineﬃcient. Apache Spark presents a new strategy to avoid such futile read or write operations on disks. It introduces Resilient Distributed Dataset (RDD), an unique distributed memory abstraction that enables data to be stored in memory. By this way, the speed of cyclic computation work is greatly improved. There is a close correlation between Fast Fourier Transform and Apache Spark. Iteration and Parallelization are two main properties of FFT, which makes Apache Spark suitable for FFT. First, as FFT intensely generates and reuses

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

295

the intermediate data, immense read and write operations are unavoidable in conventional method. To solve this problem, Apache stores intermediate data in memory so that it performs such iterative computation eﬃciently [29]. Second, inside each step of FFT, the Discrete Fourier Transform conducts computation separately on data, which makes it feasible to parallelize FFT on Apache Spark. Simply using Apache Spark to implement parallel FFT algorithm is not sufﬁcient for big data processing. The background computing system is also needed to be suitable. In order to improve the utilization of resources, a strategy to optimally allocate compute resources to each node is proposed. We design two resource allocation strategies for parallel FFT. The equally-split strategy provides a simple method to make full use of compute resources. The optimizedsplit strategy is designed to improve the eﬃciency by reducing the data exchange between compute nodes, which further improves the resource utilization. The major contributions of this paper are as follows: – A MapReduce-based mechanism to eﬃciently distribute signal data to compute nodes. The MapReduce process contains three steps: data preprocessing (map data to compute nodes), local data transform and parallel data transform to integrate processing results (collect results). – A parallel approach to implement Fast Fourier Transform based on Apache Spark. The parallel approach of FFT provides an eﬀective method to utilize more compute resources. – Optimized strategies to allocate compute resources in the Cloud for parallel FFT for high speed computation. During the process of parallel FFT, there are many redundant data exchanges between compute nodes. Our allocation strategies provide methods to reduce the data exchanges. The remainder of the paper is organized as follows. The next section reviews related work in the literature. Section 3 formally describes the system model of computation Cloud we designed. This is followed by Sect. 4, the framework of parallel Fast Fourier Transform algorithm. The allocation strategy of computation resource is given in Sect. 5. Section 6 depicts the performance development of the algorithm. Section 7 concludes the paper with a summary and future work.

2

Related Work

Since Cooley and Tukey [6] ﬁrstly introduced Fast Fourier Transform, FFT has had a substantial inﬂuence on the area of signal processing. FFT algorithm provides an eﬃcient method for the Fourier analysis to produce spectrograms. Unfortunately, with the exponential growth of data, the original FFT algorithm cloud not meet the computation demand gradually. Therefore, many approaches had been proposed to improve the speed of FFT. Interests were arisen both in ﬁnding eﬃcient implementation of FFT and in improving the algorithm itself. On the one hand, a variety of study focused on faster algorithms to improve the inner computation process of FFT. Preuss [19] proposed a radix-2 Fast

296

C. Yang et al. Table 1. Summary of the Main Notation used throughout the paper Symbol Description N

Total number of signal points in input data

p

Number of compute nodes

Dkl

The lth data set in stage k

Ck

The lth compute node in a Cloud

xk

Input data array in a Fourier Transform

Ek

Even number part of input data array

Ok

Odd number part of input data array

Xk

Result data array in a Fourier Transform

e

Natural base

Tk

Time of the kth stage in the algorithm

Ttotal

Total time of the algorithm

n

Number of cores in the Cloud

m

Cache size(GB) of the Cloud

Fourier Transform algorithm which reduces the number of multiplications to twothirds of the eﬀort required by most radix-2 algorithms. Frigo et al. [8] proposed a FFT program that tunes the computation automatically for any particular hardware which performs signiﬁcantly better than other softwares. Mullin [16] employed the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine’s underlying hardware to accelerate FFT algorithm. In our study, we choose to parallelize the radix-2 Fast Fourier Transform algorithm, which is widely used in most signal-processing area. On the other hand, many approaches were proposed to parallelize the computation of the FFT algorithm. Githens et al. proposed a framework called Parallel Element Processing Ensemble to conduct signal processing [9]. Based on this framework, Bergland introduced a parallel implementation of Fast Fourier Transform that segments the algorithm into groups of identical parallel operations [4]. Wold devised a method to implement parallel FFT in VLSI [27]. Since Google introduced Hadoop [3], many eﬃcient eﬀorts have been proposed to process the data on the eﬃcient Cloud Computing architecture. Hassen et al. [11] distributed the FFT feature extraction techniques using the MapReduce programming model in Cloud Computing environment. Vincke et al. [24] concluded several parallel software design patterns to calculate Fast Fourier Transform, such as MapReduce, Divide-and-Conquer, Fork/Join, and so on. Besides FFT, a variety of researches have been conducted to improve the performance of Cloud computing. Dean et al. [7] introduced the MapReduce programming model to separate large-scale data into partitions and parallelize the computation across large-scale Clouds. Wang et al. [25] proposed a system to combine long-running VM service with typical batch workload like MapReduce.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

297

Fig. 1. The process of parallel FFT in Cloud

Palanisamy [17] proposed a MapReduce resource allocation system aimed at enhancing the performance of MapReduce jobs in the Cloud. Zaharia et al. [29] designed the resilient distributed datasets and Apache Spark that uses cache memory to conduct computations. Zadeh et al. [5] proposed feasible matrix computation methods on Apache Spark. Recently Apache Spark has been used in many application ﬁelds like machine learning and graph calculation. To the best of our knowledge, there is little work to conduct parallel FFT with Spark in Cloud.

3

System Model

In this section, we introduce the strategies, algorithms and terminologies used in this paper. For reference, we summarize the main notation used in this paper in Table 1. 3.1

Data Processing Model

We consider the data processing model as follows. To conduct FFT in parallel, the data need to be processed in the model of MapReduce. Consider a N points signal data, in which each point has 16-bit data. The data are divided into p data sets D0,0 , D0,1 , ...D0,p−1 . These data sets are mapped to a compute Cloud which is formed by p compute nodes C1 , C2 , ..., Cp . The data sets are processed in the butterﬂy algorithm from stage 0 to stage log2 (N/p). In the lth stage, each even numbered data set Dl,2k combines with each odd numbered data set Dl,2k+1 to execute Fourier Transform to get result as Dl+1,k . Finally, two last data sets combine to the ﬁnal result Dlog2 (N/p−1,0) . Inside a compute node, the data are stored in an array Xk . The data are separated into even numbered parts Ek and odd numbered parts Ok . In order to test the eﬀectiveness, we let Tk denote Tk , the total time of the the time consumed in the kth stage and Ttotal = algorithm. Figure 1 reveals the process of the parallel FFT based on Apache Spark in Cloud environment.

298

3.2

C. Yang et al.

Compute Resource Allocation Model for Parallel FFT Job in Cloud

Consider a compute Cloud with n CPU cores and m GB cache. We propose two strategies to split these compute resources. In equally-split strategy, the compute Cloud consists of p compute nodes and an extra master node that manages these compute nodes. Since the compute resources are equally distributed to each compute node, each compute node has n/p cores and m/p GB cache. All of the compute nodes participate in the process of data processing from beginning to the end. In optimized-split strategy, the compute resources are distributed into compute nodes with diﬀerent sizes. In order to execute diﬀerent stages of FFT, the compute Cloud is separated to several sections s1 , s2 , ..., sn . Diﬀerent compute sections conduct diﬀerent stages of FFT. Inside each section si , the resource is equally divided to ri compute nodes Cik . The size of compute nodes vary in diﬀerent sections. To increase the eﬃciency of data processing, we search the optimal portion of each section by deﬁning the size θk of each section and size ωk of each node in sections. For diﬀerent strategies, the data processing methods are diﬀerent, which will be discussed in later parts.

4

Framework of Parallel FFT Algorithms

In this section, we discuss the framework of the parallel FFT in this paper. 4.1

Parallel FFT Algorithms in Distributed Compute Cloud

The overall methodology of the Parallel Fast Fourier Transform Algorithm is breaking down the input data and distributing the small data sets to compute nodes in a Cloud. Then each compute node executes FFT algorithm independently and collaboratively. At last, the results are collected by the master node of the Cloud. During the whole process, Apache Spark takes the role of distributing the input data and collecting the result. This big data processing framework provides an eﬃcient approach to storing and computing data in the form of RDD. Each RDD is mapped to each compute node to conduct FFT computation. The compute node is organized in two ways according to diﬀerent resource allocation strategies which will be discussed in Sect. 5. After the FFT computation, the results are collected by the master node. 4.2

Fast Fourier Transform Algorithm

Fast Fourier Transform is a widely used numerical algorithm in signal processing ﬁeld. Fast Fourier Transform re-expresses the Discrete Fourier Transform of an arbitrary composite size N = N1 , N2 in terms of N1 smaller DFTs of sizes N2 ,

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

299

Fig. 2. Data Preprocessing: Bit reverse

recursively, to reduce the computation time to O(N logN ) for highly composite N . The Discrete Fourier Transform is expressed as follows: Xk =

N −1

n

xn e−i2πk N ,

(1)

n=1

where xn is a time signal array with period N and k = 0, ..., N −1. Among all the Fourier Transform algorithms, the radix-2 Cooley-Tukey algorithm is the most popular FFT algorithm. Using the thought of divide-and-conquer, the time of DFT is largely shortened. FFT separates the vector into even and odd numbered parts and reduces the length from N to N/2. FFT recursively executes the separate-operation to attain smaller data sets. After small data sets are generated, FFT combines these data sets and calculates the results: N 2

Xk =

−1

−i2πk 2m N

x2m e

N 2

+

m=0

−1

x2m+1 e−i2πk

2m+1 N

.

(2)

m=0

The formula above consists of two summations. The left summation contains the even number part of the original formula and the right contains the odd 1 number part. By deﬁning a twiddle factor WNk = e−i2πk N , the former formula implies: N N 2 −1 2 −1 2km k Xk = x2m WN +WN x2m+1 WN2km . (3) m=0

m=0

km . The equation Further, It can be found that the twiddle factor WN2km = W N 2 can be simpliﬁed to: N 2

Xk =

−1

m=0

km

x2m W N + 2

WNk

N 2

−1

km x2m+1 W N .

m=0

2

(4)

Let Ek be the even part of the vector and Ok be the odd part. The N/2-potint DFT outputs can be written as: N 2

Ek =

−1

m=0

km

x2m W N 2

N 2

Ok =

−1

m=0

km x2m+1 W N . 2

(5)

300

C. Yang et al.

Fig. 3. 8 points Butterﬂy Diagram

Consequently, the complete DFT can be expressed as: Ek + WNk Ok 0 ≤ k ≤ N/2 . Xk = k−N/2 Ok−N/2 N > k ≥ N/2 Ek−N/2 − WN

(6)

By using the divide-and-conquer concept, FFT reduces the complexity of the algorithm from O(N 2 ) to O(N log2 (N )). Rather than computing the complete data, it is easier to compute a number of smaller data sets. As a result, the number of Fourier Transform calculations needed to be executed decreases dramatically. Before the decompose operation, the initial data need to be rearranged in bit-reverse order, as shown in Fig. 2. Next, the rearranged data are combined so that the DFT can be calculated. The DFT process and combine process are represented in a so called “butterﬂy diagram” as is illustrated in Fig. 3. In the ﬁrst stage, a pair of data sets forms the input of the ﬁrst DFT calculation. Then, the output data sets of the ﬁrst stage become the input of the DFT calculation in second stage. Since this process is repeated time after time, data sets combine together and become larger ones. As a new data set is formed by two smaller data sets, the number of calculation stages is determined by log2 N operations. 4.3

Parallel FFT Algorithm

Although the FFT algorithm eﬀectively decreases the amount of calculations in DFT, however, when data size becomes immensely large and data processing faces the real time demand, FFT in single compute device cannot fulﬁll the requirements in practice. Fortunately, we can parallelize the algorithm in a compute Cloud to further accelerate the computing process. Our Parallel Fast Fourier Transform Algorithm consists of three steps: Data Pre-processing, individual Butterﬂy computation, and collaborative Butterﬂy computation.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

301

The ﬁrst step is preprocessing the data for later computation. In this step, data are rearranged in bit-reverse order and divided into m blocks so that N/m data items can be separately stored into RDDs (Resilient Distributed Dataset, the data structure in Spark), where N is the number of sampling points in the signal data and m is the number of compute nodes. The second step is to execute the butterﬂy computation within each compute node. In the ﬁrst log2 (N/p) data processing stages, no data exchange between compute nodes is required. So after the data are rearranged in bit reverse order and stored in each compute node, the N/p-point FFT is performed to obtain the result separately. However, in the rest log2 (N/p) stages of FFT, data exchange is necessary because the data length is larger than N/p. In the last step, the compute nodes cooperate to calculate the result. Data Preprocessing. As shown in Fig. 2, before the calculation is performed, the data need to be rearranged in bit-reverse sequence. The algorithm of bitreverse is shown in Algorithm 1. This job is ﬁnished in the master node of the Cloud. Then, the reordered data are sequentially separated into p data sets. These data sets are stored in Resilient Distributed Datasets in Apache Spark. Then these data sets are sent to the compute nodes to complete the rest calculation. Algorithm 1. Preprocessing Require: a = (a0 , a1 , ..., an−1 ) Ensure: b = (b0 , b1 , ..., bn−1 ) b = bitReverse(a) for i ← 0 to p − 1 do for k ← 0 to N/p − 1 do P [i].c[k] ← b[i ∗ N/p + k] end for end for

Local FFT Inside Each Compute Node. Once N/p data are received by each compute node, N/p-point FFT will be executed on these data sets. Since there is no data exchange between compute nodes, each compute node performs original FFT on its local data, as shown in Algorithm 2. FFT with Data Exchange. After the former half log2 p FFT, each data set needs to be combined to complete the rest calculations. Therefore the data exchange is required. The computation is performed from the log2 (N/p)-th stage to the (log2 N − 1)-th stage where the compute nodes need communication. The algorithm is shown in Algorithm 3.

302

C. Yang et al.

Algorithm 2. Local N/p − point FFT on each compute node Require: c = (c0 , c1 , ..., cN/p−1 ) Ensure: c = (c0 , c1 , ..., cN/p−1 ) for i ← 0 to p − 1 do for k ← 0 to N/p − 1 do P [i].c[k] ← b[i ∗ N/p + k] ; if ((i ∗ N/p + k)mod l = (i ∗ N/p + k) mod 2l) then c[k] = c[k] + c[k + l ∗ z m ] ; c[k + 1] = c[k] − c[k + l ∗ z m ] ; end if end for end for

Algorithm 3. FFT with data exchange Require: c = (c0 , c1 , ..., cN/p−1 ) Ensure: c = (c0 , c1 , ..., cN/p−1 ) j = log2 (p) + 1 ; for e ← 0 to log2 (p) − 1 do t = 2e, l = 2(e+log2 (N/p)) , q = n/2l, z = wq ; j = j − 1, v = 2j ; for i ← 0 to p − 1 do if (i mod t = i mod 2t) then Receive data block from (i + p/v)th compute node and store into c[N/v] − c[N/v + N/p − 1] ; for k ← 0 to N/p − 1 do m = (i ∗ N/p + k) mod l ; c[k] = c[k] + c[k + N/v] ∗ z m ; c[k + N/v] = c[k] − c[k + N/v] ∗ z m ; end forSend transformed data in c[N/v] − c[N/v + N/p − 1] to the (i + p/v)th compute node ; else Send the data of this compute node to the (i − p/v)th compute node ; After the transformation, Receive data from the (i − p/v)th compute node and store them into c. ; end if end for end for

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

303

Fig. 4. Data exchange in a 16-compute-node Cloud

As shown in Algorithm 3, N/p-pair butterﬂy computation is performed in one compute node while the other paired compute node just sends whole N/p data to the corresponding compute node and waits until transformed data return. For communication overhead, 2N/p data are exchanged at every stage where two N/p data transfers are needed for sending and receiving on an idle compute node.

5

Compute Resource Allocation Strategy

As discussed above, the parallel FFT performs butterﬂy calculation on a compute Cloud. The input data are mapped to each compute node and the results are sent to the master node at last. During this process, the speed of computing is determined by the performance of the compute Cloud. In our experiments to process large amount of signal data, we found that the data exchange results in a great amount of I/O time between compute nodes when data size becomes large, as shown in Fig. 4. This is because in the last log2 p − log2 (n/p) stages, the sizes of data to be calculated is larger than the size of local data in each compute node. As a result, the performance of parallel FFT algorithm is limited to a low level. In order to increase the speed and make full use of the compute resources in a Cloud, we propose two strategies, i.e., equally-split strategy and optimized-split strategy to allocate them. In a compute Cloud, the compute resource is ﬁxed in common sense. We assume that the compute resources are fully used by the compute Cloud because

304

C. Yang et al.

in this way the compute process can be more eﬃcient. We designed two strategies to allocate the limited compute resources. One strategy is to equally split the total resources and the competence of each compute node is equal while another strategy is to allocate unequal resource to compute nodes. These two methods have their own pros and cons which will be discussed in later sections. In common sense, computation ability is decided by the number of CPU cores, the size of cache, and so on. Since Fast Fourier Transform mainly uses CPU to process data, we consider the number of CPU cores in each compute node as the main factor to decide the computation ability. In a certain compute Cloud, because the amount of CPU cores is ﬁxed, when the number of CPU cores in each compute node increases, the number of compute nodes decreases. 5.1

Equally-Split Strategy

We assume that the total compute resource is limited to n CPU cores and m GB caches. The equally-split strategy is to split these resources into p (p should be 2t where t is a integer) pieces. Each compute node has k/p G caches and m/p GB cache. The input data are also equally split into p pieces and distributed to each compute node. After each compute node completes its calculation on its local data, data exchange between compute nodes is required to ﬁnish the rest computing work. Assume the input data size is N . There are log2 (l) steps in a whole butterﬂy computing process. Between every two steps, data need to be exchanged once. Hence, the total size of data to be exchanged is N ∗ (log2 (l) − 1). 5.2

Optimized-Split Strategy

The core idea of optimized-split strategy is to make data ﬂow as stream in the Cloud. This method can avoid data exchange. Although equally-split strategy is a simple way to conduct parallel FFT Algorithm, unfortunately, too many data exchanges result in low speed. To better use the compute resources, we design optimized-split strategy to redistribute the compute resources. Like in equally-split strategy, we also set the total compute resources in the Cloud as n CPU cores and m GB caches. In our experiments, we found that the CPU and cache both have important impact on the FFT algorithm altogether. Therefore, we bind 1 core and 2 GB together as a computing unit. Every compute node can have 1 or n (n is an integer) computing units only. In order to execute diﬀerent stages of FFT, the compute Cloud is separated to sections s1 , s2 , ..., sn . Diﬀerent compute sections conduct diﬀerent stages of FFT. Inside each section si , the resource is equally divided into compute nodes Cik to complete parallel calculations. Because the workload in each stage varies, we set diﬀerent size θk for nodes and diﬀerent size ωk for sections.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

305

For example, there are 48 CPU cores and 96 GB cache in a compute cloud. These resources can be divided into 3 sections s1 , s2 , s3 . The ﬁrst section s1 has 16 compute nodes C1i , C12 , ..., C116 , each of which has 1 core and 2 GB cache. The second section s2 has 4 compute nodes C2i , C22 , ..., C24 , each of which has 4 cores and 8 GB cache. The third section s3 has 1 compute node C31 which has 16 cores and 32 GB cache. When data come to this Cloud, they are divided to 16 parts D1i , D12 , ..., D116 and sent to each compute C1i , C12 , ..., C116 in the ﬁrst section. Then these compute nodes conduct FFT on their local data and send the results to compute nodes D2i , D22 , ..., D24 in the second section. D2i , D22 , ..., D24 also execute the later stages of FFT on their local data and send results to D31 . D31 completes the rest computations and obtains the ﬁnal result. It should be mentioned that the data come as a stream. Hence, the data stream is constantly ﬂowing in the Cloud from section 1 to section n. Therefore, there is no idle resource in our system. The main goal of this distribution method is to ﬁnd an optimal way to balance the portion of sections. Compute node’s size determines its performance. More CPU cores and larger cache means faster speed on computation. When former stage of FFT is too slow, the later stage will not be executed and the next section will be idle. When compute nodes take too much resources, they may wait former computations.

6

Experiment Results

In this section, we present experimental results to illustrate the previous theoretical improvements. The ASFFT (parallel FFT algorithm is implemented in Apache Spark) and comparison with the MFFT algorithm (FFT in Matlab) is given. The data used in the experiments are signal data from satellites. Since the satellites constantly send data to data center, the data arrive in stream. When an amount of 64 MB data arrive, the data become a data block. Hence, a requirement of the system is to ﬁnish data processing job before the next data block comes or the data are accumulated and the computation is delayed. The compute Cloud we used has the resource of 48 CPU cores and 96 GB cache. Apache Spark is installed in the virtual machines to send data and execute computation. Diﬀerent distribution strategies are implemented to the Cloud. In Figs. 5 and 6, we show the comparison between our parallel FFT in Apache Spark and the FFT in Matlab. We use 10MB data unit and 2MB data unit to conduct our experiments. The results show that when data scale is small (shown in the left-side columns in Figs. 5 and 6), MFFT takes less time than ASFFT algorithm. The reason is that Apache Spark is designed to conduct computation for big data. When data scale is small, the initialization of Spark engine takes a big portion of total time. When data scale rise up, the initialization of Spark takes smaller portion so the parallel FFT performs better.

306

C. Yang et al. 200 MFFT ASFFT

180 160

Time (ms)

140 120 100 80 60 40 20 0

5*10m

10*10m

50*10m

100*10m

500*10m

Data volume

Fig. 5. 10 MB data unit comparison 2000 1800 1600

MFFT ASFFT

Time (ms)

1400 1200 1000 800 600 400 200 0

5*2m

10*2m 50*2m 100*2m 500*2m 1000*2m 5000*2m

Data volume

Fig. 6. 2 MB data unit comparison

However, with increase in data scale, the time spent in computation increases drastically. By comparison, although ASFFT spends more time than MFFT when data scale is small, the ASFFT shows its advantage when data scale is large. By comparison between Figs. 5 and 6, we can observe that the ASFFT shows more obvious advantage when the data unit is smaller (2MB). This is because when the data unit’s size is smaller, the FFT algorithm is easier to be conducted. From Figs. 7 and 8, we can see that the parallelization of FFT eﬀectively reduces the algorithm time. With more CPU cores, the speed of algorithm increases. When there is 1 CPU core in the Cloud, the FFT is not parallelized and the speed is low. When there are 2 cores, the time spent by the algorithm reduces greatly to nearly a half. With more and more cores, the increase of the algorithm becomes more and more unobvious.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data 7

307

104 1 core 2 cores 3 cores 4 cores 5 cores

6

Time (ms)

5

4

3

2

1

0

20 partitions

10 partitions

5 partitions

Partitions Number

Fig. 7. Parallel eﬀectiveness comparison of 10 MB data 3

104 1 core 2 cores 3 cores 4 cores 5 cores

2.5

Time (ms)

2

1.5

1

0.5

0

100 partitions 50 partitions

20 partitions

10 partitions

5 partitions

Partitioins Number

Fig. 8. Parallel eﬀectiveness comparison of 2 MB data

In addition, the partition number also aﬀects the algorithm speed. Too many partitions cause low eﬃciency. This result is because more data partitions mean more data RDDs formed in Spark. Spark divides the original data into more data partitions, which takes redundant time. Therefore, ﬁnding a smaller number of data partitions can be signiﬁcantly eﬃcient. Figure 9 reveals the comparison between the two split strategies. The experiment was conducted in a Cloud with 16 CPU cores and 32G cache. In the equally-split strategy, there are 8 workers with 2 CPU cores and 4G cache. In the optimized-split strategy, there are 4 small workers with 2 CPU cores and 4G cache and 1 large worker with 8 CPU cores and 16G cache. When data size is small, the equally-split strategy performs better than the optimized-split

308

C. Yang et al. 1400 1200

Optimized Split Equally Split

Time (ms)

1000 800 600 400 200 0 64m

320m

640m

3200m

6400m

64000m

640000m

Data Size

Fig. 9. Comparison between split strategy

strategy. Nonetheless, when the data size becomes larger, the optimized-split strategy shows its advantage.

7

Conclusion and Future Work

We have presented a parallel Fast Fourier Transform algorithm in Cloud. Using a big data framework called Apache Spark, this algorithm stores intermediate data in cache which decreases the time in FFT. A three-step parallel FFT method is proposed, which enables FFT to be computed concurrently in diﬀerent compute nodes. The existing parallel FFT algorithm has the problem of too many data exchange between compute nodes. This problem results in the low eﬃciency of the algorithm. We propose a new strategy to reallocate the computation resource. By optimized-split the CPU cores and cache into each compute node, data exchange decreases. We have validated our algorithm through comparisons and implementation in a Cloud. To improvement the performance of parallel FFT algorithm, there are many other works could be done. In this paper, we propose some strategies to allocate the computation resources. However, they can be further developed by considering more attributes of computation resources. We also noticed that some researches study the performance of FFT algorithm on GPU cluster, which could be another direction of our future work. Acknowledgements. The authors would like to thank the anonymous referees for their helpful comments from which the preparation for this version of the paper has beneﬁted. Thanks for Johann Sebastian Bach for his inspiring music accompanying the authors to complete the research. This work was supported in part by the National Natural Science Foundation of China under Grant 61572511 and Grant 91648204 and Grant 61872378, in part by the Scientiﬁc Research Project of National University of

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

309

Defense Technology under Grant ZK16-03-57, in part by the China Postdoctoral Science Foundation under Grant 2016M602960 and Grant 2017T100796, in part by Science Fund for Distinguished Young Scholars in Hunan Province under Grant 2018JJ1032. Xiaomin Zhu is the corresponding author.

References 1. Armbrust, M., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010) 2. Armbrust, M., et al.: Above the clouds: A Berkeley View of Cloud Computing. Tech. rep., Technical ReportD UCB/EECS-2009-28, EECS Department, University of California, Berkeley (2009) 3. Baker, S.: Google and the wisdom of clouds. Business Week 14 (2007) 4. Bergland, G.D.: A parallel implementation of the fast fourier transform algorithm. IEEE Trans. Comput. 100(4), 366–370 (1972) 5. Bosagh Zadeh, R., et al.: Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 31–38. ACM (2016) 6. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965) 7. Dean, J., Ghemawat, S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 8. Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE (1998) 9. Githens, J.: A fully parallel computer for radar data processing. In: IEEE Transactions on Aerospace and Electronic Systems, p. 736. No. 5 (1970) 10. Hassanieh, H., Adib, F., Katabi, D., Indyk, P.: Faster gps via the sparse fourier transform. In: International Conference on Mobile Computing and Networking, pp. 353–364 (2012) 11. Hassen, H., Khemakhem, M.: Arabic islamic manuscripts digitization based on hybrid K-NN/SVM approach and cloud computing technologies. In: Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences (32519), pp. 366–371. IEEE (2013) 12. Kulkarni, P., Kumar, V., Verma, H.: Diagnostic acceptability of FFT-based ECG data compression. J. Med. Eng. Technol. 21(5), 185–189 (1997) 13. Li, F., Xu, J., Zhouhong, J., Miao, W.: Aerial target detection via GPS satellite broadcast signal. J. Chin. Inert. Technol. 22(6), 788–793 (2014) 14. Marcum, J.: A statistical theory of target detection by pulsed radar. IRE Trans. Inf. Theory 6(2), 59–267 (1960) 15. Marple, L.: Computing the discrete-time “analytic” signal via FFT. IEEE Trans. Signal Process. 47(9), 2600–2603 (1999) 16. Mullin, L.R., Small, S.G.: Four easy ways to a faster FFT. J. Math. Model. Algorithms 1(3), 193–214 (2002) 17. Palanisamy, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2011) 18. Prasad, N., Shameem, V., Desai, U., Merchant, S.: Improvement in target detection performance of pulse coded doppler radar based on multicarrier modulation with fast fourier transform (ﬀt). IEE Proc. Radar, Sonar Navig. 151(1), 11–17 (2004)

310

C. Yang et al.

19. Preuss, R.: Very fast computation of the radix-2 discrete fourier transform. IEEE Trans. Acoustics, Speech, Signal Process. 30(4), 595–607 (1982) 20. Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 5(8), 1266– 1271 (1996) 21. Spark, A.: Lightning-fast cluster computing (2016) 22. Tang, G., Peng, L., Baldwin, P.R., Mann, D.S., Jiang, W., Rees, I., Ludtke, S.J.: Eman2: an extensible image processing suite for electron microscopy. J. Struct. Biol. 157(1), 38–46 (2007) 23. Ubeyli, E., G¨ uler, I.: Spectral analysis of internal carotid arterial doppler signals using FFT, AR, MA, and ARMA methods. Comput. Biol. Med. 34(4), 293 (2004) 24. Vincke, R., Landschoot, S.V., Cordemans, P., Peuteman, J., Steegmans, E., Boydens, J.: Algorithm parallelization using software design patterns, an embedded case study approach. In: Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 470–473 (2013) 25. Wang, Y., Yang, R., Wo, T., Jiang, W., Hu, C.: Improving utilization through dynamic VM resource allocation in hybrid cloud environment. In: IEEE International Conference on Parallel and Distributed Systems, pp. 241–248 (2015) 26. Welch, P.: The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modiﬁed periodograms. IEEE Trans. Audio Electroacoust. 15(2), 70–73 (1967) 27. Wold, E., Despain, A.: Pipeline and parallel-pipeline FFT processors for VLSI implementations. IEEE Trans. Comput. C–33(5), 414–426 (1984) 28. Xu, L., Li, J., Stoica, P.: Target detection and parameter estimation for mimo radar systems. IEEE Trans. Aerosp. Electron. Syst. 44(3), 927–939 (2008) 29. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010) 30. Zaharia, M., et al.: Apache spark: a uniﬁed engine for big data processing. Commun. ACM 59(11), 56–65 (2016) 31. Zhu, X., Mong Sim, K., Jiang, J., Wang, J., Chen, C.: Agent-based dynamic scheduling for earth-observing tasks on multiple airships in emergency. IEEE Syst. J. 10(2), 661–672 (2016) 32. Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.: Fault-tolerant scheduling for real-time scientiﬁc workﬂows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27(12), 3501–3517 (2016)

Task Oﬄoading in Edge-Clouds with Budget Constraint Lei He1 , Hongli Xu1(B) , Haibo Wang1 , Liusheng Huang1 , and Jingyi Ma2 1

Department of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei, China {hl1994,wanghaib}@mail.ustc.edu.cn, {xuhongli,lshuang}@ustc.edu.cn 2 TianPing College of SuZhou University of Science and Technology, SuZhou 215011, Jiangsu, China [email protected]

Abstract. Edge computing is an emerging computing model that extends the cloud and its services to the edge of network. In edge-cloud computing, a set of servers are deployed near the mobile devices such that these devices can oﬄoad tasks to the servers with low latency. Most existing works usually focus on oﬄoading tasks under the premise that suﬃcient resources are owned by edge servers while ignoring budget constraint of user. If failed to consider about this, the existing oﬄoading schemes may cause user to overspend, this is unacceptable to user. Thus, in this paper, we investigate the task oﬄoading problem in edge-cloud computing aiming to minimize the task duration while tasks are generated by user with constrainted budget. Besides edge servers are equipped with limited computation and storage resources. Speciﬁcally, the problem we formulate is an NP-hard problem. In order to solve it, we propose a heuristic strategy. The simulation results prove that the proposed scheme can improve the success ratio and reduce the task duration, compared to random and greedy oﬄoading schemes. Keywords: Edge computing

1

· Task oﬄoading · Budget constraint

Introduction

Mobile devices are commonly used in people’s life everyday. It is predicted that by 2020 the total quantity of devices would be 75 billion, while the volume of mobile traﬃc would exceed 24.3 exabytes/month [1]. Furthermore, mobile devices will be more and more intelligent while the applications in mobile devices become increasingly resource-hungry. These applications include wearable virtual reality (VR) [2] streaming, augmented reality (AR) [3] and vehicular system [4], etc. However, the gap between required resources and those available in mobile devices widens. To bridge this gap, mobile applications can oﬄoad their computation-intensive tasks to remote clouds [5]. However, an evident weakness of public cloud based mobile cloud computing is that mobile users may experience long latency for data exchange with the public cloud through the wide area c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 311–326, 2018. https://doi.org/10.1007/978-3-030-05057-3_25

312

L. He et al.

network. Long latency would hurt the interactive response, since humans are acutely sensitive to delay and jitter. Moreover, it is very diﬃcult to reduce the latency in the wide area network. To deal with long latency of remote clouds, edge computing [6,7] has been proposed, which extends the cloud computing by placing a number of small scale servers at the edge of network. In this way, users can oﬄoad their tasks to edge servers and receive computing results with low network latency. However, compared to cloud computing, the scale of edge servers is rather small. The development of tasks is restricted by both the resource and computation capacity. Any particular edge server might not be able to support large scale computing tasks. Therefore, the tension between resource-hungry tasks and resource-constrained edge servers hence poses a signiﬁcant challenge for the future mobile platform development. In recent years, researchers have been pay more attention to the performance of edge-cloud network, especially for the task oﬄoading problems. In summary, those works focus on two main aspects of task oﬄoading. (i) Minimizing the task duration(e.g. [8,9]). When mobile devices create tasks, those works ﬁrst decide whether the tasks should oﬄoad to edge-cloud or not. Then choose between the remote and nearby edge servers to oﬄoad tasks, which is based on the amount of computation resources required. (ii) Considering saving the network energy cost, the energy eﬃcient resource allocation scheme is studied for mobile edge computing in [10,11]. However, when we consider more about this, the eﬀective of the oﬄoading strategy faces the following challenges: (i) The limited resources of edge servers. These resources include the computation resources and storage resources. In the big data era, most tasks are created to train a general model from big data. Those tasks need appropriate edge servers to generate models and store data set. So, how to control these computing resources is a challenge problem. (ii) The budget constraint of user. As an extension of cloud computing, edge servers charge for the services from user, the price of service depends on the resources asked from tasks. Thus, the existing works may not useful at this, because those works may leading the overspend of user. For example, when a user plays VR game on mobile device, the user needs to pay for the game. So when the game application decides to oﬄoad tasks to edge-clouds, the cost of the oﬄoading should not exceed the amount of payment, and the user desires the low delay of tasks and high success ratio of task oﬄoading. Thus how to conduct an eﬀective task oﬄoading according to the budget constraint is challenging. Under this context, to get low latency of task oﬄoading, we should consider how to match the desired resources of users to the limited resources of edge servers. In this paper, we study the oﬄoading problem in edge-cloud network. To be speciﬁc, the edge servers are equipped with limited computation and storage resources, while the user who decides to oﬄoad tasks to edge servers with constrainted budget. Furthermore, the task oﬄoading problem we formulate is an NP-hardness problem. To solve the problem, we propose a budget constraint task oﬄoading scheme (BCTO). Our proposed strategy of task oﬄoading aims to minimize task duration. In detail, our task oﬄoading scheme includes two

Task Oﬄoading in Edge-Clouds with Budget Constraint

313

parts: (i) Computing cost of the computation. When a task oﬄoading to an edge server, the edge server gives the cost of the computation of this task based on the computation and storage resources required by the task. (ii) According to the cost of computation task on every edge server and the user budget, our scheme calculates the eﬀective of the cost and decides which of the server should be allocated to the task. In this paper we assume the task duration is the execution time of the tasks on edge servers, the budget is set by the user who decides to oﬄoad tasks. When user creates many tasks, we assume these tasks are independent. Which means for any two tasks, the result of one task has no impact on another task, so tasks can be executed concurrently. When user decides to oﬄoad tasks to edge servers. User acts as a buyer with constrainted budget, while the computation and storage resources are regarded as the commodities. To ensure the cost of the computing and store of tasks does not exceed the budget. The BCTO scheme chooses the appropriate edge server for oﬄoading tasks. The main contributions of this paper are summarized as follows: 1. We propose a price model of the edge servers, which measures the price of computation tasks based on the computation and storage resources required by tasks. By using this model, the cost of each task on each edge server can be obtained, and thus enabling the optimal task oﬄoading. 2. We present an eﬃcient budget constraint task oﬄoading scheme. Based on the budget, the cost of every task and the execution time on edge servers, the scheme chooses an appropriate edge server for every task. Our scheme can not only improve the success ratio of oﬄoading but also reduce the task duration. 3. We conduct extensive experiments to evaluate the performance of BCTO scheme. The experimental results validate that our proposed algorithm can improve 5%∼10% success ratio and reduce at least 30% of the task duration compared to random and greedy oﬄoading schemes. The rest of this paper is organized as follows. In Sect. 2, we present the related works. In Sect. 3 the system model is described. In Sect. 4 we give the problem formulation. We propose the eﬃcient task oﬄoading algorithm in Sect. 5. Our simulation results and discussions are given in Sect. 6. Finally, Sect. 7 concludes this paper.

2 2.1

Related Works Mobile Edge Computing

At present, mobile devices become more and more powerful and intelligent. However, the development of mobile devices does not catch up with the demand of resource of applications. So it is diﬃcult to handle all application tasks directly in mobile devices. Mobile cloud computing has proposed as a solution, and oﬄoading heavy computation tasks to the remote cloud data centers has been studied

314

L. He et al.

for over a decade. CloneCloud [12] was proposed to use cloned virtual machine images in the cloud for mobile job oﬄoading. Follow me cloud [13] was proposed for oﬄoading computation-intensive tasks to the cloud for processing. COMET [14] migrates the application threads between mobile device and the cloud by using a distributed shared memory model. However, since the locations of cloud servers are far away from mobile devices, oﬄoading tasks to cloud may get a long delay. To overcome this challenge, mobile edge computing was proposed to provide nearby rich computing resources to mobile users [15], and there have been quite a lot of studies on the resource allocation problem. [16,17] oﬄoad the tasks to the nearest edge servers since it is easy to apply, but it may lead a severe competing for the limited resources of edge server, To solve this problem, a hierarchical architecture has been proposed [9]. The architecture divides the edge-clouds into diﬀerent levels according to the distance to the edge, and presents heuristic algorithm to minimize the task duration. Most existing works consider a single edge server in task oﬄoading. The work in [18] proposed that the cooperation of edge clouds can not only reduce the processing delay of user tasks, but also reduce the energy consumption. In a word, mobile edge computing can improve the quality of service and energy eﬃciency by optimizing task oﬄoading and resource allocation policies. [19] pointed out that it is much better to processing the tasks in edge-clouds than processing at the edge-clouds in isolation. Some of the works above assumed that the releases of tasks follow some known stochastic. So in [8] an online algorithm without any assumption of the task release distribution has been proposed. 2.2

Task Oﬄoading with Limited Budget

There are many works on task scheduling with budget constraint, such as in the grid or cloud environments, [20] developed scheduling approaches, LOSS and GAIN, to adjust a schedule which is generated by a time optimized heuristic and a cost optimized heuristic to meet users budget constraints respectively. But this strategy should be supported by other scheduling algorithm. BaTS [21], a budget and time-constrained scheduler, can schedule large bags of independent tasks onto multiple Clouds. Zhu et al. [22] proposed a dynamic scheduling for ﬁxed time-limit and resource budget constraint tasks. Reference [23] focuses on using genetic algorithms to solve the scheduling problems considering the budget and deadline of entire network. Recently, HCOC [24] discusses workﬂow execution in a cloud context. HCOC reduces monetary cost while achieving the established desired execution time by deciding which resources should be leased from the public cloud or be used in the private cloud. Byun et al. [25] provided PBTS (Partitioned Balanced Time Scheduling) which estimates the minimum number of computing hosts required to execute a workﬂow within a user-speciﬁed ﬁnish time. However, the computing hosts are used as the same monetary cost per time unit. For large graph processing in Cloud, Li et al. [26] designed a cost-conscious scheduling algorithm (CCSH) which is an extension of HEFT.

Task Oﬄoading in Edge-Clouds with Budget Constraint

315

In this paper, we construct analytical models to quantify independent tasks execution performance in edge computing, and we incorporate the price model into cost calculation.

3 3.1

System Model Network Model

We consider an edge computing scenario with M heterogenous edge servers, M = {s1 , s2 , ..., sM } each of which is equipped with limited computation and storage resources. For each of the edge server, we assume the resource status of edge server si be the 2-tuple (Ric , Rim ), where Ric and Rim are the computation resource and storage resource owned by edge server si . The computation resource is described in terms of CPU cycles while the storage resource is quantiﬁed by the size of GB. There is a set T = {t1 , t2 , ..., tN } of indivisible tasks, those tasks are oﬄoaded by user with the constrained budget B. We adopt a widely used task model (see [7,27,28]) to describe task tj = (aj , cj ), i.e., where aj stands for computation amount of task, i.e., the CPU cycles needed in total to compute task, and bj stands for the size of computation task, i.e., the amount of data contents (e.g, the data input and associated processing code) to be delivered toward the edge servers. In our model, a mobile device will dispatch tasks to an edge server immediately after its release. We do not allow the servers to migrate a task to other servers after the oﬄoading to avoid migration overhead, and we assume a server can execute at most one task at a time preemptively. 3.2

Price Model

When an edge server equips with limited computation and storage resources, the price of those resources refers to the cost of oﬄoaded tasks to be executed in the server. It is more reasonable to value the resources according to edge servers’ performance when utilizing these resources. Let PiC (q) denotes the price of computation for q units of CPU cycles per second of edge server si , and PiS (e) stands for the price of e units size of storage of edge server si . In our price model of computation price, we adopt a nonlinear model. The function can be denoted as: P C (x) x iC y Pi (y)

i ∈ M, x, y ∈ {1, 2, ..., Ric }.

(1)

where PiC (x) and PiC (y) denote the price of x and y units of CPU cycles on edge server si , Ric denotes the computation resource limitation on edge server si . For the storage price, we deﬁne a linear function to the size of storage. The function can be denoted as following: P S (x) x = iS y Pi (y)

i ∈ M, x, y ∈ {1, 2, ..., Rim }.

(2)

where PiS (x) and PiS (y) denote the price of x and y units of storage size on edge server si , Rim denotes the storage resource limitation on edge server si .

316

3.3

L. He et al.

Task Oﬄoading Model

In this section, we will introduce the computation task oﬄoading model in detail. As we describe above, the task tj can be described tj = (aj , cj ), considering the diﬀerence of computation resources of edge server, we denote the computation resources (CPU cycles per second) of edge server si as Ric , according to the network model, the task duration on edge server si is the time when task is executed on the edge server. Therefore, the task duration of task tj on edge server si can be obtained as follow: tij =

aj Ric

i ∈ M, j ∈ N.

(3)

Similar to the study [29], we ignore the transmission delay for edge servers to send data from user or to user. This is because the edge servers are deployed very close to the mobile devices. The processing time of tasks on edge servers are the main part compared to the transmission time of tasks.

4

Problem Formulation

In this paper, in terms of limited computation and storage resources of edge servers, we consider the following problem: how to select the appropriate edge servers for tasks while achieving the minimum task duration under the constraint budget of user. Deﬁne the matching matrix as X = {xij }M ∗N , where xij is the indicator revealing whether edge server si can serve task tj . If task tj is oﬄoaded to the edge server si , then we have xij = 1, otherwise xij = 0, the matching matrix must satisfy the following constraint: M

xij 1 i ∈ M, j ∈ N.

(4)

i=1

which ensures that one task can only be served by at most one edge server. If task tj is allowed to be served by edge server si , then the cost of computation of task tj on edge server si is: pij = PiC (aj ) + PiS (cj )

i ∈ M, j ∈ N.

(5)

For each edge server si , the total cost for the tasks executed on the server is: pi =

N

xij pij

i ∈ M, j ∈ N.

(6)

j=1

When task tj is oﬄoaded to edge server si , the time of the execution of task tj on edge server si is tij as we describe in the task oﬄoading model. Thus, the overall time of task execution on edge server si can be expressed as follows:

Task Oﬄoading in Edge-Clouds with Budget Constraint

Ti =

N

i ∈ M, j ∈ N.

xij tij

317

(7)

j=1

According to the analysis above, the problem we need to solve can be formulated as the following: min max(Ti ) i ∈ M, j ∈ N subject

to :

M

pi B

i ∈ M,

(8) (9)

i=1 M

xij cj Rim

i ∈ M, j ∈ N,

(10)

i=1 M

xij 1 i ∈ M, j ∈ N,

(11)

i=1

xij ∈ [0, 1]

i ∈ M, j ∈ N.

(12)

Table 1. Notation Table Parameter Deﬁnition M

Set of edge server

T

Set of computation task

si

Edge server

tj

Computation task

Ric Rim

Total number of CPU cycles owned by edge server si

aj

The CPU cycles need of task tj

cj

The size of storage amount need of task tj

PiC (q) PiS (e)

The price of q units of CPU cycles and per time unit on edge server si

tij

The task duration of task tj executed on edge server si

xij

The indicator revealing whether edge server si can serve the task tj

Ti

The overall time of edge server si

Texe

The minimum execution time of all the tasks

prij

The price ratio

psij

The eﬀective of price

pall

The cost of all oﬄoaded tasks

s

p

Total number of storage size owned by edge server si

The price of e unit size of storage on edge server si

The set of eﬀective of price of all tasks on edge servers edge servers

318

L. He et al.

The objective function (8) is to minimize the maximum execute time of tasks on edge server. The ﬁrst constraint (9) indicates for all the tasks execute on edge servers, the cost of the computation and storage should not exceed the constraint budget B. The second constraint (10) states for any edge server si , the storage resource asked of any tasks execute on this edge server is no more than the edge server’s storage resource. The third constraint (11) means that one task can only be served by at most one edge server. The last condition (12) indicates whether a task tj is served by edge server si or not. The problem we formulate is a NPhard problem [30], therefore, we focus on design of a heuristic approach to this optimization problem (Table 1).

5

Task Oﬄoading Scheme in Edge Computing

Our work targets computation-intensive tasks in edge-cloud, where the data transfer time is assumed negligible since: (i) the time for data transfers in most computation-intensive tasks constitutes less than 10 of the overall task execution time [21]. (ii) the edge servers are deployed very close to the mobile devices. Algorithm 1. BCTO(M, T , B). Input: A set of edge servers M equipped with computation and storage resource. A set of tasks T with required computation and storage resource. A ﬁxed budget B. Output: The minimum execution time Texe of tasks. 1: for all si in M do 2: Set the overall time on edge server si Ti = 0. 3: for all tj in T do s 4: if cj < = Ri then 5: Calculate the execution time tij and the price pij according the equation (3) and (5). 6: if pij > B then 7: Exit the program with an error. 8: end if p 9: Calculate the price ratio as prij = Bij . pij . 10: Calculate the eﬀective of price pij , psij = prij × tij 11: else 12: Set tij = 0, pij = 0 and prij = 0. 13: end if 14: end for 15: end for 16: Texe = Oﬄoad(ps , T , B). 17: return Texe .

We propose the BCTO algorithm as shown in Algorithm 1. The Algorithm 1 ﬁrst estimates the cost and time of every task based on the resources required by

Task Oﬄoading in Edge-Clouds with Budget Constraint

319

every task and the computation and storage resources owned by edge servers. If the cost of any task is greater than the budget then we ﬁnish the oﬄoading. Then Algorithm 1 gives the price ratio of the task on every edge server. According to the price ratio of task on every edge server, we give the eﬀective of the price on each edge server. In Algorithm 1, the eﬀective of price is part of the oﬄoading strategy in Algorithm 2. Based on the eﬀective of price on each edge server, Algorithm 2 sorts the value of cost of eﬀective. After the sort of the eﬀective of price of a task on each servers, we drop out some of the edge servers that with much lower price eﬀective. From the start index to the end index indicate the edge servers we keep for oﬄoading, in this paper, we drop out two lower price eﬀective edge servers. After the drop out, we will oﬄoad the task on the edge server with minimizing increase on the time of edge servers. It is obviously the cost of all the computation tasks will not exceed the budget in Algorithm 2. The time complexity of Algorithm 1 is O(M × N ), where M is the number of edge servers, and N is the number of computation tasks. Algorithm 2. Oﬄoad(ps ,T , B) Input: The set of eﬀective of price of all tasks on edge servers edge servers ps . The set of tasks T with the price pij and tij . A ﬁxed budget B. Output: The minimum execution time Texe of those tasks. 1: Set the cost of all oﬄoaded tasks pall = 0. 2: Set M is the number of edge server, start = 1, and end = M-2. 3: Set the overall time on every edge server Ti = 0. 4: for all tj in T & pij = 0 & tij = 0 do 5: if pall < = B then 6: Sort the psij for every i in ascending order. 7: for i from start to end do 8: Oﬄoading the task tj to the edge server si and si = arg min Ti + tij . si ∈M

9: end for 10: Ti = Ti + tij , pall = pall + pij . 11: end if 12: end for 13: Set Texe = arg max Ti . i∈M

14: return Texe .

6

Simulation and Performance Evaluation

In this section, a simulation experiment is provided concerning task oﬄoading for edge computing. The experiment is divided into three parts: (i) We implement task oﬄoading algorithm and evaluate the impact of oﬄoading performance in

320

L. He et al.

comparison with two other task oﬄoading schemes in terms of budget number, the number of edge server and the number of task. (ii) We study the impact of the computation amount on the performance of task oﬄoading in comparison with random oﬄoading and greedy oﬄoading schemes. (iii) We investigate the impact of task data size on the performance of task oﬄoading in comparison with random oﬄoading and greedy oﬄoading schemes. 6.1

Simulation Settings

For task tj , we assume the resources required, namely the CPU cycles aj and data size bj are generated by a probability distribution. Similar to the work [11], we set the computation resources owned by edge servers are range from 20 to 50 GHZ, while the storage resources owned by edge servers are range from 1 GB to 16 GB. 6.2

Comparison to Other Methods

We set the comparison of our algorithm to other two diﬀerent task oﬄoading strategies: random oﬄoading scheme and greedy oﬄoading scheme. 1. Random oﬄoading scheme: the computation tasks are oﬄoaded to edge servers for processing randomly. We ﬁrst set up a random generator that can generate a M-tuple, the value in the tuple ranges from 0 to 1 and is generated with equal probability, the sum of value in the tuple is 1, where M is the number of edge server. Then, we get the index which value is maximum in the tuple, ﬁnally we oﬄoad computation task to the edge server according to the index we get. 2. Greedy oﬄoading scheme: the greedy oﬄoading scheme oﬄoads the tasks to the most powerful edge server to get the minimum task duration. Most of the works (e.g., [16,31]) on edge servers adopted the greedy strategy as the task oﬄoading policy. For the three methods above, the oﬄoading performance we evaluate refers to the task duration and the success ratio of task oﬄoading. The task duration in our work refers to task execution time on edge server, while the success ratio means that the number of successful oﬄoading tasks to the total number of tasks. The CPU cycles of each task are generated by the normal distribution with mean value of 2 GHZ, and the data size of each task is generated by the normal distribution with mean value of 2 GB. When we evaluate the impact of user budget, we set the number of task is 10000, and the number of edge servers is 20. Figure 1 shows the impact of user budget. When compared with random oﬄoading scheme, our proposed scheme can improve 6% success ratio and reduce about 30% of task duration of task oﬄoading. Compared with greedy oﬄoading scheme, our proposed scheme can improve about 30% success ratio while reduce 45% task duration of task oﬄoading.

Task Oﬄoading in Edge-Clouds with Budget Constraint 90

70

BCTO

80 Random

Task Duration(seconds)

Success Ratio(%)

Greedy

70 60 50 40 30 20 10 20000

30000

40000 Budget

50000

Greedy

60 Random BCTO 50 40 30 20 10 0 20000

60000

321

30000

40000 Budget

50000

60000

Fig. 1. Impact of user budget

Figure 2 shows the impact of the number of edge servers, we set the number of tasks is 10000, while the user budget is 60000. We can know that when compared with random oﬄoading scheme, our proposed scheme can improve 8% success ratio and reduce about 30% of task duration of task oﬄoading. Compared with greedy oﬄoading scheme, our proposed scheme can improve about 35% success ratio while reduce 45% task duration of task oﬄoading. 85

BCTO Random Greedy

75 70 65 60 55 50 45

16

18 20 22 The Number of Edeg Server

Greedy Random BCTO

80

Task Duration(seconds)

Success Ratio(%)

80

24

70 60 50 40 30 20 10

16

18 20 22 The Number of Edge Servers

24

Fig. 2. Impact of edge server number

Figure 3 shows the impact of the number of tasks, we set user budget is 100000, while the number of edge servers is 20. When compared with random oﬄoading scheme, our proposed scheme can improve 5% success ratio and reduce about 35% of task duration of task oﬄoading. Compared with greedy oﬄoading scheme, our proposed scheme can improve about 20% success ratio while reduce 40% task duration of task oﬄoading. 6.3

Impact of Computation Amount of Task Oﬄoading

In this section, we consider the impact of computation amount on task oﬄoading performance. The data size of task follows a normal distribution with mean value

L. He et al. 90

Success Ratio(%)

80

BCTO Random Greedy

80

Task Duration(seconds)

322

70 60 50 40 30 20 10 10000

20000 30000 40000 The Number of Tasks

50000

70

Greedy Random BCTO

60 50 40 30 20 10000

20000 30000 40000 The Number of Tasks

50000

Fig. 3. Impact of task number

of 2 GB. The user budget we set is 70000. The number of tasks is 10000 and the number of edge servers is 20. For the computation amount, three kinds of distribution are utilized, i.e., uniform distribution, normal distribution and pareto distribution. In the ﬁrst ﬁgure of Fig. 4 and ﬁrst ﬁgure of Fig. 5, when the computation amount follows uniform distribution. Compared with random oﬄoading scheme, it is shown that our proposed BCTO scheme can improve 5% success ratio while reduce about 30% task duration. And compared with greedy oﬄoading scheme, our scheme can improve 25%∼30% of the success ratio and reduce 40%∼50% of the task duration.

60 50 40

50 40 30

20

20

1.5 2 2.5 Average Computations Per Task(GHZ)

3

BCTO Random Greedy

80

60

30 1

90

BCTO Random Greedy

70 Success Ratio(%)

Success Ratio(%)

80

BCTO Random Greedy

70

Success Ratio(%)

80

70 60 50 40 30 20

1

1.5 2 2.5 Average Computations Per Task(GHZ)

3

10

1

1.5 2 2.5 Average Computations Per Task(GHZ)

3

Fig. 4. Impact of computation amount: success ratio under uniform distribution, normal distribution and pareto distribution

As shown in the second ﬁgure of Fig. 4 and second ﬁgure of Fig. 5, when the computation amount follows normal distribution. We can get that our scheme can improve 5% success ratio and reduce 35% task duration when compared with random oﬄoading scheme. Compared with the greedy oﬄoading scheme, our scheme can improve 25% of success ratio, while reduce 45% of task duration. In the third ﬁgure of Fig. 4 and third ﬁgure of Fig. 5, when the computation amount follows pareto distribution. When compared with the random oﬄoading scheme, our scheme can improve 5% of success ratio, while reduce more than

Task Oﬄoading in Edge-Clouds with Budget Constraint

60 50 40 30 20

1

1.5 2 2.5 Average Computations Per Task(GHZ)

60 50 40 30 20

3

70

Greedy Random BCTO

Task Duration(seconds)

70

Greedy Random BCTO

Task Duration(seconds)

Task Duration(seconds)

70

1

1.5 2 2.5 Average Computations Per Task(GHZ)

Greedy Random BCTO

60 50 40 30 20

3

323

1

1.5 2 2.5 Average Computaitons Per Task(GHZ)

3

Fig. 5. Impact of computation amount: time duration under uniform distribution, normal distribution and pareto distribution

30% task duration. Compared with greedy oﬄoading scheme, our scheme can improve 20%∼25% success ratio, while reduce 40% task duration. 6.4

Impact of Data Size of Task Oﬄoading

In this section, we consider the impact of data size on task oﬄoading performance. The computation amount follows a normal distribution with mean value of 2 GHZ. The user budget is 40000. The number of tasks we set is 10000, and the number of edge servers is 20. For the data size, three kinds of distribution are utilized, i.e., normal distribution, uniform distribution and pareto distribution. As shown in Figs. 6 and 7, we can conclude that our proposed oﬄoading scheme exhibits higher success ratio of task oﬄoading and shorter task duration than random oﬄoading scheme and greedy oﬄoading scheme. In ﬁrst ﬁgures of Fig. 6 and ﬁrst ﬁgure of Fig. 7, when computation amount follows uniform distribution. Compared with random oﬄoading scheme, it is shown that our proposed oﬄoading scheme can improve 5% of the success ratio and reduce 30% of the task duration on average of task oﬄoading. And compared with greedy oﬄoading scheme, our scheme can improve 30% of the success ratio and reduce 40% of the task duration on average of task oﬄoading. 85

BCTO Random Greedy

75 70 65 60 55

70 65 60 55

45 1.5

50

3.5

75

75

50 2 2.5 3 Average Data Size(GB)

80

BCTO Random Greedy

80 Success Ratio(%)

Success Ratio(%)

80

Success Ratio(%)

85

BCTO Random Greedy

70 65 60 55 50

1.5

2 2.5 3 Average Data Size(GB)

3.5

45 1.5

2 2.5 3 Average Data Size(GB)

3.5

Fig. 6. Impact of data size: success ratio under uniform distribution, normal distribution and pareto distribution

L. He et al. 140

Greedy

110 Random BCTO

Task Duration(seconds)

Task Duration(seconds)

120 100 90 80 70 60 50

100

Greedy Random 120 BCTO

Task Duration(seconds)

324

100 80 60 40

40 1.5

2 2.5 3 Average Data Size(GB)

3.5

1.5

2 2.5 3 Average Data Size(GB)

3.5

90

Greedy Random BCTO

80 70 60 50 40 30 1.5

2 2.5 3 Average Data Size(GB)

3.5

Fig. 7. Impact of data size: time duration under uniform distribution, normal distribution and pareto distribution

As shown in second ﬁgure of Fig. 6 and second ﬁgure of Fig. 7, when computation amount follows normal distribution. Compared with random oﬄoading scheme, it is shown that our proposed oﬄoading scheme can improve 5%∼10% of the success ratio and reduce 30% of the task duration on average of task oﬄoading. And compared with greedy oﬄoading scheme, our scheme can improve 25% of the success ratio and reduce 45% of the task duration on average of task oﬄoading. In the third ﬁgure of Fig. 6 and third ﬁgure of Fig. 7, when computation amount follows pareto distribution. When compared with random oﬄoading scheme, it is shown that our proposed oﬄoading scheme can improve 5%∼10% of the success ratio and reduce 35% of the task duration on average of task oﬄoading. And compared with greedy oﬄoading scheme, our scheme can improve 25% of the success ratio and reduce 45% of the task duration on average of task oﬄoading.

7

Conclusion

In this paper. We ﬁrst formulate a budget-constraint task oﬄoading problem for delay minimization in edge computing environments, where the edge servers are equipped with limited computation and storage resources. Then we proposed a heuristic algorithm to solve the problem we formulated. Simulation results have shown that our proposed scheme is more eﬃcient in success ratio of task oﬄoading and task duration compared to the random and greedy computation oﬄoading schemes. It would be of our future interest to consider a task oﬄoading in more complicate deployment with users mobility. Acknowledgement. This paper is supported by the NSFC under Grant No. 61472383, U1709217, and 61472385, and the Natural Science Foundation of Jiangsu Province in China under No. BK20161257.

Task Oﬄoading in Edge-Clouds with Budget Constraint

325

References 1. Networking, V.: Cisco visual networking index: Global mobile data traﬃc forecast update, 2014-2019 white paper 2. Chen, Z., et al.: An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance. In: SEC, p. 14 (2017) 3. Hu, Y.C., Patel, M., Sabella, D., Sprecher, N., Young, V.: Mobile edge computing–a key technology towards 5G. ETSI White Pap. 11(11), 1–16 (2015) 4. Truong, N.B., Lee, G.M., Ghamri-Doudane, Y.: Software deﬁned networking-based vehicular adhoc network with fog computing. In: IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 1202–1207 (2015) 5. Barbera, M.V., Kosta, S., Mei, A., Stefa, J.: To oﬄoad or not to oﬄoad? the bandwidth and energy costs of mobile cloud computing. In: Proceedings IEEE INFOCOM, pp. 1285–1293, April 2013 6. Taleb, T., Samdanis, K., Mada, B., Flinck, H., Dutta, S., Sabella, D.: On multiaccess edge computing: a survey of the emerging 5G network edge cloud architecture and orchestration. IEEE Commun. Surv. Tutor. 19(3), 1657–1681 (2017) 7. Zhang, S., Zhang, N., Zhou, S., Gong, J., Niu, Z., Shen, X.: Energy-aware traﬃc oﬄoading for green heterogeneous networks. IEEE J. Sel. Areas Commun. 34(5), 1116–1129 (2016) 8. Tan, H., Han, Z., Li, X.Y., Lau, F.C.M.: Online job dispatching and scheduling in edge-clouds. In: IEEE INFOCOM 2017 - IEEE Conference on Computer Communications, pp. 1–9, May 2017 9. Tong, L., Li, Y., Gao, W.: A hierarchical edge cloud architecture for mobile computing. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9, April 2016 10. You, C., Huang, K., Chae, H., Kim, B.H.: Energy-eﬃcient resource allocation for mobile-edge computation oﬄoading. IEEE Trans. Wirel. Commun. 16(3), 1397– 1411 (2017) 11. Chen, M., Hao, Y.: Task oﬄoading for mobile edge computing in software deﬁned ultra-dense network. IEEE J. Sel. Areas Commun. 36(3), 587–597 (2018) 12. Chun, B.G., Ihm, S., Maniatis, P., Naik, M., Patti, A.: Clonecloud: elastic execution between mobile device and cloud. In: Proceedings of the Sixth Conference on Computer systems, pp. 301–314. ACM (2011) 13. Claﬀy, K.C., Polyzos, G.C., Braun, H.W.: Application of sampling methodologies to network traﬃc characterization. In: ACM SIGCOMM Computer Communication Review, vol. 23, pp. 194–203. ACM (1993) 14. Gordon, M.S., Jamshidi, D.A., Mahlke, S.A., Mao, Z.M., Chen, X.: Comet: code oﬄoad by migrating execution transparently. OSDI 12, 93–106 (2012) 15. Taleb, T., Dutta, S., Ksentini, A., Iqbal, M., Flinck, H.: Mobile edge computing potential in making cities smarter. IEEE Commun. Mag. 55(3), 38–43 (2017) 16. Jia, M., Cao, J., Liang, W.: Optimal cloudlet placement and user to cloudlet allocation in wireless metropolitan area networks. IEEE Trans. Cloud Comput. 5(4), 725–737 (2017) 17. Urgaonkar, R., Wang, S., He, T., Zafer, M., Chan, K., Leung, K.K.: Dynamic service migration and workload scheduling in edge-clouds. Perform. Eval. 91, 205– 228 (2015) 18. Xiao, Y., Krunz, M.: Qoe and power eﬃciency tradeoﬀ for fog computing networks with fog node cooperation. In: IEEE INFOCOM 2017 - IEEE Conference on Computer Communications, pp. 1–9, May 2017

326

L. He et al.

19. Tran, T.X., Pompili, D.: Joint task oﬄoading and resource allocation for multiserver mobile-edge computing networks (2017). arXiv preprint arXiv:1705.00704 20. Sakellariou, R., Zhao, H., Tsiakkouri, E., Dikaiakos, M.D.: Scheduling workﬂows with budget constraints. Integrated Research in GRID Computing, pp. 189–202. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-47658-2 14 21. Oprescu, A.M., Kielmann, T.: Bag-of-tasks scheduling under budget constraints. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 351–359, November 2010 22. Zhu, Q., Agrawal, G.: Resource provisioning with budget constraints for adaptive applications in cloud environments. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 304– 307. ACM, New York (2010) 23. Gharooni-fard, G., Moein-darbari, F., Deldari, H., Morvaridi, A.: Scheduling of scientiﬁc workﬂows using a chaos-genetic algorithm. Procedia Comput. Sci. 1(1), 1445–1454 (2010) 24. Bittencourt, L.F., Madeira, E.R.M.: Hcoc: a cost optimization algorithm for workﬂow scheduling in hybrid clouds. J. Internet Serv. Appl. 2(3), 207–227 (2011) 25. Byun, E.K., Kee, Y.S., Kim, J.S., Maeng, S.: Cost optimized provisioning of elastic resources for application workﬂows. Futur. Gener. Comput. Syst. 27(8), 1011–1026 (2011) 26. Li, J., Su, S., Cheng, X., Huang, Q., Zhang, Z.: Cost-conscious scheduling for large graph processing in the cloud. In: IEEE International Conference on High Performance Computing and Communications, pp. 808–813, September 2011 27. Chen, X., Jiao, L., Li, W., Fu, X.: Eﬃcient multi-user computation oﬄoading for mobile-edge cloud computing. IEEE/ACM Trans. Netw. 24(5), 2795–2808 (2016) 28. Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: the communication perspective. IEEE Commun. Surv. Tutor. 19(4), 2322–2358 (2017) 29. Sun, Y., Zhou, S., Xu, J.: EMM: Energy-aware mobility management for mobile edge computing in ultra dense networks. IEEE J. Sel. Areas Commun. 35(11), 2637–2646 (2017) 30. Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientiﬁc workﬂows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2015) 31. Tawalbeh, L.A., Jararweh, Y., Ababneh, F., Dosari, F.: Large scale cloudlets deployment for eﬃcient mobile cloud computing. JNW 10, 70–76 (2015)

Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning Wenping Yu1 , Jianzhong Zhang1(B) , Jingdong Xu2 , and Yuwei Xu1 1

College of Cyberspace Security, Nankai University, Tianjin, China [email protected], {zhangjz,xuyw}@nankai.edu.cn 2 College of Computer Science, Nankai University, Tianjin, China [email protected]

Abstract. Position information is one of basic elements for context awareness of autonomous mobile robots. This paper studies the positioning algorithm of autonomous mobile robots suitable for search and rescue in dark building corridors and underground mine tunnels when an emergency occurs, and proposes a novel map matching aided positioning algorithm based on a Hidden Markov Model. This algorithm does not rely on a camera, and only uses the inertial sensors installed in mobile robot and the indoor map to realize the fusion of dead reckoning and map matching. Firstly, it detects the position-related motion postures during the motion process, and then the motion trajectory is divided into a subtrajectory sequence. By matching the sub-trajectory sequence with the indoor map, the proposed algorithm achieves tracking and positioning of the mobile robot. In order to verify the eﬀectiveness of the proposed algorithm, this paper adopts four-wheel diﬀerentially driven robot to conduct experimental analysis in an actual indoor scenario. The experimental results show that compared with the traditional dead reckoning technology, this algorithm can distinctly reduce the average positioning error of mobile robot, and it is robust to heading angle noises within a certain error range. Keywords: Mobile robot · Indoor positioning Hidden Markov Model · Posture pattern detection

1

Introduction

With the advancement of artiﬁcial intelligence, network and sensor technologies, the research and application of autonomous mobile robots have made remarkable progress in recent years. Indoor autonomous mobile robots are increasingly integrated into people’s daily lives [1]. Autonomous mobile robots can be extensively used not only in modern intelligent warehouses, home services and many other aspects, but also in corridors of complex buildings, tunnels of subway c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 327–341, 2018. https://doi.org/10.1007/978-3-030-05057-3_26

328

W. Yu et al.

and underground mines when accidents occur. Therefore, the research of indoor autonomous mobile robot technology has gradually become a hot topic, and many domestic research institutes such as Tsinghua University, Harbin Institute of Technology, Nankai University and South China University of Technology are committed to the research and development of indoor autonomous mobile robots [2–6]. The autonomous positioning of the mobile robot is a process in which the robot autonomously determines its position in the working environment, and is one of the most basic problems in improving the autonomous capabilities of the mobile robot. In terms of outdoor positioning, the Global Positioning System (GPS) has become a widely used positioning technology for mobile robots. However, in terms of indoor positioning, due to the blocking and interference of GPS signals by the external walls of buildings and indoor complex electromagnetic environment, there is no universal solution to the positioning problem of indoor mobile robots [7,8]. Currently, researchers have proposed a variety of positioning methods for indoor autonomous mobile robots, including navigation beacon-based positioning [9], computer vision-based positioning [10,11], dead reckoning positioning [12], map matching positioning [13,14] and simultaneous localization and mapping (SLAM) [15,16], and so on. Positioning techniques based on navigation beacons rely on a series of deployed feature signals to provide the stable and accurate location information, but require high deployment and maintenance costs. Dead reckoning technique uses inertial sensors or encoders to provide relatively accurate positions over short distances, but exists cumulative error that gradually increases as the distance travels, and the robot’s starting point needs to be known in advance. Map matching positioning uses known indoor maps to construct topological maps, feature maps, and other abstract maps, and then the position of the mobile robot is obtained by matching the robot motion trajectory with the indoor maps. The real-time performance of map matching is relatively poor according to its realization principle. The SLAM technology has unique advantages in the face of unknown environments and can provide indoor ﬂoor plans or 3D maps while providing positioning [17]. However, this method requires mobile robots equipped with more complex sensor devices, such as infrared, ultrasonic radar and RGB-D vision systems. Therefore, it has higher implementation cost. Corridors of buildings, subway station tunnels and underground mines often have complex passageways, similar to “mazes”. In the event of an accident such as a ﬁre, the power supply is damaged, the communication infrastructure becomes unusable and smoke and dust cause the lack of indoor lighting, and so on. All these situations pose challenges for the positioning of indoor autonomous mobile robots. Due to limitations in working environment or deployment conditions, it is diﬃcult to establish visual or wireless navigation beacons in advance. Therefore, positioning technology based on navigation beacons is not suitable; the inﬂuence of high temperature and smoke on the indoor environment makes it diﬃcult for cameras to provide image information, visual positioning technology fails; the timeliness of SLAM technology can not meet the urgent need for time

Motion Trajectory Sequence-Based Map Matching Assisted

329

factors in the above scenarios. In response to these problems, this paper introduces a hidden Markov model (HMM) based map matching algorithm that does not rely on a camera, only uses inertial sensors (accelerometer, gyroscope, and magnetometer) installed in autonomous mobile robots and known indoor maps to eﬀectively track and position mobile robots.

2

Robot Motion Model and Positioning Method

In the ﬁeld of indoor autonomous mobile robot positioning, dead reckoning technology and map matching technology have a good complementarity. This paper proposes a map matching-assisted positioning method based on motion trajectory sequence of mobile robot to realize the fusion of the above two technologies. The positioning algorithm uses stairs and corridor corners in the indoor environment as virtual landmarks. When the mobile robot passes through these landmarks, the inertial sensor data will show a speciﬁc pattern. Therefore, in this paper, the above landmarks are called posture-related positions. When the robot’s movement distance is short, the dead reckoning technology can give the real-time position of the robot. When the robot’s movement distance is long, the robot’s motion trajectory can be divided into multiple sub-trajectories according to the landmarks, consecutive sub-trajectories form sub-trajectory sequence. With the help of HMM model, the above sub-trajectory sequence can be matched to the corresponding road in the indoor map, and then the position estimation of the mobile robot is given. Further, when the robot’s motion trajectory is long enough, the absolute position of the mobile robot can still be estimated even without knowing the robot’s starting point. 2.1

Robot Motion Model and Its Dead Reckoning Algorithm

In this paper, a four-wheel diﬀerential-driven mobile robot is used to study the positioning problem of autonomous mobile robots in indoor environment. The driving motor is a direct-current (DC) motor. The two driving motors on one side are connected in reverse parallel and use the L298N motor driving module to control the DC motor, and the mobile robot adopts the Raspberry Pi B.V1.2 as the main control chip. According to the driving mode of the mobile robot, the motion models of the two wheels on each side of the wheeled robot are the same. Therefore, the motion model of the mobile robot can be simpliﬁed to a left and right two-wheel diﬀerential driving mode. Figure 1(a) shows the simpliﬁed motion model of the mobile robot, where (x, y) is the position coordinate of the mobile robot in the global coordinate system, Θ is the angle between heading direction of the mobile robot and the true north direction. The autonomous mobile robot used in this paper has built-in digital compass, three-axis accelerometer and gyroscope. The digital compass gives the initial attitude of the mobile robot. The accelerometer and the gyroscope can measure the movement acceleration and rotation angular velocity of the mobile robot.

330

W. Yu et al.

Fig. 1. Simpliﬁed motion model and its dead reckoning principle for four-wheel diﬀerentially driven robot. (a) Motion model and self coordinate system. (b) Dead reckoning in the global coordinate system.

The distance and heading direction change of the mobile robot can be obtained by integration, then we can derive the latest position and posture of the mobile robot. In order to determine the position and posture of the mobile robot in the plane, we establish the global coordinate system OXY . Assuming that the starting point (x0 , y0 ) is the origin of the coordinates and the starting attitude is the positive direction of the X-axis, then the position and posture of mobile robot at the k time can be expressed by vector (vk , θk , xk , yk )T , where vk denotes the instantaneous velocity of the mobile robot, θk denotes heading direction of the mobile robot and xk , yk denote the coordinates of the mobile robot in the global coordinate system, as shown in Fig. 1(b). When the update cycle of sensor data is very small, such as 5 ms in this paper, in one cycle, the trajectory of mobile robot can be approximated to a straight line, then the position of mobile robot at the k time can be recursively obtained by the Eq. 1. ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ vk−1 0.5(ak−1 + ak )Δt vk ⎜ θk ⎟ ⎜ θk−1 ⎟ ⎜ 0.5(ωk−1 + ωk )Δt ⎟ ⎟ ⎜ ⎟,k ≥ 1 ⎜ ⎟=⎜ (1) ⎠ ⎝ xk ⎠ ⎝ xk−1 ⎠ + ⎝ dk cos θk−1 yk yk−1 dk sin θk−1 where, Δt is the time interval from the k-1 time to the k time, if the sensor ﬁxes the data update period, Δt also represents the update period, ak indicates the instantaneous acceleration in the direction of the mobile robot at the k time, which can be measured by the Y -axis component of the accelerometer, ωk indicates the angular velocity of the heading direction at the k time, which can be measured by the Z-axis of the gyroscope, and dk indicates the movement distance of the mobile robot from k − 1 to k, which can be drawn from the following Equation: dk =

vk−1 + vk Δt 2

(2)

Motion Trajectory Sequence-Based Map Matching Assisted

2.2

331

Architecture Overview

The overall architecture of mobile robot positioning algorithm presented in this paper is shown in Fig. 2. The sensor data and the indoor ﬂoor plan are used as the input of the positioning algorithm. The sensor data is collected by the inertial sensors of the mobile robot and the indoor ﬂoor plan is obtained by manually input or the indoor electronic map construction algorithm such as SLAM technology. The indoor ﬂoor plan abstraction module translates the indoor ﬂoor plan into a directed graph, and the dead reckoning module and the motion posture detection module use the sensor data to give the relative displacement and position-related postures of the mobile robot respectively. The goal of map matching module is to match the motion trajectory of the mobile robot with the sequence of nodes in the directed graph, and then estimate the real time position of the mobile robot. First, the road segments are selected according to the heading direction estimation and the connection of road segments. Secondly, this algorithm updates the related parameters in the hidden Markov model according to the latest candidate road segments. Finally, it estimates the possibilities of all the alternative roads through the Viterbi decoder. When the proposed algorithm is in the convergence stage, the most possible alternative road is the optimal estimation. The ﬁnal output of the algorithm proposed in this paper is the real-time position and heading direction of mobile robot.

Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning

Motion Sensors, Compass

Indoor Floor Plan

Dead Reckoning

Candidate Paths Selection

Motion Posture Detection

HMM-Based Map Matching

Floor Plan Abstraction

Viterbi Decoder

Positioning Results of Mobile Robot Timestamp, Position and Heading

Fig. 2. System architecture of mobile robot positioning algorithm.

2.3

Indoor Floor Plan Abstraction

The posture-related positions divide the indoor roads into road segments. Taking the road segment as a node, the posture change pattern from one road segment to another as a directed edge, the indoor ﬂoor plan can be abstracted as a directed graph. Figure 3 shows an example of indoor ﬂoor plan and its corresponding directed graph. In this paper, a node is represented by the tuple

332

W. Yu et al.

(id, x1 , y1 , x2 , y2 , ϕ1 , ϕ2 ), where xi , yi , i = 1, 2 represent coordinates of the two endpoints of the road segment, ϕ1 , ϕ2 represent the heading direction when the mobile robot moves on the road segment and reaches the corresponding endpoint. A tuple (id1 , id2 , x, y, change of motion attitude(M A)) represents a directed edge between nodes, where, id1 denotes the identity of the starting node, id2 denotes the identity of the end node and x, y,MA represent the coordiantes of position and the change of motion attitude from the starting node to the end node, respectively.

s4

s6

s7

s3

s5

s1

s6

s2

s7

s4

s5

s1

s3

s2

Fig. 3. Indoor ﬂoor plan example and its corresponding directed graph.

2.4

Motion Posture Detection

This section presents a decision tree model for the motion posture detection of indoor mobile robots. This paper focuses on the positioning of mobile robots in indoor 2D plane. Therefore, our decision tree considers only the relevant postures detection of a mobile robot on the plane, including stationary, go straight, left/right turns and U-turns. The horizontal movement posture of the indoor autonomous mobile robot is distinguished by diﬀerent modes of the horizontal component of the accelerometer and the vertical component of the gyroscope in the mobile robot, where the horizontal component of the accelerometer is the data of Y -axis in the local coordinate system of the mobile robot and the vertical component of the gyroscope is the data of Z-axis in the robot’s local coordinate system. Furthermore, by extracting the vertical component of the accelerometer, this method can be easily extended to three-dimensional indoor positioning scenes. Figure 4 shows a decision tree for the motion posture detection of an indoor autonomous mobile robot. The decision tree uses the signal characteristics of built-in acceleration and gyroscope to identify diﬀerent motion posture patterns of the mobile robot. Considering that the linear velocity of the mobile robot is obtained from the integration of the acceleration horizontal component in time, the instantaneous velocity has an accumulated error at a certain moment. Therefore, the top layer of the decision tree uses the variance of the acceleration horizontal component to separate the stationary and going straight; the second level of the decision tree uses the rotation rate measured on the Z-axis of the

Motion Trajectory Sequence-Based Map Matching Assisted

Var. Acc. Low?

No Rot. Rat. Low?

No

No

U-Turn

Rot. Ang. Low?

Yes

333

Yes Stationary

Yes Go Straight

Right/Left Turn

Fig. 4. Decision tree for motion posture detection.

gyroscope to separate turns and going straight; ﬁnally, the third level of the decision tree uses the rotation angle to separate the U-turn from the left or right turn.

3

HMM Based Map Matching Algorithm

With the help of the motion posture detection based on inertial sensor data, the motion trajectory of the mobile robot can be divided into sub-trajectory segments by position-related postures, such as left or right turn and U-turn, and these sub-trajectory segments form a sub-trajectory sequence in the time dimension. This section gives a detailed description of Hidden Markov Models for matching sub-trajectory sequence to indoor abstract graph. 3.1

Hidden Markov Model

A Hidden Markov Model is a time-series probability model that describes the state of a process using discrete random variables. A basic HMM can be represented as λ = (S, V, A, B, π), where: (1) S = {s1 , s2 , s3 , . . . , sN } is the set of possible hidden states and N = |S|. In our case, each state represents an indoor road segment, that is, a node of the directed graph. Therefore, a state s is represented by the tuple in the form of (id, x1 , y1 , ϕ1 , x2 , y2 , ϕ2 ), where id is the identiﬁcation of road segment, x1 , y1 , ϕ1 , x2 , y2 and ϕ2 are diﬀerent attributes of node of the directed graph, respectively. It should be noted that if the mobile robot can reach another road segment by going straight from one road segment, these two road segments can be merged into a new road segment, also a new hidden state. The road segments s4 , s6 can be combined into new road segments as shown in Fig. 5.

334

W. Yu et al.

(2) V = {v1 , v2 , v3 , . . . , vM } is the set of observations from the model and M = |V |. In our case, an observable state represents the relative movement distance and heading direction measured by motion sensors installed in the mobile robot and is represented in terms of (dist, ϕ). (3) A = {aij } is the state transition probability distribution, where aij = p {qt+1 = sj |qt = si } , i, j ≤ N , where qt denotes the state at time t. In other words, aij indicates the possibility of moving from one road segment to adjacent road segments. (4) B = {bi (k)} is the observation probability distribution in state i, where bi (k) = p{zt = vk |qt = si }, 1 ≤ i ≤ N, 1 ≤ k ≤ M and zt , qt are the observation and state at time t, respectively. In other words, bi (k) indicates the possibility of a certain distance and heading direction measured by the inertial sensors after the mobile robot has passed a road segment. (5) π = {πi } is the initial state distribution, where πi = p {q1 = Si }. 3.2

Transition Probability Distribution (A)

The transition probability distribution refers to the possibility of moving from a hidden state to the next hidden state. In this paper, it also means the possibility of moving from one road segment to the adjacent road segment. The adjacent road segments are divided by posture-related positions. Each posture-related position has a corresponding motion posture. The higher the degree of matching between the mobile robot’s motion posture and position-related posture is, the greater the probability that the mobile robot moves from one road segment to another road segment through this posture-related position is, and vice versa. Therefore, we use the degree of matching between the motion posture of the mobile robot and position-related posture to represent the transition probabilities between adjacent road segments. Let eij denote the edge of the directed graph from si to sj , the corresponding position-related posture can be represented by eij .MA according to deﬁnition in Sect. 2.3. Given the motion posture of the mobile robot RobM A (t) at time t. The probability from si to sj is shown in Eq. 3, where p(RobM A (t)|eij .M A) can be obtained from the motion posture confusion matrix in Sect. 2.4. p(sj,t |si,t−1 ) = p(sj |si , RobM A (t)) = p(RobM A (t)|eij .M A) 3.3

(3)

Observation Probability Distribution (B)

In this paper, an observable state consists of the relative displacement of the mobile robot and the heading direction, and the two are independent of each other. Therefore, the observable probability distribution can be deﬁned as: P (vk,t |sj,t ) = P (ϕ(t)|sj,t ) · P (dist(t)|sj,t )

(4)

where, P (ϕ(t)|sj,t ) represents the observable probability of the mobile robot’s heading direction at time t, and P (dist(t)|sj,t ) denotes the observable probability determined by the relative displacement of the mobile robot.

Motion Trajectory Sequence-Based Map Matching Assisted

335

The higher the degree of matching between the heading direction of the mobile robot and the road segment, the greater the possibility that the mobile robot is located in the road segment. In the indoor environment, the error of the heading direction of the mobile robot not only comes from the accumulation error, but also comes from the interference of various metal materials in the buildings. In general, the error of the heading direction is relatively large and it is diﬃcult to accurately model this error. Therefore, in this paper, Eq. 5 is used to model the heading direction in the observable state. P (ϕ(t)|sj,t )

= P {ϕ(t)|sj .ϕi , i = 1, 2} =

1, if |ϕ(t) − sj .ϕi | < HT H , i = 1, 2 0, others

(5)

where, HT H is a constant threshold used to determine whether the heading direction of the mobile robot matches the direction of the road segment or not. In order to avoid that the correct road segment is excluded due to the large error of the heading direction, HT H is set to 59◦ in this paper. The relative displacement error of the mobile robot mainly comes from the accumulative error caused by the acceleration error in the dead reckoning process. Here we assume that the relative displacement of the mobile robot obeys the Gaussian distribution. On the one hand, intuitively speaking, the closer the relative displacement of the mobile robot and the length of the road segment is, the more likely the mobile robot is located in the road segment; on the other hand, For the road segments whose lengths are much larger than the relative displacement of the mobile robot, all of them should have the same possibility. Combining the above two situations, this paper uses Eq. 6 to model the relative displacement in the observable state. P (dist(t)|sj,t ) = P {dist(t)|sj .dist} =

⎧ 1 ⎨ √2πσ e−4.5 , dist(t) + 3σd ≤ sj .dist d

− 1 e 2πσd

⎩√

(dist(t)−sj .dist)2 2σd 2

(6)

, others

where sj .dist is the length of the road segment, which can be derived from the two endpoints of the road segment sj and σd is the standard deviation of the relative displacement of the mobile robot at time t. In order to estimate the value of σd , this paper ﬁrstly tests the change of the accelerometer’s value Δa when the mobile robot is stationary, and estimates the standard deviation of the acceleration σa based on the absolute median error (MAD) of the test data [18]. It can be inferred that there is a secondary relationship between σd and σa according to the principle of the dead reckoning described in Sect. 2.1. σa = 1.4826 × median(|Δa|)

(7)

336

3.4

W. Yu et al.

Initial State Distribution

If the starting point of the mobile robot is already known, then the road segment where the starting point is located is the initial state, and the probability is set to 1; if the starting point of the moving robot is unknown, all candidates can be selected by Eq. 5 based on the initial heading direction information of the mobile robot and the initial probability distribution is a uniform distribution over the candidate road segments. 3.5

Optimal Motion Trajectory Estimation

Based on the above-deﬁned Hidden Markov Model, this paper uses Viterbi algorithm to determine the optimal estimation of the moving trajectory of a mobile robot. For a given observable state sequence (z1 , z2 , ..., zk ), the goal of the Viterbi algorithm is to ﬁnd the most possible hidden state sequence (q1 , q2 , ..., qk ). Figure 5 brieﬂy illustrates the decoding process of the Viterbi algorithm. z1

s4

s6

s7

s3

s5

z2

z3

s1

s1

p(z3|q3)

π1

s3

s2

s1

s7

π2

s5

s4

s6

s5

p(s4|s3)

s6

s6

p(s7|s6)

s2

s1

Motion Trajectory of Mobile Robot

q1=s3

q2=s4+s6

q3=s7

Fig. 5. Illustration of the proposed HMM model Viterbi decoding.

The Viterbi decoder is implemented by the dynamic programming method. First, a Viterbi variable is deﬁned to represent the maximum probability that the Hidden Markov Model will reach the state si along a path at time t: δt (i) = max P {q1 , q2 , · · · , qt = si , z1 , z2 , · · · , zt |λ}

(8)

At time t+1, the maximum probability reaching the hidden state sj can be recursively derived from the Viterbi variable at time t by the following equation. δt+1 (j) = [max(δt (i) · P {qt+1 = si |qt = sj })] · P {zt+1 |qt+1 }, 1 ≤ t ≤ k i

(9)

By recording the backward pointers, at time k, the most likely hidden state sequence, that is, the optimal estimation of the motion trajectory of the mobile robot can be obtained by the path backtracking method.

Motion Trajectory Sequence-Based Map Matching Assisted

4

337

Evaluation

We uses the wheeled mobile robot described in Sect. 2 to complete the experimental analysis. The experimental environment is the ﬁfth ﬂoor of a teaching hall on our campus. The experimental area is divided into east and west parts, and the east part is approximately 84.85 ∗ 66.8 (m2 ). The west part is approximately 68.7 ∗ 106.75 (m2 ), the length of connecting corridor between two parts is 46.25 m and the width is 2.4 m. The overall layout is shown in Fig. 6. In order to record the real position of the robot during the movement, this article divides the experimental area into squares of 0.8 ∗ 0.8 (m2 ) and marks every small areas. In the experiment process, another mobile robot with camera is used to move in parallel with the robot to record real-time positions. In the experimental area, the robot moves along the two planned trajectories denoted as T1 and T2 in Fig. 6. The length of T1 is 181.9 m, including 3 posture-related positions. The whole trajectory is divided into 4 sub-trajectories. The length of T2 is 180.5 m, including 2 posture-related positions, the whole trajectory is divided into 3 subtrajectories, and the mobile robot repeats 9 times for each trajectories.

T2 End Point T1 End Point

T1 Starting Point

T2 Starting Point

Fig. 6. Indoor ﬂoor plan of experimental environment and mobile robot trajectories.

4.1

Influence of Heading Direction Errors

When the starting point is unknown, the convergence performance of the algorithm is closely related to the detection results of position-related postures. In general, after at least correctly detected two consecutive position-related postures, the map matching algorithm based on the posture detection is likely to converge. Facing the position-related posture detection in 2D ﬂoor plan, such as left and right turning, the error of the heading direction has a greater impact

338

W. Yu et al.

on the detection results. Therefore, this section ﬁrst analyzes the inﬂuence of heading direction error on the convergence performance of map matching algorithm. Here, we ﬁrst deﬁne the precision of position-related posture detection by Eq. 10. N umber of Correctly Detected Consecutive T wo P ostures T otal N umber of Consecutive T wo P ostures (10) Supposing that the distribution of the heading direction estimation error obeys Gaussian and the average is 0. Based on the raw data of the heading direction, a Gaussian random value is added to simulate diﬀerent degrees of error. Figure 7 shows the variation of the precision of position-related posture detection under diﬀerent values of the standard deviation of the heading direction error. It can be seen from Fig. 7 that the precision of position-related posture detection is stable under certain heading direction error conditions, but when the standard deviation of heading direction error reaches a certain level (T1 is 40◦ and T2 is 30◦ ), the precision drops rapidly. P recison =

Fig. 7. Heading errors on the precision of mobile robot posture detection.

4.2

Convergence Speed Analysis Without Knowing the Starting Point

If the starting point is unknown, after the mobile robot moves a certain distance, the algorithm can still converge and ﬁnally estimate the real-time position of the mobile robot. The distance before convergence of the algorithm represents the convergence performance of the positioning algorithm. In order to evaluate the convergence performance of the proposed map matching algorithm, we compare the proposed map matching algorithm with the semMatch algorithm proposed in [18] because semMatch has certain similarities with the algorithm presented in this paper. The hidden Markov model is also used to implement map matching operations in semMatch, however, the details of the HMM model are slightly diﬀerent.

Motion Trajectory Sequence-Based Map Matching Assisted

339

Fig. 8. Distance traveled before convergence for each trajectory.

Using the same decision tree model to detect the position-related postures in the trajectory of mobile robot, Fig. 8 shows the convergence performance of the two algorithms. On the one hand, for T1 , both algorithms reach the convergence state after passing through two posture-relate positions. However, the algorithm proposed in this paper needs to observe the subsequent road segment after detect the corresponding posture, so the convergence performance is slightly worse. When it reaches the convergence state, the mobile robot moves 1.9 m more. On the other hand, for T2 , the algorithm proposed in this paper reaches the convergence state shortly after the correct detection of the ﬁrst position-related posture, but semMatch does not converge due to the symmetry of indoor road network denoted by ∞ in Fig. 8. The main reason is the diﬀerence in the HMM model deﬁnition of the two map matching algorithms. The hidden state of the map matching algorithm proposed in this paper is the straight road segment in the indoor road network. For T2 , the mobile robot ﬁrstly pass a long enough road segment, and the proposed algorithm combines this observable state with the subsequent detection of the ﬁrst position-related posture to achieve the convergence. 4.3

Online Positioning Performance with Knowing the Starting Point

If the starting point is already known, the proposed algorithm does not need to pass the motion trajectory matching stage to converge. After convergence, the algorithm can track the moving trajectory of the mobile robot in real time. We use the Euler distance between the real position of the mobile robot and the position estimate given by the algorithm to analyze the real-time positioning performance. Figure 9 shows the variation of positioning error of the mobile robots with increasing distances on both T1 and T2 trajectories. When the motion trajectory does not include posture-related positions, the map matching assisted positioning technology proposed in this paper is equivalent to the traditional dead reckoning technology, but after detecting the posturerelated positions, the known coordinates of the posture-related positions can be used to calibrate the real-time position estimation of the robot. The real-time

340

W. Yu et al.

Fig. 9. Online positioning errors for each trajectory.

positioning results of the mobile robot on T1 and T2 shown in Fig. 9 verify this trend. For T1 , the average positioning error decreases from 4.0 m to 2.49 m, while for T2 , the average positioning error decreases from 6.58 m to 3.39 m. From the experimental results, it can be deduced that in the actual environment, with the density of posture-related positions increasing, the improvement of positioning performance of the proposed algorithm is more obvious.

5

Conclusion

In order to solve the diﬃcult problem of positioning of autonomous mobile robots in dark complex building corridors, subway tunnels, or underground mines after sudden accident such as a ﬁre, this paper proposes an indoor autonomous mobile robot tracking and positioning algorithm based on a novel hidden Markov Model. In the structured indoor environment, this method uses the detection of positionrelated postures to match the motion trajectory of mobile robot to the abstraction of indoor ﬂoor plan. Compared with the traditional dead reckoning technology, the proposed algorithm can signiﬁcantly reduce the inﬂuence of cumulative errors on the positioning accuracy, and is robust to the heading direction and acceleration value noises within a certain error range. This algorithm does not rely on cameras, and uses only motion sensors installed in autonomous mobile robots and known indoor ﬂoor plan to achieve fusion positioning of dead reckoning and map matching techniques, even when the starting point is unknown. This algorithm has the characteristics of simple deployment, low manufacturing cost and easy operation. Acknowledgment. This work was supported by the National Natural Science Foundation of China (No. 61702288), the Natural Science Foundation of Tianjin in China (No. 16JCQNJC00700) and the Fundamental Research Funds for the Central Universities.

Motion Trajectory Sequence-Based Map Matching Assisted

341

References 1. Garcia, E., Jimenez, M.A., De Santos, P.G., Armada, M.: The evolution of robotics research. Robot. Autom. Mag. IEEE 14(1), 90–103 (2007) 2. Wu, J., Li, T.M., Tang, X.Q.: Robust trajectory tracking control of a planar parallel mechanism. J. Tsinghua Univ. 5, 642–646 (2005) 3. Wu, J., Wang, D., Wang, L.: A control strategy of a two degrees-of-freedom heavy duty parallel manipulator. J. Dyn. Syst. Meas. Contr. 137(6), 061007 (2015) 4. Yang, J., Yang, J., Cai, Z.: An eﬃcient approach to pose tracking based on odometric error modelling for mobile robots. Robotica 33(6), 1231–1249 (2015) 5. Yuan, X., Wang, D., Yan, Y.: Self-positioning of robot based on dead reckoning and ultrasonic data fusion (in chinese). J. Naval Univ. Eng. 21(5), 67–72 (2009) 6. Yu, N., Wang, S., Xu, C.: RGB-D based autonomous exploration and mapping of a mobile robot in unknown indoor environment. Robot 39(6), 860–871 (2017). (in chinese) 7. Bachrach, A., De Winter, A., He, R., Hemann, G.: Range - robust autonomous navigation in GPS-denied environments. In: IEEE International Conference on Robotics and Automation, pp. 1096–1097. IEEE (2011) 8. Bao, H., Wong, W.C.: An indoor dead-reckoning algorithm with map matching. In: 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 1534–1539. IEEE (2013) 9. Tang, H., Chen, W., Wang, J.: Artiﬁcial landmark distribution based on multi-ary m-sequence. Robot 36(1), 29–35 (2014). (in chinese) 10. Lu, Y., Song, D.: Visual navigation using heterogeneous landmarks and unsupervised geometric constraints. IEEE Trans. Robotic. 31(3), 736–749 (2015) 11. Gao, X., Zhang, T.: Unsupervised learning to detect loops using deep neural networks for visual slam system. Auton. Robots 41(1), 1–18 (2017) 12. Kim, J.H., Lee, J.C.: Dead-reckoning scheme for wheeled mobile robots moving on curved surfaces. J. Intell. Robotic Syst. 79(2), 211–220 (2015) 13. Grisetti, G., Stachniss, C., Burgard, W.: Improved techniques for grid mapping with rao-blackwellized particle ﬁlters. IEEE Trans. Robotics 23(1), 34–46 (2007) 14. Cheng, H., Chen, H., Liu, Y.: Topological indoor localization and navigation for autonomous mobile robot. IEEE Trans. Autom. Sci. Eng. 12(2), 729–738 (2015) 15. de la Puente, P., Rodr´ıguez-Losada, D.: Feature based graph-slam in structured environments. Auton. Robots 37(3), 243–260 (2014) 16. Havangi, R., Taghirad, H.D., Nekoui, M.A., Teshnehlab, M.: A square root unscented fastslam with improved proposal distribution and resampling. IEEE Trans. Ind. Electron. 61(5), 2334–2345 (2014) 17. Richter, C., Vega-Brown, W., Roy, N.: Bayesian learning for safe high-speed navigation in unknown environments. In: Bicchi, A., Burgard, W. (eds.) Robotics Research. SPAR, vol. 3, pp. 325–341. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-60916-4 19 18. Aly, H., Youssef, M.: Semmatch: road semantics-based accurate map matching for challenging positioning data. In: The 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 5. ACM (2015)

Towards the Independent Spanning Trees in the Line Graphs of Interconnection Networks Baolei Cheng1,2,3 , Jianxi Fan1,2(B) , Xiaoyan Li1 , Guijuan Wang1 , Jingya Zhou1 , and Yuejuan Han1 1

2

School of Computer Science and Technology, Soochow University, Suzhou 215006, China {chengbaolei,jxfan,jy zhou,hyj}@suda.edu.cn, {xyli,20164027004}@stu.suda.edu.cn Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing 21000, Jiangsu, China 3 Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China

Abstract. Node/edge-Independent spanning trees (ISTs) have attracted a lot of attention in the past twenty years. Many results such as edge-disjoint Hamilton cycles, traceability, number of spanning trees, structural properties, topological indices, etc, have been obtained on line graphs, and researchers have applied the line graphs of some interconnection networks into data center networks, such as SWCube, BCDC, etc. However, node/edge conjecture is still open for n-node-connected interconnection network with n ≥ 5. So far, results have been obtained on a lot of special interconnection networks, but few results are reported on the line graphs of them. In this paper, we consider the problem of constructing node-ISTs in a line graph G of an interconnection network G . We ﬁrst give the construction of node-ISTs in G based on the edgeISTs in G. Then, an algorithm to construct node-ISTs in G based on the edge-ISTs in G is presented. At the end, simulation experiments on the line graphs of hypercubes show that the maximal height of the constructed node-ISTs on the line graph of n-dimensional hypercube is n + 1 for n ≥ 3. Keywords: Independent spanning trees Line graph · Interconnection network

1

· Internally disjoint paths

Introduction

Node/edge-Independent spanning trees (ISTs) can be used in reliable communication protocols [2,20], one-to-all broadcasting [29], multi-node broadcasting [4], reliable broadcasting, and secure message distribution [3]. Therefore, the problem to construct multiple node/edge-ISTs for a given interconnection network is becoming an important issue. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 342–354, 2018. https://doi.org/10.1007/978-3-030-05057-3_27

ISTs in the Line Graphs of Interconnection Networks

343

We focus on the well-known two conjectures on the existence of ISTs in any interconnection network [20,33] as follows: Conjecture 1. Given an n-node-connected interconnection network G with n ≥ 1, there exist n node-ISTs rooted at an arbitrary node in G. Conjecture 2. Given an n-edge-connected interconnection network G with n ≥ 1, there exist n edge-ISTs rooted at an arbitrary node in G. Khuller and Schieber gave a proof that if any n-node-connected interconnection network has n node-ISTs, then any n-edge-connected interconnection network has n edge-ISTs [21]. However, Gopalan and Ramasubramanian found a counterexample to disprove Khuller and Schieber’s results [12]. Thus, either the node conjecture implies the edge conjecture or vice versa is still an open problem. For any interconnection network with n ≤ 4, Conjectures 1 and 2 were solved in [9,10,13,16,20,33]. For n ≥ 5, Conjectures 1 and 2 have been solved for some restricted classes of networks, such as planar networks [17], product networks [26], hypercubes [30,32], locally twisted cubes [25], crossed cubes [5–7], M¨ obius cubes [8], even networks [22], odd networks [23], Gaussian networks [18], etc. The line graph has received much attention by researchers in recent years. Results have been reported on edge-disjoint Hamilton cycles [24], traceability [28], number of spanning trees [11], structural properties [14], topological indices [27], treewidth [15], clique-perfectness [1], etc. Line graphs have applications in some data center networks by deploying servers on the edge of the original interconnection networks, such as SWCube [19], BCDC [31], etc. However, few results have been reported on the topic of independent spanning trees on line graphs. In this paper, we ﬁrst adopt the deﬁnition of line graph G of n-edge-connected interconnection network G . We mainly obtained the following results: 1. If there are n edge-ISTs rooted at an arbitrary node in G , then there are n node-ISTs rooted at an arbitrary node in G. 2. An algorithm to construct n node-ISTs rooted at an arbitrary node in G based on the n edge-ISTs rooted at an arbitrary node in G is presented. 3. Some simulation results on the line graphs of hypercubes based on Java and JUNG technology are shown. Finally, we pointed out that the algorithm proposed in this paper can be used to construct node-independent spanning trees on SWCube and BCDC data center networks.

2 2.1

Preliminaries Graph Terminology and Notation

An interconnection network can be abstracted as a graph G(V (G), E(G)), where V (G) denotes the node set and E(G) denotes the edge set. In this paper, graphs

344

B. Cheng et al.

and networks are used interchangeably. We can also use the decimal numbers to denote the nodes in G. Two x, y-paths P and Q started at x and ended with y are edge-disjoint if E(P ) ∩ E(Q) = ∅. Two x, y-paths P and Q are internally node-disjoint if they are edge-disjoint and V (P ) ∩ V (Q) = {x, y}. Two spanning trees T1 and T2 , rooted at the same node u in G, are edge-independent if the u, vpath in T1 and the u, v-path in T2 are edge-disjoint for each v ∈ V (G)\{u}. Two spanning trees T1 and T2 rooted at u in network G are node-independent if the u, v-path in T1 and the u, v-path in T2 are internally node-disjoint for each v ∈ V (G)\{u}. Clearly, if two trees T1 and T2 are node-independent spanning trees, then they are also edge-independent spanning trees. We can also use path(u, v, T ) to denote the u, v-path in a tree T rooted at node u. A set of spanning trees rooted at the same node in G are edge-independent (resp., node-independent) if they are pairwisely edge-independent (resp., nodeindependent). We also use node-ISTs (resp., edge-ISTs) for short to represent node-independent spanning trees (resp., edge-independent spanning trees). 2.2

A Class of Networks—Line Graphs

Given a network G , its line graph G is a graph such that each vertex of G represents an edge of G and two vertices of G are adjacent if and only if their corresponding edges share a common endpoint (which are incident) in G . Now we provide Transformation 1 to demonstrate the construction of a line graph based on an existing network. Transformation 1. Given a network G , we construct the line graph G by the following steps: (1) For every edge started from node x and ended at y in E(G ), add a node [x, y] to network G, which is referred to as edge-node. (2) For every two adjacent edges (x, y) and (y, z) in G , connect [x, y] with [y, z] in G. Figure 1 shows the network G and its line graph G. Network G is derived from network G , where the number of edges in G equals to the number of nodes in G. Enlightened by Conjectures 1 and 2, the following interesting problem is naturally proposed. Problem 1. Given n edge-ISTs in an n edge-connected network G , can we construct n node-ISTs in the line graph of G ? In the following section, we try to answer this question by providing a general algorithm for any n-edge-connected network and its line graph.

ISTs in the Line Graphs of Interconnection Networks

345

Fig. 1. A network G and its line graph G.

3

Node-Independent Spanning Trees in Line Graphs

In this section, we ﬁrst propose an algorithm, called NodeIST, to construct n node-ISTs in its line graph based on the n edge-ISTs in G . Then, we prove that the n trees obtained by Algorithm NodeIST based on Transformation 1 are n node-ISTs. 3.1

Construction Algorithm of Node-Independent Spanning Trees for Line Graphs

We now present an algorithm, called NodeIST, to construct n node-ISTs T1 , T2 , . . . , Tn rooted at node [u, v] in the line graph G of G , based on the n edgeISTs T1 , T2 , . . . , Tn rooted at u in n-edge-connected network G and an edge (u, v). Since (u, v) and (v, u) are the same edge in G , we will let [u, v] and [v, u] denote the same node in G. For simplicity, we will always let an edge started at a smaller node and ended with a bigger node in the examples shown in Fig. 2. In Algorithm NodeIST, Step 1 is called to initialize trees T1 , T2 , . . . , Tn . By Step 2, the edge started at the root node [u, v] in each tree is determined and the edges derived from T1 , T2 , . . . , Tn are determined. After executing Step 3, each tree contains all the edges in G. Algorithm NodeIST Input: n edge-independent spanning trees T1 ’, T2 ’, ..., Tn ’ rooted at u in n-edge-connected network G’, v is an arbitrary adjacent node of u in G’, where v > u; Output: n node-independent spanning trees T1 , T2 , ..., Tn rooted at node [u, v] in the line graph of G’, denoted as G; Begin Step 1: 1: V (Ti ) = V (G) and E(Ti ) = ∅ for i = 1 to n.

346

B. Cheng et al.

Step 2: 2: for i = 1 to n do in parallel 3: Suppose that there exists u(i) such that (u, u(i) ) ∈ E(Ti ’). 4: if (u(i)) = v) 5: E(Ti ) = E(Ti ) ∪ {([u, v], [u, u(i) ])}. 6: end if 7: if any edge (x, y) is adjacent to another edge (w, z) in V (Ti ’) 8: E(Ti ) = E(Ti ) ∪ {([x, y], [w, z])}. 9: end if Step 3: 10: for any edge (x, y) ∈ E(G’)\ (E(Ti ’) ∪ {(u, v)}) with x < y do 11: Suppose that there exists an edge (x, y (i) ) ∈ E(Ti ’). 12: E(Ti ) = E(Ti ) ∪ {([x, y], [x, y (i) ])}. 13: end for end

Example 1. Take the network G and its line graph G in Fig. 1 for example. The three trees in Fig. 2(a) are not edge-ISTs in G , because the 0, 7-path in the second tree and the 0, 7-path in the third tree have the common edge (2, 6). In Fig. 2(b), the three trees are edge-ISTs rooted at node 0 in G which are isomorphic to each other. Suppose that the three trees in Fig. 2(b) from left to right are T1 , T2 , and T3 . We let the three trees and node 1 as the input of Algorithm NodeIST. After the ﬁrst step, we obtain trees T1 , T2 , and T3 shown in Fig. 2(c), the edge sets of which are empty and the node sets of which contain all the edge-nodes of G ; after the second step, the three trees are shown in Fig. 2(d); lastly, the constructed node-ISTs are demonstrated in Fig. 2(e). We notice that each node in Fig. 2(b) is denoted by one decimal value, while each node in Fig. 2(c), (d), and (e) are denoted by two decimal values. Now, T1 , T2 , and T3 are three node-ISTs rooted at [0, 1] in G.

3.2

Correctness of Node-Independent Spanning Trees Obtained by Algorithm NodeIST

By Algorithm NodeIST, every edge of G is contained in Ti for i = 1, 2, . . . , n. Thus, we have the following lemma. Lemma 1. Ti obtained by Algorithm NodeIST is a spanning tree in G for any integer i with 1 ≤ i ≤ n. Proof. By Algorithm NodeIST, Ti contains all the nodes in V (G) and it is easy to verify that Ti is a tree for any integer i with 1 ≤ i ≤ n. Thus, the proof is completed.

ISTs in the Line Graphs of Interconnection Networks

347

Fig. 2. (a) Wrong edge-ISTs. (b) Correct edge-ISTs. (c) Trees obtained by Step 1 of Algorithm NodeIST. (d) Trees obtained by Step 2 of Algorithm NodeIST. (e) NodeISTs.

Suppose that T is a tree rooted at node [u, v] and [x, y] is an arbitrary node in the set V (G)\{[u, v]}. We use path([u, v], [x, y], T ) to denote the node set of the path started at [u, v] and ended at [x, y] in T . By the deﬁnition of independent spanning trees, we present the following lemma to redeﬁne node-independent. Lemma 2. Let Ti and Tj be two diﬀerent spanning trees rooted at node [u, v] in G where 1 ≤ i < j ≤ n. Ti and Tj are node-independent if and only if for every node [x, y] in G, [x, y] = [u, v], V (path([u, v], [x, y], Ti )) ∩ V (path([u, v],

348

B. Cheng et al.

[x, y], Tj )) = {[u, v], [x, y]} and V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], v[x, y], Tj )) ⊃ {[u, v], [x, y]}. Now we prove that the n trees obtained by Algorithm NodeIST are n nodeindependent spanning trees. Theorem 1. T1 , T2 , . . . , Tn obtained by Algorithm NodeIST are n nodeindependent spanning trees rooted at node [u, v] in G. Proof. By Lemma 1, Tl obtained by Algorithm NodeIST is a spanning tree in G for any integer l with 1 ≤ l ≤ n. Let [u, v] be the root node of each tree. We only need to prove that for any vertex [x, y] ∈ V (G)\{[u, v]} with x < y, V (path([u, v], [x, y], Ti )) ∩ V (path([u, v], [x, y], Tj )) = {[u, v], [x, y]} and V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. For any 1 ≤ i < j ≤ n and any edge (x, y) ∈ E(G )\{(u, v)}, we have the following cases: Case 1. (x, y) ∈ E(Ti ) and (x, y) ∈ E(Tj ). Then, the path(u, x, Ti ) and path(u, x, Tj ) are edge-disjoint (Similarly, path(u, y, Ti ) and path(u, y, Tj ) are edge-disjoint) by the hypothesis of Algorithm NodeIST. Thus, we can verify that E(path(u, x, Ti )) ∩ E(path(u, x, Tj )) = ∅. By Algorithm NodeIST, all the edges in path(u, x, Ti ) and path(u, x, Tj ) are transformed into nodes and the connected two edges are transformed into two adjacent nodes. Since E(path(u, x, Ti )) ∩ E(path(u, x, Tj )) = ∅, we have V (path([u, v], [x, y], Ti )) ∩ V (path([u, v], [x, y], Tj )) = {[u, v], [x, y]}. Let the node adjacent to node u in Ti be w and the node adjacent to node u in Tj be z. Since Ti and Tj are edge-independent, w = z. We have the following subcases. Case 1.1. w = v and z = v. It is clear that {w, z, x, y, u, v} ⊃ {x, y, u, v}. By Algorithm NodeIST, [u, z] ∈ V (path([u, v], [x, y], Tj )), which implies that V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. Case 1.2. w = v and z = v. The proof is similar to Case 1.1. Case 1.3. w = v and z = v. It is clear that {w, z, x, y, u, v} ⊃ {x, y, u, v}. By Algorithm NodeIST, [u, w] ∈ V (path([u, v], [x, y], Ti )) and [u, z] ∈ V (path([u, v], [x, y], Tj )), which implies that V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. Case 2. (x, y) ∈ E(Ti ) and (x, y) ∈ E(Tj ). By Algorithm NodeIST, if the node adjacent to node u in Ti is v, we can verify that V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Ti )) equals to the set of edgenodes transformed from edges in path(u, x, Ti ) plus the set {[u, v], [x, y]}. The following proof is similar to Case 1. Case 3. (x, y) ∈ E(Ti ) and (x, y) ∈ E(Tj ). The proof is similar to Case 2. Case 4. (x, y) ∈ E(Ti ) and (x, y) ∈ E(Tj ). By Algorithm NodeIST, if the node adjacent to node u in Ti is v, we can verify that V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti ) plus the set {[u, v], [x, y]}.

ISTs in the Line Graphs of Interconnection Networks

349

If the node adjacent to node u in Tj is v, we can verify that V (path([u, v], [x, y], Tj )) equals to the set of edge-nodes transformed from edges in path(u, x, Tj ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Tj )) equals to the set of edge-nodes transformed from edges in path(u, x, Tj ) plus the set {[u, v], [x, y]}. Since Ti and Tj are edge-independent, the node adjacent to node u in Ti and Tj are diﬀerent. The following proof is similar to Case 1. By Lemma 2, Ti and Tj are independent. As a result, the theorem holds. Based on the n edge-independent spanning trees T1 , T2 , . . . , Tn rooted at u in n-edge-connected network G , v is an arbitrary adjacent node of u in G , where v > u, the n node-independent spanning trees T1 , T2 , . . . , Tn rooted at node [u, v] in G are constructed in parallel, thus we have the following theorem. Theorem 2. The set of node-independent spanning trees T1 , T2 , . . . , Tn obtained by Algorithm NodeIST can be obtained in O(N ) time, where N is the number of nodes in G (or the number of edges in G ). Based on the above discussion, we further present the following observations. Observation 1. Algorithm NodeIST can be improved to obtain optimized node-ISTs. For example, in Fig. 2(e), if we let the node [1, 5] be adjacent to node [5, 7] in the third tree. Then, we can obtain another set of optimized nodeISTs with lower height. Observation 2. Given n node-independent spanning trees in an n-nodeconnected network G , we can also construct n node-independent spanning trees in the line graph of G based on Algorithm NodeIST. Observation 3. It is also interesting to study another similar algorithm with the reverse direction based on Algorithm NodeIST.

4

Simulation of Node-ISTs on the Line Graphs of Hypercubes

As well-known interconnection networks, hypercubes have received much attention from researchers. In this section, we mainly simulate the construction of node-ISTs on hypercubes based on Java and JUNG technology. The ndimensional hypercube Qn , is a graph consisting of 2n nodes and n2n−1 edges. Each node in Qn is represented by binary strings of length n, and any two nodes in Qn are adjacent whenever their corresponding strings diﬀer in exactly one place. For example, Fig. 3 shows the four node-ISTs rooted at 0 in Q4 constructed by the algorithm in [30], the maximal height of which is 5. To simulate the duplicate nodes by JUNG technology, which do not admit the same node in one canvas, here, the preﬁxes A, B, C, D are only used to distinguish nodes, for example, A0, B0, C0, D0 are used to denote the same node 0. Since hypercube is node-symmetric, the line graph of hypercube is also nodesymmetric. If the nodes in 4-dimensional hypercube are 0, 1, . . . , 15, then there

350

B. Cheng et al.

Fig. 3. 4 edge-ISTs rooted at 0 on 4-dimensional hypercube.

are 32 edge-nodes in the line graph of 4-dimensional hypercube. For simpliﬁcation, we use the numbers 1, 2, . . . , 32 to denote the edge-nodes, the corresponding relation is shown in Table 1, which will be used in the simulation program to show the node-ISTs. Similarly, the preﬁxes a, b, c, d are only used by the program to distinguish nodes, for example, a1, b1, c1, d1 are used to denote the same node 1. Table 1. Corresponding relations between numbers and edge-nodes. 1→ [0, 1]

2→ [0, 2]

3→ [0, 4]

4→ [0, 8]

5→ [2, 3]

6→ [1, 3]

7→ [1, 5]

8→ [3, 11]

9→ [1, 9]

10→ [2, 6]

11→ [3, 7]

12→ [6, 7]

13→ [4, 6]

14→ [6, 14]

15→ [7, 15]

16→ [5, 7]

17→ [4, 5]

18→ [2, 10]

19→ [4, 12]

20→ [5, 13]

21→ [8, 12]

22→ [12, 14] 23→ [12, 13] 24→ [14, 15]

25→ [13, 15] 26→ [10, 14] 27→ [11, 15] 28→ [9, 13] 31→ [9, 11]

32→ [8, 9]

29→ [8, 10]

30→ [10, 11]

ISTs in the Line Graphs of Interconnection Networks

351

The node-ISTs rooted at 1 (corresponding to the edge-node [0, 1] in the line graph of 4-dimensional hypercube) in the line graph of 4-dimensional hypercube based on Algorithm NodeIST are shown in Fig. 4, the height of the four trees are 4, 5, 5, 5, respectively. Take the internally node-disjoint paths between 1 and 25, the 4 paths are as follows:

Fig. 4. The node-ISTs on the line graph of 4-dimensional hypercube.

1→ 7→ 20→ 25 1→ 2→ 10→ 14→ 24→ 25 1→ 3→ 19→ 23→ 25 1→ 4→ 32→ 28→ 25 The paths denoted in edge-nodes are as follows: [0, 1]→ [1, 5]→ [5, 13]→ [13, 15] [0, 1]→ [0, 2]→ [2, 6]→ [6, 14]→ [14, 15]→ [13, 15] [0, 1]→ [0, 4]→ [4, 12]→ [12, 13]→ [13, 15] [0, 1]→ [0, 8]→ [8, 9]→ [9, 13]→ [13, 15] It is easy to verify that the paths between the edge-node [0, 1] and any other edge-node are also internally node-disjoint. The radial mode of the node-ISTs rooted at 1 are shown in Fig. 5. Here, the number of nodes deployed in the layers from the inside to the outside are 4, 9, 24, 39, 40, 12, respectively. Simulation results show that the maximal height of the node-ISTs rooted at any node in the line graph of n-dimensional hypercube is n+1. We have the following observation. Observation 4. The height of ISTs T1 and Ti in the line graph of n-dimensional hypercube are n and n + 1 for i = 2, 3, . . . , n, respectively, where n ≥ 3.

352

B. Cheng et al.

Fig. 5. The radial mode of node-ISTs.

Observing that all the height of the n optimal node-ISTs rooted at any node in n-dimensional hypercube Qn is n + 1 [30] and L(Qn ) contains more nodes than Qn for n ≥ 3, the set of node-ISTs rooted at any node in L(Qn ) have advantages in the height with respect to the number of nodes. If we abstract the interconnection network of severs in SWCube and BCDC, we obtain the line graph of generalized hypercube and crossed cube, respectively. Thus, we only need to construct independent spanning trees in the two networks. Let the input be the set of independent spanning trees from [26] and [31], we can use Algorithm NodeIST to construct independent spanning trees in the line graph of generalized hypercube and crossed cube, respectively.

5

Conclusions

In this paper, we have proved that if there are n edge-independent spanning trees rooted at an arbitrary node in the n-edge-connected network G , then there are n node-independent spanning trees rooted at an arbitrary node in the line graph of G . An algorithm to construct node-ISTs in G based on the node/edgeISTs in G is also presented. Some simulations of independent spanning trees on

ISTs in the Line Graphs of Interconnection Networks

353

the line graphs of hypercubes were presented and we also pointed out that the algorithm proposed in this paper can be used to construct independent spanning trees on SWCube and BCDC data center networks. It is still interesting to prove that either the node conjecture implies the edge conjecture, or vice versa. Acknowledgment. This work is supported by National Natural Science Foundation of China (No. 61572337, No. 61502328, and No. 61602333), China Postdoctoral Science Foundation Funded Project (No. 2015M581858), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 18KJA520009), the Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1501089B and No. 1701173B), Opening Foundation of Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks (No. WSNLBKF201701), and Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX17 2005 and No. KYCX18 2510).

References 1. Bonomom, F., Dur´ an, G., Safe, M.D., Wagler, A.K.: Clique-perfectness of complements of line graphs. Discret. Appl. Math. 186(1), 19–44 (2015) 2. Bao, F., Funyu, Y., Hamada, Y., Igarashi, Y.: Reliable broadcasting and secure distributing in channel networks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E81–A, 796–806 (1998) ¨ 3. Bao, F., Igarashi, Y., Ohring, S.R.: Reliable broadcasting in product networks. Discret. Appl. Math. 83(1–3), 3–20 (1998) 4. Chen, Y.-S., Chiang, C.-Y., Chen, C.-Y.: Multi-node broadcasting in all-ported 3-D wormhole-routed torus using an aggregation-then-distribution strategy. J. Syst. Arch. 50(9), 575–589 (2004) 5. Cheng, B., Fan, J., Jia, X., Zhang, S.: Independent spanning trees in crossed cubes. Inf. Sci. 233(1), 276–289 (2013) 6. Cheng, B., Fan, J., Jia, X., Wang, J.: Dimension-adjacent trees and parallel construction of independent spanning trees on crossed cubes. J. Parallel Distrib. Comput. 73, 641–652 (2013) 7. Cheng, B., Fan, J., Lyu, Q., Zhou, J., Liu, Z.: Constructing independent spanning trees with height n on the n-dimensional crossed cube. Futur. Gener. Comput. Syst. 87, 404–415 (2018) 8. Cheng, B., Fan, J., Jia, X., Jia, J.: Parallel construction of independent spanning trees and an application in diagnosis on M¨ obius cubes. J. Supercomput. 65(3), 1279–1301 (2013) 9. Cheriyan, J., Maheshwari, S.N.: Finding nonseparating induced cycles and independent spanning trees in 3-connected graphs. J. Algorithms 9(4), 507–537 (1988) 10. Curran, S., Lee, O., Yu, X.: Finding four independent trees. SIAM J. Comput. 35(5), 1023–1058 (2006) 11. Dong, F., Yan, W.: Expression for the number of spanning trees of line graphs of arbitrary connected graphs. J. Graph Theory 85(1), 74–93 (2017) 12. Gopalan, A., Ramasubramanian, S.: A counterexample for the proof of implication conjecture on independent spanning trees. Inf. Process. Lett. 113(14–16), 522–526 (2013) 13. Gopalan, A., Ramasubramanian, S.: On constructing three edge independent spanning trees. SIAM J. Comput. (2011, submitted)

354

B. Cheng et al.

14. Hasunuma, T.: Structural properties of subdivided-line graphs. J. Discret. Algorithms 31, 69–86 (2015) 15. Harvey, D.J., Wood, D.R.: Treewidth of the line graph of a complete graph. J. Graph Theory 79(1), 48–54 (2015) 16. Hoyer, A., Thomas, R.: Four edge-independent spanning tree. SIAM J. Discret. Math. 32(1), 233–248 (2018) 17. Huck, A.: Independent trees in planar graphs. Graphs Comb. 15(1), 29–77 (1999) 18. Hussain, Z., AlBdaiwi, B., Cerny, A.: Node-independent spanning trees in Gaussian networks. J. Parallel Distrib. Comput. 109, 324–332 (2017) 19. Li, D., Wu, J.: On data center network architectures for interconnecting dual-port servers. IEEE Trans. Comput. 64(11), 3210–3222 (2015) 20. Itai, A., Rodeh, M.: The multi-tree approach to reliability in distributed networks. Inf. Comput. 79(1), 43–59 (1988) 21. Khuller, S., Schieber, B.: On independent spanning trees. Inf. Process. Lett. 42(6), 321–323 (1992) 22. Kim, J.-S., Lee, H.-O., Cheng, E., Lipt´ ak, L.: Independent spanning trees on even networks. Inf. Sci. 181(13), 2892–2905 (2011) 23. Kim, J.-S., Lee, H.-O., Cheng, E., Lipt´ ak, L.: Optimal independent spanning trees on odd graphs. J. Supercomput. 56(2), 212–225 (2011) 24. Li, H., He, W., Yang, W., Bai, Y.: A note on edge-disjoint Hamilton cycles in line graphs. Graphs Comb. 32, 741–744 (2016) 25. Liu, Y.-J., Chou, W.Y., Lan, J.K., Chen, C.: Constructing independent spanning trees for locally twisted cubes. Theor. Comput. Sci. 412(22), 2237–2252 (2011) 26. Obokata, K., Iwasaki, Y., Bao, F., Igarashi, Y.: Independent spanning trees of product graphs and their construction. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E79–A(11), 1894–1903 (1996) 27. Su, G., Xu, L.: Topological indices of the line graph of subdivision graphs and their Schur-bounds. Appl. Math. Comput. 253, 395–401 (2015) 28. Tian, T., Xiong, L.: Traceability on 2-connected line graphs. Appl. Math. Comput. 321, 1339–1351 (2018) 29. Tseng, Y.-C., Wang, S.-Y., Ho, C.-W.: Eﬃcient broadcasting in wormhole-routed multicomputers: a network-partitioning approach. IEEE Trans. Parallel Distrib. Syst. 10(1), 44–61 (1999) 30. Tang, S.-M., Wang, Y.-L., Leu, Y.-H.: Optimal independent spanning trees on hypercubes. J. Inf. Sci. Eng. 20(1), 143–155 (2004) 31. Wang, X., Fan, J., Lin, C.-K., Zhou, J., Liu, Z.: BCDC: a high-performance, servercentric data center network. J. Comput. Sci. Technol. 33(2), 400–416 (2018) 32. Yang, J.-S., Tang, S.-M., Chang, J.-M., Wang, Y.-L.: Parallel construction of optimal independent spanning trees on hypercubes. Parallel Comput. 33(1), 73–79 (2007) 33. Zehavi, A., Itai, A.: Three tree-paths. J. Graph Theory 13(2), 175–188 (1989)

POEM: Pricing Longer for Edge Computing in the Device Cloud Qiankun Yu, Jigang Wu(B) , and Long Chen Guangdong University of technology, Guangzhou 510006, China [email protected], [email protected], [email protected]

Abstract. Multiple access mobile edge computing has been proposed as a promising technology to bring computation services close to end users, by making good use of edge cloud servers. In mobile device clouds (MDC), idle end devices may act as edge servers to oﬀer computation services for busy end devices. Most existing auction based incentive mechanisms in MDC focus on only one round auction without considering the time correlation. Moreover, although existing single round auctions can also be used for multiple times, users should trade with higher bids to get more resources in the cascading rounds of auctions, then their budgets will run out too early to participate in the next auction, leading to auction failures and the whole beneﬁt may suﬀer. In this paper, we formulate the computation oﬄoading problem as a social welfare optimization problem with given budgets of mobile devices, and consider pricing longer of mobile devices. This problem is a multiple-choice multi-dimensional 0-1 knapsack problem, which is a NP-hard problem. We propose an auction framework named MAFL for long-term beneﬁts that runs a single round resource auction in each round. Extensive simulation results show that the proposed auction mechanism outperforms the single round by about 55.6% on the revenue on average. Keywords: Edge computing · Computation oﬄoading Multiple rounds · Mobile device cloud · Long-term · Auction

1

Introduction

In the past few years, despite the increasing capabilities of mobile devices including smart phones, Internet of Things (IoT) devices, and wearable devices, resource requirements for mobile applications can often transcend the computation of a single device [1–4]. Therefore, mobile cloud computing is proposed to oﬄoad tasks to remote cloud for execution [5–9], though it may introduce longer delay and user experience may suﬀer. Moreover, long distance telecommunication will consume more energy. In recent work, multiple access mobile edge computing has been proposed as a promising technology to bring computation services close to end users, by making good use of edge cloud servers. There are three types of architecture used in edge computing [10]: edge server, c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 355–369, 2018. https://doi.org/10.1007/978-3-030-05057-3_28

356

Q. Yu et al.

coordinator device, and device cloud. This paper uses the third architecture. The computation oﬄoading [11,12] can be performed in Mobile Device Clouds (MDC) [13–16], which use idle resources of nearby mobile devices to execute tasks. However, mobile devices that provide idle resources may also incur extra cost to themselves, which should be monetary compensated. To encourage more devices sharing their idle resources, several prior works have been done in MDC. Miluzzo et al. [17] proposed an incentive scheme in MDC. However, this scheme ignored the resource requirements of tasks. Song et al. [18] designed a non-competitive pricing mechanism, with a bill backlog threshold. If a device exceeds the threshold, it can reduce its bill backlog by providing services for others. Otherwise it will not be able to get the service. However, they do not consider whether the device has suﬃcient resources to provide services for others or not. Wang et al. [19] proposed a Stackelberg Game approach for cooperative application execution in mobile cloud computing. However, they do not consider that the mobile device is heterogeneous, diﬀerent mobile devices may have diﬀerent processing power levels and energy consumption levels. Therefore, the payment of the tasks should be diﬀerent. In recent studies, auction has been widely used as one of the most popular incentive schemes in many areas, such as virtual machine allocation [20,21] and wireless spectrum allocation [22,23]. The celebrated VCG mechanism [24] is a well known type of auction. It is essentially the only type of auction that simultaneously guarantees both truthfulness and absolute economic eﬃciency. Li et al. [25] proposed an online spectrum auction framework. This mechanism can also be used in MDC’s resource allocation, but buyer’s budget constraints are not considered. Jin et al. [26] designed an incentive compatible auction mechanism for cloudlet resource sharing in mobile cloud computing. However, this mechanism uses a one-to-one match and assumes that the resource requirements are homogeneous. In this work, we consider a seller can serve multiple buyers and the resource requirements of buyers are heterogeneous. Wang et al. [27] designed an eﬃcient auction mechanism to solve the task assignment problem in MDC. However, this auction mechanism assumes that every buyer must be allocated to resources. In this work, we consider that resources are limited and can not ensure that every buyer can be allocated resources. In MDC, existing auction mechanisms only focus on a single-round auction [26,27]. In many cases, we need multiple rounds of auctions. Although existing single round auctions can also be used for multiple times, the user’s budget constraints should be considered. The budget is the total amount of money a buyer could pay, it plays a key role in designing periodical auctions. Despite long-term and budget constraints are considered in crowdsourcing [28], tasks are homogeneous and a task can be allocated to multiple workers. However, authors in [28] as assumed unlimited resources at workers. In a resource limited MDC, idle mobile devices are unlikely to meet the needs of all users at the same time. So the scheme can’t be used directly in MDC, which motivates us to design a long-term auction of multiple rounds with budget constraint. To design eﬀective schemes, the following challenges should be property handled: (1) How to prevent the user’s budget from running out prematurely by multi-round

POEM: Pricing Longer for Edge Computing in the Device Cloud

357

auction? (2) How to eﬃciently allocate resources for diﬀerent bids of diﬀerent devices? (3) How to attract more sellers to participate in MDC? To solve the above challenges, in this paper, we consider Pricing lOnger for Edge coMputing in the device cloud (POEM). We aim to design a long-term auction of multiple rounds. The main features are as follows: (1) the mobile tasks are indivisible and the number of resources (CPU, memory, battery etc.) requirements for a task are diﬀerent. (2) The number of resources requested by each user is not a ﬁxed value in each round, and the amount of resources provided by the nearby mobile device is also not a ﬁxed value in each round. (3) We punish the winning user to reduce the bid according to its remaining budget in the next round of the auction. The main contributions of this paper are as follows: • Considering the time correlation of resource allocation, we formulate the task oﬄoading problem as an integer linear programming. And we design an MDC Auction Framework for Long-term (MAFL). The next round of genuine bids will be adjusted according to the results of the previous round. • We design a Single Round Mobile Resources Auction (SRMRA) algorithm for comparing purposes with the MAFL. And we demonstrate the performance of the algorithm by proofs and extensive experiments. • We conduct extensive simulation experiments to demonstrate the performance of our mechanism. MAFL is better than the single round auction SRMRA. MAFL outperforms SRMRA by about 12.2% on revenue when the number of users is 40, the 80 round auction is performed. MAFL outperforms SRMRA by about 55.6% on revenue on average when the number of users changed from 10 to 80, the 80 round auction is performed. The rest of the paper is organized as follows. Section 2 describes the system model and problem formulation. The auction mechanism for single round MDC’s resources allocation is designed in Sect. 3. Section 4 proposes an auction framework in the MDC for long-term optimisation. Section 5 presents the simulation results. Finally, we conclude the paper in Sect. 6.

2 2.1

Problem Definition System Model

We assume that the total time of the whole auction period is T (T is a long time) [21], and divide T into multiple time slots. Perform one round auction at each time slot l ∈ L, where L = {1, 2, 3, · · · , L} and L is the total number of auction rounds. There are U users in the MDC, each user u ∈ U needs some resources (CPU, memory, battery etc.) to perform its indivisible tasks, where U = {1, 2, 3, · · · , U }. There are M sellers in MDC, each seller m ∈ M can share (l) their resources with others, where M = {1, 2, 3, · · · , M }. Let ru be the amount (l) of resources requested by user u and Rm be the amount of resources provided by the seller in the l-th round. The U users are bidders in the auctions, each user

358

Q. Yu et al. (l)

(l)

(l)

(l)

(l)

(l)

submits its valuation Vu = {vu,1 , vu,2 , · · · , Vu,M } in round l, where vu,m ∈ Vu denotes the valuation of buyer for seller m. Moreover, as each user is also budget constrained, we use Bu to denote the user u’s total budget in all rounds. Of course, sellers can not provide unlimited resources. So we use Wm to represent the total number of resources provided by the seller m in all rounds. Speciﬁcally, (l) (l) (l) (l) the resource allocation is determined by Yu = {yu,1 , yu,2 , · · · , yu,M }, where (l)

yu,m ∈ {0, 1} is a binary indicator whose value is 1 if user u’s tasks is performed on seller m in round l and 0 otherwise. We list basic notations used in this paper in Table 1. Table 1. Basic notations Notation

Descriptions

T

The total time of whole auction period

L

The total number of auction rounds

U, M

The total number of users and sellers

L

The set {1, 2, 3, · · · , L}

U, M

The set of users and sellers

(l)

ru

The amount of resources requested by user u in the l-th round

(l)

The amount of resources provided by the seller m in the l-th round

Vu

(l)

The set of u’s valuation in the l-th round

(l) Yu (l) vu,m (l) yu,m

The set of u’s indicator in the l-th round

Bu

The user u’s total budget

Wm

The total number of resources provided by seller m

Rm

2.2

The user u’s valuation for seller m in the l-th round The user u wins the resources provided by the seller m in the l-th round or not

Problem Formulation

The objective of the MDC (mobile device clouds) resource allocation problem is to maximize the user’s bids. In the whole auction period, the higher the total price of the user’s bids, the more compensation of the device that provided, so the more people will be attracted to share the idle resources in his device. We formalize our objective as follows: (l) (l) OPT-1 obj : max vu,m yu,m (1) l∈L u∈U m∈M

subject to: m∈M

(l) yu,m ≤ 1 ∀u ∈ U ∀l ∈ L

(1-1)

POEM: Pricing Longer for Edge Computing in the Device Cloud

(l) (l) ru(l) yu,m ≤ Rm

∀m ∈ M ∀l ∈ L

359

(1-2)

u∈U

(l) (l) vu,m yu,m ≤ Bu

∀u ∈ U

(1-3)

∀m ∈ M

(1-4)

m∈M l∈L

(l) ru(l) yu,m ≤ Wm

l∈L u∈U

(l) yu,m ∈ {0, 1}

∀u ∈ U ∀m ∈ M ∀l ∈ L

(1-5)

The constraint (1-1) means that a user’s task can only be performed on one device. Constraint (1-2) ensures that the resources of the devices that can be provided is limited in each round, so it is forbidden to exceed the number of resources the device oﬀered. The constraint (1-3) is to make sure that the user’s bid can’t exceed its budget in the whole auction period of T . The constraint (1-4) indicates that the amount of resources provided by sellers is limited in the whole auction period of T . Theorem 1. Social welfare optimization problem (OPT-1) is NP-hard. Proof. The multiple-choice multi-dimensional knapsack problem is a NP-hard problem [29]. In OPT-1, the amount of resources that each seller can provide is equivalent to the capacity of the backpack in each round of the auction. The resource requirement of each user is equivalent to the weight of the object. Each user can be allocated to resources or not. So OPT-1 is a special case of the multiple-choice multi-dimensional 0-1 knapsack problem, which is NP-hard. We ignore the indicator variable constraint (1-5) temporarily, and introduce dual variable vectors α, β, η and χ. We then obtain the dual problem of OPT-1: (l) (l) (l) OPT-2 obj : min Bu αu(l) + βu(l) + Rm ηm + Wm χm u∈U

u∈U l∈L

m∈M l∈L

m∈M

(2) subject to: (l) αu + βu(l) + vu,m

(l) (l) (l) ru(l) ηm + ru(l) χm ≥ vu,m ∀u ∈ U ∀m ∈ M ∀l ∈ L

(2-1)

m∈M

(l) (l) , χm ∈ [0, 1] αu(l) , βu(l) , ηm

∀u ∈ U ∀m ∈ M ∀l ∈ L

(2-2)

360

Q. Yu et al.

Since we do not know all the information in each round auction during the whole auction period of T , i.e. the demand for user resources and the corresponding bids, as well as the amount of resources provided by sellers. The auction mechanism is carried out round after round with time. So we just consider the current bids and resources in each round auction. To prevent users from running out of budget too early. We adjust the user’s bid according to its remaining budget in each round. So we introducean auxiliary (l) (l) (l) (l) (l −1) variable αu for each user u ∈ U, where αu ∈ [0, 1]. Let vu,m =vu,m 1 −αu denote the real valuation. Now, we give the following formulation. (l) (l) OPT-3 obj : max vu,m yu,m (3) u∈U m∈M

subject to:

(l) yu,m ≤ 1 ∀u ∈ U

(3-1)

m∈M

(l) (l) ru(l) yu,m ≤ Rm

∀m ∈ M

(3-2)

u∈U

(l) ∈ {0, 1} yu,m

∀u ∈ U ∀m ∈ M

(3-3)

We ignore the indicator variable constraint (3-3) temporarily, and adopt the same dual variables as in the dual of (1). We then obtain the dual problem of OPT-3: (l) (l) βu(l) + Rm ηm (4) OPT-4 obj : min u∈U

m∈M

subject to: βu(l) +

(l) (l) ru(l) ηm ≥ wu,m

∀u ∈ U ∀m ∈ M

(4-1)

m∈M

(l) ∈ [0, 1] βu(l) , ηm

3

∀u ∈ U ∀m ∈ M

(4-2)

Single Round Resources Auction Design in MDC

In this section, we focus on the design of Single Round Mobile Resources Auction (SRMRA), and we prove that SRMRA is truthful, individual rationality. The detailed description of SRMRA in round l is showed in Algorithm 1. The resources demand information of users and the resources information shared by sellers is collected by the auctioneer. The U users are bidders in auctions, each

POEM: Pricing Longer for Edge Computing in the Device Cloud

361

submits a bid containing M valuations of sellers in round l. We consider that users or sellers may join and leave during the auction period. In this case, the default is 0 in bids. We use Q to denote the set of winners. We choose the user (l) (l) u who has the largest bid density vu,m /ru , i.e. the algorithm chooses each user according to the bid and the amount of requested resources, and always chooses the user with a highest bid on few resources as the winner. However, the resources provided by the seller are limited. Resource allocation cannot exceed the amount of resources shared by the seller in round l (line 13). And the amount of resources provided by sellers in the whole auction period is limited (line 3 and 14). Then, (l) We use the VCG price mechanism. Let pu denote the price of the user’s ﬁnal (l) (l) payment in round l. Let S−u and Su denote the social welfare achieved when winner u is excluded and the social welfare achieved when u is not involved in (l) (l) (l) bidding in round l, respectively. The payment of the winner u, pu = S−u − Su . Algorithm 1. (SRMRA): Single Round Mobile Resources Auction 1: for m = 1, 2, 3, · · · , M do 2: The amount of shared resources collected from seller m. (l) 3: if Wm < Rm then (l) 4: Rm = Wm ; 5: end if 6: end for 7: for u = 1, 2, 3, · · · , U do (l) (l) (l) (l) (l) 8: Collect bid Vu = {vu,1 , vu,2 , · · · , vu,M } and resource requirements quantity ru from user u. 9: end for 10: Q = ∅ ; 11: for all u ∈ / Q do (l) (l) u∈U m∈M; 12: {u, m} = arg max vu,m /ru (l)

(l)

13: if ru ≤ Rm then (l) (l) (l) (l) 14: Q =Q u; Rm =Rm −ru ; Wm =Wm −ru ; (update Wm ) 15: end if 16: end for 17: for all u ∈ Q do 18: Execution of the 10 to 16 line of the algorithm again with user u excluded. (l) (l) (l) 19: pu = S−u − Su ; 20: end for

Theorem 2. SRMRA is a truthful auction mechanism. Proof. If the allocation algorithm is monotone and exact and the payment scheme calculates critical value for each winner, then the mechanism is truthful [30]. From line 12 of the SRMRA, it is clear that a user can increase its chance of winning by increasing its bid. Therefore, the winner determination algorithm

362

Q. Yu et al.

of SRMRA is monotone. Then, a winning bidder u pays the minimum amount it has to bid to get resources, i.e., its critical value. This is done by ﬁnding the losing bidder who would win if u would not participate in the auction. User u’s minimum bid density has to be at least equal to the bid density of user the losing bidder for winning its resources. Therefore, user u’s critical valuation is (l) (l) vu,m /ru , which is the payment calculated by SRMRA. Thus, we conclude that SRMRA is a truthful mechanism. Theorem 3. SRMRA is individual rationality. (l)

Proof. pu is the critical value for winner u by the analysis in Theorem 2. Thus (l) (l) (l) (l) (l) (l) pu ≤ vu,m . Due to vu,m ≥ vu,m , we can conclude that pu ≤ vu,m . So, SRMRA is individual rationality.

4

MDC Auction Framework Design for Long-Term

In this section, we propose a MDC Auction Framework for Long-term (MAFL) that runs a Single Round Mobile Resources Auction (SRMRA) in each round. Then, we give the theoretical analysis of approximate ratio of MAFL. Although existing single round auctions can also be used for multiple times, users should trade with higher bids to get more resources in the cascading rounds of auctions. On the one hand, some users continue to bid a high price, causing other users can not get resources. On the other hand, if a large portion of buyers run out their budget rapidly in short-terms, the users who participate in the competition may be reduced. Therefore, the total revenue may reduce signiﬁcantly. Our main idea is to design an appropriate long-term auction framework with budget constraint handled elaborately. In MAFL (Algorithm 2), we intro(l) duce an auxiliary variable αu ∈ [0, 1] for each user u ∈ U. Its initial value is 0, which increases with the decrease of the remaining budget of the user. Then, in (l) (l) (l −1) as the virtual valuation for user each round l, we use vu,m =vu,m 1 −αu (l)

u. After executing SRMRA, Let Q be set of winning users and adjust αu for each user u ∈ Q. The detailed process is displayed in Algorithm 2. Theorem 4. When the algorithm MAFL terminates, the constraint condition (1-1) are satisfied in the formulation (1). And each user will not be over bud (l) (l) get with a factor of 1 +ϕ, i.e. vu,m yu,m ≤ Bu (1 + ϕ), ∀u ∈ U, where m∈M l∈L (l) ϕ = maxu∈U ,m∈M,l∈L vu,m /Bu . (l)

Proof. αu is the auxiliary variable we have introduced. When the user u gets (l) (l) the requested resources, the αu will be increase, where αu is the budget that has been used at the end of round l in MAFL. Therefore, when the user u runs

POEM: Pricing Longer for Edge Computing in the Device Cloud

363

Algorithm 2. (MAFL): MDC Auction Framework for Long-term (l)

1: αu =0 ∀u ∈ U ; 2: for l = 1, 2, 3, ·· · , L do (l) (l) (l −1) ; 3: vu,m =vu,m 1 −αu 4: Execute SRMRA, Let Q be set of winning users. 5: for all u ∈ Q do (l −1) (l) 6: if αu + vu,m Bu < 1 then (l) (l −1) (l) 7: αu = αu + vu,m Bu ; 8: else (l) 9: αu = 1; 10: end if 11: end for 12: for all u ∈ / Q do (l) (l −1) ; 13: αu =αu 14: end for 15: end for (l) (L) ∀u ∈ U; 16: αu =αu

(l)

out its budget (αu = 1), it won’t get any more resources in the next rounds. (l) We assume αu = 1, when l = l∗ , then we get: (l) (l) (l) (l) vu,m yu,m ≤ vu,m yu,m m∈M 1≤l≤l∗

m∈M l∈L

=

(l) (l) vu,m yu,m

m∈M 1≤l0 B, if t = 1 ykt = yk(t−1) − fk(t−1)j , else 2 ≤ t ≤ 5

(8)

where K represents the total number of MCVs, B and M are the battery capacity and the number of charging guns of each MCV, respectively. We assume that one MCV can meet the charging demand of 10 EVs and it has 3 charging guns. We further assume that fktj is the amount of demand can be satisﬁed this time for each MCV dispatched from the current position i to the target parking lot j and if fktj = 0 the MCV will not be scheduled; ykt is the left power of MCV k at the start of time period t, by default, one MCV can charge no more than represents the current demand at the grid point 3 EVs at the same time; Etj

Towards an Eﬃcient and Real-time Scheduling Platform for MCVs

411

j in the time period t and the initial demand at j is Etj . As shown in Eq. (9), after a MCV k is scheduled to this point during this time period, the demand for , and then the distribution of UD in the current time period satisfaction is fktj will be reset. = Etj − fktj (9) Etj GSD and Scheduling Distance (GSDD): This strategy considers both the UD in each parking lot and the scheduling distance of this schedule, because the movement of MCVs will also consume part of the power. For this strategy, the schedule result is the same as that of GSD for t = 1, which satisﬁes the Eq. (8). Starting from the second time period, the objective function is as follows: max :

K n

fktj − μkt × rate

k=1 j=1 s.t.: fktj = min[M, ykt , Etj ]>0

2≤t≤5 0 < rate < 1

(10)

μkt ≥ 0 where μkt represents the distance traveled by the MCV k in the time period t, and rate is the inﬂuence of the distance on each dispatch. When rate = 0, this strategy is equivalent to the previous scheduling strategy. Global Optimization Strategy Based on Demand and Scheduling Distance (GOSDD): Unlike the second scheduling strategy, this strategy considers the possible scheduling situations for all time periods, and then ﬁnd the optimal solution. We deﬁne the penalty cost to represent the sum of the weighted dispatch distance and the UD of all the dispatches in each time period. The speciﬁc objective function is as follows: min :

n j=1

s.t.:

K

δtj +

K

μkt × rate

k=1

fktj + δtj = Etj

k=1

(11)

δtj ≥ 0, 1 ≤ t ≤ 5 0 ≤ fktj ≤ ykt and fktj ≤ M × xktj n xktj = {0, 1} and xktj = 1 j=1

K where the k=1 fktj and δtj represent the total amount of demand that can and cannot be satisﬁed by all the MCVs which are scheduled to the grid point j during the time period t respectively. When xktj = 1, MCV k is scheduled to the parking lot j during time period t.

412

5

Q. Liu et al.

Evaluation

Figure 11 shows the impact of diﬀerent distance weights on the total number of services, where the ordinate is the total number of services in a day. We can see that the curve of rate = 0.05 is similar to the one of rate = 0, due to the reason that the radius of Sixth Ring Road is about 25 km. In the case of rate = 0.1, the total number of services in a day reaches the upper limit when K = 150. Figure 12 shows the service times distribution of diﬀerent numbers of MCVs in each time period when rate = 0.1. It can be seen that aﬀected by demand, for all possible K, the service times is the lowest when t = 1, and reaches the maximum at t = 3. For K = 50, we can see that it has the least amount of service at t = 5, which is due to the fact that most MCVs have reached the maximum number of services before the 5th time period. In addition, we can also see in this ﬁgure that the distribution of K = 150 is similar to that of K = 200, which also shows that for rate = 0.1, 150 is already the upper limit of K, beyond which the extra MCVs will not bring additional improvements to the service times. Figure 13 shows the utilization of diﬀerent numbers of MCVs and the number of EVs they served in one day when rate = 0.1. In this ﬁgure, the blue and red curves represent the utilization rate of MCVs and the number of service times, respectively. From the blue curve, we can see that when the number of MCVs is less than 75, the utilization of MCVs is more than 95%. It is because the amount of UD is far greater than the services that MCVs can provide, so before the end of the day, most MCVs have already run out of electricity. The number of unserved

Fig. 11. Scheduling distance weight

Fig. 12. MCVs number analysis

Fig. 13. Usage rate and service times

Fig. 14. Number of unmet EVs

Towards an Eﬃcient and Real-time Scheduling Platform for MCVs

413

Fig. 15. Analysis of scheduling results (rate = 0.1, K = 25)

Fig. 16. Analysis of scheduling results (rate = 0.1, K = 50)

Fig. 17. Analysis of scheduling results (rate = 0.1, K = 75)

EVs in each time period before and after deploying MCVs when rate = 0.1 is shown in Fig. 14. The histograms of blue and orange indicate the number of unserved EVs in each time period when K = 0 and K = 150, respectively. We compared the scheduling results of these three diﬀerent scheduling strategies in Figs. 15, 16 and 17. Figure 15a shows the comparison of the number of services in each time period. It can be seen that before the start of the 5th time period, the demand satisﬁed by GSD is slightly higher than those of the other two strategies. In the case of t = 5, the number of services under each of the three strategies drops sharply, because most of MCVs run out of power before the start of the 5th time period. Comparison between Figs. 16a and 17a shows that as K increases, the number of services in each time period also increases, and the change is the most signiﬁcant when t = 5.

414

Q. Liu et al.

Figure 15b shows the scheduling distance of MCVs from the previous time period to the next time period. It can be seen that the scheduling distance of GSD is signiﬁcantly greater than that of the other two scheduling results. The total scheduling distance from t = 4 to t = 5 is about 0, which also shows that most MCVs have run out of their service power at the end of t = 4. Similarly, comparing Figs. 16b and 17b, we can also conclude that as K increases, the dispatch distance in each time period will also increase. However, in comparison, GSD has the largest increase, and the changes in GSDD and GOSDD are small. Figure 15c shows the comparison between the total penalty costs of diﬀerent scheduling strategies in each time period, that is, the weighted sum of the unmet demand and the scheduling distance. Comparing with Figs. 16c and 17c, we can see that with the increase of K, the diﬀerence between GSD and the other two strategies also increases. In addition, it can be seen that for the penalty cost, no matter how big the K is, the total amount of penalty in each time period is about the same, because when K is relatively small, the dispatching distance is small, and consequently the corresponding demand that can be satisﬁed is also limited. If K is larger, the total dispatching distance will increase, but the number of services in each time period also increases. Of course, as K increases, the average penalty for each MCV will become smaller and smaller. Through the comparison of Figs. 15, 16 and 17, we can ﬁnd that the eﬀect of GSD is obviously worse than the other two scheduling strategies when the size of K is moderate. The overall scheduling eﬀect of GSDD and GOSDD is similar. But the computational complexity of GOSDD is signiﬁcantly larger because it needs to ﬁnd the optimal solution from all possible schedules. With the increase of K, the time GOSDD takes will increase exponentially, which is not suitable for real-time dispatch. While for GSDD, the complexity is much lower, which is more suitable for the real-time application.

6

Conclusion

MCVs have good mobility and scalability. They can be used not only to alleviate the pressure of charging stations and reduce the waiting time for users, but also to provide guidance for the construction of charging stations in the future. For example, the place where MCVs are often dispatched may be more suitable for building a new charging station. The size of the charging station can also be roughly estimated according to the service times of the MCVs. From the comparison of the three diﬀerent scheduling strategies, the scheduling result of GSDD is similar to that of GOSDD. If we consider the computation complexity of the algorithms, GSDD is more applicable to actual scheduling. In the future work, we will consider unexpected events, such as the break down of EVs caused by the depletion of electricity, in order to achieve multi-functional scheduling. One the other hand, we will also consider the impact of diﬀerent driving paths (especially with traﬃc jam) on the scheduling of MCVs. Acknowledgment. This work was partially funded by NSFC-61472384. And we are particularly grateful for the cooperation and support from echarge.

Towards an Eﬃcient and Real-time Scheduling Platform for MCVs

415

References 1. China has methodically built the world’s largest market for electric vehicles. http:// cn.wsj.com/gb/20171006/biz102359.asp 2. The new parking space in Beijing must be equipped with charging piles. http:// www.evehicle.cn/?p=3579 3. Charging pile, why became a “stumbling block” of electric cars? http://society. people.com.cn/n1/2016/0603/c1008-28410174.html 4. OpenStreetMap contributors. https://www.openstreetmap.org 5. Zhang, Q.H., Xiu, N.N., Cheng, G.Q., Wang, Z.: Research on gas station selection support system with given refueling volume. In: Advanced Materials Research, pp. 756–760. Trans Tech Publ (2012) 6. Wang, J., Li, J., Pang, T., Sun, X., Liu, Q., Liu, H.: Towards a holistic and optimized framework for smart grid regulation. In: The 36th IEEE International Performance Computing and Communications Conference (IPCCC). IEEE (2017) 7. Sun, X., Li, J., Zheng, W., Liu, H.: Towards a sustainable incentive mechanism for participatory sensing. In: 2016 IEEE First International Conference on Internetof-Things Design and Implementation (IoTDI), p. 4960. IEEE (2016) 8. Xu, P., Li, J., Sun, X., Zheng, W., Liu, H.: Dynamic pricing at electric vehicle charging stations for Queueing delay reduction. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 2565–2566. IEEE (2017) 9. Malandrino, F., Casetti, C., Chiasserini, C.-F.: The role of its in charging opportunities for EVs. In: 2013 16th International IEEE Conference on Intelligent Transportation Systems-(ITSC), pp. 1953–1958 (2013) 10. Li, J., Sun, X., Liu, Q., Zheng, W., Liu, H., Stankovic, J.: Planning electric vehicle charging stations based on user charging behavior. In: The 3rd ACM/IEEE International Conference on Internet-of-Things Design and Implementation (2018) 11. Ge, S., Feng, L., Liu, H.: The planning of electric vehicle charging station based on grid partition method. In: 2011 International Conference on Electrical and Control Engineering (ICECE), pp. 2726–2730 (2011) 12. Li, Y., Luo, J., Chow, C.-Y., Chan, K.-L., Ding, Y., Zhang, F.: Growing the charging station network for electric vehicles with trajectory data analytics. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1376–1387. IEEE (2015) 13. Liu, Z., Wen, F., Ledwich, G.: Optimal planning of electric-vehicle charging stations in distribution systems. IEEE Trans. Power Deliv. 28, 102–110 (2013) 14. Liu, Z.-F., Zhang, W., Ji, X., Li, K.: Optimal planning of charging station for electric vehicle based on particle swarm optimization. In: Innovative Smart Grid Technologies-Asia (ISGT Asia), pp. 1–5. IEEE (2012) 15. Timpner, J., Wolf, L.: Design and evaluation of charging station scheduling strategies for electric vehicles. IEEE Trans. Intell. Transp. Syst. 15, 579–588 (2014) 16. Chen, T.D., Kockelman, K.M., Khan, M., et al.: The electric vehicle charging station location problem: a parking-based assignment method for seattle. In: Transportation Research Board 92nd Annual Meeting, pp. 13–1254 (2013) 17. McPherson, C., Richardson, J., McLennan, O., Zippel, G.: Planning an electric vehicle battery-switch network for Australia. In: Australasian Transport Research Forum 2011 Proceedings (2011) 18. Zheng, Y., Dong, Z.Y., Xu, Y., Meng, K., Zhao, J.H., Qiu, J.: Electric vehicle battery charging/swap stations in distribution systems: comparison study and optimal planning. IEEE Trans. Power Syst. 29, 221–229 (2014)

416

Q. Liu et al.

19. Mak, H.-Y., Rong, Y., Shen, Z.-J.M.: Infrastructure planning for electric vehicles with battery swapping. Manag. Sci. 59, 1557–1575 (2013) 20. Xiong, H., Xiang, T., Rong, X., Chen, H.: Optimal allocation of electric vehicle battery swap stations. Electr. Power Autom. Equipment, 1–6 (2012) 21. Yang, J., Sun, H.: Battery swap station location-routing problem with capacitated electric vehicles. Comput. Oper. Res. 55, 217–232 (2015) 22. Liu, N., Chen, Q., Lu, X., Liu, J., Zhang, J.: A charging strategy for PV-based battery switch stations considering service availability and self-consumption of PV energy. IEEE Trans. Ind. Electr. 62, 4878–4889 (2015) 23. Cao, Y., Miao, Y., Jiang, Q.: Optimal operation of islanded microgrid with battery swap stations. Electr. Power Autom. Equipment, 1–6 (2012) 24. Technical speciﬁcations of remote service and management system for electric vehicles - Part 3: Communication protocol and data forma. In: National Technical Committee of Auto Standardization, Tech (2016)

SoProtector: Securing Native C/C++ Libraries for Mobile Applications Ning Zhang1, Guangquan Xu1(&), Guozhu Meng2, and Xi Zheng3 1

Tianjin Key Laboratory of Advanced Networking (TANK), School of Computer Science and Technology, Tianjin University, Tianjin 300350, China [email protected] 2 Nanyang Technological University, Singapore, Singapore 3 Department of Computing, Macquarie University, Sydney, Australia

Abstract. Java code is easy to be decompiled, and third-party SO ﬁles are used frequently by developers to improve development efﬁciency. Therefore, more and more core functions of Android applications are implemented in the native layer. However, there is neither comprehensive security research work nor automated security analysis tools on Android native layer, especially for thirdparty SO ﬁles that are dynamically loaded within the applications. To solve this problem, SoProtector, a novel and effective system is proposed to defend against the privacy leaks, which mainly analyzes the data stream between two levels: application and Native layers. In addition, SoProtector includes a real-time monitor to detect malicious functions in binary code. Our evaluation using 3400 applications has demonstrated that SoProtector can detect more sources, sinks and smudges than most static analysis tools; And it detects and effectively blocks more than 82% of applications that dynamically load malicious thirdparty SO ﬁles with low performance overhead. Keywords: Mobile security Android

Mobile privacy Native C/C++ libraries

1 Introduction At present, the privacy disclosure is still a serious problem in smartphone applications. Here are a few examples: (1) Facebook leaked the phone number from a mobile device before the user logged into the application [1]; (2) Angry Birds collected user data, which was found to be used by the NSA to proﬁle users [2]; (3) out of 25,976 Android applications, 969 applications leaked location data and 347 recorded audio without the user’s permission [3]. Along with privacy concerns there are security concerns as well. Malware constitute the main media for security attacks against mobile devices. It has been recently reported [4] that almost 60 percent of existing malware send stealthy premium rate SMS messages. Also Google Play, the ofﬁcial market for Android apps, has hosted applications which have been found to be malicious [5]. In the past few years, malware have increasingly relied on root exploit. Some of the famous malware families include DroidKungfu [6], GingerMaster [7] and DroidDream [8]. These exploits allow for the © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 417–431, 2018. https://doi.org/10.1007/978-3-030-05057-3_32

418

N. Zhang et al.

escalation of privileges, which bypassed the security measures of the Android operating system. It allows the malware to have unlimited access to the device, resulting in the downloading and running of the payload to get information of users. Mostly important, a new family of malware (Godless [9]) using the root exploit that is stored in a native library, has emerged recently in 2016. The exploit binary contains a series of vulnerability that includes the Towelroot exploit (vulnerability numbers: CVE-2014-3153) and PingPong exploit (vulnerability numbers: CVE-2015-3636). This alarming trend of malware using native library code plays an important motivation for us to create a detection system to identify malware that contains such exploit [9]. For the detection of privacy leak (although there existed some detection frameworks [19–22]), the most important method is stain analysis, including static stain analysis and dynamic stain analysis. The main dynamic stain analysis tools are TainDroid [10] and AppFence [11]. The typical static stain analysis tools include FlowDroid [12] and AndroidLeaks [3]. However, static stain analysis tools cannot effectively handle the Android dynamic loading mechanism and the reflection mechanism, while dynamic stain analysis tools cannot generate data stream graph of the C/C++ programs (which generate SO ﬁles) on the native layer. Unfortunately, reference [13] points out that from 2010 to 2014, the proportion of malicious Android applications using dynamic loading and reflection mechanism increased from 43.87% to 78%, non-malicious applications from 55% to 93%. The large number of applications using dynamic loading techniques makes it increasingly difﬁcult for current stain analysis tools to effectively detect privacy leaks in Android applications [14]. Standing on the security’s point, SO ﬁles (see Fig. 1) are binary code ﬁles, reference [15] measured that 1,161 insecure code snippets posted on Stack Overflow were copied and pasted into 1,305,820 Android applications available on Google Play. They demonstrated that the proliferation of insecure code snippets within the Android ecosystem and the reuse rate of insecure code (Java and C/C++ code) are high. Inspired by this situation, given two binary functions such as the malicious function and the unknown function, we could detect whether they are similar [16]. This problem is known as “binary code similarity detection.” Reference [17, 18] used a graph matching algorithm to check whether two functions’ control flow graph representations are similar to conduct a binary code similarity detection. Genius [19] learns high-level feature representations from the control flow graphs and encodes (i.e., embeds) the graphs into embeddings (i.e., high dimensional numerical vectors). However, the graph matching algorithms are slow, i.e., requiring super-linear runtime in the graph size. Thus such approaches are inevitably inefﬁcient. In recent years, deep learning [20] has been applied to many application domains, including binary analysis [21], and has shown stronger results than other approaches. Reference [22] proposed a deep neural network-based approach to generate embeddings for binary functions for similarity detection. On the other hand, reference [27, 28] presented a method to extract important byte sequences in malware samples by application of convolutional neural network (CNN) to images converted from binary data. However, related work listed above is not in the ARM platform.

SoProtector: Securing Native C/C++ Libraries

419

Fig. 1. The SO libraries in Android APK

Contributions. In summary, this paper makes the following contributions. • Based on FlowDroid, we developed an effective stain analysis tool for the data interaction between native C/C++ libraries and Java API framework by changing Android source code, which is not yet solved by traditional static stain analysis and dynamic stain analysis methods. Through experiments, we also veriﬁed its effectiveness. • For reversible SO ﬁles, we designed an automation tool to analyze the combined characteristics of assembly code to detect whether they are malicious. We tested the performance of the automated tools and veriﬁed its effectiveness through experiments. For non-reversible SO ﬁles, we developed a new method to construct texture maps by combining image processing with machine learning to detect malicious variants. • For the third-party SO ﬁles called by dynamic loading mechanism, we ﬁrst proposed to establish a real-time monitoring platform by uploading the SO ﬁles for online examining, monitoring changes of the third party SO Files. By changing the Android source code and combining with the tools of dynamic stain analysis, the monitoring platform is set up to monitor the third-party SO ﬁle loaded in the test APPs in real time. The validity of the method is veriﬁed through experiments. • We have created a malicious native program dataset, including their Android source programs and malicious binary SO ﬁles. Roadmap. The rest of this paper is organized as follows. Section 2 gives the motivation example, and then introduces some background knowledge about Android dynamic loading. Section 3 gives an overview of SoProtector and illustrates the key techniques applied for it. Section 4 describes the approach step by step. Section 5 describes the experiment and gives the evaluation. Section 6 discusses the limitation of SoProtector and concludes this work.

2 Problem Statement In this section, we investigate challenges in static analysis to analyze the SO ﬁles. We also give some background knowledge about some important mechanisms in the Android platform.

420

2.1

N. Zhang et al.

Background

Dynamic loading means that the Android application achieves some speciﬁc functions by loading some of the local non-existent executable ﬁles which can be replaced at run time; and Android NDK uses dynamic loading such as loading SO libraries and calling functions through JNI methods. SO libraries are generally compiled from C/C++, running in the native layer. Due to their much higher efﬁciency in the virtual machine layer, the SO libraries are often chosen instead of the native java code to do some work to meet performance requirements (such as T9 search, or Bitmap decoding, etc.). In addition, since the SO library is compiled by C/C++ and decompiled into assembly codes (sometimes they are hard to be understood), the SO library can also be used for other purposes. For instance, a new family of malware (Godless) uses the root exploit method whose code is stored in a native library [9]. In general, we package the SO libraries together inside the apps, but the SO libraries also can be loaded from the external storage ﬁles. 2.2

Challenges

SO libraries are binary ﬁles which are composed of 0 and 1: if we take security measures like pacify, maybe we are unable to get their assembly codes. There is no automated tool: we need to manually analyze the data interaction between the native and java layers: no automated tool means less efﬁcient. In addition, dynamic loading SO libraries from third-party is becoming more and more popular during application development. Third-party so libraries do not need to be packaged directly in the APK (see Fig. 2).

APK

Old SO File

New SO File

SDCARD

Internet

Fig. 2. Third-party SO libraries can be updated from the Internet

With the program is running, the required SO ﬁles are loaded into the speciﬁed executable private directory. Because the using SO ﬁles are not in the application, static analysis cannot effectively analyze the data flows that involve on the native layer. More importantly, since the SO ﬁles can be updated at any time after the program is run and no user is required to reinstall the APK, if a malicious APK replaces the benign SO ﬁle with a malicious SO ﬁle after installing the security check, it will not be monitored and processed by security protection software.

SoProtector: Securing Native C/C++ Libraries

421

3 System Overview In this section, we give an overview of the SoProtector framework, which consists of SoDetection and SoPlatform and describe the key techniques applied in our framework. Figure 3 shows the overall architecture of SoProtector. In order to facilitate the following description, we make the following deﬁnitions of terms: (1) Source method: the method called from native layer, denoted by Sf, as shown in Listing 1 (e.g. JNITransmit method). (2) Source ﬁle: the C /C++ ﬁle where source method lies, denoted by Sw. (3) The target method: the native layer function, which calls the Java layer method, denoted by Tf, as shown in Listing 3. (4) Target class: the class where the target method lies in, denoted by Tc. JNI interaction method: The method invoked by the caller to implement the reflection mechanism, denoted by Jh, (e.g.GetMethodID and GetSaticMethodID methods). SoProtector consists of SoDetection and SoPlatform, SoDetection mainly consists of two parts: the dynamic execution module and the static analysis module. For the convenience of the following description, we abbreviate them as SoDetection-x and SoDetection-y respectively.

Fig. 3. System overview of the SoProtector

422

N. Zhang et al.

Information Extraction. SoDetection ﬁrstly runs the SoDetection-x on the computer, and the tested application is installed onto the Android device. The Android system installed in the Android device is generated after Android source codes are modiﬁed and recompiled. The SoDetection-x can record the relevant information like dynamic loading functions that occur in the output of the log. We will illustrate in the next section about some changes made to the Android system source code. After installing the application, SoDetection-x runs the application and reads the log output of the system cyclically to obtain information about application’s dynamic loading and reflection invocation. When capturing the application’s dynamic loading behavior, SoDetection-x would send the download command by adb to the Android device to download the SO ﬁles and.dex ﬁles to the local computer. When capturing the reflection calling behavior of the application, SoDetection-x would extract the source method named Sf, the JNI interaction method named Jh, the target method named Tf corresponding to the reflection call from the log and store the information by the form of a triplet in the local computer’s ﬁle, which is named SJT repository. The.dex ﬁles and the SJT repository will be used for subsequent static analysis, and the downloaded SO ﬁles will be used in SoPlatform’s work. Data Analysis. When the dynamic analysis module is ﬁnished, SoDetection will run SoDetection-y that is actually an improvement of FlowDroid. We added the dex ﬁles and SJT information library into SoDetection-y’s static stain analysis process. SoDetection-y ﬁrstly loads the required JavaClass ﬁles of the APK and the.dex ﬁles into memory, then it translates them into the three-address intermediate language named Jimple of Soot [20]. According to the SJT repository, SoDetection-y transformes between reflection methods and source methods so that it can construct the correct functions’ call graph. Malicious Detection by SoPlatform. SoPlatform ﬁrstly calculates the hash value of the SO ﬁle and stores it. The malicious code image, the OpCode n-gram, and the system call are used as the features. We used the DNN classiﬁer, the Decision tree and the Random Forest as the machine learning algorithms for classiﬁcation (to judge whether this SO ﬁles is or not the malicious ﬁles by setting a threshold (this data is deﬁned by a speciﬁc large number of test sets) and if it was the malicious ﬁles we need to know which malicious native family it belongs to. The speed of the training is accelerated by instruments named xgboost and pypy. Notes: In order to improve the efﬁciency: if the app was replaced with a malicious SO ﬁle during the updating time, the hash value of it would be changed and reanalyzed. If the hash value did not change, the analysis would not be performed.

4 Implementation In this section, we explain the details of four stages in SoDetector’s implementation approach.

SoProtector: Securing Native C/C++ Libraries

4.1

423

Stage 1: Pre-processing

We modiﬁed the Android (its version is 4.1.2) source code. The main changes include: (a) We modiﬁed the Android source code so that it can store method parameters to get complete information in the method stack. (b) We hooked the Runtime Class by recording the mLibPaths so that when the external SO ﬁle is loaded, name and address of the SO ﬁle is recorded to facilitate SoProtector’s next steps. (c) Hook the GetMethodID and GetstaticMethodID methods. (d) Hook some invoke methods. (e) Hook the pNewEntry to get some process information. 4.2

Stage 2: Disposal by SoDetection-X

SoDetection-x will read the phone’s log information and extract the uid number as the log records of the tested app. When reading a record of a dynamically loaded dex ﬁle, the dex ﬁle is downloaded to a designated folder on the local computer according to the ﬁle information recorded in the log in order to provide analysis for SoDetection-y. When capturing the output information of GetMethodID, GetstaticMethodID, or the reflection calling information output by some invoke methods, the information of the corresponding source method Sf, JNI interaction method Jh, target method Tf is extracted, and stored as a triplet in the SJT repository in order to provide reflection calling information for SoDetection-y’s stain analysis process. Figure 4 describes the extraction principle for this information. Because GetMethodID and GetstaticMethodID method are called by the upper layer function, in order to capture the functions’ calling and displaying, there are two ways:

Cs

Push Stack

Fs

Sf

Ps

Jh

Tf

Push Stack Cr

Fr

Pr

Cs

Fs

Ps

Stack Analysis

Log Analysis Ct

Ft

Pt

Library

Fig. 4. The way to get the library

424

N. Zhang et al.

For the SO ﬁles that are able to be reversible to the ARM and C code, we need to use four tools: GCC, Addr2line tools, open source tools Pvtrace and Dot (see Fig. 5); we labeled the native functions where lie Tf by GCC detection functions, which generated the trace ﬁles named trace.txt; After using Addr2line tools to transform the address of the function into the function name, we could get the function calling graph by the dot and the map can be transformed into the data flow diagram of native layer added in the FlowDroid diagram. By analyzing the calling relationship, we could ﬁnd Sf and Jh that Tf matches;

Source,sink and entry-point detection...

analyze C/C++ files restoring from XXX.SO

Parse manifest file,.dex file,layout xmls

Trace.txt from MinGb

E004013B3 E0030134C …...

Entrance Function Generate main method Buid call graph

Function calling graph

As the supplement to main graph

Perform taint analysis

Function A Function B Function C ...

Fig. 5. The way to deal with the reversible SO ﬁles

For the SO ﬁles that are irreversible or difﬁcult to reverse, we triggered the native function by IDApro. We observed that the SO ﬁles’ different Paragraph information mapped to memory in the process table, got the code segment with execute permission and found its memory base address so as to get a series of DCB data. DCB data’s header string had the Sf and Jh that matches Sf. 4.3

Stage 3: Disposal by SoDetection-Y

SoDetection-y is an improved tool of FlowDroid, it will add.dex ﬁles and SJT information library (see Fig. 6) to the analysis process, so that it can correctly handle SO ﬁles which is dynamic loaded. Our analysis method is based on Soot’s Jimple language. At ﬁrst, it will load all the classes into memory, and then generate the main method and build the function’s call graph. In order to reduce the memory burden, SoDetection-y takes the corresponding classes in the loaded.dex ﬁles only. When building the function call graph, SoDetection-y automatically adds the function call graph generated by disposal of SO ﬁles to the main graph. In a word, SoDetection-y deals with the reflection method (in the native layer and application layer) of the source ﬁles so as to form the complete control flow graph

SoProtector: Securing Native C/C++ Libraries

425

Fig. 6. Algorithm flow chart of processing SJT library

containing the data interaction of the application layer and the native layer. The algorithm flow chart of processing SJT library is shown in Fig. 6. For a SJT mapping, SoDetection-y ﬁrstly determines the method is getxxxid or invoke method. Then SoDetection-y converts the target reflection parameter Sp to the target parameter Tp; Then, it determines whether the target reflection object is empty; if it is empty, it will deﬁne and assign the target object. For the getxxxid method, it mainly judges initialization parameters from source data as shown in Fig. 6. The taint track between native C/C++ libraries and Java API framework is shown in Fig. 7. Our tool can distinguish the flow of private data between the application and native layers.

426

N. Zhang et al. Sw … String deviceID=hook.get DeviceId(); JNITransmit(deviceI D); ...

Sf

Tc Tf

public void sendFaker1(String number,String content) … SmsManager sms = SmsManager.getDefault();s ms.sendTextMessage(numbe r,null,content,null,null);} ...

Jh

Function(LoadLibrary and Load)can help us find Jh

..._JNITransmit(JNIEnv*env,j object thiz,jstring string1) {… jstring string2 = (*env)>NewStringUTF(env, "11111"); jmethodID sendFaker1 = (*env)->GetMethodID(env, class, "sendFaker1", "..."); (*env)->CallVoidMethod(env, object, sendFaker1, string2,string1); …}

Function(FindClass and GetmethodId) can help us find Tc and Tf

In Native layer

The trajectory track of the privacy data

Fig. 7. The taint track between native C/C++ libraries and Java API framework.

4.4

Stage 4: Processing by SoPlatform

SoPlatform is deployed to a remote server to detect the SO ﬁles and upload them to the server. If the SO ﬁle is uploaded for the ﬁrst time, the server ﬁrstly calculates and stores the hash value of the SO ﬁle. The malicious code image, the OpCode n-gram, and the system call are used as the features. We used the DNN classiﬁer, the Decision Tree and the Random Forest as the machine learning algorithms for classiﬁcation (to judge this SO ﬁles is or not the malicious ﬁles by the setting threshold and if it was the malicious ﬁles we need to know which malicious native family it belongs to). Notes: In order to improve the efﬁciency: if the app was replaced with a malicious SO ﬁle during the updating time, the hash value of it would be changed and reanalyzed. If the hash value did not change, the analysis would not be performed. SO ﬁle is a binary ﬁle and the difﬁculty is how to get the malicious behavior characteristics of the malicious code. We can use heuristic scanning with unknown binary code detection. Heuristics is a static detection method which do not actually run the binary ﬁle with the highest efﬁciency (see the experimental part). Notes: SO ﬁles will have different segment information mapped to memory (including the data segment and the code segment, the process table will see a number of SO sub-paragraph), we need to ﬁnd the code segment with executive SO segment. Next, we describe the three major features selected in detail: • Feature 1: Presenting a binary ﬁle as a gray scale image, using the texture features in the image to determine the maliciousness of the binary For a binary ﬁle, each byte ranges between 00 * FF, just corresponding to gray scale 0 * 255 (0 is black, 255 is white). Converting a binary ﬁle into a matrix (each byte in the matrix element corresponds to the size of the matrix, which can be adjusted according to the actual situation), the matrix can be easily converted into a gray scale. Speciﬁc implementation (by python):

SoProtector: Securing Native C/C++ Libraries

427

(1) We used hexlify function to transform a binary ﬁle into a hexadecimal string; (2) By byte segmentation, we used reshape function to create the rectangle according to the width set; (3) we used fromarray function to convert this rectangle into an image. The same family of malicious code images in the texture exists a certain similarity and different malicious code family is different. Using the GIST feature technology of computer vision, Using the GIST feature technology of computer vision, a ﬁvedimensional perception dimension (vector) is used to describe the image. That is, an image is input and the corresponding GIST descriptor is output. After getting these vectors, classiﬁcation training of machine learning algorithm can be done. • Feature 2: Opcode Sequence Frequency of Appearance The code of the SO ﬁle is reversely obtained by using the ARM instruction set. The opcode sequence is obtained by using the python. The sequence is processed according to the length of the sub-sequence as n (n is 1, 2, 3), and then we calculate the TF result of each opcode sequence. The vector S = (D1, D2,… Dn) consisting of the opcode sequence frequency is obtained. We combined two values above to the weighted vector V = (wtf1, wtf2,… wtfn). Also we calculated the vector V1 for the malicious SO ﬁles to be tested and the vector Vm+1 cosine similarity of m different kinds of malicious samples respectively. • Feature 3: Sequences of System API calls With the use of IDAPro, each ﬁle is disassembled into the assembly language and a gdl ﬁle that contains the assembly code will be generated. Since IDAPro can disassemble binary ﬁles into a basic block of assembly code, the gdl ﬁle will capture this valuable information. The system call (see Fig. 8) will be recorded into an output format of text ﬁle as an input to feed into a machine learning algorithm. The output ﬁles are used to model the behavior of the binary or native code that are in both malware and benign application. Then we will use machine learning algorithm such as random forest tree classiﬁer for classiﬁcation and detection of malware.

Fig. 8. Statistic of system call for AnserverBot Malware

428

N. Zhang et al.

5 Empirical Study In this section, we ﬁrst present our empirical settings and then we present our evaluation results. 5.1

Empirical Settings

Dataset: We crawled 3000 apps from Wandoujia Store [24] (covering its pre-classiﬁed 15 categories) and 400 apps with Native layer from VirusShare [25] (with malware spanning from 2013 to 2017), Genome (apps are divided in 49 malware families), Contagio Mobile [26] (we have tested SoProtector against 13 malware families from the Contagio database) and reference [29]. For each category in the Wandoujia store, we downloaded the top 200 apps. Excepting some connection errors occurred in the crawling process, totally we collected 3400 apps as our dataset. This dataset will be used in both model training and evaluation of SoProtector. Notes: the malware native families include: (1) ADRD (2) AnserverBot (3) BaseBridge (4) Geinimi (5) Asroot (6) BeanBot (7) Bgserv (8) DroidKungFu1 (9) DroidKungFu2 (10) DroidKungFu3 (11) DroidKungFu4. Environment: The main hardware devices for the experiment were a Samsung S6 mobile phone (4-core processor) and an ASUS computer (8-core CPU with 8 GB of memory). We made the modiﬁed Android system into a ROM and flushed it into a Samsung mobile phone. All tested APK by SoProtector ran on this Android system. The Dell computer is used to run the main program of SoPlatform. 5.2

Overall Analysis and Performance

From Table 1 (note: the numbers in column “Content of Privacy” represent the following types of the private data that can be leaked: 1-call records, 2-geolocation information, 3-message records, 4-contacts, 5-mobile phone identiﬁcation, 6-baidu accounts, 7-wiﬁ information, 8-bluetooth information, 9-base station information, 10browser information. The marks in column “Ways of Privacy Leakage” represent the following ways through which the private data can be leaked: 1-network, 2-short message, 3-log, 4-ﬁle. The top 5 are benign apps and the last 5 are malware apps, details of these package are publicly available in our laboratory website [30]), we can see that SoDetection can detect more sources, sinks and smudges than FlowDroid. The reason is that it can effectively deal with dynamic loading and reflection mechanism. And we can see from Table 2 that non-malicious applications may not use dynamic loading mechanism, the reason is they could not load the third-party SO ﬁles, meanwhile the malicious applications whose malicious code is in SO ﬁles use dynamic loading mechanism totally. During our experiment, the disposal phase for the per application took 2 s. Pre-Processing time for applications is included in the disposal phase. The static analysis for the applications based on the dataset is processed in 23 threads concurrently. Since SoProtector mainly targets for customized system vendors or security analysts, we consider such overhead quite acceptable.

SoProtector: Securing Native C/C++ Libraries

429

Table 1. Effectiveness analysis APK ID

1 2

FlowDroid Sink Taint number propogation path number 35 0 24 3

SoDetection Sink Taint number propogation path number 41 4 30 7

3

72

12

74

19

4

44

21

44

26

5 6

86 14

0 5

86 17

0 7

7

29

14

30

15

8 9 10

0 14 11

0 5 4

1 16 17

1 6 9

Content of privacy

Ways of privacy leakage

{1,2} [3, 5], {3,5} [6, 10], {6,10} [1, 2], {1,2} {7,10} [3, 6], {3,6} [2,3], {2,3,4,5} {1,3} [4],{4,9} [1],{1,7}

{1} {1,2} [1],{1} [1, 3], {1,3,4} [1],{1} [1, 3], {1,3} [1, 2], {1,2} {1,2,4} [1],{1,2,3} [1, 3], {1,3,4}

Table 2. Mechanism analysis Type

Total With dynamic loading With invoke mechanism mechanism App number Proportion App number Proportion Non-malicious Apps 3000 1637 0.5456 1892 0.6306 Malicious Apps 400 400 1 135 0.3375

5.3

Precision

Based on the total number of TPs, FPs (non-malware apps mistakes as malware apps) and FNs (malware apps mistakes as non-malware apps) (2972, 375, 53), we compute the precision and recall of SoProtector as follows: based on the total number of TPs, FPs and FNs (973, 68, 103), we compute the precision and recall of SoDetection as follows: Precision ¼

TP TP þ FP

Recall ¼

TP TP þ FN

Overall, SoProtector precisely identiﬁed most of Apps, with 88.79% precision and 98.25% recall.

430

N. Zhang et al.

An important class of tested malware is the DroidKongFu app, which is a type of malware whose core code is in the native layer that became popular in the last years, especially in China, where SoProtector has successfully identiﬁed it’s malicious in the dataset. In particular, DuanXinGongJiQi is a recent malware (SMS Trojan) known to be able to evade most anti-virus in China. In these cases, SoProtector detected the misbehavior of the outgoing SMS message, typical of SMS Trojan. These results shows how the SoProtector approach is a valid and effective alternative to static stain analysis approaches, which are more accurate against malware whose core functions are in native layer, such as some apps are not detected by FlowDroid. By the way, in Sect. 4.4, due to the randomness in the random forest training process, each result is not the same. But in general, the accuracy of the combination of the two methods is much higher than each one, where the basic accuracy can reach more than 72%.

6 Discussion In this section, we discuss the general applicability of SoProtector, as well as limitations and future work. SoDetection can output the complete path of the application layer and the native layer pollution source to the sinking point, but the defect lies in the prevention of the implementation part of the api kernel in the in-depth linux kernel analysis, which affects the detection effect to a certain extent, which is also going to continue in the future Research work. SoPlatform needs to set up related servers and has the cost of network transmission, which has affected its efﬁciency to some extent. Acknowledgement. This work has been partially sponsored by the National Key R&D Program of China (No. 2017YFE0111900), the National Science Foundation of China (No. 61572355, U1736115), the Tianjin Research Program of Application Foundation and Advanced Technology (No. 15JCYBJC15700), and the Fundamental Research of Xinjiang Corps (No. 2016AC015).

References 1. Symantec index. http://www.symantec.com/connect/blogs/norton-mobile-insight-discoversfacebook-privacyleak 2. Ball index. http://www.theguardian.com/world/2014/jan/27/nsa-gchqsmartphone-app-angrybirds-personal-data 3. Gibler, C., Crussell, J., Erickson, J., Chen, H.: AndroidLeaks: automatically detecting potential privacy leaks in android applications on a large scale. In: Katzenbeisser, S., Weippl, E., Camp, L.Jean, Volkamer, M., Reiter, M., Zhang, X. (eds.) Trust 2012. LNCS, vol. 7344, pp. 291–307. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30921-2_17 4. Kaspersky index. http://usa.kaspersky.com/about-us/press-center/pressreleases 5. Symantec index. http://www..com/connect/blogs/yet-another-bunchmalicious-apps-foundgoogle-play 6. News index. https://www.csc2.ncsu.edu/faculty/xjiang4/DroidKungFu2/

SoProtector: Securing Native C/C++ Libraries

431

7. GingerMaster index. https://www.csc2.ncsu.edu/faculty/xjiang4 8. News index. https://blog.lookout.com/blog/2011/03/02/android-malware-droiddream-howit-works/. Accessed 4 Mar 2017 9. Liu, Z.: Veriﬁable searchable encryption with aggregate keys for data sharing system. Future Gener. Comput. Syst. 78, 778–788 (2018) 10. Enck, W.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans. Comput. Syst., 2–32 (2014) 11. Hornyack, P.: These aren’t the droids you are looking for: retroﬁtting Android to protect data from imperious applications. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 639–652 (2011) 12. Arzt, S.: Flowdroid: Precise context, flow, ﬁeld, object-sensitive and lifecycle-aware taint analysis for Android apps. ACM SIGPLAN Not. 49, 259–269 (2014) 13. Chen, X.: N-Mobishare: new privacy-perserving location-sharing system for mobile online social networks. Int. J. Comput. Math. 93, 384–400 (2018) 14. Li, T.: CDFS: a cryptographic data publishing system. J. Comput. Syst. Sci., 80–91 (2018) 15. Fischer, F.: Stack overflow considered harmful? the impact of copy & paste on android application security. In: IEEE Symposium on Security and Privacy (SP), pp. 121–136 (2017) 16. Xu, D.: Cryptographic function detection in obfuscated binaries via bit-precise symbolic loop mapping. In: IEEE Symposium on Security and Privacy (SP), pp. 921–937 (2017) 17. Eschweiler, S.: Efﬁcient cross-architecture identiﬁcation of bugs in binary code. In: The Network and Distributed System Security Symposium (2016) 18. Pewny, J.: Cross-architecture bug search in binary executables. In: IEEE Symposium on Security and Privacy, pp. 709–724 (2015) 19. Feng, Q.: Scalable graph-based bug search for ﬁrmware images. In: ACM SIGSAC Conference on Computer and Communications Security, pp. 480–491 (2016) 20. Geoffrey, H.: Deep learning. Nature 521, 436–444 (2015) 21. Richard, S.: Recognizing functions in binaries with neural networks. In: USENIX Security, pp. 611–626 (2015) 22. Xiao, J.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: ACM Conference on Computer and Communications Security, pp. 435–446 (2017) 23. Wang, H.: A secure, usable, and transparent middleware for permission managers on Android. In: IEEE Transactions on Dependable and Secure Computing, pp. 350–362 (2017) 24. Wandoujia Store Index. http://www.wandoujia.com/apps 25. VirusShare Index. https://virusshare.com 26. Krupp, B.: SPE: security and privacy enhancement framework for mobile devices. IEEE Trans. Dependable Sec. Comput. 14, 433–446 (2017) 27. Saracino, A.: MADAM: effective and efﬁcient behavior-based android malware detection and prevention. IEEE Trans. Dependable Sec. Comput. 15, 83–97 (2018) 28. Tongxin, L.: Unleashing the walking dead: understanding cross-app remote infections on mobile WebViews. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 829–844 (2017) 29. Paranthaman, R.: Malware collection and analysis. In: 2017 IEEE International Conference on Information Reuse and Integration, pp. 26–31 (2017) 30. Files Websites index. http://cs.tju.edu.cn/csweb/cyxz

CloudPT: Performance Testing for Identifying and Detecting Bottlenecks in IaaS Ameen Alkasem ✉ (

)

, Hongwei Liu, and Decheng Zuo

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China [email protected], {liuhw,zdc}@hit.edu.cn

Abstract. This work addresses performance testing for monitoring mass quan‐ tities of large-dataset measurements in infrastructure-as-a-Service (IaaS). Phys‐ ical resources are not virtualized in sharing dynamic clouds; thus, shared resources compete for access to system resources. This competition introduces signiﬁcant new challenges when assessing the performance of IaaS. A bottleneck may occur if one system resource is critical to IaaS; this may shut down the system and services, which would reduce the workﬂow performance by a large margin. To protect against bottlenecks, we propose CloudPT, a performance test manage‐ ment framework for IaaS. CloudPT has many advantages: (I) high-eﬃciency detection; (II) a uniﬁed end-to-end feedback loop to collaborate with cloudecosystems management; and (III) a troubleshooting performance test. This paper shows that CloudPT eﬃciently identiﬁes and detects bottlenecks with a minimal false-positive rate ( 3), then the error class is “Serious”, while the node fault state is “yes” (fault/abnormal).

Fig. 10. AFBD Algorithm proposed for fault bottlenecks/anomalies behavior detected

4.3.2 Diagnosis Engine Stage In this stage, we proposed then implemented algorithms that combined the test data with the exercising dataset model to generate an intermediate table and classify the job. This allowed us to concurrently compute the probability of every component in the three classes via the java and Scala language coding programs and for the algorithms, this is displayed in Appendix A.1, Fig. 16. Our simple dataset outlines a probability model based on the state predicted component usage. Here, the measurement level diﬀers wildly from 0–100%. We observed the percentage component utilization at discrete times t1,…,tn. The component usage utilized CPU∈{0–25%, 26–75%, 76–100%} as thresholds [27]. The new dataset held outcomes on the model classiﬁer by combining the test and training datasets; it evolves when using the proposed algorithm, as shown in Fig. 17 in Appendix A.1. In the process, we initially deﬁned our task to categorize the three states utilizing the new model. When one component is faulty, the system cannot work. We employed 0, 1, and 2 to embody the system and the component situations. In this case, 0 represented good status (standard working situations), 1 signiﬁed a minor error, and

444

A. Alkasem et al.

Fig. 11. A classiﬁcation of a probability dataset with ﬁnal result fault states (no or yes)

2 symbolized a server error. For instance, the CPU, memory and network signify 3 simple modules while host state server signiﬁes the system state (Fig. 12). The results represent only three classes (normal, minor, and serious) for the component state and two (yes, no) for the system fault state. To ease the problem, we chose the same amount of normal, minor, and serious measures for all the models (see Fig. 13(A)–(B)) [27].

CloudPT: Performance Testing

445

Fig. 12. An NBC Model predicting the components utilization for a host VM system

Fig. 13. Results of the algorithms proposed

5

Experimental and Evaluation Results

5.1 OpenStack OpenStack [28] is open source software that can control massive amounts of storage, computing, and network resources at a datacenter. Typically, one manages it on a dash‐ board or on an OpenStack API. We introduced 40 irregularities into the OpenStack online service of the host server [29], which resulted in faults/anomalies for global resource consumption (see Fig. 14(A)–(B)). These 40 irregularities represent the extreme failure source issues one can identify within online services [21].

446

A. Alkasem et al.

Additionally, in our experiments, we ran each application independently on Hadoop and Spark, respectively, for 24 h. We ran their respective benchmarks on 1.2 GB to 12 GB datasets, a throughput of systems of diﬀerent model implementations, as shown in Fig. 15. We observed patterns of CPU utilization when the testbed displayed the projected performance for the host server conﬁrming our hypotheses. We gathered the metrics of VMs and the host using a fault troubleshooting method for 4 s. Throughout this period, we injected glitches into the testbed system.

Fig. 14. CPU utilization parameters using Ganglia monitoring metrics

Fig. 15. System throughput of experiments using Hadoop and Spark

5.2 Evaluation and Experiment Results The statistical measures, recall, precision, and accuracy, along with exactness were used to assess whether the fault diagnosis was eﬀective using Apache Spark and NBC for the massive dataset problem. As displayed in Table 2, we utilized four statistical measures [26, 30] to evaluate CloudPT’s eﬀectiveness in identifying and eradicating bottlenecks. A successful anomaly/fault bottleneck recognition was deﬁned by the program diag‐ nosing the irregularity carefully utilizing the fault type identiﬁcation (type, size, loca‐ tion) and conferring to the aﬀected host VM and metrics. CloudPT is the ﬁrst end-toend performance testing management framework that can troubleshoot, analyze, classify and suggest repair actions for virtualized cloud based fault bottlenecks and anomalisms.

CloudPT: Performance Testing

447

Table 2. Four statistical measures Precision

Recall

Accuracy

successfuldetections oftotalalarms

successfuldetections oftotalanomalies

2 ∗ precision ∗ recall precision + recall

False-alarm rate (FAR) 1 − presision

Overall, the all-around feedback loop performance eﬀectiveness of the CloudPT diagnosed performance bottlenecks in 20 s. The results showed an 86% improvement in the Accuracy (F1) score compared with the theoretical method, with a standard false alarm rate of Static thresholds > 90%

25

16

0.40

0.64

0.48

0.36

CloudPD

Problem deﬁnition in sharing dynamic cloud

44

32

0.80

0.72

0.76

0.28

5.3 Performance Testing Overheads CloudPT uses non-virtualization cloud assets to test the performance of bottlenecks and behavior anomalies. We quantiﬁed the overhead of CloudPT according to CPU, memory and network overhead utilization. For our experiments, we made bottlenecks and considered the failure of a host VM to startup. We used the virtualization’s resources averaged across VMs and over the 24-hour experiment duration; this is represented in Table 3. It is evident that CloudPT presents minimal overhead on the system. Hence, our experimental study conﬁrms the eﬀectiveness of CloudPT’s accuracy and frequency in detecting bottlenecks and anomaly faults in accumulation to having a low cloud system overhead.

6

Conclusion

We proposed an Apache Spark-based bottleneck troubleshooting performance frame‐ work, called CloudPT for IaaS. The proposed framework includes three construction troubleshooting measures: (I) data collection, (II) analysis and classiﬁcation engine implementation, and (III) decision engine implementation. The objectives of CloudPT

448

A. Alkasem et al.

are to monitor collections, develop analysis, and classify the attributes of measurements, as opposed to the individual metric thresholds, by extending the detect of faults into troubleshooting. In general, the framework focuses on monitoring the shared virtualized resource measurements to address the problems that lead to failure bottlenecks. More speciﬁcally, CloudPT troubleshoots all apparent bottlenecks or anomalies by using precomputed fault notiﬁcations. Through this framework, we also measured and modelled CPU utilization, memory usage, and the network overhead. CloudPT troubleshoots all apparent bottlenecks or anomalies using pre-computed fault notiﬁcations. Simultane‐ ously, it allows recovery to occur in an automated model that is integrated into cloud management services. We conducted a comprehensive assessment of CloudPT on two representative cloud workloads: Hadoop and Spark. Shortly thereafter, we also conducted a host VM startup failure case study. The outcomes of all the experiments demonstrate that CloudPT attains signiﬁcant accuracy with a low occurrence of false alarms; in short, it eﬃciently identiﬁes and eradicates bottlenecks and behavior anoma‐ lies. One area of future work will mainly cover the development of additional features for the CloudPT, such as recovering and self-healing. Acknowledgments. We are also thankful to anonymous reviewers for their valuable feedback and comments for improving the quality of the manuscript.

CloudPT: Performance Testing

449

Appendix A A.1. A Proposed Algorithms

Fig. 16. Algorithm for training, ﬁltering and streaming dataset based on Hadoop and Spark

450

A. Alkasem et al.

Fig. 17. Algorithm for combining the testing and training datasets classiﬁcation and evaluation results

References 1. Malli, S.S., Soundararajan, V., Venkataraman, B.: Real Time Big Data Analytics to Derive Actionable Intelligence in Enterprise Applications, Internet of Things and Big Data Analytics Toward Next-Generation Intelligence, pp. 99–121. Springer, Cham (2018) 2. Gregg, B. Systems Performance: Enterprise and The Cloud. Pearson Education, New Jersey 3. Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Network 30(3), 22–29 (2016) 4. Performance-testing (2017). http://www.softwaretestinghelp.com/what-is-performancetesting-load-testing-stress-testing/ 5. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. J. Internet Serv. Appl. 1(1), 7–18 (2010)

CloudPT: Performance Testing

451

6. Alkasem, A., Liu, H., Decheng, Z., et al.: AFDI: A Virtualization-based Accelerated Fault Diagnosis Innovation for High Availability Computing, arXiv preprint arXiv:1507.08036 (2015) 7. High CPU utilization but low load average (2017). https://serverfault.com/questions/667078/ high-cpu-utilization-but-low-load-average/667089 8. Alkasem, A., Liu, H., Zuo, D.: Utility cloud: a novel approach for diagnosis and self-healing based on the uncertainty in anomalous metrics. In: Proceedings of the 2017 International Conference on Management Engineering, Software Engineering and Service Sciences, pp. 99–107. ACM (2017) 9. Zhai, Y., Xu, W.: March. eﬃcient bottleneck detection in stream process system using fuzzy logic model. In: Euromicro International Conference on Parallel, Distributed and Networkbased Processing (PDP), pp. 438–445. IEEE (2017) 10. Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013) 11. Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., et al.: Anomaly-based network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1), 18–28 (2009) 12. Massie, M., et al.: Monitoring with Ganglia: Tracking Dynamic Host and Application Metrics at Scale. O’Reilly Media, Inc., Massachusetts (2012) 13. Barth, W.N.: System and Network Monitoring. No Starch Press, San Francisco (2008) 14. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016) 15. Sharma, B., Praveen, A., Chita, R.D.: Problem determination and diagnosis in shared dynamic clouds. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE (2013) 16. Cherkasova, L., Ozonat, K., Mi, N., Symons, J., Smirni, E.: Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. Comput. Syst. (TOCS) 27(3), 1–32 (2009) 17. Kumar, A., Shankar, R., Choudhary, A., Thakur, L.S.: A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int. J. Prod. Res. 54(23), 7060–7073 (2016) 18. Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z.: Online optimization for scheduling preemptable tasks on IaaS cloud systems. J. Parallel Distrib. Comput. 72(5), 666–677 (2012) 19. Alkasem, A., Liu, H., Shaﬁq, M., Zuo, D.: A new theoretical approach: a model construct for fault troubleshooting in cloud computing. Mobile Inf. Syst. 2017, 16 (2017). https://doi.org/ 10.1155/2017/9038634. Article ID 9038634 20. SivaSelvan, N., Haider, M.Y., Selvan, N.S., Hegde, G.: Design and Development of Performance Management System (2016) 21. Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: Network Operations and Management Symposium (NOMS). IEEE (2010) 22. Bertino, Elisa, Catania, Barbara: Integrating XML and databases. IEEE Internet Comput. 5(4), 84–88 (2001) 23. Barham, P., Boris, D., Keir, F., Steven, H., et al.: Xen and the art of virtualization. In: ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 164–177. ACM (2003) 24. Riddle, A.R., Soon, M.C.: A survey on the security of hypervisors in cloud computing. In: 2015 IEEE 35th International Conference on Distributed Computing Systems Workshops (ICDCSW), pp. 100–104. IEEE (2015) 25. Gelman, A., John, B.C., Hal, S.S., Donald, B.R.: Bayesian Data Analysis, vol. 2. Chapman & Hall/CRC, Boca Raton (2014)

452

A. Alkasem et al.

26. Doane, D.P., Lori, E.S.: Applied Statistics in Business and Economics. Irwin, New York (2005) 27. Alkasem, A., Liu, H., Zuo, D., Algarash, B.: Cloud computing: a model construct of realtime monitoring for big dataset analytics using apache spark. J. Phys: Conf. Ser. 933(1), 012018 (2018) 28. Jackson, K.: OpenStack Cloud Computing Cookbook. Packt Publishing Ltd, Birmingham (2012) 29. Kumar, V., Karsten, S.S., Yuan, C., Akhil, S.: A state-space approach to SLA based management. In: Network Operations and Management Symposium NOMS 2008 IEEE, pp. 192–199. IEEE (2008) 30. Alkasem, A., Liu, H.: A survey of fault-tolerance in cloud computing: concepts and practice. Res. J. Appl. Sci. Eng. Technol. 11(12), 1365–1377 (2015)

Smart Grid Power Trading Based on Consortium Blockchain in Internet of Things Dong Zheng1,2(B) , Kaixin Deng1 , Yinghui Zhang1,2(B) , Jiangfan Zhao1 , Xiaokun Zheng3 , and Xinwei Ma1 1

National Engineering Laboratory for Wireless Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, People’s Republic of China [email protected], [email protected], [email protected], [email protected], [email protected] 2 Westone Cryptologic Research Center, Beijing 100070, China 3 School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, People’s Republic of China [email protected]

Abstract. Internet of Things (IoT) technologies have attracted enormous attention from academics and industries, and one of the most representative application is the smart grid. Most smart grid system models have to rely on trusted third-parties, but there are no trusted thirdparties in practice. Blockchain technologies show a lot of advantages in IoT due to its unique characteristics. In this paper, to enable reliability, eﬃciency, ﬂexibility and security in smart grid trading, we combine blockchain technologies, proof of stake consensus mechanisms and cryptography tools to build a novel smart grid power trading system. Our security analysis shows that the proposed system can protect users’ data privacy. Keywords: Smart grid · Blockchain Internet of Things · Energy market

1

· Smart contracts

Introduction

In future smart grid designs, users can use renewable energy such as solar energy and wind energy to convert them into storable electricity to reduce power companies’ dependence on fossil fuels [9]. Users can complete trading with companies or other users through gateways [20]. It is easy to cause privacy disclosure while Supported by National Key R&D Program of China (No. 2017YFB0802000), National Natural Science Foundation of China (No. 61772418, 61472472, 61402366), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JZ6001, 2015JQ6236). Yinghui Zhang is supported by New Star Team of Xi’an University of Posts and Telecommunications (No. 2016-02). c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 453–459, 2018. https://doi.org/10.1007/978-3-030-05057-3_34

454

D. Zheng et al.

the information is not be encrypted. In recent years, many technologies were used to protect Internet of Things (IoT) security [6,8,12,15]. Electric power companies as trusted third-parties are vulnerable to suﬀer attacks, and there still exists some security issues [1,12]. In view of these security threats, it is urgent to design a safe and reliable decentralized system to ensure that the interests of users and companies are not violated. Blockchain is deﬁned as a distributed database that records transactions of value using a cryptographic signature that is inherently resistant to modiﬁction [11]. It allows to have a distributed peer-to-peer network where non-trusting members can interact with each other without a trusted intermediary [2]. Famous WannaCry blackmail virus used bitcoin as the payment currency [3], which makes more and more people aware of the uniqueness of the blockchain. With the development of cloud computing [16,19] and wireless network technologies [13,14], blockchain technologies have been used in outsourcing services for payment [18] and keyword search [17] in cloud computing and intelligent control in energy market [4,10], and consortium blockchain has high potential to establish decentralized electricity trading system with moderate cost [7]. However, most of existing schemes adopt proof of work (PoW) consensus mechanisms or private chains, where PoW is wasteful and private chains are not decentralize in essence. In the future smart grid systems, using blockchain technology, constructing a decentralized peer-to-peer network system can bring the system more security and ﬂexibility. This paper presents a smart grid power trading system. The main contributions of this paper are two-fold. – For one thing, we adopt the proof of stake (PoS) consensus mechanism instead of PoW to present a new architecture of smart grid power trading system. Our architecture overcomes the shortcomings of the 51% attack, which is the most common attack method in the blockchain. – For another, for the security problems of users’ uploading data to the authorized nodes, we use cryptography to encrypt data collected by the sensors. Organization. The remaining of this paper is organized as follows. In Sect. 2, we describe the proposed system in detail together with the proposed system architecture. In Sect. 3, we give the security analysis of the proposed system. We draw our conclusions in Sect. 4.

2 2.1

Smart Grid Power Trading System System Architecture

Figure 1 presents an overview of the model, which uses blockchain as the protocol layer, rely on the Ethereum to run the smart contract, and uses the PoS consensus mechanism, completes power trading with the help of market. The system has these following levels:

Smart Grid Power Trading Based on Consortium Blockchain in IoT

455

Fig. 1. System model

(1) User layer. Users register in the system through smart meters [5] with true identity, system returns his public key and certiﬁcate, the certiﬁcate can be used to uniquely identify the user node through binding registration information of the user. Then users generate their public key and private key by their own. Users can price electrical energy through interactive devices, set the expected sales price and electricity waiting for smart contract processing. (2) Authorized nodes layer. The system establishes authorized nodes based on the users’ geographical distribution for the user to participate in the system. A consensus reached between authorized nodes according to user needs. When the user’s transaction requirements are satisﬁed, the smart contract is automatically executed. (3) Power company layer. In system, power companies play the role of power balance energy storage. Performing big data analysis and forecasting according to the regional electricity situation, and complete the estimation of the current regional peak time, sending request information through contract during low electricity period for low-cost electricity purchase. Power companies are also important power carriers to transmit power over long distances. (4) PoS consensus mechanism. We use PoS mechanism instead of PoW, it is based on the energy converted from renewable energy and the time it is stored for interest release and block generation. If a region is rich in electricity resources, due to PoS mechanism will have interest compensation for emptying more power areas, the authorized node will charge a part of the commission by agreement, then distribute most of the interest to users. Users will be more willing to sell the electricity at a lower price than the market price.

456

D. Zheng et al.

2.2

System Details

System Initialization. The smart meter ﬁrst needs to register through the authorized node to participate in the system and become the legal node in the system. When new user U seriu involves in the system, obtains the certiﬁcate Certui and node public key P Ku used to encrypt the sensing data from the authorized nodes N odeu and generate the user’s own public and private keys {P Kiu , SKiu }. u represents the uth community and i represents the ith user in u. The new user’s smart meter will download the current system’s block data storage location index table from the authorized node’s records, after which the synchronization list may be obtained from the nearby smart meters through the P2P network. The process is expressed as follows: N odeu → U seriu : {P Ku , Certui }

(1)

Authorized Data Upload. The user or power company’s smart meter senses the electrical energy converted from renewable energy from the energy storage unit to further collect sensory data dataui ; The user sets the price for selling or purchasing electricity on the interactive device according to the current market price pui , then set up to sell or buy electricity xui , the smart meter packages the data and encrypts the signature, which is passed to the authorized node. Upload data using pseudonyms and digital signatures to ensure the integrity and authenticity of data. The process is expressed as follows: U seriu → N odeu : dataipx = EncP Ku (dataui ||Certui ||Sigiu ||timestamp) Among them:

dataui = EncP Kiu {data||pui ||xui }

(2) (3)

Authorized Node Validation. When the N odeu receives the data, it uses SKiu to decrypt dataipx for verifying user’s identity. If the information is valid, it can be saved in the data record pool for the next processing; If the information is not secure or invalid, the data is ignored. PoS Consensus Mechanism Operation. The PoS consensus mechanism has a unique concept: coindays, the currency multiplied by the number of holding days. The authorized node is responsible for running smart contract and the generation of blocks. Total electricity used to generate coindays for block generation, as long as the node holding electricity, no matter how many can be dug to data blocks, without using any mineral pools it will not cause computing power concentration. At the same time, it reduces the resource consumption because of the use of coindays to generate blocks instead of computing power. If a new PoS block is discovered by the authorized node, it will clear the coindays to gain pay to compensate users and the authorized node.

Smart Grid Power Trading Based on Consortium Blockchain in IoT

457

If a authorized node N odeu consumes coindays and generated a new block data block, the node integrates the data sets dataut received from other nodes during the integration, and attach the signature Sigu and the hash value of the new data blocks, broadcast to other authorized nodes. The process is expressed as follows: dataut = {dataupx1 ||dataupx2 || · · · ||dataupxn ||timestamp}

(4)

N odeu → All N ode : (dataut ||data hash||Certu ||Sigu ||timestamp)

(5)

Among them:

data hash = Hash(dataut ||timestamp)

(6)

Reply. After the other authorized nodes receive the node’s broadcast, they verify the legitimacy and correctness of the data block through the block hash and digital signature, and broadcast the audit result to other authorized node N odel with their signatures. After the N odeu receives and summarizes all the audit results, it signs and sends a reply to the master node. N odel → N odeu : reply = EncP Ku (result sets||Certl ||Sigl ||timestamp) (7) Among them: result sets = {result1 ||result2 || · · · ||resultl }

(8)

Writing in the Chain. The authorized node N odeu can decrypt the replies with its own private key SKu after receiving the reply from other authorized nodes. If the audit results of other nodes pass, then the N odeu will put the audit result into the data block and write it into the main chain, so as to get the system reward. Contract Operation. There is a virtual trading market in our model. Authorized node packaged power sales and quotes are broadcast on the entire network, smart contracts will be automatically executed according to the needs of users and power companies by running scripts, such as searching electricity prices from low to high, according to the user’s estimated price for intelligent sales, buy electricity at a lower price; power companies can buy electricity in the low valley period and sell electricity in the peak period according to the smart contract. Power Transfer. When the smart contract is completed, the smart meter will conduct electricity dispatching according to the data broadcasted by the authorized nodes, and obtain or pay the corresponding digital currency.

3

Security Analysis

The system proposed in this paper utilizes asymmetric encryption technology and has good resistance to traditional security attacks. Through the

458

D. Zheng et al.

cryptographic authentication mechanism, the attacker cannot crack the encrypted information within the eﬀective time; by adding a time stamp to the data information, attackers cannot launch the replay attack; by using the digital signature technology in the data information, it can prevent attackers from forging fake data or tampering with data. In blockchain security, our system does not require a reliable third-party. The data is backed up at each authorized node. A small number of nodes that are attacked cannot aﬀect the collapse of the entire system. System uses pseudonyms to ensure the privacy of the user’s personal information so that nodes cannot obtain the true identity. This article uses smart contracts to share data, restricts data access rights, and makes transactions transparent. The PoS mechanism will submit its consumed coindays to each block in order to increase the score of the block. The block with the highest depletion coindays will be selected as the main chain. In PoW mechanism, if someone has more than 50% of the computing power, he can mine the block faster than others, so he actually has the absolute right to the block, such as undo the payment transactions. Our design reduces the worries of the 51% attack, because in the PoS consensus mechanism, the 51% attack need to be controlled in a large number of coins, the cost may be higher than the 51% computing power, so it increases the cost of the attack.

4

Conclusion

With the rapid development of smart grid systems, centralized data storage methods are increasingly diﬃcult to deal with attacks. Turning data centered storage into distributed storage is the future trend. This paper proposes a smart grid trading system based on the consortium blockchain, relies on smart contracts and PoS consensus mechanism, enables users and operators who maintain the nodes to form a win-win situation. In order to further improve the security of the system, we consider improving the consensus algorithm based on the PoS consensus to ensure that the veriﬁer of the highest-value deposit in each block can operate the blockchain in the best proﬁt model.

References 1. Aitzhan, N.Z., Svetinovic, D.: Security and privacy in decentralized energy trading through multi-signatures, blockchain and anonymous messaging streams. IEEE Trans. Dependable Sec. Comput. (2016). https://doi.org/10.1109/TDSC.2016. 2616861 2. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016) 3. Crowe, J.: Wannacry ransomware statistics: the numbers behind the outbreak. https://blog.barkly.com/wannacry-ransomeware-statistics-2017/ 4. Etemad, R.H., Lahouti, F.: Resilient decentralized consensus-based state estimation for smart grid in presence of false data. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3466–3470. IEEE (2016)

Smart Grid Power Trading Based on Consortium Blockchain in IoT

459

5. Han, Q., Zhang, Y., Chen, X., Li, H., Quan, J.: Eﬃcient and robust identity-based handoﬀ authentication in wireless networks. In: Xu, L., Bertino, E., Mu, Y. (eds.) NSS 2012. LNCS, vol. 7645, pp. 180–191. Springer, Heidelberg (2012). https://doi. org/10.1007/978-3-642-34601-9 14 6. Li, J., Zhang, Y., Chen, X., Xiang, Y.: Secure attribute-based data sharing for resource-limited users in cloud computing. Comput. Secur. 72, 1–12 (2018) 7. Li, Z., Kang, J., Yu, R., Ye, D., Deng, Q., Zhang, Y.: Consortium blockchain for secure energy trading in industrial internet of things. IEEE Trans. Ind. Inform. (2017). https://doi.org/10.1109/TII.2017.2786307 8. Liu, Y., Zhang, Y., Ling, J., Liu, Z.: Secure and ﬁne-grained access control on e-healthcare records in mobile cloud computing. Future Gener. Comput. Syst. 78, 1020–1026 (2018) 9. Mahmoud, M.M., Saputro, N., Akula, P.K., Akkaya, K.: Privacy-preserving power injection over a hybrid AMI/LTE smart grid network. IEEE Internet Things J. 4(4), 870–880 (2017) 10. Mannaro, K., Pinna, A., Marchesi, M.: Crypto-trading: Blockchain-oriented energy market. In: AEIT International Annual Conference, pp. 1–5. IEEE (2017) 11. Mylrea, M., Gourisetti, S.N.G.: Blockchain for smart grid resilience: exchanging distributed energy at speed, scale and security. In: Resilience Week (RWS), pp. 18–23 (2017) 12. Zhang, Y., Zheng, D., Deng, R.H.: Security and privacy in smart health: eﬃcient policy-hiding attribute-based access control. IEEE Internet Things J. 5(3), 2130– 2145 (2018) 13. Zhang, Y., Chen, X., Li, H., Cao, J.: Identity-based construction for secure and eﬃcient handoﬀ authentication schemes in wireless networks. Secur. Commun. Netw. 5(10), 1121–1130 (2012) 14. Zhang, Y., Chen, X., Li, J., Li, H.: Generic construction for secure and eﬃcient handoﬀ authentication schemes in EAP-based wireless networks. Comput. Netw. 75, 192–211 (2014) 15. Zhang, Y., Chen, X., Li, J., Li, H., Li, F.: FDR-ABE: attribute-based encryption with ﬂexible and direct revocation. In: International Conference on Intelligent Networking and Collaborative Systems (INCoS), pp. 38–45. IEEE (2013) 16. Zhang, Y., Chen, X., Li, J., Wong, D.S., Li, H.: Anonymous attribute-based encryption supporting eﬃcient decryption test. In: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, pp. 511–516. ACM (2013) 17. Zhang, Y., Deng, R.H., Jiangang, S., Kan, Y., Dong, Z.: TKSE: trustworthy keyword search over encrypted data with two-side veriﬁability via blockchain. IEEE Access 6, 31077–31087 (2018) 18. Zhang, Y., Deng, R.H., Ximeng, L., Dong, Z.: Blockchain based eﬃcient and robust fair payment for outsourcing services in cloud computing. Inf. Sci. 462, 262–277 (2018) 19. Zhang, Y., Li, J., Chen, X., Li, H.: Anonymous attribute-based proxy re-encryption for access control in cloud computing. Secur. Commun. Netw. 9(14), 2397–2411 (2016) 20. Zhang, Y., Zhao, J., Zheng, D.: Eﬃcient and privacy-aware power injection over AMI and smart grid slice in future 5G networks. Mob. Inf. Syst. 2017, 1–11 (2017)

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing with Edge-Cloud Collaboration Xin Long , Jigang Wu(B) , and Long Chen School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China [email protected], [email protected], [email protected]

Abstract. Multiple access mobile edge computing is an emerging technique to bring computation resources close to end mobile users. By deploying edge servers at WiFi access points or cellular base stations, the computation capabilities of mobile users can be extended. Existing works mostly assume the remote cloud server can be viewed as a special edge server or the edge servers are willing to cooperate, which is not practical. In this work, we propose an edge-cloud cooperative architecture where edge servers can rent for the remote cloud servers to expedite the computation of tasks from mobile users. With this architecture, the computation oﬄoading problem is modeled as a mixed integer programming with delay constraints, which is NP-hard. The objective is to minimize the total energy consumption of mobile devices. We propose a greedy algorithm with approximation radio of (1 + ε) as well as a simulated annealing algorithm to eﬀectively solve the problem. Extensive simulation results demonstrate that, the proposed greedy algorithm can achieve the same application completing time budget performance of the Brute Force optional algorithm with only 31% extra energy cost.

Keywords: Mobile edge computing Remote cloud · Task dependency

1

· Cooperate · Greedy algorithm

Introduction

The recent tremendous growth of various wireless devices and diverse applications has brought the challenge in wireless systems. Since the proliferation of smart mobile devices and wearable sensors, mobile traﬃc and computation tasks have increased dramatically. Therefore, cloud computing [2] as well as 5G communication [5,9] has been proposed to deal with this challenge in the big data era. Despite the potential in data storage and analysis, cloud computing cannot fulﬁll the growing application requirements such as low latency and context awareness. Multiple-access mobile Edge Computing (MEC) [13] that serves as a complement for cloud computing can potentially overcome the weakness of c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 460–475, 2018. https://doi.org/10.1007/978-3-030-05057-3_35

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

461

mobile cloud computing by oﬄoading computation intensive tasks at the edge of wireless networks. Task allocation and computation resource assignment are crucial to MEC, especially in the presence of an application with a large number of delay sensensitive subtasks. For example, on-line gaming for recreation or face recognition for security purposes. Those tasks should be handled in time taking the ﬁnite bandwidth and limited computation resources into consideration. The oﬄoading problem that taking into consideration the above factors jointly are usually mixed integer programming problems which are non-convex and NP-hard [7,8]. Among the task allocation and resource assignment schemes, energy optimization is one of the key factors that aﬀect the performance of the computation resource limited mobile devices. That’s because the energy consumption of mobile devices would exponentially grow when there are multiple complex tasks on the devices. Earlier works on energy optimization for MEC, such as [3,17], assumed unlimited energy supply of edge servers. Bi et al. [3] addressed the computation rate maximization problem in wireless powered MEC networks. Mobile devices can harvest energy from the cellular base station that with an MEC server. The original problem was non-convex and a decoupled optimization with coordinate descent method was proposed to solve the proposed problem. Lyu et al. in [17] studied the total energy consumption of multiple devices with latency constraints. The problem was modeled as a mixed-integer programming, followed by a dynamic programming algorithm based on Bellman equation. More recent researches [4,14] have been focused on delay minimization with energy or budget constraints of edge servers. Chen et al. [4] carried out with a novel multi-cell MEC architecture where edge devices such as base stations can cooperate with remote server on task execution. Considering the ON/OFF nature of edge servers, they used Lyapunov optimization technique to obtain optimal decisions on task oﬄoading. Considering task dependency, Kao et al. [14] presented Hermes, aiming at minimizing total execution time of tasks with user budget constraints (Table 1). Based on the literature reviews, task dependency was not properly investigated by [3,4,17], which is important for real deployment. Although task dependency was used in the model by [14], authors in [14] merely neglected the inﬂuence of remote cloud servers. Moreover, all the above works assume the remote cloud server can be viewed as a special edge server or the edge servers are willing Table 1. Comparison between existing works and this work. Existing works

[3]

[4]

Hermes [14] [17]

This work

Task dependency

No

No

Yes

No

Yes

Edge-cloud collaboration

No

No

No

No

Yes

Energy constraint of users No

Yes

Yes

No

Yes

Server utility constraint

No

No

No

No

Yes

Objective

Computation rate Delay Delay

Energy Energy

462

X. Long et al.

to cooperate. In real scenarios, the remote cloud server has higher computation capability than the edge server and the transmission delay between edge cloud and remote server cannot be neglected when designing proper oﬄoading schemes. Take face recognition as an example. The feature extraction tasks for face images obtained by individual mobile devices can be oﬄoaded to edge servers while the machine learning and face recognization, i.e., image matching tasks can be executed on the remote cloud servers. Therefore, with edge-cloud cooperation, the target faces can be detected with certain bounded delay for distributed mobile devices. In this work, we investigate computation oﬄoading decision and resource allocation problem with given delay requirements of mobile applications. The objective is to minimize sum energy consumption of mobile devices. Diﬀerent from above works, we take edge-cloud cooperation into account, which being new challenges for the energy optimization problem. Since there are heterogeneous network resources, it is necessary to determine which the computation tasks should be done at remote clouds, processed at edge servers or local mobile devices. From the perspective of edge and remote cloud servers, their service for mobile devices should be compensated for the cost of execution and their profits should be guaranteed. Since the tasks of one application is delay bounded, how to handle edge-cloud cooperation with user budget constraints should be carefully designed. The main contributions of this paper can be summarized as follows: – A novel edge-cloud cooperation architecture is proposed in wireless heterogeneous network with edge servers deployed at small-cell base stations and remote cloud servers connected to the macro-cell base station. The edge server can hire remote edge servers to process some of the tasks originated from mobile devices. – The oﬄoading problem is modeled as a mixed integer non-linear programming, which is NP-hard. We then propose a greedy algorithm as well as a simulated annealing algorithm to eﬀectively solve the problem. – To provide incentive for edge servers, we propose a pricing scheme with virtual currency from mobile users to edge servers and remote cloud servers for the dedication of servers serving mobile users. The remainder paper is organized as follows. System model and computation model are presented in Sect. 2. Section 3 presents the problem formulation. The proposed algorithms is described in Sect. 4. Section 5 presents the performance evaluation. Section 6 concludes this paper with future remarks.

2

System Model and Computation Model

This section ﬁrstly describes the system model and formulates the oﬄoading problem for energy saving with local computing, edge computing and the collaboration between edge and cloud servers.

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

463

Fig. 1. System architecture

2.1

System Model

As shown in Fig. 1, each edge server is located at the access point (AP) [6] which is also being attached by multiple mobile devices. The edge server is deployed at the AP and is linked to the remote cloud via high speed ﬁber links. Let U be the set of mobile devices, We assume that there are M mobile devices. Therefore, we have U = {u1 , u2 , u3 , · · · , uM }, where M ≥ 1. Meanwhile, there is a set Tm subtasks on the m-th mobile device, which cloud be denoted as Tm = {τm,1 , τm,2 , τm,3 , · · · , τm,N }, where N ≥ 0. Next, we will introduce the communication and computation models for mobile devices, edge servers and remote cloud in detail. 2.2

Communication Model

m f ∈ {0, 1}, Xm,n ∈ Transmission between Mobile Devices and Edge. Let Xm,n c {0, 1} and Xm,n ∈ {0, 1} each represents the computation oﬄoading policy made m = 1 denotes that the subtask n by the m-th mobile device. Particularly, Xm,n f = 1 denotes that the subtask on mobile device m is executed locally while Xm,n c = 1 denotes n of mobile device m is executed on the edge server. Similarly, Xm,n that the subtask n on mobile device m is executed on the remote cloud. We can compute the uplink data rate for wireless transmission between mobile device and edge server as [5]: m Pm,n Gm,n , (1) Rm,n = W log2 1 + 2 mG σm + i=m,j=m Pi,j i,j m is the transmission power of mobile device m to upload the subwhere Pm,n task n to the edge server via AP, Gm,n is the channel gain between the mth mobile device and the corresponding AP when transmitting subtask n. 2 Gm,n = (dis−η m,p ) |hm,p | where dism,p denotes the Euclidean distance between

464

X. Long et al.

mobile device and edge server, hm,p is the corresponding Rayleigh fading channel coeﬃcient that obeys the distribution of N (0, 1) [20]. The surrounding noise 2 [20]. power at the receiver, i.e. the AP, is σm It should be noted that, for the beneﬁt of presentation, the downlink transmission rate is represented by the corresponding uplink rate. In the following expressions, we also utilize the expression of uplink transmission delay to represent the downlink transmission delay. That’s because the downlink transmission rate is usually a few times larger than the uplink transmission rate due to the channel allocation result of network operator. With this change, we can reduce the complexity of delay and energy cost expressions, which will be described in detail in following paragraphs (Table 2). Table 2. Basic notations Notation

Descriptions

M N Dm,n Wm,n Rm,n ttm,n trm,n

Number of mobile devices Number of subtasks Data size of subtask n on mobile device m Workload of subtask n on mobile device m Uplink data rate for subtask n of mobile device m Time spent when sending subtask n of device m to edge server Time spent when sending subtask n of device m from edge server to remote cloud Energy cost during transmission between mobile device and edge server for subtask n of device m Energy cost during transmission between edge and cloud for subtask n of mobile device m The delay when executing subtask n locally Energy consumption when executing subtask n of device m Completing time of subtask n on mobile device m that executed locally Energy cost during the completing time of subtask n on device with local computing Budget or allowed delay threshold for subtasks on device m Total energy cost for all subtasks of device m Total time consumed for all subtasks of mobile device m Proﬁt of the edge server Oﬄoading policy for subtask n of device m on local computing Oﬄoading policy for subtask n of device m on edge computing Oﬄoading policy for subtask n of device m on remote execution

t Em,n r Em,n

tlm,n l Em,n l T Fm,n l EFm,n Budgetm Em T Fm Upf l Xm,n f Xm,n c Xm,n

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

465

The transmission delay of subtask n between mobile device m and the corresponding edge server thus can be [11] ttm,n =

Dm,n , Rm,n

(2)

where ttm,n represents the time spent on sending the subtask n on mobile device m to the edge server, while Dm,n is the data size of the subtask n of device m. Based on the above equations, we can obtain the energy consumption when transmitting subtask n of mobile device m to the edge server as t m t Em,n = Pm,n tm,n ,

(3)

tx where Pm,n is the power of mobile device m when sending subtask n.

Transmission between Edge and Cloud. Due to fact that the edge server links the remote cloud via wired connection, the delay of data transmission from edge server to the cloud thus is Dm,n , (4) trm,n = ω where trm,n denotes the transmission delay for subtask n of mobile device m from edge server to the cloud. ω denotes the upstream bandwidth. Given the transmission delaybetween edge and remote cloud trm,n and the transmission r can be expressed as power P0 , Em,n r Em,n = P0 trm,n ,

(5)

r where Em,n is the energy consumed when sending the subtask n of mobile device m from edge to the cloud.

2.3

Computation Model

l be the CPU clock speed of mobile device Computation on Local Device. Let fm m and Wm,n be the workload of subtask n of mobile device m, if the subtask n on mobile device m is executed locally, then the subtask’s execution time is

tlm,n =

Wm,n . l fm

(6)

Given the computation time tlm,n , the energy consumed for subtask n of mobile device m for local computing is 2

l l = kWm,n fm . Em,n

By default, k is set as 10−11 following [12].

(7)

466

X. Long et al.

Computation on Edge. Let f f be the CPU frequency of edge server, if the subtask n of mobile device m is executed on the edge server, the computation time of the edge server can be tfm,n =

Wm,n , ff

and the energy cost of edge server can be expressed as: σ f Em,n = αf f f + βf tfm,n .

(8)

(9)

According to [19], αf and βf are the positive constants which can be obtained by oﬄine power ﬁtting and σ ranges from 2.5 to 3. If subtask n of mobile device m is executed on the cloud, the computation delay and energy cost of remote cloud are as follows: Wm,n , (10) tcm,n = fc and

2.4

σ

c = (αc (f c ) + βc ) tcm,n . Em,n

(11)

Dependency Constraints

Definition 1. Subtask’s completing time: subtask n of mobile device m can only start when all its predecessor subtasks has been completed. The completion time for the nth subtask of mobile device m is consisted of two parts: the time spent to obtain the results of all its predecessor tasks and the time spent for its own computation. Definition 2. Energy cost to accomplish one subtask: it is also consisted of two parts: the energy spent getting the result of predecessor tasks and the energy spent for its own execution. Base on the above deﬁnitions, if subtask n of mobile device m is assigned to be executed locally, its completion time can be expressed as: t

f l c T Fm,n tm,n + trm,n + tlm,n , = maxk∈pre(n) Xm,k ttm,n + Xm,k (12) and the energy cost for local completion is f t l t c r l EFm,n Xm,k Em,n Em,n + Em,n + Em,n = + Xm,k .

(13)

k∈pre(n) f c Xm,k = 0. The notation pre(n) in (12) means all the preIn (12) and (13), Xm,k

f decessor subtasks of the nth subtask. In (12), the term Xm,k ttm,n is the delay to obtain the predecessor subtask’s result of the nth subtask, if the predecessor sub- t c tm,n + trm,n task of n is executed on the edge server. Similarly, the term Xm,k

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

467

is the delay to obtain the result if the predecessor subtask of n is accomplished on the cloud server. If subtask n of mobile device m is assigned to be executed on the edge server, the completion time of subtask n can be deﬁned as: m t f c = maxk∈pre(n) Xm,k tm,n + Xm,k trm,n + tfm,n , (14) T Fm,n m is predecessor subtask’s assignment strategy on mobile device. where Xm,k m = 1 means the kth subtask is computed on the local mobile device, while Xm,k m m t Xm,k = 0, otherwise. The term Xm,k tm,n is the delay to transmit the result of c trm,n is the predecessor task from mobile device to the edge server while Xm,k delay to send the prior result from the remote cloud to the edge server. f be the energy cost for subtask n of device m executed on the edge Let EFm,n server, similarly as (13), it can be deﬁned as f m t c r f Xm,k + Em,n EFm,n = Em,n + Xm,k Em,n . (15) k∈pre(n)

Similarly as (12) and (14), if subtask n of mobile device m is assigned to be executed in the remote cloud, its completion time can be expressed as

t f c m T Fm,n tm,n + trm,n + Xm,k = maxk∈pre(n) Xm,k trm,n + tcm,n , (16) and the corresponding energy cost to complete the subtask on the remote cloud, c is EFm,n

t f c m r f c Xm,k + Em,n Em,n + Em,n + Xm,k EFm,n = Em,n . (17) k∈pre(n)

2.5

Utility Constraints

Next, we drive the utility constraints of edge server and the time budget for the completion time. The utility of edge server is Upf =

N M

f r c P f Xm,n , − Em,n Xm,n

(18)

m=1 n=0

where Upf is the utility of the edge server, P f is service price of edge server.

3

Problem Formulation

In this section, we will present the problem formulation with constraint of time budget and utility constraint Upf . Firstly, the completion time of all tasks on mobile device m can be deﬁned as T Fm =

N m l f f c c Xm,n T Fm,n , + Xm,n T Fm,n + Xm,n T Fm,n n=0

(19)

468

X. Long et al.

l where T Fm,n is the task completion time of subtask n if it is executed locally, f T Fm,n is the task completion time of subtask n if it is executed on the edge c is the task completion time of subtask n if it is executed on server and T Fm,n the remote cloud. The total energy consumption of one application, which is denoted as Em is

Em =

N l l f f c c EFm,n , Xm,n + EFm,n Xm,n + EFm,n Xm,n

(20)

n=0 l where EFm,n is the energy consumption of subtask n if it is executed on the f is the energy cost of subtask n if it is executed on edge mobile device, EFm,n c server and EFm,n is the energy cost of subtask n if it is executed on the remote cloud. In this work, the goal is to minimize the total energy consumption of tasks while meeting the completion time constraint. Meanwhile, the utility of the edge server Upf is guaranteed. The energy consumption minimization problem thus can be deﬁned as: OPT − 1 obj : min Em

C1 : Upf > 0, C2 : T Fm < Budgetm , C3 :

m Xm,n

f c ∈ {0, 1}, Xm,n ∈ {0, 1}, Xm,n m f c Xm,n + Xm,n + Xm,n = 1, n ∈

∈ {0, 1}, n ∈ [0, N ], m ∈ [1, M ],

C4 :

[0, N ], m ∈ [1, M ].

Where constraint C1 is the utility constraint which guarantees the positive utility of the edge server. C2 is the task completion time budget, i.e., the delay constraint. C3 lists binary constraints and C4 is the unique solution constraint, which means that one subtask can only be executed at one place. Theorem 1. The sum task completion energy minimization problem for computation oﬄoading in this study is NP-hard. Proof. We transform the oriental problem depicted in OP T − 1 and consider a special case that the mobile device, edge server and remote cloud server are with the same conﬁgurations, which result in the same energy costs and executing time when executing tasks. Regarding each subtask as a goods with value and weight, then the value corresponds to the execution time while the weight corresponds to the energy cost. Then we ignore the task dependency constraint between subtasks as well as the constraint C1. C2 can then be viewed as the knapsack’s value constraint. Therefore, the relaxed problem of OP T − 1 has changed into a knapsack problem [15] which is NP-hard. Therefore, the original problem OP T − 1 is also NP-hard, which concludes this proof.

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

4 4.1

469

Algorithms Gain Method

Based on the above models and analysis, ﬁrst of all, we design a greedy method named Gain to minimize the energy consumption of mobile device m when ﬁnish executing tasks. To acquire the minimum energy cost of all subtasks in an application on mobile device m, the minimum energy cost of subtask n is selected l f c , EFm,n , EFm,n . This subtask-procedure is shown between Lines 1 from EFm,n to 11 of Algorithm 1. Then, we iteratively adjust the initial oﬄoading policy to ﬁt for the constraint of Upf and the completion time budget Budgetm . If the oﬄoading policy does not satisfy the constraint of Upf , which means that the number of subtask executed on remote cloud is too much to make the edge server get proﬁts when serving mobile users. To ﬁt the constraint of Upf , we must oﬄoad some subtasks from the remote cloud to mobile device or to the edge servers. Then the algorithm chooses subtask considering which subtask will be oﬄoaded. To obtain the minimum energy cost, we take the changing energy cost as the criteria to set the priority. The smaller the changing energy cost is, the higher the priority will be. To ﬁt for the constraint of completion time budget, we compute the changing completion time and the changing energy cost in each oﬄoading choice. We choose the corresponding oﬄoading strategy in the choice, which decreases the changing completion time and guarantees the minimum changing of energy cost. Due to the constraint of utility Upf , the choosing of oﬄoading site for subtasks should be very careful. If subtask n is assigned to be executed on mobile device, the oﬄoading choice must be from mobile device to the edge server. If subtask n is assigned to be executed on edge server, the oﬄoading choice must be from edge server to mobile device. If subtask n is assigned to be executed on remote cloud, the oﬄoading choice can either be from the remote cloud to edge serve or from the remote cloud to mobile device. The detail of the Gain algorithm is depicted in Algorithm 1. Theorem 2. The time complexity of the Gain algorithm is O(N ). Proof. In Algorithm 1, the time complexity of subprocess from line 1 to 12 is O(N ) and the time complexity of subprocess from line 14 to 31 is O(N ) for the reason that the adjust time of time won’t be more than N . So the time complexity of the Gain algorithm is O(N ). Theorem 3. the approximation ratio is (1 + ε). Proof. Due to limited space,omitted

5 5.1

Performance Evaluation Simulation Setup

To study the performance of proposed algorithms, We implement the algorithms on a high performance work station with an Intel I7 processor at frequency 3.9

470

X. Long et al.

Algorithm 1. Gain method for mobile device m Input: tasks: a sequence of N subtask-tasks mobile device m, the execute order of subtasks; W : the workload size of subtasks; D: the data size of subtasks; Budgetm : the completion time budget for subtasks; pre: 2-D array for each subtask’s predecessor’s task; Output: X m : the policy of subtask executed on mobile device locally; X f : the policy of subtask executed on edge server; X c : the policy of subtask executed on remote cloud; 1: for n in tasks do l f c (13), (15), (17) 2: computer m,n , EFm,n by Equation EFl m,n , EF f c l then 3: if min EFm,n , EFm,n , EFm,n = EFm,n l f c 4: Xm,n ← 1, Xm,n ← 0, Xm,n ←0 5: end if l f c f = EFm,n , EFm,n , EFm,n then 6: if min EFm,n l f c 7: Xm,n ← 0, Xm,n ← 1, Xm,n ← 0 8: end if l f c c = EFm,n , EFm,n , EFm,n then 9: if min EFm,n l f c 10: Xm,n ← 0, Xm,n ← 0, Xm,n ← 1 11: end if 12: end for 13: compute Upf and T Fm 14: while Upf ≤ 0 T Fm ≥ Budgetm do 15: if Upf ≤ 0 then 16: choose the subtask that bings about minimum changing energy consumption when oﬄoading the subtask from the remote cloud to the edge server, or from the remote cloud to mobile device. 17: end if 18: if T Fm ≥ Budgetm then 19: for n = 0 → N do m = 1 then 20: if Xm,n 21: compute the changing energy cost when oﬄoading the subtask from mobile device to the edge server. 22: end if f = 1 then 23: if Xm,n 24: compute the changing energy cost when oﬄoading the subtask from the edge server to mobile device 25: end if c = 1 then 26: if Xm,n 27: compute the changing energy cost when oﬄoading the subtask from remote cloud to mobile device or from remote cloud to edge server. 28: end if 29: choose the oﬄoading policy with the minimum changing energy cost and decrease changing completing time 30: end for 31: end if 32: end while

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

471

GHz and has a 8G RAM. We use Python 3.6 [1] to simulate the oﬄoading of subtasks and evaluate the algorithms in terms of running time, application completion time and energy cost with 100 repeated trials. In order to simulate real-world tasks, we use a typical task graph as shown in Fig. 2. In Fig. 2, dependency constraints exists between subtasks, which determine the execution order. Based on the task graph, one possible execution sequence for subtasks is [0, 1, 2, 3, 4, 5, 6, 7].

Fig. 2. The task graph.

We set 8 subtasks in an application with evenly distributed workload and evenly distributed data size. The signal noise between the edge server and mobile device is set as σ 2 = 1, the wireless bandwidth of upload is set as W = 2 Mbps and the wireless bandwidth of download is set as W = 10 Mbps [10]. The bandwidth between edge server and remote cloud of upload is W = 1024 Mbps and the bandwidth between edge server and remote cloud of download is W = 8192 Mbps [10]. The CPU frequency of mobile device is f m = 5 × 106 Hz, while the CPU frequency of edge server is f f = 2 × 109 Hz [18]. The CPU frequency of remote cloud is set as f c = 4 × 109 Hz [18]. System parameters αf = 0.1, βf = 0.1, αc = 0.2, βc = 0.2 [18]. The communication chip power of mobile device is 0.1 watt [16]. The communication chip power of edge server is 1watt [16] and the communication chip power of remote cloud is 3 watt [16]. 5.2

Simulation Result

Figure 3 shows the comparisons of Gain, Brute Force and SA in terms of running time with diﬀerent workload sizes. From Fig. 3, we observe that, the running time of Brute Force ranges from 7.54 s to 7.68 s and the running time of Gain is less than 0.02 s. That is because the Brute Force tries to exhaustively search all solutions and the solution space of the problem is N 3 , where N denotes the number of subtasks. From Fig. 3, we can observe that, the running time of three algorithms stay almost no ﬂuctuations, which indicates the robustness of algorithms. For example, in Brute Force, the maximum running time is 7.66 s, while the minimum running time is 7.547 s, the diﬀerence value between the maximum running time and

472

X. Long et al.

the minimum running time is only 0.12 s. In Gain, the maximum running time is 0.0015 s and the minimum running time is 0.001 s.

Fig. 3. The comparisons of three algorithms’ executing time with diﬀerent workload size.

Fig. 4. The energy cost of Gain and Brute Force with the change of workload size.

Figure 4 show the comparisons of Gain, Brute Force on energy cost with different workload sizes. In Fig. 4, The Brute Force always obtains the minimum energy cost compared with the other algorithm. From the comparison between Brute Force and Gain, we observe that Gain can optimally achieve the same completion time budget performance of optimal result with only 31% extra energy cost averagely. The energy cost of Gain approximates the optimal result, especially for case when the workload sizes are 87.5 M and 262.5 M. In Fig. 4, when the workload size grows from 43.75 to 87.5 M, the energy cost also increases by 0.06 KJ but the energy cost falls by 0.04 KJ when the workload size grows from 87.5 M to 131.25 M due to the constraint of task dependency. From Fig. 4, the change of curve in energy consumption of Gain is almost the same as the changes of curves in energy consumption of Brute Force. Figure 5 shows the comparisons of application completion time of Gain, Brute Force. The completion time budget Budget can be represent as (21) and W denotes the workload matrix, N denotes the number of subtask of the mobile device m, which is, N Wm,n . (21) Budget = 0.5 × n=0

From Fig. 5, we observe that the completion time of Gain and Brute Force are always lower than that of the completion time budget. Therefore Gain and Brute Force always obtain eﬃcient solutions which satisfy the completion budget. When the workload size increases from 43.7 M to 206.25 M, the completion time of Gain also increases from 1.05 s to 6.22 s, because the greater the workload is, the longer time the Gain will be. In Fig. 6, we can see the completion time of Gain occupies 40% to 80% of the completion time budget. While the completion time of Brute Force occupies 22% to 80% of the completion time budget, which is optimal.

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing The percentage of budget change with workload

Application completing time change with workload

0.9

10

Brute Force Gain

Brute Force Gain Budget

9

473

0.8

8 The percentage of budget

Application completing time

0.7 7

6

5

4

0.6

0.5

0.4

3 0.3 2

1 0

50

100

150 200 Workload (a)

250

300

350

Fig. 5. The comparisons of application completion time of Gain, Brute Force and Budget on diﬀerent workload size.

6

0.2 0

50

100

150 200 Workload (b)

250

300

350

(b)

Fig. 6. The comparisons of application completion time as a percentage of Budget.

Conclusions

This paper has addressed novel computation oﬄoading schemes with device, edge server and remote cloud collaboration. We have formulated the oﬄoading problem as an energy cost minimization problem with application completion time budget and edge server proﬁt’s constraint. The problem is NP-hard. We have designed a Gain algorithm aimed to minimize the energy cost, which also follows the constraints of completion time, utility and task dependency. After extensive simulation, we can obtain following ﬁnding. Firstly, the implementation shows that in a three-tier structure such as mobile, edge server and remote cloud, edge server plays a very important role in reducing the energy consumption during task execution. Secondly, the proposed greedy algorithm can achieve the same application completion time performance of the Brute Force optimal algorithm with only 31% extra energy cost on average. In the future, we will devise online algorithms by modifying the initialization process of each algorithms and explore the energy cost minimization problem with completion time constraint of each subtask. Acknowledgment. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702115 and 61672171, Natural Science Foundation of Guangdong, China under Grant No. 2018B030311007, and Major R&D Project of Educational Commission of Guangdong under Grant No. 2016KZDXM052. This work was also supported by China Postdoctoral Science Foundation Fund under Grant No. 2017M622632. The corresponding author is Jigang Wu ([email protected]).

474

X. Long et al.

References 1. Aksimentiev, A., et al.: Python for scientiﬁc computing (2007) 2. Barbera, M.V., Kosta, S., Mei, A., Stefa, J.: To oﬄoad or not to oﬄoad? the bandwidth and energy costs of mobile cloud computing. In: 2013 Proceedings IEEE INFOCOM, pp. 1285–1293. IEEE (2013) 3. Bi, S., Zhang, Y.J.A.: Computation rate maximization for wireless powered mobileedge computing with binary computation oﬄoading. IEEE Trans. Wirel. Commun. PP(99), 1–14 (2018). https://doi.org/10.1109/TWC.2018.2821664 4. Chen, L., Zhou, S., Xu, J.: Energy eﬃcient mobile edge computing in dense cellular networks. In: 2017 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE (2017) 5. Chen, L., Wu, J., Dai, H.N., Huang, X.: BRAINS: joint bandwidth-relay allocation in multi-homing cooperative D2D networks. IEEE Trans. Veh. Technol. 67, 5387– 5398 (2018). https://doi.org/10.1109/TVT.2018.2799970 6. Chen, L., Wu, J., Zhou, G., Ma, L.: QUICK: QoS-guaranteed eﬃcient cloudlet placement in wireless metropolitan area networks. J. Supercomput. 74, 1–23 (2018). https://doi.org/10.1007/s11227-018-2412-8 7. Chen, M.H., Dong, M., Liang, B.: Joint oﬄoading decision and resource allocation for mobile cloud with computing access point. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3516–3520 (2016) 8. Chen, M.H., Liang, B., Dong, M.: Joint oﬄoading and resource allocation for computation and communication in mobile cloud with computing access point. In: INFOCOM 2017 IEEE Conference on Computer Communications, pp. 1–9. IEEE (2017) 9. Dhillon, H.S., Ganti, R.K., Baccelli, F., Andrews, J.G.: Modeling and analysis of K-Tier downlink heterogeneous cellular networks. IEEE J. Sel. Areas Commun. 30(3), 550–560 (2012) 10. Ding, L., Melodia, T., Batalama, S.N., Matyjas, J.D.: Distributed routing, relay selection, and spectrum allocation in cognitive and cooperative ad hoc networks. In: Sensor Mesh and Ad Hoc Communications and Networks, pp. 1–9 (2010) 11. Dinh, T.Q., Tang, J., La, Q.D., Quek, T.Q.S.: Oﬄoading in mobile edge computing: task allocation and computational frequency scaling. IEEE Trans. Commun. 65(8), 3571–3584 (2017) 12. Guo, S., Xiao, B., Yang, Y., Yang, Y.: Energy-eﬃcient dynamic oﬄoading and resource scheduling in mobile cloud computing. In: IEEE INFOCOM 2016 the IEEE International Conference on Computer Communications, pp. 1–9 (2016) 13. Hu, Y.C., Patel, M., Sabella, D., Sprecher, N., Young, V.: Mobile edge computing. A key technology towards 5G. ETSI White Paper 11(11), 1–16 (2015) 14. Kao, Y.H., Krishnamachari, B., Ra, M.R., Fan, B.: Hermes: Latency optimal task assignment for resource-constrained mobile computing. In: IEEE Conference on Computer Communications (ICC), pp. 1894–1902 (2015) 15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004) 16. Liu, P.J., Lo, Y.K., Chiu, H.J., Chen, Y.J.E.: Dual-current pump module for transient improvement of step-down DC-DC converters. IEEE Trans. Power Electr. 24(4), 985–990 (2009) 17. Lyu, X., Tian, H., Ni, W., Zhang, Y., Zhang, P., Liu, R.P.: Energy-eﬃcient admission of delay-sensitive tasks for mobile edge computing. IEEE Trans. Commun. 66, 2603–2616 (2018). https://doi.org/10.1109/TCOMM.2018.2799937

Energy-Eﬃcient Oﬄoading in Mobile Edge Computing

475

18. Park, C.B., Park, B.S., Uhm, H.J., Choi, H., Kim, H.S.: IEEE 802.15.4 based service conﬁguration mechanism for smartphone. IEEE Trans. Consum. Electr. 56(3), 2004–2010 (2010). https://doi.org/10.1109/TCE.2010.5606358 19. Rao, L., Liu, X., Ilic, M.D., Liu, J.: Distributed coordination of internet data centers under multiregional electricity markets. Proc. IEEE 100(1, SI), 269–282 (2012). https://doi.org/10.1109/JPROC.2011.2161236 20. Zhang, L., et al.: Primary channel gain estimation for spectrum sharing in cognitive radio networks. IEEE Trans. Commun. PP(99), 1 (2016)

Quantitatively Investigating Multihop Localization Errors in Regular 2-D Sensor Networks Bing Jia1,2 , Baoqi Huang1,2(B) , Tao Zhou1,2 , and Wuyungerile Li1,2 1

2

Inner Mongolia A.R. Key Laboratory of Wireless Networking and Mobile Computing, Hohhot 010021, China [email protected] College of Computer Science, Inner Mongolia University, Hohhot 010021, China

Abstract. In practice, a wireless sensor network normally includes a small portion of nodes with known locations, termed anchors, and the other nodes with unknown locations, termed sensors, have to be localized through dedicated algorithms. Since not every sensor is directly neighboring with anchors, sensor locations are determined in a multi-hop localization manner, and therein, localization errors of sensors display to rise up with their minimal hop count to anchors, which is termed error propagation. Grasping the rule of error propagation is critical to design and develop both localization algorithms as well as various applications. In this paper, we focus on quantitatively measuring how the localization errors vary across diﬀerent sensors. To do so, regular 2-dimensional wireless sensor networks are taken into consideration, and formulae with respect to diﬀerent sensors and diﬀerent anchor placement are obtained. Simulation results are conducted to validate these formulae and analyze the characteristics of error propagation. Keywords: Localization errors Error propagation

1

· Wireless sensor network · Multihop

Introduction

In wireless sensor networks (WSNs), sensor locations are key prerequisite for many applications and techniques, such as reporting the geographic origin of events, assisting in target tracking, and achieving geographic aware routing. Therefore, considerable eﬀort has been invested in the development of localization systems [1–6]. Range-based sensor localization [7] is the problem of identifying the locations of sensor nodes, or simply sensors, given estimates of the Supported by the National Natural Science Foundation of China under Grants 41761086, 41401519, 61461037, 61761035 and 61661041, the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No. 2016YFB0502102, the Natural Science Foundation of Inner Mongolia Autonomous Region of China under Grant 2017JQ09, and the Grassland Elite Project of the Inner Mongolia Autonomous Region under Grant CYYC5016. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 476–488, 2018. https://doi.org/10.1007/978-3-030-05057-3_36

Quantitatively Investigating Multihop Localization Errors

477

distances between them, known as range measurements. The basic range-based localization algorithms include trilateration which in 2-dimensional (2-D) space employs at least three range measurements from non-collinear nodes at known locations, termed anchors, to localize a sensor. Due to the existence of noises in range measurements, only location estimates as opposed to exact positions can be derived. If not every sensor can measure its distances to suﬃcient anchors, already localized sensors must be used as pseudo-anchors to help their neighboring sensors become localized; this process is called multihop sensor localization. As a result, localization errors of pseudo-anchors propagate into localization results of later localized sensors, a phenomenon which is called error propagation. In the literature, discussions in [8,9] have raised series concerns about error propagation. In particular, error propagation in regular 1-dimensional WSNs was examined by obtaining the closed-form Cram´er-Rao Lower Bound (CRLB) in [10,11], and some key conclusions were drawn on how fast the error is propagated and how anchor placement aﬀects error propagation have been reported. Moreover, error propagation in 2-D random WSNs was studied through a semiquantitative approach in [9], in the sense that the inﬂuences of the sensor density and the hop count from a sensor to anchors have been investigated based on certain approximations. As such, it is still challenging to precisely measure how localization errors are propagated in 2-D WSNs. As a preliminary study, we shall investigate the phenomenon of error propagation through exact formulation of localization errors in regular 2-D WSNs. To be speciﬁc, we ﬁrstly discuss the error propagation in a speciﬁc bilateration WSN, in which nodes with odd labels and nodes with even labels are regularly deployed with equal spaces in two parallel and horizontal straight lines, and obtain the formulae for localization errors through a linearized Maximum Likelihood Estimator (MLE); then, we generalize this speciﬁc bilateration WSN to a new bilateration WSN by integrally moving one horizontal line towards one horizontal direction, and extend the error formulae as well. These formulae accurately describe how the localization errors in the considered bilateration WSNs increase with the corresponding hop count to anchors increasing, and also demonstrate that diﬀerent anchor placement will result in dramatically diﬀerent localization performance. Finally, we explore the error propagation of a regular 2-D sensor network consisting of multiple bilateration networks, and formulate localization errors through CRLB which essentially equals to perturbed Toeplitz block matrices [12]. Simulation results are presented to validate the formulae obtained in this paper and illustrate the characteristics of error propagation in regular 2-D WSNs. The remainder of this paper is organized as follows. Section 2 establishes the problem model of a speciﬁc bilateration network WSN (SBWSN) and give the solving process of multihop localization errors. Section 3 presents the solving process of multihop localization errors in general bilateration WSN (GBWSN) with both the two anchors and the four anchors. Section 4 discusses the rate of error propagation of a regular 2-D WSN and Sect. 5 concludes the paper.

478

2 2.1

B. Jia et al.

The Specific Bilateration WSN The SBWSN with Anchors on One Side

The SBWSN with anchors on one side is illustrated in Fig. 1. As can be seen, nodes with odd labels are located in the same horizontal straight line and so are nodes with even labels; nodes 1, 2 are in the same vertical straight line, so on and so forth; the left-most two nodes are anchors, and all the others are sensors, edges denote range measurement between two nodes. Deﬁne the following notations.

Fig. 1. A bilateration WSN with two anchors on one side.

– n is obviously an even integer; – Nodes are labeled from 1 to n in order; – Noise in range measurements is additive independent Gaussian with mean zero and standard deviation 1, denoted by ei (0 < i < 2n + 1); – The true location for node i is (xi , yi ) and the uncertainty or error in node T i’s estimated location is Ui uxi uyi where superscript T denotes transposition. Speciﬁcally, U1 = U2 = 0. – The covariance matrix for coordinates of nodes 2i+1 and 2i+2 (0 < i < n/2) is a 4 × 4 matrix, denoted by Qi . U2i+1 Qi = Cov U2i+2 At ﬁrst, according to the structure of SBWSN, we can deﬁne – Two 2 × 2 matrices J1 and J2 :

J1 = J2 =

1 0 cos α sin α

cos α − sin α 1 0

Quantitatively Investigating Multihop Localization Errors

479

– Two 2 × 4 matrices K1 and K2 : 10 0 0 K1 = 0 0 cos α sin α cos α − sin α 0 0 K2 = 0 0 10 Based on the linearized MLE adopted in [9], the localization errors in node 2i − 1 and 2i are e4i−3 U2i−3 T −1 T T −1 T U2i−1 = (J1 J1 ) J1 + (J1 J1 ) J1 K1 (1) e4i−2 U2i−2 e4i−1 U2i−3 + (J2T J2 )−1 J2T K2 (2) U2i = (J2T J2 )−1 J2T e4i U2i−2 Furthermore, they can be simpliﬁed as e4i−3 U2i−3 + J1−1 K1 U2i−1 = J1−1 e4i−2 U2i−2 e4i−1 U2i−3 + J2−1 K2 U2i = J2−1 e4i U2i−2

(3) (4)

Then, the covariance matrix is Qi =

(J1T J1 )−1 0 0 (J2T J2 )−1

+

J1−1 K1 J2−1 K2

= AAT + BQi−1 B T =

i−1

B j AAT (B j )T

Qi−1

J1−1 K1 J2−1 K2

T (5) (6) (7)

j=0

= Qi−1 + B i−1 AAT (B i−1 )T where Q0 = 0, B 0 is an identity matrix, and −1 0 J1 – A= 0 J −1 −1 2 J1 K1 – B= J2−1 K2

(8)

480

B. Jia et al.

Regarding the matrices A and B, we can obtain following equations (i ≥ 0): ⎛ ⎞ 1 0 0 0 ⎜ −(2i + 1) cot α 0 (2i + 1) cot α 1 ⎟ ⎟ B 2i+1 = ⎜ (9) ⎝ 0 0 1 0⎠ −(2i + 1) cot α 1 (2i + 1) cot α 0 ⎛ ⎞ 1 0 0 0 ⎜ −2i cot α 1 2i cot α 0 ⎟ ⎟ B 2i = ⎜ (10) ⎝ 0 0 1 0⎠ −2i cot α 0 2i cot α 1 ⎛ 1 −(2i + 1) cot α ⎜ −(2i + 1) cot α (5 + 12i + 8i2 ) cot2 α + csc2 α 2i+1 T 2i+1 T B AA (B ) =⎜ ⎝ 0 2(i + 1) cot α −2(i + 1) cot α 4(i + 1)(2i + 1) cot2 α ⎞ 0 −2(i + 1) cot α ⎟ 2(i + 1) cot α 4(i + 1)(2i + 1) cot2 α ⎟ (11) ⎠ 1 (2i + 1) cot α 2 2 2 (2i + 1) cot α (5 + 12i + 8i ) cot α + csc α ⎛ 1 −(2i + 1) cot α ⎜ −(2i + 1) cot α (1 + 4i + 8i2 ) cot2 α + csc2 α 2i T 2i T B AA (B ) = ⎜ ⎝ 0 2i cot α −2i cot α 4i(2i + 1) cot2 α ⎞ 0 −2i cot α ⎟ 2i cot α 4i(2i + 1) cot2 α ⎟ (12) ⎠ 1 (2i + 1) cot α (2i + 1) cot α (1 + 4i + 8i2 ) cot2 α + csc2 α Since we are interested in the diagonal entries in Qi , we only investigate the diagonal entries in above resulting matrices. To diﬀer even and odd cases, we consider Q2i and Q2i+1 respectively. (Q2i )11 = (Q2i )33 = 2i 16 2 (Q2i )22 = (Q2i )44 = ( i3 + i) cot2 α + 2i csc2 α 3 3 (Q2i+1 )11 = (Q2i+1 )33 = 2i + 1 16 14 (Q2i+1 )22 = (Q2i+1 )44 = ( i3 + 8i2 + i + 1) cot2 α + (2i + 1) csc2 α 3 3 The formulae can be uniﬁed as (Qi )11 = (Qi )33 = i 2 4 (Qi )22 = (Qi )44 = ( i3 + i) cot2 α + i 3 3 where 0 ≤ i < n/2. As such, the Mean Squared Error (MSE) can be formulated as 2 4 (13) M SE(Ui ) = ( i3 + i) cot2 α + 2i 3 3

Quantitatively Investigating Multihop Localization Errors

481

Evidently, the localization error measured by MSE is propagated at the speed of Θ(i3 ) where i denotes the hop count from a sensor to the anchors in this regular scenario. 2.2

Placing Anchors on both Sides of the SBWSN

Suppose that another pair of anchors are placed at the right-most side of the bilateration WSN as shown in Fig. 2. However, the localization procedure becomes complicated in comparison with the aforementioned SBWSN with the anchors only at the left-most side. It is straightforward that a centralized localization algorithm will be preferred because the information can be suﬃciently used, but the centralized implementation suﬀers from communication overheads and time delay, especially in large-scale wireless sensor networks. Therefore, we adopt a simple approach by independently performing two localization procedures at ﬁrst, each of which is initialized by the pair of anchors at one side of the bilateration network and then fuse the two location estimates at each sensor through the weighted average algorithm to produce the ﬁnal location estimate.

Fig. 2. A SBWSN with four anchors.

Speciﬁcally, given node i with i being odd, its location estimate from the localization procedure initialized by the left-most anchors is Ui , and evidently, the location estimate by the right-most anchors equals to Un−i (Un+2−i provided i is even). Then, the ﬁnal location estimate can be formulated as Vi = wi Ui + (1 − wi )Un−i

(14)

where wi is the weight. Then, the MSE is M SE(Vi ) = wi2 M SE(Ui ) + (1 − wi )2 M SE(Un−i )

(15)

It is noticeable that diﬀerent weights will result in dramatically diﬀerent error characteristics. In order to eﬃciently fuse two location estimates, the more

482

B. Jia et al.

accurate is a candidate location estimate, the larger is its weight. Therefore, we let wi be M SE(Un−i )/(M SE(Ui ) + M SE(Un−i )), and then M SE(Vi ) =

M SE(Ui ) M SE(Ui )M SE(Un−i ) = SE(Ui ) M SE(Ui ) + M SE(Un−i ) 1 + MMSE(U n−i )

(16)

Obviously, the MSE of the fusion location estimate is smaller than those of both separate location estimates. Consequently, the speed of error propagation will decrease as well. 2.3

Simulation Results

We conduct simulations to validate the results about the error propagation with respect to diﬀerent hop counts using two anchors and four anchors in the bilat3π π eration WSN respectively, with α being one of 5π 16 , 8 and 4 . The simulation results are plotted in Fig. 3. As can be seen, the angle α has series impact on error propagation in both cases, and the smaller is the angle, the faster is the error propagated. Moreover, placing anchors on both sides dramatically reduces localization errors in comparison with placing anchors on one side.

Fig. 3. The MSE with respect to diﬀerent hops in the SBWSN.

3 3.1

A General Bilateration WSN Placing Anchors on One Side of the GBWSN

More generally, we generalize the angles by appointing α and β in the bilateration WSN as a GBWSN, which is illustrated in Fig. 4. The left-most two nodes are anchors, and all others are sensors and edges denote range measurement between two nodes. In the network, nodes with odd labels are located in the same horizontal straight line and so are nodes with even labels; nodes 1, 2 are

Quantitatively Investigating Multihop Localization Errors

483

Fig. 4. A GBWSN with two anchors.

not required in the same vertical straight line and so forth, with diﬀerent angles α and β. The localization errors can be formulated as (Qi )11 = (Qi )33 = i 1 1 (Qi )22 + (Qi )44 = ( i3 − i)(cot α + cot β)2 + 2i(1 + cot2 α + cot2 β). 3 3 3.2

Simulation Results

We conduct an experiment to analyze the location error with respect to diﬀerent hops using two anchors and four anchors in the general bilateration network π 3π π 3π 5π respectively, whenα and β are set as ( 5π 16 , 4 ), ( 8 and 4 ), and ( 8 , 16 ). The result is shown in Fig. 5

Fig. 5. The MSE with respect to diﬀerent hops in the GBWSN.

484

4 4.1

B. Jia et al.

A regular 2-D WSN consisting of multiple bilateration WSNs The Problem Model

Supposing a regular 2-D network as illustrated in Fig. 6, mn nodes are placed at its mn corners, and the edge between any pair of nodes denotes a range measurement with independent additive Gaussian noise, N (0, σ 2 ).

Fig. 6. A grid.

4.2

The Rate of Error Propagation

The Fisher Information Matrix (FIM) of this sensor network, denoted J, is a 2mn × 2mn square matrix and can be formulated as ⎛ ⎞ A B 0 ⎜B AB ⎟ ⎜ ⎟ ⎜ 0 BA ⎟ ⎜ ⎟ 1 ⎜ ⎟ . .. J= 2⎜ ⎟ ⎟ σ ⎜ ⎜ AB 0 ⎟ ⎜ ⎟ ⎝ BA B⎠ 0 B A

Quantitatively Investigating Multihop Localization Errors

485

where A, B, A are 2 m × 2 m matrices: ⎞ ⎛ 1 0 −1 0 0 0 ⎟ ⎜ 0 2 0 0 0 0 ⎟ ⎜ ⎟ ⎜ −1 0 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 2 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 −1 0 2 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 0 0 2 A=⎜ ⎟ ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 2 0 0 ⎟ ⎜ ⎝ −1 0 1 0 ⎠ 0 0 0 2 ⎞ ⎛ 1 0 −1 0 0 0 ⎟ ⎜ 0 1 0 0 0 0 ⎟ ⎜ ⎟ ⎜ −1 0 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 1 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 −1 0 2 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 0 0 1 A =⎜ ⎟ ⎟ ⎜ . .. ⎟ ⎜ ⎟ ⎜ ⎜ 2 0 −1 0 ⎟ ⎟ ⎜ ⎜ 0 1 0 0⎟ ⎟ ⎜ ⎝ −1 0 1 0 ⎠ 0 0 0 1 ⎞ ⎛ 0 0 0 0 0 0 ⎟ ⎜ 0 −1 0 0 0 0 ⎟ ⎜ ⎟ ⎜0 0 0 0 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 −1 0 0 ⎟ ⎜ ⎟ ⎜0 0 0 0 0 0 B=⎜ ⎟ ⎟ ⎜ 0 0 0 0 0 −1 ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎝ 0 0 ⎠ 0 −1 Obviously, J is an n × n symmetric tridiagonal block matrix with block size 2m×2m. Each block matrix is associated with one row in the network. Moreover, A and A are also symmetric tridiagonal block matrices with block size 2 × 2; while B is a diagonal matrix. In J, each node is relevant to two adjacent rows and columns. If we want some nodes to be anchors, we just need to eliminate columns and rows associated with these nodes. For example, if all of the bottom and top rows are anchors, we obtain:

486

B. Jia et al.

⎛

⎞ A BT 0 ⎜ B A BT ⎟ ⎜ ⎟ ⎜ ⎟ 1 ⎜0 B A ⎟ J = 2⎜ ⎟ .. ⎟ σ ⎜ . ⎜ ⎟ T ⎠ ⎝ AB B A 4.3

Simulation Results

As J is a tridiagonal block Toeplitz matrix, a bunch of papers have been available in the literature, e.g. [12,13]. To obtain the inverse of J, we need at ﬁrst invert A because A will be frequently used in inverting J. Due to the simple structure of A, it is not hard to computing A ; however, computing J will be a diﬃcult problem. For simplicity, we let p = m − 2 and q = n − 2 and moreover let σ = 1. Figure 7 shows the inverse of J (i.e. the CRLB) for the case of p = 10 and q = 10.

Fig. 7. The CRLB for a 10 × 10 regular 2-D WSN.

Quantitatively Investigating Multihop Localization Errors

487

As can be seen from Fig. 7(a), with respect to the diﬀerent column number, the CRLB has the same tendency with the diﬀerent row number, and in Fig. 7(b), the with respect to the diﬀerent row number, the CRLB has the same tendency with the diﬀerent column number similarly. Therefore, a subgraph when n = 2 (or m = 2), say bilateration networks, can reﬂect the error propagation signiﬁcantly in a regular 2-D WSN.

5

Conclusion

Since in the WSN, sensor locations are determined in a multi-hop localization manner generally, we investigate the error propagation problem in 2-D regular WSNs through quantitatively measuring how the localization errors vary across diﬀerent sensors in this paper. Furthermore, formulae with respect to diﬀerent angles, number of hops and diﬀerent anchor placement are obtained. Simulation results are conducted to validate these formulae and analyze the characteristics of error propagation.

References 1. Albowicz, J., Chen, A., Zhang, L.: Recursive position estimation in sensor networks. In: 2001 Ninth International Conference on Network Protocols, pp. 35–41. IEEE (2001) 2. Huang, B., Yu, C., Anderson, B., Mao, G.: Estimating distances via connectivity in wireless sensor networks. Wirel. Commun. Mob. Comput. 14(5), 541–556 (2014) 3. Liu, Z., Luo, D., Li, J., Chen, X., Jia, C.: N-mobishare: new privacy-preserving location-sharing system for mobile online social networks. Int. J. Comput. Math. 93(2), 384–400 (2013) 4. Han, G., Jiang, J., Zhang, C., Duong, T.Q., Guizani, M., Karagiannidis, G.K.: A survey on mobile anchor node assisted localization in wireless sensor networks. IEEE Commun. Surv. Tutor. 18(3), 2220–2243 (2016) 5. Huang, Y., Li, B., Liang, S., Ma, H., Liu, Z.: Generalized format-preserving encryption for character data. J. Netw. 7(8), 1239–1244 (2016) 6. Liu, Z., Li, T., Li, P., Jia, C., Li, J.: Veriﬁable searchable encryption with aggregate keys for data sharing system. Futur. Gener. Comput. Syst. 78, 778–788 (2017) 7. Dil, B., Dulman, S., Havinga, P.: Range-based localization in mobile sensor networks. In: R¨ omer, K., Karl, H., Mattern, F. (eds.) EWSN 2006. LNCS, vol. 3868, pp. 164–179. Springer, Heidelberg (2006). https://doi.org/10.1007/11669463 14 8. Liu, J., Zhang, Y., Zhao, F.: Robust distributed node localization with error management. In: Proceedings of the 7th ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 250–261. ACM (2006) 9. Huang, B., Yu, C., Anderson, B.: Understanding error propagation in multi-hop sensor network localization. IEEE Trans. Ind. Electron. 60(12), 5811–5819 (2013) 10. Huang, B., Yu, C., Anderson, B.D.: Error propagation in sensor network localization with regular topologies. In: Global Telecommunications Conference, GLOBECOM 2009, pp. 1–6. IEEE (2009)

488

B. Jia et al.

11. Huang, B., Yu, C., Anderson, B.D.: Analyzing error propagation in range-based multihop sensor localization. In: Proceedings of the 48th IEEE Conference on Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference, CDC/CCC 2009, pp. 865–870. IEEE (2009) 12. Akaike, H.: Block toeplitz matrix inversion. SIAM J. Appl. Math. 24(2), 234–241 (1973) 13. Meurant, G.: A review on the inverse of symmetric tridiagonal and block tridiagonal matrices. SIAM J. Matrix Anal. Appl. 13(3), 707–728 (1992)

Optimizing WiFi AP Placement for Both Localization and Coverage Yu Tian1,2 , Baoqi Huang1,2(B) , Bing Jia1,2 , and Long Zhao3 1

Inner Mongolia A.R. Key Laboratory of Wireless Networking and Mobile Computing, Hohhot 010021, China [email protected] 2 College of Computer Science, Inner Mongolia University, Hohhot 010021, China 3 School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

Abstract. Nowadays, WiFi infrastructures and WiFi-enabled mobile devices have been ubiquitous in our daily lives, and are promising to provide both network services and indoor positioning and navigation services due to its simplicity and low costs. But, it is evident that AP placement is critical to both localization and network coverage, so that it is helpful to ﬁnd the optimal AP placement scheme in terms of both localization and coverage. This paper tackles this problem by leveraging the widely used Cramer-Rao lower bound (CRLB) and heuristic genetic algorithm to develop an eﬃcient AP optimization method. To be speciﬁc, the CRLB is used as the metric for localization and a multiple degree criterion is deﬁned as the metric for coverage, which is incorporated into the ﬁtness function in the genetic algorithm. Furthermore, instead of using the idea log distance path loss (LDPL) model, the more practical Motley-keenan model is adopted to reﬂect the inﬂuences of obstacles which are widespread in indoor environments. Finally, extensive simulations are conducted, and comparisons between the proposed method and the other three popular methods conﬁrm the eﬃciency and eﬀectiveness of the proposed method. Keywords: WiFi AP · Localization Cramer-Rao lower bound

1

· Coverage · Genetic algorithm

Introduction

Recently, with the development of wireless networks and popularity of mobile intelligent devices, wireless indoor localization has attained much attention in Supported by the National Natural Science Foundation of China under Grants 61461037, 41761086 and 61761035, the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No. 2016YFB0502102, the Natural Science Foundation of Inner Mongolia Autonomous Region of China under Grant 2017JQ09, and the Grassland Elite Project of the Inner Mongolia Autonomous Region under Grant CYYC5016. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 489–503, 2018. https://doi.org/10.1007/978-3-030-05057-3_37

490

Y. Tian et al.

both academia and industries, with the result that various indoor localization techniques and other related research such as location-aware services and location privacy protection have been reported [1–5]. Therein, WiFi based localization is most promising due to its low-cost and existing infrastructures and ubiquitous client devices. In particular, due to its simplicity and tolerance to pervasive multipath eﬀects in indoor environments, WiFi ﬁngerprint-based localization has been widely studied [6,7]. WiFi ﬁngerprint-based localization involves two steps, namely oﬄine site survey and online localization. In the oﬄine site survey, a radio map is constructed by collecting received signal strength (RSS) measurements from surrounding APs at diﬀerent locations as ﬁngerprints [8]. In the online localization, the RSS measurement collected in real time is compared with the ﬁngerprints in the radio map to select one or several most probable ﬁngerprints, such that the ﬁnal location estimate can be returned [9,10]. It has been shown that the radio map plays a vital role in both localization accuracy and eﬃciency, so that many studies have been carried out to improve the construction of radio maps [11–13]. An important and eﬀective approach is developed by optimizing AP placement to improve localization accuracy [14,15], and many eﬀorts have been imposed on this area [16–21]. Most existing studies typically involve the following key aspects: (1) establish a proper objective function for judging the quality of an AP placement scheme; (2) choose a search algorithm for searching the optimal placement scheme from the candidates; (3) determine a signal propagation model for generating RSS measurements given the AP placement and an arbitrary location. As to the objective function, various localization performance criteria have been reported. In [16,17,20,21], the variation of RSS measurements induced by diﬀerent AP placement schemes was used as the objective function; in [18], the geometric dilution of precision (GDOP) was utilized to evaluate localization performance; in [19], the total number of similar ﬁngerprints over each pair of ﬁngerprints in the radio map was employed. As to the search algorithm, since the problem of ﬁnding the optimal AP placement scheme is essentially NP-complete, diﬀerent heuristic algorithms have been adopted to improve the time eﬃciency, but only suboptimal solution is obtained. For instance, the simulated annealing algorithm was used in [17,19], the genetic algorithm was used in [16,18,21], and the diﬀerential evolution algorithm was used in [20]. As to the wireless signal propagation model, the simply log distance path loss (LDPL) model was adopted in most studies [16,18,20,21], whereas the more practical Motley-Keenan model and the ray tracing propagation model were used in [17] and [19] respectively. Intuitively, the WiFi AP placement not only aﬀects localization accuracy, but also determines the coverage of the WiFi network. Therefore, it is critical to tackle the problem of AP placement by taking into consideration both localization and coverage. But, most existing studies ignore the coverage problem, except that the simple one-degree coverage is guaranteed in [17,18]. In this paper, a novel method is proposed to optimize AP placement by satisfying the multiple degree of coverage. To be speciﬁc, the widely used mathematical tool for localization performance analysis, i.e. the Cramer-Rao lower bound

Optimizing WiFi AP Placement for Both Localization and Coverage

491

(CRLB), is employed to measure the resulting localization error given diﬀerent AP placement schemes; in order to address the issue that the optimization problem is NP-complete, the heuristic genetic algorithm is leveraged by establishing the ﬁtness function that incorporates both localization and coverage. Besides, to advance the practicability of the proposed method, the Motley-Keenan model is adopted to reﬂect the characteristics of wireless signal propagation in indoor environments as much as possible. Extensive simulations are conducted and show that the proposed method is advantageous in both time eﬃciency and localization accuracy, in comparison with two exhaustive methods based on the CRLB or the Fisher information matrix (FIM) and the method in [17].

2

The Metric for WiFi Localization

In this section, we shall introduce some preliminaries on evaluating WiFi localization performance. Firstly, a generic localization model is reported. After that, two critical issues in relation to the metric are addressed. 2.1

A Generic Localization Model

As was commonly assumed [22], the RSS measurements of the signals propagated from n APs to a receiver at a position x, denoted y = [y1 , y2 , · · · , yn ]T , are independent and identically distributed random variables, namely y ∼ N (m(x), σ 2 In ),

(1)

where m(x) = [m1 (x), m2 (x), · · · , mn (x)]T is a vector function of mean RSS measurements at the position x = [x1 , x2 ]T from the n APs, and In denotes the identity matrix of order n. The localization problem aims to infer the unknown position x given a sample of the RSS measurements y. By suppose that the RSS measurement model in (1) is known, the localization problem can be solved through, e.g. the maximum likelihood estimator (MLE). Deﬁne gradients ri = ∂mxi (x) = [ri1 , ri2 ] and formulate r = [rT1 , rT2 , · · · , rTn ]T . Then, the likelihood function can be formulated as L(y; x) = log p(y|x), (2) and the Fisher information matrix (FIM) is 2 ∂ L(y; x) 1 = 2 rT r. F(x) = −E 2 ∂x σ

(3)

The Cramer-Rao lower bound (CRLB), which equals to the inverse of the FIM, can be formulated as F−1 (x) = σ 2

n 2 ri1 ri2 ri1 2 ri1 ri2 ri2 i=1

−1 ,

(4)

492

Y. Tian et al.

and generally, its trace is utilized to denote the lower bound on the mean squared error (MSE) of any unbiased localization algorithm, namely n 2σ 2 i=1 ri 2 −1 , (5) Tr(F (x)) = n n 2 i=1 j=1 (ri rj sin θij ) where θij denotes the angle subtended by ri and rj (Fig. 1).

Fig. 1. The illustration of hij associated with ri and rj .

It is evident that the MSE based on the CRLB, or simply the CRLB, can be applied as a metric to evaluate the localization accuracy given diﬀerent AP placement, such that it is acceptable to optimize AP placement in terms of localization performance through the CRLB. It is noticeable that the CRLB essentially involves two aspects of RSS measurements. One is the gradients of RSS measurements, which actually reﬂect the diﬀerences between RSS measurements at nearby locations and have been utilized in several existing studies for evaluating AP deployment [17]. The other one is the angles subtended by gradients, which are usually ignored by these studies but will be considered in the propose method. 2.2

Gradient Approximations

In order to calculate the gradients ri with i = 1, · · · , n which are prerequisite for calculating the CRLB, one must have the formulation of m(x), which is hard and even impossible due to the following two aspects. Firstly, from the perspective of statistics, the accurate formulation of the mean functions associated with a set of unknown random variables can only be asymptotically approached by their sample means. Secondly, the RSS measurements available for further processing are just collected at a certain number of spatially discrete reference points, which cannot produce accurate gradients. As such, the conditions for precisely calculating the gradients cannot be satisﬁed. To address this issue, bivariate polynomial regression is adopted to generate an approximate vector function of m(x) by making use of the average RSS measurements around any given reference point. Speciﬁcally, given any reference point and an arbitrary AP, deﬁne a (second-order) bivariate polynomial function, e.g. f (x) = xT Ax + bT x + c. By using the average RSS measurements at nearby

Optimizing WiFi AP Placement for Both Localization and Coverage

493

reference points, f (x) can be ﬁtted and further, its ﬁrst derivatives with respect to the considered reference point can be used to approximate the requested gradients at this reference point from the given AP. 2.3

The Motley-Keenan Propagation Model

Another key issue arising in evaluating the CRLB lies in collecting suﬃcient RSS measurements at a set of reference points to regress the vector function of m(x), which is often infeasible for the problem of optimizing AP placement due to the fact that the optimization of AP placement should be conducted prior to deploying APs in practice. Therefore, the common approach adopted in the literature is to select an appropriate signal propagation model for predicting RSS measurements at any reference point. The most commonly adopted model is the LDPL model [23], but cannot embody the attenuation caused by obstacles occurring between an AP and a receiver, like walls, furniture, and etc. As such, the Motley-Keenan model [24] is used in this paper to characterize indoor propagations of WiFi signals, namely m d ki Lwi (6) + PL(d) = PL(d0 ) + 10α log d0 i=1 where PL(d) is the mean path loss at a distance of d from the AP, PL(d0 ) is the mean path loss at a reference distance d0 from the AP, α is the path loss exponent, ki is the number of the i-th type obstacle (there totally exist m types) between the current pair of AP and receiver, and Lwi is the penetration loss by the i-th type obstacle. Table 1 lists typical attenuation values with respect to diﬀerent types of materials. Table 1. The attenuations caused by diﬀerent materials.

3

No. Materials

Typical attenuation (dB)

1

Brick Wall

10

2

Concrete Wall 15

3

Elevator Shaft 10

4

Door

2

5

Window

1

The Metric for WiFi Coverage

In this section, we shall introduce the metric for evaluating WiFi coverage. Similar to the approach adopted in evaluating localization performance, although the target deployment region is continuous, only a set of discrete reference points is selected for use.

494

Y. Tian et al.

A reference point satisﬁes c-degree coverage if and only if the receiver at this reference point is able to receive valid signals with RSS measurements above a threshold from at least c APs. Likewise, an AP placement scheme satisﬁes cdegree coverage if and only if all the reference points satisfy c-degree coverage. In order to measure the coverage of a AP placement scheme, denoted I, deﬁne the coverage ratio as the percentage of the reference points which satisfy c-degree coverage, namely m C(I, c, i) (7) fC (I, c) = i=1 m where m is the number of reference points, and 1 if the i-th reference point satisﬁes c-degree coverage C(I, c, i) = (8) 0 otherwise

4

The Proposed Algorithm

In this section, we shall detailed introduce the proposed genetic algorithm which ﬁnds the optimal placement of given n APs in terms of both localization accuracy and coverage. 4.1

Analyzing the Computational Complexity of Optimizing AP Placement

Strictly speaking, ﬁnding the optimal AP placement requires to search all possible candidate AP placement, but since the area for deploying APs is usually continuous, it will cost inﬁnite time to conduct the search. Alternatively, such kind of problems is often simpliﬁed and thus approximately solved by searching over a set of ﬁnite and discrete points. Speciﬁcally, the area can be replaced by a lattice of points (i.e. reference points), such that the aim becomes to ﬁnd the optimal AP placement over this lattice of points, namely that APs can only be placed on these reference points so as to maximize the localization and coverage in relation to these reference points. However, the performance of the alternative approach remarkably relies on the granularity of the lattice, in the sense that a small granularity beneﬁts the performance but induces severe computation complexities. For instance, given p reference points and n APs, the search space is as large as Cpn , which exponentially scales with p and n. Therefore, in order to ﬁnd nearly optimal AP placement as fast as possible, the genetic algorithm is adopted and will be elaborated in the following subsection. 4.2

GA-CRLB

The genetic algorithm is a heuristic search algorithm that evolves from the evolutionary rules of biology. It maintains excellent genes and promotes group evolution through selection, crossover, and mutation operations to obtain the global

Optimizing WiFi AP Placement for Both Localization and Coverage

495

optimal solution. In this paper, we develop GA-CRLB based on the genetic algorithm and CRLB to search for the optimal AP placement that maximizes localization accuracy and coverage simultaneously. The speciﬁc steps of the GA-CRLB algorithm are described as follows. 1. Initial population. Before initializing population, it is necessary determine the coding method. By using the AP coordinates as gene codes, the gene sequence can be coded as Gi = [xi , yi ] where xi and yi denote the coordinates of the i-th AP. Let P = (I1 , I2 , ..., Ik ) denotes the population, where Ii = (G1 , G2 , ..., Gn ) denotes the i-th individual, n is the number of genes and k is the number of individuals. As a result, k×n APs are randomly generated to form the initial population. It is noticeable that, each individual corresponds to a AP placement scheme, and the coordinates of these APs are uniformly and randomly generated within the set of reference points. 2. Individual evaluation. Calculate the ﬁtness of each individual in the population by using the following ﬁtness function: 1 fC (Ii , c) ≥ FC (9) fF (Ii ) = fL (Ii ) 0 otherwise where Ii is the i-th individual in the population, FC is the coverage ratio threshold, and fL (Ii ) is the average localization error at all reference points, i.e. p −1 (xj )) j=1 Tr(F , (10) fL (Ii ) = p where p is the number of reference points, and xj is the coordinate of the j-th reference point. 3. Selection operation. The selection operation aims to select some optimal individuals from the parent generation to the next generation based on the evaluation of individual ﬁtness, or generate new individuals in the next generation through the following crossover and mutation operations. The roulette selection model is adopted by evaluating the selection probability based on the ﬁtness as follows: fF (Ii ) pi = k j=1 fF (Ij )

(11)

where pi is the selection probability of the i-th individual, i.e. Ii . Consequently, the individual with a high ﬁtness value has a high probability of being selected. 4. Crossover operation. Crossover actually determines the global search ability by exchanging parts of genes associated with two individuals and creating new individuals. If a random number between 0 to 1 is less than the crossover probability pc , we randomly select n cross positions to exchange genes of two individuals.

496

Y. Tian et al.

5. Mutation operation. The mutation operation avoids the genetic algorithm falling into a local optimum by changing genetic values in individuals. If a random number 0 to 1 is less than the mutation probability pm , we randomly select genes in individuals and change the corresponding coordinates xi and yi . 6. Termination. After the population is initialized, the steps from 2 to 5 will be repeated to produce the next generation population. If the iteration number is greater than the threshold T , GA-CRLB will be terminated and the individual with the maximum ﬁtness among the current population is returned as the optimal solution. In summary, GA-CRLB leverages CRLB to comprehensively evaluate the inﬂuences of APs on localization by satisfying a predeﬁned coverage threshold, and more importantly, is able to quickly return sub optimal solutions in comparison with the approaches relying on searching the huge solution space.

5

Simulations

Extensive simulations are carried out in this section to validate the eﬀectiveness and eﬃciency of the proposed GA-CRLB. 5.1

Tool Introduction

In order to conveniently validate the proposed algorithm, we develop a Java Application tool, as shown in Fig. 2. It supports the following functions in relation to AP placement optimization: – load and display ﬂoor plans in the shp format;

Fig. 2. The GUI of the simulation tool.

Optimizing WiFi AP Placement for Both Localization and Coverage

497

– RSS simulations based on diﬀerent signal propagation models with conﬁgurable parameters, e.g. the LDPL model, the Motley-Keenan model, and etc.; – optimize AP placement with diﬀerent methods, which will be introduced in the following subsection; – evaluate the CRLB based localization errors given diﬀerent AP placement schemes; 5.2

Setup

Two scenarios with diﬀerent sizes are taken into consideration in the simulations. The large scenario is the third ﬂoor in the Building of College of Computer Science, Inner Mongolia University, and has the size of 81 m × 14 m, as shown in Fig. 3(a). The small scenario is part of the large scenario, and has the size of 29 m × 14 m, as shown in Fig. 3(b).

Fig. 3. Two scenarios considered in the simulations.

In order to validate the proposed method, the coverage-FD method [17] and two exhaustive methods are realized for comparison. Speciﬁcally, the coverageFD adopts the diﬀerence between the RSS measurements at nearby reference points as an optimality metric for localization and uses the simulated annealing algorithm to search for the optimal AP placement; the exhaustive methods include exhaustion-CRLB and exhaustion-FIM which search the whole solution space with either CRLB or the determinant of FIM as an optimality metric for localization. Note that for fair comparison, the coverage-FD method does not

498

Y. Tian et al.

work to ﬁnd the minimum number of APs for localization, but directly searches for the optimal AP placement given a ﬁxed number of APs. During the simulations, besides the parameters listed in Table 2, the inﬂuences of several critical parameters, including the number of APs, the lattice granularity, and etc., are investigated. The simulation tool is run on a Lenovo PC with core i7 CPU and 8 GB of RAM. Table 2. The values of diﬀerent parameters in the simulations. No. Name

Value

Usage

1

PT

20 dBm

AP transmission power

2

d0

1m

Reference distance

3

PL(d0 ) 30

Path loss at the reference distance

4

α

3

Path loss exponent

5

σ

4 dBm

Standard deviation of RSS measurements in dBm

6

Rmin

−85 dBm Threshold of RSS measurements for valid WiFi signals

7

c

3

Degree of coverage

8

FC

1

Threshold of coverage ratio

9

pc

0.8

Crossover probability

10

pm

0.15

Mutation probability

11

t

40

Number of iteration times

12

k

30

Number of individuals

5.3

Evaluation of Execution Time

In the ﬁrst place, we compare the execution time of four optimization methods with respect to diﬀerent numbers of APs. Due to the fact that the execution time of the two exhaustive methods will be extremely long when the solution space is large, only the small scenario with the grid distance of 3.3 m is taken into consideration, which involves totally 36 lattice points. The two exhaustive methods are executed once, but the other two heuristic methods are executed 10 times and the average execution time is evaluated. The execution time is plotted in log scale in Fig. 4 with respect to the number of APs rising from 3 to 7. As can be seen, the execution time of the exhaustionCRLB method increases from 10 s to 4.8 h, and that of the exhaustion-FIM method from 0.02 s to 2.65 min; however, the execution time of both the other two methods increases slowly with increasing the number of APs, and is less than around 5 s regardless of the number of APs. Therefore, it can be concluded that the two heuristic methods are extremely time-eﬃcient in comparison with the two exhaustive methods. Note that the proposed GA-CRLB is slower than the coverage-FD method because computing the CRLB involves matrix inversion and thus spends more time than computing the diﬀerence of RSS measurements used in the coverage-FD method.

Optimizing WiFi AP Placement for Both Localization and Coverage

Average simulation time (ms)

10 8

10

499

exhaustion-CRLB GA-CRLB coverage-FD exhaustion-FIM

7

10 6 10 5 10 4 10 3 10 2

3

4

5

6

7

Number of APs

Fig. 4. Execution time vs. the number of APs with respect to the four methods.

In addition, we also compare the resulting localization errors given the optimal AP placement schemes returned by the four methods by evaluating the root mean squared error (RMSE) based on CRLB, namely (12) RM SE = fL (I) where I is an optimal AP placement scheme obtained by one of the four methods, and fL (I) is the average localization error among all reference points using (10). The RMSE values associated with four methods are plotted in Fig. 5 with respect to diﬀerent numbers of APs. As can be seen, among all the four methods, the exhaustion-CRLB method unsurprisingly performs best since it returns the theoretically optimal AP placement scheme, but the exhaustion-FIM methods performs worst, which is attributable to the diﬀerent optimality criterion adopted; moreover, the proposed GA-CRLB performs almost the same as the 4.5

exhaustion-CRLB GA-CRLB coverage-FD exhaustion-FIM

4

RMSE (m)

3.5 3 2.5 2 1.5 1 0.5

3

4

5

6

7

Number of APs

Fig. 5. RMSE vs. number of APs in case of the four algorithms.

500

Y. Tian et al.

best exhaustion-CRLB method when the number of APs is between 3 and 5, and evidently outperforms the other two methods. To sum up, it can be concluded that the proposed GA-CRLB is not only timeeﬃcient in comparison with the exhaustive methods, but also achieves superior performance close to that of the theoretically optimal method. 5.4

Evaluation of Diﬀerent Factors

6

6

5.5

5.5

5

5

4.5

4.5

4

RMSE(m)

RMSE(m)

In the second place, the inﬂuences of diﬀerent factors including the grid distance and number of APs on the two heuristic methods are investigated. To do so, given every conﬁguration, both the proposed GA-CRLB method and the coverage-FD method are run 10 times in the large and small scenarios, respectively, and the box plots of the resulting RMSE values are displayed for analysis purposes. Provided that the grid distance is 1 m in both the large and small scenarios, Figs. 6 and 7 illustrate the box plots of the RMSE produced by the proposed GA-CRLB method and the coverage-FD method with respect to diﬀerent numbers of APs, respectively. As can be seen, the magnitudes of the RMSE values obtained by both methods in both scenarios reduce with the AP number increasing, and the proposed GA-CRLB method signiﬁcantly outperforms the coverage-FD method in terms of both the median RMSE and the stability, which veriﬁes the advantage of the proposed GA-CRLB method. In particular, when the AP number is relatively large, the resulting RMSE values appear to be quite concentrated, which is a key merit for practical applications.

3.5 3 2.5

4 3.5 3 2.5

2

2

1.5

1.5

1

1

0.5

0.5

4

5

6

7

8

9

Number of APs

(a) GA-CRLB.

10

11

12

4

5

6

7

8

9

10

11

12

Number of APs

(b) coverage-FD.

Fig. 6. RMSE distribution under diﬀerent number of APs in the large scenario.

Provided that there are 8 APs in the large scenario and 5 APs in the small scenario, Figs. 8 and 9 illustrate the box plots of the RMSE produced by the proposed GA-CRLB method and the coverage-FD method with the grid distance rising from 0.5 m to 4 m. Similarly to the inﬂuence of the AP number, the proposed GA-CRLB method achieves low localization errors and high stability in comparison with the coverage-FD method in all the circumstances considered. However, even though it is shown that the RMSE decreases with the grid

Optimizing WiFi AP Placement for Both Localization and Coverage 6.5

6

6

5.5

5.5

5

5

4.5

4.5

RMSE(m)

RMSE(m)

6.5

4 3.5 3 2.5

4 3.5 3 2.5

2

2

1.5

1.5

1

501

3

4

5

6

7

1

8

3

4

5

Number of APs

6

7

8

Number of APs

(a) GA-CRLB.

(b) coverage-FD.

Fig. 7. RMSE distribution under diﬀerent number of APs in the small scenario.

1.5

1.5

1.4

1.4

1.3

1.3

RMSE(m)

RMSE(m)

distance increasing in most cases, this should be attributable to the fact that a large grid distance results in less reference points, which induces large errors in approximating the localization errors in continuous region; that is to say, the signiﬁcance of the simulations is limited as the grid distance is large.

1.2 1.1

1.2 1.1

1

1

0.9

0.9

0.8

0.8 0.5

1

1.5

2

2.5

3

3.5

4

0.5

1

Grid distance (m)

1.5

2

2.5

3

3.5

4

Grid distance (m)

(a) GA-CRLB.

(b) coverage-FD.

2.5

2.5

2

2

RMSE(m)

RMSE(m)

Fig. 8. RMSE distribution under diﬀerent grid distances in the large scenario.

1.5

1

1.5

1 0.5

1

1.5

2

2.5

3

Grid distance (m)

(a) GA-CRLB.

3.5

4

0.5

1

1.5

2

2.5

3

3.5

4

Grid distance (m)

(b) coverage-FD.

Fig. 9. RMSE distribution under diﬀerent grid distances in the small scenario.

502

Y. Tian et al.

In summary, no matter how many APs are given and how large the grid distance is, the proposed GA-CRLB method display superior performance to the coverage-FD method at costs of some moderate extra computations.

6

Conclusions

This paper studied the optimization problem of placing APs for both localization and coverage. Speciﬁcally, the CRLB is used as the metric to evaluate the localization accuracy given AP placement schemes, and then the genetic algorithm is applied to quickly search for the optimal AP placement scheme achieving the minimum average localization error and simultaneously satisfying the predeﬁned coverage requirement. Simulation results show that the proposed algorithm is not only time-eﬃcient in comparison with the exhaustive methods, but also derives nearly optimal performance that the exhaustive method can derive and signiﬁcantly outperforms the other popular heuristic method based on the ﬁngerprint diﬀerence and the simulated annealing algorithm.

References 1. Calderoni, L., Maio, D., Palmieri, P.: Location-aware mobile services for a smart city: design, implementation and deployment. J. Theor. Appl. Electron. Commer. Res. 7(3), 74–87 (2012) 2. Dawood, R., Yew, J., Jackson, S.J.: Location aware applications to support mobile food vendors in the developing world. In: Extended Abstracts on Human Factors in Computing Systems, CHI 2010, pp. 3385–3390 (2010) 3. Liu, Z., Luo, D., Li, J., Chen, X., Jia, C.: N-mobishare: new privacy-preserving location-sharing system for mobile online social networks. Int. J. Comput. Math. 93(2), 384–400 (2013) 4. Liu, Z., Li, T., Li, P., Jia, C., Li, J.: Veriﬁable searchable encryption with aggregate keys for data sharing system. Futur. Gener. Comput. Syst. 78, 778–788 (2017) 5. Li, M., Liu, Z., Li, J., Jia, C.: Format-preserving encryption for character data. J. Netw. 7, 1239–1244 (2012) 6. Zou, H., Huang, B., Lu, X., Jiang, H., Xie, L.: A robust indoor positioning system based on the procrustes analysis and weighted extreme learning machine. IEEE Trans. Wirel. Commun. 15(2), 1252–1266 (2016) 7. Zhou, M., Tang, Y., Nie, W., Xie, L., Yang, X.: Grassma: graph-based semisupervised manifold alignment for indoor WLAN localization. IEEE Sens. J. 17(21), 7086–7095 (2017) 8. Zhao, H., Huang, B., Jia, B.: Applying kriging interpolation for WiFi ﬁngerprinting based indoor positioning systems. In: 2016 IEEE Wireless Communications and Networking Conference, pp. 1–6, April 2016 9. Zou, H., Zhou, Y., Jiang, H., Huang, B., Xie, L., Spanos, C.: Adaptive localization in dynamic indoor environments by transfer kernel learning. In: 2017 IEEE Wireless Communications and Networking Conference, pp. 1–6, March 2017 10. Zhou, M., Tang, Y., Tian, Z., Geng, X.: Semi-supervised learning for indoor hybrid ﬁngerprint database calibration with low eﬀort. IEEE Access 5, 4388–4400 (2017)

Optimizing WiFi AP Placement for Both Localization and Coverage

503

11. Fang, S.H., Lin, T.N., Lin, P.C.: Location ﬁngerprinting in a decorrelated space. IEEE Trans. Knowl. Data Eng. 20(5), 685–691 (2008) 12. Jia, B., Huang, B., Gao, H., Li, W.: On the dimension reduction of radio maps with a supervised approach. In: 2017 IEEE 42nd Conference on Local Computer Networks (LCN), pp. 199–202, October 2017 13. Jia, B., Huang, B., Gao, H., Li, W.: Dimension reduction in radio maps based on the supervised kernel principal component analysis. Soft Comput. 22, 1–7 (2018) 14. Baala, O., Zheng, Y., Caminada, A.: The impact of AP placement in WLAN-based indoor positioning system. In: 2009 Eighth International Conference on Networks, pp. 12–17, March 2009 15. Huang, B., Liu, M., Xu, Z., Jia, B.: On the performance analysis of WiFi based localization. In: 2018 IEEE Conference on International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2018) 16. Alsmady, A., Awad, F.: Optimal Wi-Fi access point placement for RSSI-based indoor localization using genetic algorithm. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 287–291, April 2017 17. Chen, Q., Wang, B., Deng, X., Mo, Y., Yang, L.T.: Placement of access points for indoor wireless coverage and ﬁngerprint-based localization. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications, 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 2253–2257, November 2013 18. Zirazi, S., Canalda, P., Mabed, H., Spies, F.: Wi-Fi access point placement within stand-alone, hybrid and combined wireless positioning systems. In: 2012 Fourth International Conference on Communications and Electronics (ICCE), pp. 279– 284, August 2012 19. Sharma, C., Wong, Y.F., Soh, W.S., Wong, W.C.: Access point placement for ﬁngerprint-based localization. In: 2010 IEEE International Conference on Communication Systems, pp. 238–243, November 2010 20. Zhao, Y., Zhou, H., Li, M.: Indoor access points location optimization using differential evolution. In: 2008 International Conference on Computer Science and Software Engineering, vol. 1, pp. 382–385, December 2008 21. He, Y., Meng, W., Ma, L., Deng, Z.: Rapid deployment of APS in WLAN indoor positioning system. In: 2011 6th International ICST Conference on Communications and Networking in China (CHINACOM), pp. 268–273, August 2011 22. Wen, Y., Tian, X., Wang, X., Lu, S.: Fundamental limits of RSS ﬁngerprinting based indoor localization. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 2479–2487. IEEE (2015) 23. Rappaport, T.: Wireless Communications: Principles and Practice, 2nd edn. Prentice Hall PTR, Upper Saddle River (2001) 24. Keenan, J., Motley, A.: Radio coverage in buildings. Br. Telecom Technol. J. 8, 19–24 (1990)

PLZMA: A Parallel Data Compression Method for Cloud Computing Xin Wang1,3 , Lin Gan1,3,5(B) , Jingheng Xu1,3 , Jinzhe Yang3,4 , Maocai Xia1,3 , Haohuan Fu2,3,5 , Xiaomeng Huang2,3,5 , and Guangwen Yang1,2,3,5 1

5

Department of Computer Science and Technology, Tsinghua University, Beijing, China [email protected] 2 Ministry of Education Key Lab. for Earth System Modeling, and Department of Earth System Science, Tsinghua University, Beijing, China 3 National Supercomputing Center, Wuxi, China 4 Department of Computing, Imperial College London, London, UK Lab. for Regional Oceanography and Numerical Modeling, Qingdao National Lab. for Marine Science and Technology, Qingdao, China

Abstract. Recent decades have seen the rapid development of cloud computing, resulting in a huge breakthrough for people to handle the data produced every second and everywhere. Meanwhile, data compression is becoming increasingly important, due to its great potential in beneﬁting both the network transportation and the storage. Based on the urgent demand in high-eﬃcient compression method with balanced performance in both merits of compression time and ratio, this paper presents PLZMA, a parallel design of LZMA. Process-level and threadlevel parallelisms are implemented according to the algorithm of LZMA, which have gained great improvement in compression time, while ensuring a fair compression ratio. Experimental results on real-world application showed that PLZMA is able to achieve more balanced performance over other famous methods. The parallel design is able to achieve a performance speedup of 8× over the serial baseline, using 12 threads.

Keywords: Data compression

1

· Parallel computing · LZMA

Introduction

In the past few decades, the society, or the way of people’s life, is completely changed due to a revolutionary development of technology, and the world is now entering a new era, when data containing suﬃcient useful information, via various of networks, which exits everywhere and are easy to access in no time. Among diﬀerent technical breakthroughs that contribute to the new era, cloud storage is undoubtedly one of the most essential and eﬃcient approaches. Unlike traditional storage method that mainly relies on local systems, cloud storage is able to provide more secure, simple, and convenient way of accommodating c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 504–518, 2018. https://doi.org/10.1007/978-3-030-05057-3_38

PLZMA: A Parallel Data Compression Method for Cloud Computing

505

data, thanks to these technical innovations in areas such as parallel computing, virtualization, and data center infrastructure. In the meantime, the data surge continues, and has increased to an extraordinarily huge level. For example, the amount of data that Google needs to process every day is already over 20PB [1], while the amount of image storage for Facebook already exceeding 540 TB, with 100 million images to upload per week, and 475,000 image requests to respond per second. In return, the surge of data for processing and storage is greatly challenging the network systems. Scientists and researchers are eager to seeing more eﬃcient approaches to appera, so that data convey over the network is able to be further accelerated. To be capable of decreasing both the bandwidth of data communication and the space of data storage, data compression is undoubtedly one of the most eﬃcient solutions. Currently, data compression has been widely applied in many key applications, including telecommunication [2], and multi-media technology [3], as well as in some high performance computing areas such as geophysics exploration [4,5]. Data compression plays important role to provide high performance in data communication and accommodation. Particular in the ﬁeld of cloud storage, with the demand on storing larger amounts of data on the cloud, the popularity of virtualization and cloud technologies, the establishment of worldwide ultra-large-scale data centers worldwide, traditional data storage technologies that focusing on data storage eﬀectiveness, such as Data Deduplication and Thin Provisioning, are no longer enough to ensure the eﬃciency of frequent cloud services. Therefore data compression, becomes necessary to provide further contributions in cloud-related applications. Many compression algorithms, such as BWT [6], Deﬂate algorithm [7], and LZ77/78 [8], have been applied in cloud computing system. On the other hand, even though data compression is able to decrease the demand for both the network bandwidth, and the system storage for cloud storage, extra modules for data compression and decompression, within the cloud system, will bring overhead, and might slow down the overall performance. Therefore, the performance of the selected data compression method, also matters in a cloud storage system, to achieve the best overall performance. However, most compression algorithms used in cloud storage are still largely based on serial algorithm, and are only able to guarantee a fair compression ratio, instead of the compression speed that should have been considered as well. Up to now, parallel data compression method with balanced performance in both compression ratio and speed is still less to be seen. For example, Pigz [9] is a parallel compression software based on the Deﬂate algorithm. It greatly improved the compression time but has resulted in poor performance in compression ratio due to the use of parallel methods that have destroyed the integrity of the ﬁles. As good performance in both the compression ratio and speed is required, this work mainly focuses on designing an eﬃcient parallel compression method that is qualiﬁed for a high eﬃcient cloud storage system. Among diﬀerent data compression methods available, the popular LZMA algorithm is selected as the baseline, and is carefully parallelized and optimized based on traditional multi-core

506

X. Wang et al.

parallel system. The reason for selecting LZMA is that LZMA has a good compression ratio when processing binary ﬁles, and consumes a small amount of memory at the same time. Therefore, parallelization of LZMA can achieve relatively good results, so as to make up for the lack of compression time. To summarize, this paper designs PLZMA, a parallel data compression method. The major compression modules of the LZMA algorithm are fully scaled onto parallel systems, with ﬁne grained tuning eﬀorts such as task overlapping, to customize into a parallel framework that ﬁts well with the parallel system.

2 2.1

Related Work Traditional Compression Algorithm

Prediction by partial matching (PPM) [10], which is considered to be one of the best lossless compression algorithms for text data compression, is a typical statistic-based compression algorithm. Based on the context modeling, PPM model uses the previous symbol-patterns in the uncompressed text stream to predict the upcoming symbol. To make the prediction more eﬀective, each context model used in the algorithm is updated based on statistical information. Compared with other compression algorithms, PPM can achieve a very high compression rate for text ﬁles. However, the compression speed of PPM is very limited, mainly due to the huge amount of computation. Deﬂate algorithm [7], which is widely used in HTTP protocol [11], PPP compression control protocol [12], PNG and MNG ﬁle formats [13], zlib compression library [14], etc., is a combination of LZ77 algorithm and Huﬀman code. It is a general-used compression algorithm whose memory consumption is independent of the size of the compressed-ﬁle, thus the Deﬂate algorithm is more suitable for servers with limited memory capacity. Based on the successful experience of Deﬂate, the Lempel-Ziv-Markov chain algorithm (LZMA)[15] is a further improvement to the LZ77 algorithm. The LZMA algorithm oﬀers two longest matching choices for users: Fast mode (ﬁnd patterns based on the hash array of index lists) and normal mode (ﬁnd patterns based on the hash array of binary tree). The purpose of the LZMA algorithm is to increase the compression rate by using a large dictionary, so as to reduce both the compression time and the memory usage during decompression. Compared with other compression method, LZ77 is applicable when user wants to achieve real-time decompression but the compression speed is not emphasized. The BurrowsWheeler transform (BWT) [16] is also a widely used compression algorithm which rearranges a character string into runs of similar characters. Compared with other algorithm, the biggest advantage of BWT is that it can obtain a large compression throughput rate. In general, traditional compression algorithms can achieve very good performance in compression ratio or compression throughput, but it is diﬃcult to achieve good performance in both aspects at the same time.

PLZMA: A Parallel Data Compression Method for Cloud Computing

507

In the meantime, some customised toolkits or libraries have also been developed to facilitate the usage of data compression, such as Snappy [17], LZMA SDK [18], QuickLZ [19], and LZO [20]. 2.2

Parallel Compression Algorithm

Compared with traditional compression method based on serial algorithm, the parallel compression algorithm can eﬀectively reduce the compression time, and thereby achieve a very high compression eﬃciency. A good example is the Pigz compression algorithm, a parallel version of the zlib compression library [9]. During the pigz compression process, the input is decomposed into blocks of 128 KB, and the data in each block is compressed/decompressed in parallel. At the same time, the check point is introduced to ensure accuracy, and it is also calculated in parallel. The compressed data will be written to the output sequentially, and the check values obtained by each processor will be combined into a valid system check value. Pbzip compression method [21], which could be regarded as the parallel version of BWT, is another famous parallel compression method. As the separated blocks in BWT method are independent for each other, we could run the computation simultaneously. That is the basic principle of Pbzip algorithm. However, though remarkable compression speed improvement could be achieved by adopting both of the two compression methods, the compression ratio of such technique is aﬀected due to the overhead of implementing parallelisms. With the surge of the data size in current applications such as cloud computing, novel compression technique which could achieve balanced performance in both the compression speed and the compression ratio is becoming an urgent demand.

3 3.1

Design Overview Baseline Algorithm: LZMA

This work aims at a novel compression approach with good performance in both the compression speed and the compression ratio, and is suitable to be applied in cloud computing system to provide better overall eﬃciency. The LZMA (Lempel-Ziv-Markov chain-Algorithm) [15] algorithm is an improved version based on the Deﬂate algorithm [7], and is considered as one of the most popular compression methods. LZMA was developed by Igor Pavlov and used in his own 7-ZIP [22] product which was designed for a high compression ratio, a rapid decompression/compression, and a low memory consumption when decompressing. Essentially, users can set two modes for ﬁnding the longest match string, which is a key step during the compression procedure. The fast search mode uses the hash array of the index list for searching, while the general search mode uses the hash array of the binary decision tree for searching. Both the index list and the binary decision tree are parallelised in this work.

508

X. Wang et al.

An important reason to select LZMA method as the baseline compression method is that LZMA has a very high compression ratio to resist the overhead brought by parallelisation. The compression speed can be well increased while the compression ratio is still good. Therefore, LZMA is of great potential to achieve the balanced performance in both merits. 3.2

Overview of the Parallel Design

In this part we ﬁrst show the overview of the PLZMA library proposed in this work. In general, PZLMA contains three steps, with two diﬀerent levels of parallelisms, as is demonstrated in Fig. 1.

Update Dictionary Data partition N

Compressed result 1 Initialize Dictionary

Dictionary search

Interval coding

Update Dictionary

Input data Files

Parallel I/O reads

Data partition 2

Compressed result 2 Initialize Dictionary

……

Dictionary search

Interval coding

……

Parallel I/O writes

Output compressed Files

……

Update Dictionary Data partition N

Compressed result N Initialize Dictionary

Dictionary search

Interval coding

Fig. 1. Overview of the PLZMA method

Step 1 : Data Partition. According to the number of the computing processes available of the selected parallel computing platform, the original ﬁles are ﬁrst divided into several data partitions. The principle is to ensure that the sizes of diﬀerent data partitions are as close as possible, so as to ensure the load balance after parallelization. Step 2 : Two-Level Parallel Data Compression. For each data partition, compression method can be applied in parallel and independently. Compressing each data partition generally contains two parts. One is a dictionary-based compression process. First, the dictionary is initialized, and then the dictionary is queried and the dictionary is updated based on the new data that is continuously entered. The other part is the operation of interval encoding. After the input data is compressed by the dictionary, it is converted to the matching position, the length of the match, and the triplet composition of the next character.

PLZMA: A Parallel Data Compression Method for Cloud Computing

509

In this part, we will use interval coding to further compress the data. Step 2 is the essential procedure during the PLZMA process, and this work will use two levels of parallelism to achieve better performance. The ﬁrst level (Process Level) is data parallelism in the upper level. Each data partition is compressed by the computing processes independently and simultaneously. The second level (Thread Level) is in the dictionary compression part. A ﬁner-grained parallelization on the thread level is applied to further improve the performance. Step 3 : Data Combination. After each data segment has been compressed, the ﬁnal result ﬁle is written in parallel according to the length of the resulting compressed data. In our parallel design of the PLZMA, Step 1 and Step 3 are implemented using the parallel I/O library (e.g., MPIO). The Process Level compression in Step 2 is implemented using MPI directly. So the essential part of this work is the Thread Level parallelism, which will be explained in detail in the following section.

4

Thread-Level Parallel Design

In this part we present a parallelized dictionary compression algorithm based on the multi-thread resources, which could full take advantage of the hardware units of modern multi-core CPU processor. 4.1

Algorithm and Workflow

The algorithm description of the dictionary compression part of PLZMA is shown in Algorithm 1. First, we should build basic data structure and hyper-parameters, such as the data structure of hash table, and size of the dictionary. Secondly, while there are still some data needing to be compressed in current position, the algorithm will calculate the hash value of the ﬁrst-batch-of-characters at the current position. Thirdly, if the hash value can be found in the hash table, that means pairing successfully, then we could do the query and update task according to our demand. On the contrary, if the corresponding value can not be found, then the pairing failed and the algorithm directly jump to line 8. When we do the update task (line 5), ﬁrstly we should update the current item to the data structure of the hash table. We should notice that the procedure is various according to the data structure of the hash table. Secondly, a parallel matching search should be performed on the possible matching positions in the hash table entry so as to obtain the maximum matching position in the dictionary that could be matched with the current position. From the above description, we could ﬁnd out that the basic idea is similar to the algorithm of the LZ77, and the data is matched and compression-encoded by maintaining the sliding window. Compared with the previous dictionary compression algorithm, the biggest progress of PLZMA is to adopt the fast-search data structures (such as the hash table and corresponding data structures such as link, binary-tree) so as to accelerate the speed of search procedure. In the

510

X. Wang et al.

Algorithm 1. The Parallel Dictionary Compression Algorithm 1: Dictionary compression initialization 2: While (there are still some data need to be compressed in current position) 3: Calculate the hash value of the ﬁrst batch 4: if (the hash value can be found in the hash table) { 5: Update the value into hash table 6: Encode the value as (oﬀset, len, c) according to the current position 7: } else { 8: Encode the value as (0, 0, c) according to the current position 9: } 10: End While

following part of this section, we will focus on the implementation of update and query of the PLAMA algorithm. 4.2

The Search Entry of Parallel Dictionary

The main compression method we use in PLZMA is also based on the concept of a sliding window. As shown in Fig. 2, in the compression process, we will maintain a ﬁxed size dictionary window. The content of the window is the data that has been compressed before the current position. Behind the current position is the data to be compressed. The basic principle of compression processing is to look for the longest string that can match the current data to be compressed in the dictionary window (as shown in the green block in Fig. 2), and then convert the matching data to a triple (oﬀset, len, c). Among them, oﬀset is the oﬀset from the current position to the matching position in the dictionary window, len is the length of the match, and c is the next character after the matched data. After ﬁnding a match, the dictionary window moves back after the current position moving to the matching data. If the current data cannot ﬁnd a match, a triplet (0, 0, c) is output, where c is the ﬁrst character in the data to be compressed and then the current position is shifted back by one.

Fig. 2. Sliding window matching schematic (Color ﬁgure online)

Most of the early dictionary compression algorithms only support smaller dictionaries (32 KB). In the PLZMA algorithm, depending on the size of the input ﬁle, we will support dictionaries of several MB or even tens of MB. By using

PLZMA: A Parallel Data Compression Method for Cloud Computing

511

a large-capacity dictionary, we can store more data patterns that have appeared in the ﬁle, so that we can achieve more content matching and higher compression ratios in the compression process. However, the large-capacity dictionary will also bring another problem, i.e., the time-consuming query of the dictionary. To solve this problem, we use a hash table in the dictionary so as to reduce the query time. The hash function is calculated based on the ﬁrst few (generally one to three) characters of the compressed data to obtain the matching position in the dictionary. Due to the large dictionary capacity, the hash value of the ﬁrst few characters may have multiple corresponding positions in the dictionary, and in such case, which position has the longest matching distance with the current data is unknown. Therefore, in order to achieve the best compression eﬀect, we will store multiple possible matching positions in each hash table entry. Compared with other dictionary compression algorithms, the hash table entries in PLZMA can be stored in binary trees, linked lists, and static arrays. Among these choices, the storage form of the static array can support simultaneous access to diﬀerent matching positions, thereby providing better support for parallel dictionary queries. Thus, we should use this kind of data structure whenever possible. 4.3

Dictionary Update and Maintenance

The introduction of the hash table can greatly speed up the pattern matching speed, but it also brings certain overhead to the updating and maintenance of the dictionary. While moving the window forward, the corresponding structure in the hash table needs to be updated accordingly. As described above, in the PLZMA algorithm, there are three kinds of possible data structures to store the hash table entry: linked list, binary tree, or static array. 1. Store the hash table by linked list In this mode, elements within the hash table will point to values stored in a linked list, while the starting node of the corresponding linked list stores in the hash table. Since there is no limit to the length of the list itself, in order to avoid the algorithm spending too much time searching for a speciﬁc match, PLZMA will provide a settable parameter to specify the maximum length of the search in the linked list. At the same time, the latest matching data will be added to the beginning of the linked list in the hash table entry, thus to ensure that the matching is performed from the most recent position each time. As indicated above, PLZMA uses sliding window technology. As the window slides forward, the hash table updates accordingly. Since adopting the dynamic allocation of memory to maintain the list would bring a huge memory management overhead, PLZMA uses static arrays and cyclic buﬀer technology to avoid frequent allocation and release of memory operations. The structure of such cyclic buﬀer is shown in Fig. 3.

512

X. Wang et al.

Fig. 3. Basic structure of the cyclic buﬀer

As demonstrated in Fig. 3, cyclic buﬀer is essentially an array one byte more than the sliding window size (window plus the current encoded byte), but it maintains a logical ring. When the window slides forward, we only need to change the values of start and end accordingly. Each byte in the cyclic buﬀer corresponds to each byte in the sliding window, as shown in Fig. 4.

Fig. 4. Cyclic buﬀer before coding

As shown in Fig. 4, when encoding the current byte, the index in the hash table is obtained by the hash function. With this index, the algorithm can get the absolute position of the previous byte with the same hash value. Through the current encoding byte position and the oﬀset of the position, the corresponding position can be found in the son table (namely the cyclic buﬀer). Since the cyclic table is a linked list, we can ﬁnd all the matching positions in the current dictionary accordingly. After coding, PLZMA makes corresponding changes to the hash table and son table, as shown in Fig. 5. Comparing the above two ﬁgures, we can see that the hash table and its son table slide forward with the window and keep changing. In this process, there is no memory redistribution and release operation, which greatly enhances the speed. 2. Store the hash table by binary tree As indicated above, in some cases we could also employ binary tree to store the hash table entry so as to achieve satisfying search performance. Similar to the linked list, in the binary tree mode, the hash table entry points to a binary search tree which is empty at the ﬁrst. With point reading and encoding of data, trees are continuously created or grown. Each data unit will have a corresponding node on the binary tree. For instance, if we hash each two bytes of a work, then every two consecutive bytes will correspond to a node on the binary tree, and the

PLZMA: A Parallel Data Compression Method for Cloud Computing

513

Fig. 5. Cyclic buﬀer after coding

word ‘good’ will be split into three nodes: ‘go’, ‘oo’, and ‘od’, which are assigned to the corresponding binary tree. Below we will use a detailed example to illustrate the speciﬁc process of binary tree establishment and maintenance. Suppose we are going to compress a long string that contains ﬁve ab-starting data items in diﬀerent locations. These content and location are (abm, abcd2, abcx, abcd1, aby) and (11, 24, 30, 57, 78) respectively.

Fig. 6. Binary Tree Establishment and Maintenance

(1) Suppose that ab has a hash value of 62. When the ﬁrst scan obtains the position of ab as 11, the binary tree is empty. Thus, a binary tree is created in position 62 in the hash table. At this moment, the binary tree only have one root node storing the value 11. Since no match was formed, the character a was directly output and then the encoder started processing the next string bm. (2) When the compression algorithm processes position 24 to the next occurrence of ab, 62 is pop-up by calculating the hash value of ab, thereby starting the search in the binary tree that belongs to position 62 in the hash table. First of all, the encoder updates the new location 24 to the root of the binary tree. Then, position 11 is connected as the right child of position 24, since the character c of the string abcd2 is smaller than the character m of string abm.

514

X. Wang et al.

(3) When the next ab is found at position 30, we ﬁrst update position 30 to the root of the binary tree, and place the previous position 24 to the left child of position 30 and position 11 to the right child of position 30, according to the character order. (4) The next ab corresponds to position 57. This location is updated to be the root node. The characters after all previous ab-strings are less than the characters after ab. So the original binary tree is placed to be the right child of position 57. (5) The last ab-string in this example is found at position 78. Similar to previous ones, such value is updated to the root node of the binary tree. Since the characters of all previous positions after ab are smaller than the character y of position 78 after ab, in the previous binary tree is placed as the left child of the root node. As demonstrated the above example, the binary search tree used by PLZMA will ensure that the most recent pattern must be presented at the root place of the binary tree. At the same time, the internal nodes of the binary tree will be arranged in accordance with the character sequence, so as to achieve an eﬀective search during the matching process. Beneﬁt from these two features, the compression ratio is able to further improved by using the binary tree as the data structure. However, at the same time, the update and maintenance overhead of the binary tree is relatively complicated, which may increase the compression time as a result. 3. Store the hash table by static array In addition to linked list and binary tree, PLZMA also supports the storage of hash entries by using static array. Compared with previous data structure, employing static array could remarkably reduce the parallel dictionary query time, as a tradeoﬀ, the comparison ratio of such data structure is not as good as before. To simplify the narrative, Fig. 7 demonstrates the basic ideas of static-array based hash table. In order to maintain and update each hash table entry, we store two variables for each hash table entry’s corresponding array. The ﬁrst variable is the number of valid positions in the array. At the beginning, the variable is initialized to 0, and then increases as the match pattern gradually

Fig. 7. Static array

PLZMA: A Parallel Data Compression Method for Cloud Computing

515

added, up to the upper limit of the array size, and then no longer changes. The other variable is the index corresponding to the latest matching position in the array. By using this variable, we can guarantee that the searching will start at the nearest matching position every time when we make a query. When the static array elements are used up, the most recent match will be updated into the hash table entry to replace the longest match before. 4.4

Parallel Dictionary Query

As indicated in previous sections, to achieve optimal compression performance, a multi-level parallelism should be adopted in the PLZMA algorithm. Besides the parallel data compression discussed earlier, the parallel dictionary query is also one of the most important parts of the algorithm design. As the data structure of hash table entry changes, implementation of the parallel dictionary query should also be diﬀerent so as to achieve the optimal performance. (1) For the hash table using the linked list, for each query we can directly traverse along the list and perform matching search on the position that corresponds to each linked list entry, and thread management techniques such as thread-pool could be adopted in this case. (2) For the hash table using binary tree, the implementation of the parallel dictionary query is relatively diﬃcult, because the parallel search and update of the binary tree are not easy. In this case, we can allocate multiple threads to the task, using one thread to perform the longest matching search in the binary tree, and then come up with another thread to update and maintain the binary tree. (3) For the hash table with static array, parallel dictionary queries are more straightforward. Since the possible matching positions corresponding to the same hash value have been sequentially stored in the array, we can initiate multiple threads directly through OpenMP or pthread to process diﬀerent positions in parallel.

5 5.1

Performance Results on Telecommunication Dataset Selected Platform and Dataset

Table 1 shows the hardware conﬁguration of the selected platform. As for the dataset for testing the performance of the PLZMA, we choose diﬀerent types of data, including 10 ﬁles in text format (ranging from 1.8 MB to 1 GB), and some other ﬁles including video, audio, and ppt format. 5.2

Performance Result

For all test case, we ﬁrst validate the parallel program by comparing between the original data with the decompressed data, and guarantee that the data is identical. So the program is validated. Figure 8 shows the performance speedups of PLZMA over diﬀerent threads. With the increase of the threads number, the performance improves. Using 12 threads is able to achieve a speedup of 8× over the serial LZMA.

516

X. Wang et al. Table 1. Hardware conﬁguration Hardware

Conﬁguration

CPU

Intel(R) Xeon(R), E5645, 6 cores, 2.40 GHz

CPU number 12 Cache

1 MB

Memory

36 GB

Disk

4 TB

OS

Red Hat Enterprise 5.5

Compiler

Intel Compiler ICC 12.0

MPI

Intel MPI Library 4.0

Fig. 8. Speedups over diﬀerent thread

Figure 9 shows the compression time of PLZMA as well as some other famous compression methods, while Fig. 10 shows the compression ratio of these diﬀerent methods. From the two ﬁgures concerning compression time and ratio, we can see that PLZMA is able to gain more balanced performance in both merits.

Fig. 9. Compression time over other parallel compression method

PLZMA: A Parallel Data Compression Method for Cloud Computing

517

Fig. 10. Compression ratio over other parallel compression method

5.3

Analysis

Based on the experimental result we can ﬁgure out that, compared with other compression methods, our PLZMA method is more balanced in both the timeto-solution and the compression ratio. Pigz is based on zlib, and performs worse in the compression ratio. Pbzip2 is based on bzip2, and performs worse in the compression time. Both Pxz and PLZMA are based on LZMA, but the former performs worse in scalability. Therefore, PLZMA outperforms other methods, and have already been applied in industry now for compressing the real data in cloud computing system.

6

Conclusion

This work presents PLZMA, a parallel data compression method for high performance solutions in cloud computing system. Based on two diﬀerent levels of parallelisms, every ﬁne-grained step of the original LZMA method is well tuned, and obtained better performance in the merits of time and ratio. Acknowledgement. L. Gan, and J. Xu are supported by the National Natural Science Foundation of China (grant no. 61702297); and the China Postdoctoral Science Foundation (grant no. 2016M601031). H. Fu, and X. Wang are supported by the National Key Research & Development Plan of China (grant no. 2017YFA0604500), the National Natural Science Foundation of China (grant no. 91530323, 41661134014, 41504040 and 61361120098); and the Tsinghua University Initiative Scientiﬁc Research Program (grant no. 20131089356). G. Yang, and J. Yang are supported by the National Key Research & Development Plan of China (grant no. 2016YFA0602200). X. Huang is supported by a grant from the State’s Key Project of Research and Development Plan (2016YFB0201100) and the National Natural Science Foundation of China (41375102).

518

X. Wang et al.

References 1. Dean, J., Ghemawat, S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 2. Motley, C.F.: Telecommunication data compression apparatus and method, April 13 2004. US Patent 6,721,282 3. Yan, C., Zhang, Y., Dai, F., Li, L.: Highly parallel framework for HEVC motion estimation on many-core platform. In: Data Compression Conference (DCC), pp. 63–72. IEEE (2013) 4. Gan, L., Haohuan, F., Luk, W., Yang, C., Xue, W., Yang, G.: Solving mesoscale atmospheric dynamics using a reconﬁgurable dataﬂow architecture. IEEE Micro 37(4), 40–50 (2017) 5. Gan, L., Fu, H., Mencer, O., Luk, W., Yang, G.: Data ﬂow computing in geoscience applications. Adv. Comput. 104, 125–158 (2017) 6. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994) 7. Deutsch, P.L.: Deﬂate compressed data format speciﬁcation version 1, 3 (1996) 8. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977) 9. Gristwood, T., Fineran, P.C., Everson, L., Salmond, G.P.C.: PigZ, a TetR/AcrR family repressor, modulates secondary metabolism via the expression of a putative four-component resistance-nodulation-cell-division eﬄux pump, zrpadbc, in serratia sp. atcc 39006. Mol. Microbiol. 69(2), 418–435 (2008) 10. Adiego, J., Fuente, P.D.L.: Merging prediction by partial matching with structural contexts model, p. 522 (2004) 11. Berners-Lee, T., Fielding, R., Frystyk, H.: Hypertext transfer protocol-http/1.0. Technical report (1996) 12. Woods, J.: PPP deﬂate protocol (1996) 13. Boutell, T.: PNG (portable network graphics) speciﬁcation version 1.0. (1997) 14. Deutsch, P., Gailly, J.-L.: Zlib compressed data format speciﬁcation version 3.3. Technical report (1996) 15. Zhu, W., Xu, J., Ding, W., Shi, Y.: Adaptive LZMA-based coding for screen content. In: Picture Coding Symposium, pp. 373–376 (2013) 16. K¨ arkk¨ ainen, J.: Fast BWT in small space by blockwise suﬃx sorting. Elsevier Science Publishers Ltd. (2007) 17. Culler, M., Dunﬁeld, N.M., Weeks, J.R.: Snappy, a computer program for studying the geometry and topology of 3-manifolds (2017) 18. Pavlov, I.: Lzma sdk (software development kit) (2007) 19. Reinhold, L.M.: Quicklz website 20. Oberhumer, M.F.X.J.: Lzo-a real-time data compression library (2008). http:// www.oberhumer.com/opensource/lzo/ 21. Varsaki, A., Afendra, A.S., Vartholomatos, G., Tegos, G., Drainas, C.: Production of ice nuclei from two recombinant zymomonas mobilis strains employing the inaZ gene of pseudomonas syringae. Biotechnol. Lett. 20(7), 647–651 (1998) 22. Lembayung, W.: Comparative analysis on the izarc compression process and 7-zip (2011)

A Caching-Based Parallel FP-Growth in Apache Spark Zhicheng Cai(B) , Xingyu Zhu, Yuehui Zheng, Duan Liu, and Lei Xu School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China [email protected]

Abstract. The association-rule-based recommendation is widespread in many big data applications which need quick response to improve user experience. Spark is a widely used distributed computing platform, which accelerates the processing of large-scale distributed data. Developing appropriate distributed algorithm for Spark is essential to decrease the processing time of distributed recommendation. The existing FP-Growth in Spark is a popular parallel recommendation method but getting the best performance only when the memory of machines can accommodate all immediate Resilient Distributed DataSets (RDDs). However, memory of many practice data centers is still not large enough for large data sets. Therefore, in this paper, a caching-based parallel FP-Growth is proposed which consists of an integer-based sorting and an RDD-caching strategy to improve the eﬃciency. Experimental results show that the proposal decreases the execution time by 32.37% on average compared with the existing parallel FP-Growth in Spark. Furthermore, impacts of some important parameters upon the performance of the proposal are analyzed by numerous realistic experiments in Spark. Keywords: Spark

1

· Parallel FP-Growth · Caching strategy

Introduction

The main objective of recommendation systems is to recommend appropriate items such as products, movies and the like to consumers [15,19]. Association rule mining is one of the widespread methods for recommendation systems [3,13] which has been adopted in diverse ﬁelds. Mining of frequent itemsets is essential for generating association rules from training data. Contemporarily, there are many single-machine based association-rule mining algorithms such as Apriori [2] and FPGrowth [8,27]. The FPGrowth is a widely used frequent-itemset mining algorithm, which requires a tremendous amount of memory when a transaction consists of quite a few items. In some occasions, the data size is so large that the FPTree cannot be accommodated in the memory of a single machine which is likely to result in failure events. Moreover, single-machine-based methods cannot fulﬁll the requirement of timely response when they confront a large amount of data. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 519–533, 2018. https://doi.org/10.1007/978-3-030-05057-3_39

520

Z. Cai et al.

Parallel algorithms have been developed to reduce memory usage and computation times of frequent itemset mining algorithms. MapReduce, which has been widely used in many ﬁelds, is a much more ﬂexible programming model compared with multiprocessing systems [6]. MapReduce-based parallel versions of the FP-Growth [25] and the Apriori [10] algorithms have been proposed to accelerate the process distributed data. Although the MapReduce programming model provides a parallel computing environment, the intermediate results of MapReduce are stored to disks, which inevitably leads to many time-consuming I/O operations. Apache Spark [1,23] compensates for this defect by storing the intermediate data in memory. Therefore, Spark versions of parallel FPGrowth (PFPGrowth) [7] and Apriori [12,16] have been developed. Existing works about frequent itemset mining in Spark mainly focus on the parallelism itself without considering the impact of caching strategies and the inﬂuences of Spark parameters upon the performance. The PFPGrowth divides the data into diﬀerent partitions, generating one separate FPTree for each partition. Then frequent itemsets are generated by mining each FPTree, which will be collected to recommend items. The traditional parallel FP-Growth in Spark assumes that intermediate Resilient Distributed DataSets (RDD) are not released from memory whenever they are needed. However, the mining of FPTree is very memory consuming. Therefore, following tasks mining new FPTrees will occupy the memory used to store intermediate RDDs generated during the mining of previous FPTrees which leads to recomputing of these intermediate RDDs. In this paper, a caching strategy is proposed to avoid the recomputing of critical intermediate RDDs. For example, the RDD of transactions of training data and the RDD of frequent itemsets, which will be used multiple times, are cached in disk or memory. In the original PFPGrowth, single items are sorted according to their frequencies based on String formats and transformed to integer formats after the sorting, which is memory consuming. Therefore, in this paper, the String-to-Integer transformation is adjusted to be done before the sorting, which will reduce the memory consuming signiﬁcantly. Moreover, less attention has been paid on the inﬂuence of data partition numbers on the performance. Therefore, experiments have been done to evaluate the performance of processing the same amount of data under diﬀerent data partition numbers. The main contributions of this paper are as follows: i. A caching strategy is developed to cache appropriate RDDs to eliminate the recomputing of them to reduce the execution time. ii. Extensive experiments were carried out to analyze the impact of dividing the same size of input data into diﬀerent data partitions on the Spark performance which is used to guide the selection of the number of data partitions. The rest of the paper is organized as follows. Related works are presented in Sect. 2. Section 3 describes the problem followed by the introduction in Sect. 4. Sections 5 and 6 present the proposed recommendation methods and experimental results respectively. The paper is concluded in Sect. 7.

A Caching-Based Parallel FP-Growth in Apache Spark

2

521

Related Works

There are many recommendation algorithms such as Apriori [2], FPGrowth [8] and other collaborative ﬁltering-based algorithms [14]. Traditional singlemachine based recommendation algorithm cannot process distributed data eﬃciently. It is vital to develop distributed algorithms to process distributed data [5]. For high-performance computing systems such as multiprocessing systems, a parallel Apriori algorithm was investigated by Ye et al. [21] to increase the capacity of single-machine algorithms. A message-passing-interface based parallel FPGrowth algorithm was developed by Yu et al. [22] in which the data set is divided by evaluating the width and depth of the FPTrees to balance the size of generated FPTrees. The MapReduce-based programming model is more ﬂexible and scalable compared with many other parallel and distributed computing platforms [26]. Therefore, diﬀerent MapReduce-based recommendation algorithms have been developed in the literature. A MapReduce-based parallel version of the FP-Growth algorithm was proposed by Zhang et al. [25]. Xun et al. [20] investigated a parallel FP-Growth on MapReduce clusters, which places similar transactions into the same partition to reduce excessive redundant conditional pattern bases. Lin et al. [10] developed an Apriori-based parallel algorithm on MapReduce. A MapReduce-based collaborative ﬁltering recommendation algorithm is proposed by Li et al. [9]. However, the MapReduce model stores results to disks which is time-consuming and is not suitable for multi-step or iterative computing. Apache Spark is a new distributed computing platform [1,23], which stores intermediate data in memory to avoid frequent disk I/O. Many recommendation algorithms have been converted to parallel version and deployed to Spark. For instance, Qiu et al. [12] investigated a parallel Apriori on Spark which outperforms the MapReduce method around 25 times. A hybrid parallel Apriori in Spark is proposed by Sethi et al. [16] which avoids scanning the complete dataset each iteration. A distributed Apriori-like frequent-itemset-mining algorithm on spark was developed by Zhang et al. [26] which reduces the number of candidate sets by applying a matrix-based pruning approach. Winlaw et al. [19] proposed a method to accelerate the collaborative ﬁltering optimization method on Spark which consists of an eﬃcient line search technique, requiring only one pass over distributed data. Gassama et al. [7] developed a parallel version of FPGrowth which scales ﬂexibly and process distributed data eﬃciently. Most of the existing work about parallel FPGrowth and Apriori focus on the converting from single-machine algorithms. However, caching strategies are very crucial to parallel algorithms in Spark, which have been ignored by existing work. Spark parameters exert a tremendous inﬂuence upon the algorithm performance, which can be tuned by users according to characteristics of speciﬁc applications to optimize the performance. A case in point is that the execution time is diﬀerent when the same amount of data is divided into diﬀerent numbers of data partitions for the same algorithm [24]. Wang et al. [18] present a method to predict the performance of diﬀerent applications on the Spark by running a small fraction of the original data. Petridis et al. [11] investigates the impact

522

Z. Cai et al.

of the most important of the tunable Spark parameters on the application performance. A machine-learning-based method is proposed by Wang et al. [17] to set the conﬁguration the Spark automatically. However, the impact of dividing the same size of input data into diﬀerent numbers of partitions on the Spark performance has not been investigated. Therefore, in this paper, caching strategies have been developed to appropriately use limited memory to improve the performance of parallel FPGrowth on Spark. Moreover, the inﬂuence of the number of data partitions on the execution time and memory usage has been analyzed by extensive experiments.

3

Problem Description

Let I = {i1 , i2 , ..., in } be the set of distinct items where ik is the k-th item and n is the number of items. T = {t1 , t2 , ...tm } represents the set of transactions in which tw is the w-th transaction and the subset of I, and m is the number of transactions. A set of items is called an itemset. The ratio of the number of transactions containing an itemset X in T to m is referred to as the frequency (support) of X and labeled as support(X). An association rule is a pair of itemsets in the form of X ⇒ Y , where X is called antecedent and Y is the consequent of X. X ⇒ Y means that the appearance of X usually leads to the appearance of Y . The conﬁdence of an association rule is equal to the ratio support(X ∪ Y )/support(X). To fulﬁll the requirement of quick response, it is vital to accelerate the speed of the existing parallel FP-growth. The current parallel FP-growth in Spark assumes that there is enough memory to support the computation which cannot adapt to data clusters composed of computers with low memories. When the same RDD need to be used multiple times, it will be recomputed at the second time because it is released after the ﬁrst use which is time-consuming. Consequently, the objective of this paper is to develop appropriate cache strategies to accelerate the execution speed.

4

Existing Parallel FP-Growth in Spark

The traditional FP-growth is composed of two steps [27]. In the ﬁrst step, the data is scanned for the ﬁrst time to calculate the frequency of each item, and items are sorted in descending order of frequencies. Then, the data is scanned secondly to eliminate items of which the frequencies are smaller than a threshold and to sort each transaction according to frequencies. In the second step, an FP-Tree is constructed based on sorted transactions and mined recursively to generate the frequent-item set. The existing Parallel FP-growth [1,7] in Spark is composed of following steps: S1: Single frequent item generating and sorting. As shown in Stage 0 and 1 in Fig. 1, data is ﬁrst divided into p partitions and stored in distributed storage system like HDFS. Each partition contains many transactions each of which

A Caching-Based Parallel FP-Growth in Apache Spark

523

is split into an array of items and then ﬂatted into (item, 1) pairs. Next, such pairs on diﬀerent partition are reduced by items to calculate the frequency of each item and the results are stored in (item, frequency) pairs. Items with frequencies (support) lower than a threshold minSup are eliminated. Finally, items are sorted in descending order of frequencies and the sorted results are stored in SortList. Meanwhile, a map from the original item to the rank in the SortList is generated. S2: Conditional-pattern-base generating. According to a given partition function, items are divided into p groups each of which has a unique group-id (p is the number of data partitions). For each data partition, each transaction is converted into multiple conditional pattern bases. Items of each transaction are scanned from the last to the ﬁrst. Whenever an item belongs to a new group (the conditional pattern base for the current transaction has not been generated for this group), a new conditional pattern base consisting of the ﬁrst item to the current item is generated for this group. Finally, one conditional pattern base will be generated for each group. Generated conditional pattern bases are stored in ConditionTransRDD[Map(group id, conditional pattern bases)] as shown in Stage 2 of Fig. 1. S3: Frequent-itemset based association rule generating. Conditional pattern bases are reduced by the group-id into diﬀerent data partitions, each of which consists of bases with the same group-id. A FP-tree is ﬁrst generated for each group and stored in FPTreeRDD[groupid, FPTree] as shown in Stage 3 of Fig. 1. Then, each FPtree is mined to generate frequent items FrequentItemRDD[FreqItemset], where FreqItemset is a class which stores a frequent itemset and its support. Next, based on FrequentItemRDD, association rules AssociationRDD[(antecedent, (consequent, union-frequency))] are obtained by splitting each frequent itemset where antecedent and consequent are itemsets, and union-frequency is equal to support(antecedent ∪ consequent). To calculate the conﬁdence of each association rule, support(antecedent) is needed. It has been proved that all antecedents are contained in the original FrequentItemRDD. Therefore, FrequentItemRDD is transformed into a new join-available PairFrequentItemRDD[(frequent-itemset, frequency = support(f requent−itemset))] by unfolding attributes of the class FreqItemset. Details can be found in Stage 3 and 4 of Fig. 1. S4: Computing confidences of association rules. Stage 5 of Fig. 1 shows that PairFrequentItemRDD and AssociationRDD are ﬁrst joined and reduced according to the key frequent-itemset and antecedent. Reduced results are stored in AssociationConfRDD[(antecedent, (consequent, union-frequency), frequency)]. Next conﬁdence degrees of association rules are calculated by mapping AssociationConfRDD to AssociationConfDoubleRDD [(antecedent, consequent, conf ] in which conf is equal to union-frequency/ frequency. If conf of an association rule is smaller than minConf , the rule is eliminated.

524

5

Z. Cai et al.

Proposed Caching Based Recommendation

For one thing, transactions of input data are usually stored in String formats. Although the original parallel FP-growth algorithm has a map from items in String formats to rank values of integer in Stage 2, it is still very memoryconsuming in Stage 0 and 1 to operate on String data directly. Therefore, a storage type transforming method is proposed ﬁrst to preprocess the original data. For another one, as shown in Stage 3 of Fig. 1, in the original algorithm, FPTreeRDD is ﬁrst constructed and used to generate the FrequentItemRDD. Then, FrequentItemRDD is used to generate PairFrequentItemRDD. Next, AssociationRDD is generated from the same FrequentItemRDD which forms a Stage 4. If tasks of Stage 3 is scheduled and executed ﬁrst, all the previous three RDDs are computed one by one. When the memory is not large enough, the memory cannot accommodate all the intermediate RDDs including the FrequentItemRDD which will be released to support the calculation of following tasks. When tasks of Stage 4 are scheduled later and want to use the intermediate FrequentItemRDD to generate the AssociationRDD, the FrequentItemRDD has been removed from memory and needs to be recomputed. The recomputing of FrequentItemRDD involves

Fig. 1. RDD transformation process of existing parallel FP-growth

A Caching-Based Parallel FP-Growth in Apache Spark

525

the recomputing of many previous RDDs even the RDDs in previous Stages like the ConditionTransRDD, which is time-consuming. RDDs in Spark can be cached (including persisting) into memory or disk to reuse some important RDDs and avoid unnecessary recomputing. However, caching consumes additional time, and it is not sensible to cache all RDDs. Therefore, appropriate caching strategies are essential to Spark applications. In this paper, a caching strategy is developed to accelerate the processing speed of the parallel FP-Growth algorithm. At last, a set-matching based recommendation method is proposed to recommend based on user data. Details of the proposal are as follows. 5.1

Integer Based Sorting

The main objective of Stage 0 and 1 of the parallel FP-Growth is to read transactions, calculate frequencies of single items and sort them. The original training data is stored in String formats which make the following computation memoryconsuming. For example, when the original input data in HDFS is 1.5 GB, the total memory consumption of split transactions in SplitTransactionRDD is about 13 GB. Although the input training data is in String formats, each item is a number. Therefore, when TransactionRDD is mapped to SplitTransactionRDD by splitting each String to an Array of String, each String value is transformed to an integer value, i.e., each transaction is stored by an Array of integers. The new integer-based SplitTransactionRDD only consumes 1 GB approximately. In Spark, execution and storage share the total memory, reducing the memory consumption of SplitTransactionRDD is helpful to accelerate later sorting. 5.2

Proposed Caching Strategy

As shown in Fig. 1, SplitTransactionRDD are used to sort items ﬁrst in Stage 1, and to generate conditional FPtrees in Stage 2. In Spark, when the memory is not large enough, intermediate RDDs will be replaced to accommodate later RDDs if these intermediate RDDs are not cached. Consequently, if SplitTransactionRDD is not cached after it is ﬁrst obtained, SplitTransactionRDD will be removed from memory in later steps. When it is needed again in Stage 2, SplitTransactionRDD needs to be recomputed. In this paper, SplitTransactionRDD can be cached (The cache operation includes cache and persist operations in Spark) by two diﬀerent ways: Memory-and-Disk based caching and Disk based caching when it is ﬁrst generated. The Memory-and-Disk caching uses memory as much as possible to cache RDDs while the Disk only use the disk. In Fig. 1, the FPTree constructing and frequent-itemset mining are two most time-consuming steps. The obtained frequent itemsets are stored in the intermediate FrequentItemRDD. Then, the intermediate FrequentItemRDD will be used twice to generate the PairFrequentItemRDD and the AssociationRDD in Stage 3 and 4 respectively. Because the computing of PairFrequentItemRDD and the AssociationRDD belong to diﬀerent Stages, they cannot be computed together. After FrequentItemRDD is used in Stage 3, it will be removed from memory for later computation when there is not enough memory to hold all intermediate

526

Z. Cai et al.

Fig. 2. RDD transformation process of proposed caching-based parallel FP-growth

RDDs. Therefore, in this paper, FrequentItemRDD is cached when it is ﬁrst generated. Moreover, the number of frequent itemsets is counted which trigger the computation of Stage 3 ﬁrst in a separate Spark job. At last, in Stage 4, association rules are generated from the cached FrequentItemRDD without recomputing of FrequentItemRDD which speeds up the execution. Transformation of RDDs of caching-based parallel FPGrowth are shown in Fig. 2 in detail. 5.3

Set-Matching-Based Recommendation

The existing parallel FP-growth does not consist of recommendation method. Therefore, a set-matching based recommendation method is developed in this paper. As shown in Fig. 3, the user data is ﬁrst read from distributed storage system like HDFS. Each user transaction is split and transformed into an integervalue-based array which is stored in UserTransactionRDD[ Array[item:Int]]. To match user transaction with antecedents of association rules quickly, each user transaction is mapped to a set of integer, generating UserSetRDD[ Set[item:Int]]. Then, if an user transaction is the subset of the antecedent of a rule in CombinedRulesRDD[(antecedent, List[(consequent, Confidence)])], a new recommendation result (user-transaction, List[(consequent, Confidence)]) is generated.

A Caching-Based Parallel FP-Growth in Apache Spark

527

Fig. 3. Set-matching-based recommendation

5.4

Description of the Proposal

The caching-based parallel FPGrowth (CPFPGrowth) is described in Algorithm 1. First, from Step 2 to 8 frequencies of items are counted, and items are sorted in the order of frequency. Then conditional pattern bases are generated in Step 9 and FPTree is constructed for each group at Step 10. FPTrees are mined to generate frequent itemsets and association rules from Step 11 to 14. Next, from Step 15 to 19, frequencies of itemsets and association rules are joined to compute conﬁdence degrees of association rules. Finally, recommendations are generated for user data. The CPFPGrowth with Memory-and-Disk and Disk based caching strategies are referred as CPFPGrowth-MD and CPFPGrowth-D respectively.

6

Performance Evaluation

Experiments are performed to compare the execution times of the caching-based parallel FPGrowth with the traditional PFPGrowth of which the source codes are available in public Spark website. Source codes of the proposed caching-based parallel FPGrowth is available on the website [4]. Two sets of data are used to evaluate the proposal which includes a training set TD and a test set UD. TD is composed of 169208 records (transactions) and there are 526765 items in total. Each line of TD and UD is a transaction which consists of multiple items. To evaluate the eﬀectiveness of our methods, a private Spark cluster is built on three physical machines in which each node has two cores and 5 GB Memory. Meanwhile, a cluster with six nodes is established on virtual machines rented from public clouds with the conﬁguration of 4 cores and 62 GB memory. Spark 2.3.0 and Hadoop 2.7.5 are installed on each node.

528

Z. Cai et al.

Algorithm 1. Caching-based Parallel FPGrowth Recommendation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

6.1

Input: Training data TD, user data UD, and partition number p begin Read TransactionRDD=textFile(TD, p); SplitTransactionRDD = TD.Map(line⇒ line.split(” ”).map( .toInt)); SplitTransactionRDD.Persist().Count(); ItempairRDD = SplitTransactionRDD.Flatmap(t ⇒t.toSet).Map(v ⇒ (v, 1)); ReducedItemRDD = ItempairRDD.ReducceByKey( + ); FrequenitemArray=ReducedItemRDD.Filter( . 2≥ minSup).Collect(); Sort FrequenitemArray in descending order of frequency and get SortList of items; ConditionTransRDD=Generate conditional pattern bases from SplitTransactionRDD; FPTreeRDD=Aggregate ConditionTransRDD by group id and construct FPTree for each group separately; FrequentItemRDD=Mine each FPTree to generate frequent itemsets; PairFrequentItemRDD= FrequentItemRDD to join-available pairs; PairFrequentItemRDD.Count(); AssociationRDD=Generate association rules from FrequentItemRDD; AssociationConfRDD=Join and Reduce AssociationRDD and PairFrequentItemRDD; AssociationConfDoubleRDD=Divide the union-frequency by the frequency to calculate the conﬁdence of each rule; AssociationConfDoubleRDD.Filter( . 3≥ minConf ).Count(); CombinedRulesRDD=Combine rules with the same antecedent; RuleList=CombinedRulesRDD.Collect(); Read UserDataRDD=textFile(UD, p); UserTransactionRDD= UserDataRDD.Map(line⇒ line.split(” ”).map( .toInt)); UserSetRDD=UserTransactionRDD.Map( .toSet); RecommendResultRDD=Recommend items for UserSetRDD by comparing whether user transaction is the subset of antecedent of association rules; return RecommendResultRDD

Results Under Diﬀerent minSups

Execution times of diﬀerent algorithms on diﬀerent minSups are shown in Table 1 which shows that the proposed CPFPGrowth-MD and CPFPGrowth-D is faster than the traditional PFPGrowth. Moreover, as the minSup decreases, the proposals save much more execution times. The reason is that more items with frequencies (support) higher than the minSup are kept for later analysis as minSup decreases. For instance, Fig. 4 shows the size of shuﬄe data in Stage 2 of Fig. 2, which demonstrates that the size of shuﬄe data increases as the minSup decreases. Furthermore, Table 1 also illustrates that CPFPGrowth-D has shorter times than CPFPGrowth-MD. The cause is that CPFPGrowth-MD uses more memory space to cache intermediate RDDs, leaving less memory to computation

A Caching-Based Parallel FP-Growth in Apache Spark

529

which makes the Memory-and-Disk based strategy consume more time than the Disk based caching. Therefore, caching RDDs in memory to accelerate and using the memory to computing should be balanced according to the characteristics of applications. For example, for CPFPGrowth algorithms, the constructing and mining of FPTrees are very memory-consuming. If the memory cannot accommodate all intermediate RDDs, it is beneﬁcial to use the Disk -based strategy to cache RDDs which will be used multiple times. Table 1. Average execution times under diﬀerent minSups (Min) Algorithm/minSup 0.3 0.25 0.20 0.15 0.10 PFPGrowth

1.2 1.3

4.6

15

49

CPFPGrowth-D

0.7 1.1

3.3

11

35

CPFPGrowth-MD

1.3 1.1

3.6

11

48

Fig. 4. Size of shuﬄe data generated by CPFPGrowth (MB)

6.2

Results Under Diﬀerent Partition Numbers

For a given size of input data, dividing the data into diﬀerent numbers of partitions is crucial to the performance. Computation times of compared algorithms under diﬀerent partition numbers p are shown in Fig. 5 which demonstrates that the proposed CPFPGrowth-MD and CPFPGrowth-D have shorter times than the original PFPGrowth. The average execution time is decreased by 32.37% through caching essential RDDs appropriately. The reason lies in the fact that essential intermediate RDDs are cached to avoid recomputing them when the memory is not large enough to hold all intermediate RDDs. Therefore, the proposed caching-based methods are more suitable for the execution on data centers without enough memory to accommodate all RDDs.

530

Z. Cai et al.

Fig. 5. Execution times under diﬀerent partition numbers (Min)

Fig. 6. Size of shuﬄe data under diﬀerent partition numbers (MB)

Figure 5 also illustrates that the number of data partitions exerts an enormous inﬂuence upon the execution time. The reason is that more conditional pattern bases are generated when the number of partitions increases. For each transaction, items are scanned from the last to the ﬁrst. Whenever an item belongs to a new partition, a new conditional pattern base consisting of the ﬁrst item to the current item is generated. Consequently, a larger partition number means more conditional pattern bases which lead to more shuﬄe data as shown in Fig. 6. Processing more data usually consumes a much longer time if tasks cannot be processed in parallel. Only when the number of data partition is smaller than the number of executors, increasing the number of data partition can decrease the total execution time by executing tasks simultaneously. However, the test data center only consists of 6 executors. Increasing the number of data partition only leads to long execution times. Therefore, the number of data partition should be as small as possible but larger than the number of executors. On the contrary, the number of data partitions cannot be too small either. A smaller number of partitions means there is a larger FPTree to process in

A Caching-Based Parallel FP-Growth in Apache Spark

531

Fig. 7. Size of pattern bases per partition under diﬀerent partition numbers (MB)

each partition. For instance, Fig. 7 shows the size of conditional pattern bases per partition, which indicates that more conditional pattern bases are used to construct an FPTree in each partition given smaller partition numbers. Because constructing and mining FPTrees are very memory-consuming, a failure event will occur when the memory is not large enough to handle the processing of a large FPTree. Therefore, the number of data partition should be determined considering the size of FPtrees, the size of shuﬄe data and the number of executors together. Consequently, the number of partition should be as small as possible but larger than the number of executors and large enough to decrease the size of FPTrees to be available for the memory of the cluster.

7

Conclusion

FPTree constructing and mining in traditional parallel FP-Growths are very time consuming, which makes the parallel FP-Growth in Spark only get the best performance in data centers with large enough memory to accommodate all intermediate RDDs. However, there are many data centers without suﬃcient large memory. In this paper, the traditional parallel FP-Growth is improved to ﬁt such data centers by adding appropriate caching strategies. Experimental results show that the proposed caching-based PFPGrowth decreases the execution time by 32.37% on average. Moreover, inﬂuences of the minimum support degree and the number of data partitions upon the performance of the proposal is evaluated by extensive experiments. The principle about determining the number of data partition is concluded from experiments. Since the proposal fails when the memory cannot accommodate an FPTree, decreasing the memory consumption of FPTree constructing is a promising future work. Acknowledgments. Zhicheng Cai is supported by the National Natural Science Foundation of China (Grant No. 61602243) and the Natural Science Foundation of Jiangsu Province (Grant No. BK20160846). Lei Xu is supported by the National Natural Science Foundation of China (No. 61671244). Duan Liu is supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province.

532

Z. Cai et al.

References 1. Spark: Lightning-fast uniﬁed analytics engine. http://spark.apache.org/. Accessed 14 June 2018 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: International Conference on Very Large Data Bases, pp. 487–499 (1994) 3. Agrawal, R., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993) 4. Cai, Z., Zhu, X., Zheng, Y.: Source codes of the proposed cachingbased parallel FP-Growth. https://github.com/czcnjust/ElasticSim/blob/master/ cachingbasedFPGrowth.zip. Accessed June 14 2018 5. Chung, H., Nah, Y.: Performance comparison of distributed processing of large volume of data on top of Xen and Docker-based virtual clusters. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10177, pp. 103–113. Springer, Cham (2017). https://doi.org/10.1007/978-3-31955753-3 7 6. Dean, J., Ghemawat, S.: Mapreduce: a ﬂexible data processing tool. Commun. ACM 53(1), 72–77 (2010) 7. Gassama, A.D.D., Camara, F., Ndiaye, S.: S-FPG: a parallel version of FP-growth algorithm under apache spark. In: IEEE International Conference on Cloud Computing and Big Data Analysis, pp. 98–101 (2017) 8. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000) 9. Li, C., He, K.: CBMR: an optimized mapreduce for item based collaborative ﬁltering recommendation algorithm with empirical analysis. Concurr. Comput. Pract. Exp. 29(10), 1–7 (2017) 10. Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC 2012, pp. 76:1–76:8. ACM, New York (2012) 11. Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-47898-2 24 12. Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yaﬁm: a parallel frequent itemset mining algorithm with spark. In: Parallel and Distributed Processing Symposium Workshops, pp. 1664–1671 (2014) 13. Rathee, S., Kashyap, A.: Adaptive-miner: an eﬃcient distributed association rule mining algorithm on spark. J. Big Data 5(1), 6 (2018) 14. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative ﬁltering recommendation algorithms. In: International Conference on World Wide Web, pp. 285–295 (2001) 15. Schafer, J.B., Konstan, J., Riedl, J.: Recommender systems in e-commerce. In: ACM Conference on Electronic Commerce, pp. 158–166 (1999) 16. Sethi, K.K., Ramesh, D.: HFIM: a spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73, 1–17 (2017) 17. Wang, G., Xu, J., He, B.: A novel method for tuning conﬁguration parameters of spark based on machine learning. In: IEEE International Conference on High PERFORMANCE Computing and Communications; IEEE International Conference on Smart City; IEEE International Conference on Data Science and Systems, pp. 586–593 (2017)

A Caching-Based Parallel FP-Growth in Apache Spark

533

18. Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE International Conference on High PERFORMANCE Computing and Communications, 2015 IEEE International Symposium on Cyberspace Safety and Security, and 2015 IEEE International Conference on Embedded Software and Systems, pp. 166–173 (2015) 19. Winlaw, M., Hynes, M.B., Caterini, A., Sterck, H.D.: Algorithmic acceleration of parallel ALS for collaborative ﬁltering: speeding up distributed big data recommendation in spark. In: IEEE International Conference on Parallel and Distributed Systems, pp. 682–691 (2016) 20. Xun, Y., Zhang, J., Qin, X., Zhao, X.: Fidoop-dp: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28, 101–114 (2017) 21. Ye, Y., Chiang, C.C.: A parallel apriori algorithm for frequent itemsets mining. In: International Conference on Software Engineering Research, Management and Applications, pp. 87–94 (2006) 22. Yu, K.-M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Malyshkin, V. (ed.) PaCT 2007. LNCS, vol. 4671, pp. 623–631. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-739401 63 23. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Usenix Conference on Hot Topics in Cloud Computing, p. 10 (2010) 24. Zaharia, M., et al.: Apache spark: a uniﬁed engine for big data processing. Commun. ACM 59(11), 56–65 (2016) 25. Zhang, D., Zhang, D., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: ACM Conference on Recommender Systems, pp. 107–114 (2008) 26. Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Cluster Comput. 18(4), 1493–1501 (2015) 27. Zhou, L., Wang, X.: Research of the FP-growth algorithm based on cloud environments. J. Softw. 9(3), 676 (2014)

Contextual-Field Supported Iterative Representation for Face Hallucination Kangli Zeng1 , Tao Lu1(B) , Xiaolin Li1(B) , Yanduo Zhang1 , Li Peng2 , and Shenming Qu3 1

Hubei Key Laboratory of Intelligent Robot, School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China [email protected], [email protected] 2 School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China 3 School of Software, He’nan University, Kaifeng 475004, China

Abstract. Face hallucination is a special super-resolution (SR) algorithm that enhances the resolution and quality of low-resolution (LR) facial image. For reconstructing ﬁner high frequency information which are missing in image degradation, learning-based face SR methods rely on accurate prior information from training samples. In this paper, we propose a contextual-ﬁeld supported iterative representation algorithm for face hallucination to discovery accurate prior. Diﬀerent from traditional local-patch based methods, we use contextual-ﬁeld supported sampling to replace local receptive ﬁeld patch sampling for enriching prior information. Then, two weighted matrices are introduced to constrain reconstruction-errors term and representation-coeﬃcients term simultaneously, one matrix ameliorates the heteroscedasticity of real data and the other one improves the stability of solution. Finally, we use iterative representation learning to iteratively update the supported dictionary pairs and their representation-coeﬃcients to reﬁne accurate highfrequency information. The experimental results show that the proposed approach outperforms some state-of-the-art face hallucination methods over FERET and CMU-MIT face databases using both subjective and objective evaluation indexes. Keywords: Face hallucination · Iterative representation Contextual information · Dictionary learning

This work is supported by the National Natural Science Foundation of China (61502354, 61501413, 61671332, 41501505, U1404618), the Natural Science Foundation of Hubei Province of China (2018ZYYD059, 2015CFB451, 2014CFA130, 2012FFA099, 2012FFA134, 2013CF125), the Science and Technique Development Program of He’nan (172102210186), Scientiﬁc Research Foundation of Wuhan Institute of Technology (K201713), Graduate Education Innovation Foundation of Wuhan Institute of Technology (CX2017069, CX2017070). c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 534–546, 2018. https://doi.org/10.1007/978-3-030-05057-3_40

Contextual-Field Supported Iterative Representation for Face Hallucination

1

535

Introduction

In scenario of video surveillance, face images are often in LR and poor image quality. Accurate recognition tasks with those LR face images are very challenge, there is an urgent need to improve the resolution and quality of LR face images. Face hallucination is a special kind of SR algorithm to reconstruct high-resolution (HR) images from one or multiple LR images, which has important applications in security monitoring, computer vision and other ﬁelds. Recently, learning-based face SR which mainly learns the mapping relationship between HR and LR from training samples [9], which has attracted more and more attention from scholars. Because they showed strong advantages by generating details that can not be found in LR input from training samples. Considering the special characteristics of face images, Baker and Kanade [1] ﬁrst proposed a learning-based face SR approach for reconstructing HR face images, which learned a prior on the distribution of image gradients from local patch level. Chang et al. used neighbourhood embedding method [2] to explore the accurate local prior by linear combination. Yang et al. ﬁrst introduced sparse coding scheme [23] into SR algorithm to avoid the over-ﬁtting problem. Assuming that the position patches share the similar geometry structure in the LR and HR spaces, Ma et al. [12] used position-patch to get accurate prior, which represent and reconstruct image patch position-by-position. Obviously, speciﬁc position-patch represents certain face semantic information, i.e., facial conﬁguration. Jiang et al. [5] proposed a locality-constrained representation (LCR) scheme for face hallucination, which applied locality regularization to the least squares inversion problem and achieved sparsity and locality. Moreover, in order to improve the performance of accurate position-patch, they used an iterative version of LCR [6], which matched the interpolated LR image in HR space and achieved good results. Considering the accuracy of prior from patch, Zhang et al. [25,26] proposed an iterative collaborative representation SR algorithm, which iteratively improved the visual results by multi-layer scheme. Timofte et al. [19,20] employed regression-based methods for transforming the input LR patches into HR ones. Shi et al. [18] proposed two novel regularization models to deal with LR face hallucination problem in high-resolution feature space. Then, weighted representations [21,24], context-patch [7], low-rank constraints [10,11] have made impressive results. On the other hand, deep learning based approaches has been proposed for SR [3,4,8]. Cui et al. [3] proposed a deep network cascaded (DNC) enhancement to the perform SR tasks. Dong et al. [4] ﬁrst used convolution neural network (SRCNN) to learn the mapping functions from LR to HR by an end-to-end manner. Kim et al. [8] further increased network depth by residual network. Although above mentioned deep-learning based approaches providers new paradigms for image SR task. The end-to-end “black box” mapping model and stochastic gradient optimization cannot provide a clear physical interpretation of SR process which limit their further improvement. Although above algorithms achieve excellent SR performance, for the sake of accuracy prior, iterative scheme always brings reﬁning function of solution

536

K. Zeng et al.

which beneﬁts better reconstruction performance [16]. From this point, we propose a simple but eﬀective contextual-ﬁeld supported iterative representation for face hallucination in this paper. We interpolate LR images into the same size as the HR ones. Then contextual patch is sampled to enrich prior with a larger contextual-ﬁeld. Reconstruction-errors and representation-coeﬃcients terms are weighted simultaneously for precise prior. Thus we use the iterative framework to reﬁne the high-frequency details of the reconstructed HR output. Here, the output of upper layer are treated as the input of the next layer. Finally, the residual learning algorithm is used to stitch the ﬁnal image. The proposed method has obtained excellent subjective and objective SR performance. For accurate prior information, some contributions are proposed through the following aspects. First, we use contextual-ﬁeld to support a prior for contextualpatch which enriches its representation ability. Second, reconstruction-errors and representation-coeﬃcients terms are simultaneously weighted to achieve the purpose of eliminating the heteroskedasticity of the original data and obtaining a more stable solution. Third, iterative representation learning is used to iteratively update the reconstructed residual patches to recover the missing highfrequency information for better visual eﬀects.

2 2.1

Related Work Notations N

N

Let us deﬁne {Ai }i=1 ∈ p×q and {Bi }i=1 ∈ pt×qt as the LR and HR training face images, where t is the scale factor and N is the number of training sample. We that Y ∈ p×q is a LR input, and yi ∈ d×1 (the patch size is √ √ assume d × d and they are represented by column vectors) is a LR patch. Therefore, the corresponding reconstructed HR image is X ∈ pt×qt , and the HR image 2 patch is xi ∈ (d×t )×1 . 2.2

Face Hallucination Based on Least Squares

Assuming we have already learned the HR and LR dictionary pairs DiH and DiL , yi represents one patch in a LR image, the optimal weight can be solved by the following constrained least square ﬁtting problem [13]: 2 α∗i = arg min yi − DiL αi 2 . (1) αi

Because Eq. (1) is a convex optimization function, it has an analytical solution: −1 T T DiL yi . (2) α∗i = DiL DiL In the image super-resolution scene, according to the manifold consistency assumption, the corresponding HR patch and LR patch have the same representation coeﬃcients. Therefore, the weighted coeﬃcients of the LR patches can be

Contextual-Field Supported Iterative Representation for Face Hallucination

537

mapped onto the HR patch dictionary and the reconstructed HR patch can be obtained: (3) xi = DiH α∗i . After each input LR patch is rebuilt, the ﬁnal HR image X can be generated from xi by averaging pixel values of overlapped regions according to the original position.

3 3.1

Contextual-Field Supported Iterative Representation (CSIR) Context Dictionary Learning

Diﬀerent from traditional position-based patch methods, we use context-patch [15] to get the priori of the image representation. An illustration is shown in Fig. 1. We take the i−th position patch of the same image, and the context increases the local receptive ﬁeld through its local area to enrich the context information. The position-based patch method only takes its ﬁxed position patch, ignoring the prior information provided by the local area. As shown in Fig. 1 that the local receptive ﬁeld of context-patch is much larger than that of positionbased patch method.

Fig. 1. Comparison of local receptive ﬁelds provided by position-patch and contextpatch. Contextual-ﬁeld supports more receptive areas than position-patch.

√ √ Let us deﬁne context window size is w × w, patch size is d × d. In this larger window (window size greater than or equal to patch size), we use the step length e to sample multiple patches, the number of context patches c can be obtained by: ⎧ √ 2 ⎪ √ ⎪ w − d ⎨ 1+ , w > d (context - patch) e (4) c= ⎪ ⎪ √ ⎩ 1 , w = d (position - patch)

538

K. Zeng et al.

From the upper form, we can see that as the size of the window size w increases, the number of patches will increase exponentially. Therefore, these patches provide more contextual information. After taking the patches from the context, we can get the HR and LR 2 H H ∈ (d×t )×(c×N ) and contextual dictionary pool CiH = cH 1 , c2 , · · · , cc×N 2 C L = cL , cL , · · · , cL ∈ (d×t )×(c×N ) corresponding to i−th patch (LR is i

1

2

c×N

used for representation and HR is used for reconstruction). Here, we interpolate LR into the same size as HR. Since LR and HR images have the same information to a large extent, the modeling and analysis of their residual images is helpful for the reconstruction task, because it shows the diﬀerences between the HR and the LR. Therefore, Therefore, we not only use contextual information but also use residual learning to perform better. The contextual information residual dictionary is Ri = CiH − CiL . Therefore, CiL and Ri are used as a representation dictionary and a reconstruction dictionary, respectively. 3.2

Face Hallucination via CSIR

Inspired by deep learning, we develop a method of contextual-ﬁeld supported iterative representation for face hallucination. The model of we propose is mainly to obtain more accurate weight factors. On the basis of adaptive weighting, an iterative structure is established, which not only can accurately describe the mapping relationship between LR and HR, but also make the weight coeﬃcient more accurate in continuous iteration to improve the performance of the CSIR. Assuming that the dictionaries DiL and DiH has already been obtained, given ∗(s) can be updated by: the input patch yi , the representation weights αi ∗(s)

αi

⎧ (s) (s) T (s) −1 (s) L (s) (s) ⎫ ⎪ ⎨ yi(s) − DiL ⎬ Φi yi − Di αi αi ⎪ , (5) = arg min 2 (s) ⎪ (s) (s) αi ⎪ ⎩ ⎭ +λ Ωi • αi 2

(s) (s) (s) where Ωi = diag yi − l1 , · · · , yi − lK and l1 , l2 , · · · , lK is the 2 2 L (s) L . And we deﬁne atom in Di the reconstruction error as σi = yi − Di αi , L and their variance var σi Di = Φi . And Φi are independent variables that determines the particular heteroskedasticity and Φi = diag (σi ). Given a LR image, after interpolation, get the i−th position patch. We ﬁnd K nearest neighbor patches from CiL using K − N N clustering. The K − N N index of the yi is deﬁned as: CK (yi ) = sup port (dist |K ) ,

(6)

where the dist |K denotes the smallest K entries of dist and disti = yi − lj 2 . Here dist is a distance metric between yi and the k−th dictionary atom lj from LR patch pool CiL , we use Euclidean distance in this paper. When the K nearest

Contextual-Field Supported Iterative Representation for Face Hallucination

539

neighbor patches in the LR dictionary DiL is ready, we use index to learn the corresponding residual dictionary DiH from Ri . According to the following formulate, we can obtain the iterative weight coeﬃcient, ∗(s) αi

=

(s) T −1 (s) T (s) (s) (s) DiL Φi DiL + λ Ωi Ωi

(s) T −1 (s) (s) DiL Φi yi .

(7) In our model, the main idea is to get the exact weight coeﬃcient, so the SR problem can be eﬀectively solved and improved in an iterative way. We set the initial input to the interpolated version of the test LR patch. Then, we use the following steps to solve iteratively: (s) H (s) (s) (s) , Di , Φ and Ωi : we ﬁnd the K neighborhood Updating DiL L (s) H (s)i (s) patches cluster Di and Di from CiL and Ri . And calculating Φi and (s) Ωi by, (s−1) (s−1) (s−1) (s) (s−1) (s−1) Φi = diag(σi = yi − DiL αi , (8) ), s.t σi (s)

Ωi

(s−1) (s−1) = diag yi − l1 , · · · , yi − lK . 2

∗(s)

2

(9)

∗(s)

Updating αi : calculate αi according to the following formula, T L (s−1) T (s−1) −1 L (s−1) ∗(s) (s−1) (s−1) Di Φi Di + λ Ωi Ωi αi = . −1 (s−1) T (s−1) (s−1) L Φi Di yi

(10)

(s) ∗(s) ∗(s) and Updating xi : the residual patch xi is reconstructed with DiH ∗(s) ∗(s) (s+1) and xi = yi , αi ∗(s) (s) (s) xi = Ri αi . (11) Therefore, we use the following formula to ﬁnd the reconstructed HR image: X=

M (s) (s) Ri αi + yi∗ ,

(12)

i=1

where yi∗ is the interpolated version of the tested LR yi .

4 4.1

Experimental Results Database

The experiments are conducted on the FERET face databases [14] for demonstrating the performance of the proposed method. The FERET database contains 1400 images from 200 subjects and each subject has seven diﬀerent images,

540

K. Zeng et al.

including a frontal image, two left sides, two right sides and two expressive images (Fig. 2). In this experiment, 400 images were randomly selected, and two subjects were selected from all subjects. The size of the sample face image is 80 × 80 pixels. All of the 400 images, we randomly selected 360 images (including 180 subjects) as training samples and the remaining 40 as test images. So all test images are not in the training sample. The LR image is formed by smoothing (by a 4 × 4 size average ﬁlter) and down sampling (by a factor of 4) from a corresponding HR image. In this paper, we interpolate all LR images to the same size as HR.

Fig. 2. Some training faces in FERET face database (The ﬁrst row is the HR image, and the second row is the LR image.).

4.2

Parameter Settings

In this experiment, we adjusted all parameters of the proposed method to obtain better results. We set the patch size of the HR image to 12 × 12 pixels, and the overlap between adjacent patches is 4 pixels, then the patch size in the LR image is 3 × 3 pixels and the overlap is 1 pixel (the down-sampling factor is 4). For the window size w of the context information patch and the number of iterations m will be discussed and analyzed as follows:

Fig. 3. Performance of face reconstruction with diﬀerent values of window size (w) on the FERET dataset.

Contextual-Field Supported Iterative Representation for Face Hallucination

541

The Window Size w of the Context Information Patch: We select the window size w of diﬀerent context information patches to test the performance of our method, which controls the amount of priori information provided. As shown in Fig. 3, we can clearly see that with the increase of w, more beneﬁts can be obtained. This means that context information is important for image reconstruction. Because it can provide more prior knowledge than position patches. However, as can be seen from Fig. 3, setting w too large will degrade the performance of the algorithm. Therefore, choosing the appropriate context window w = 32, the proposed algorithm can achieve better results. The Number of Iterations s: Moreover, in order to test the eﬀect of the number of iterations s, we give the performance of CSIR with diﬀerent iterations s. As shown in Fig. 4, we plot the average PSNR and SSIM [22] for all test images. We found: (i) as the number of iterations s increases, the gain of the proposed method is signiﬁcantly increased, which means that increasing the high frequency information by iteration is very important for improving the performance of the algorithm; (ii) The proposed method needs to iterate several times to achieve convergence, i.e. the number of iterations for our experiments is set to 30, which means there is a lot of room for improvement in obtaining more a priori information.

Fig. 4. The inﬂuence of diﬀerent the number of iterations s on the proposed CSIR method.

4.3

Analysis of Proposed Models

In this section, our approach will be compared with several typical face hallucination method, including LSR [13], WASR [21], LLE [2], LCR [5], CLNE [6], RMHF [18], TLCR [7], SRCNN [4] and VDSR [8]. All of the comparison methods are set to the best performance. PSNR and SSIM are used to evaluate the qualitatively of facia image hallucination. The higher the value of PSNR and SSIM, the better the performance of the reconstruction. As shown in Table 1, we compare PSNR and SSIM of diﬀerent face hallucination methods. Both LCR and CLNE use position-patch method. Compared with our method, the PSNR of our method is higher than LCR and CLNE 0.94 dB

542

K. Zeng et al.

Table 1. Comparison of PSNR and SSIM for diﬀerent face hallucination methods. Method LSR

WSR

LLE

LCR

CLNE RMHF TLCR CSIR

Remove border SRCNN VDSR CSIR

PSNR

26.88

SSIM

0.7485 0.7933 0.8019 0.8156 0.8217 0.8273

27.77

27.92

28.17

28.38

28.57

28.85

29.11

29.04

0.8351 0.8441 0.8401

29.25

29.45

0.8465 0.8479

and 0.73 dB, SSIM is higher than LCR and CLNE 0.0285 and 0.0224, respectively. This fully shows that context information patches can provide more useful prior information. Moreover, CLNE and CSIR both use the iterative structure to improve the performance of the algorithm by continuously updating the representation coeﬃcients. Due to the establishment of regularization models in diﬀerent spaces, CSIR performs better than RMHF in reconstruction performance. Although TLCR and CSIR both use context information patches, CSIR combines contextual information and residual learning to make prior knowledge more accurate than TLCR. In order to be able to make fair comparison, we have also done the remove edge processing when compared with the deep learning method. Compared with VDSR, CSIR has many similarities with it, such as residual learning and deep structure. However, our deep structure and deep learning structure are diﬀerent. And the PSNR and SSIM of our algorithm are still higher than VDSR 0.20 dB and 0.0014. CSIR combines context residual learning with deep iterative structure to improve the representation ability of the image and eﬀectively improve the performance of the algorithm.

Fig. 5. Some visualized hallucination with diﬀerent methods on FERET database. From left to right: (a) input, (b) LSR [13], (c) WASR [21], (d) LLE [2], (e) LCR [5], (f) CLNE [6], (g) RMHF [18], (h) TLCR [7], (i) SRCNN [4], (j) VDSR [8], (k) ours and (l) Original HR image.

Contextual-Field Supported Iterative Representation for Face Hallucination

543

Some visualized hallucination are shown in Fig. 5. It can be clearly seen that LSR and WSR has obvious fuzziness in the details of the face. CLNE, LCR and other position-based super-resolution methods have many noises in the eyes, nose, mouth and other parts, so that the face contour is not clear. CSIR can overcome the problem of alignment before using the position-patch, and the facial contour is clearer than the position-based method. Because CSIR introduces residual learning to better recover high-frequency information, it performs better than TLCR. Compared to the deep learning method (SRCNN, VDSR), although there is little diﬀerence between PSNR and SSIM, it can be seen from the results of the reconstruction that the methods we propose are clearer in the eyes, nose and other parts. The eleventh column is the hallucination HR face image we proposed, and we can see that CSIR is superior to all comparison methods and produces reasonable results with more facial details. 4.4

Experiments on the Real-World Images

In this section, we tested our proposed method on CMU-MIT face database [17] and use FERET database as training set. As shown in Fig. 6, we have selected the test images on the CMU-MIT database. For face images in real scenes, the image degradation process can not be simply obtained by adding fuzzy and down sampling to the corresponding HR images. Since the FERET database only contains face images and don’t have other background patterns, we need to preprocess the CMU-MIT database. We manually tailor face images to the actual scene images and adjust them to 80 × 80 pixels. Figure 7 shows the results of super-resolution reconstruction of diﬀerent methods. It can be clearly seen that CSIR not only can produce reasonable results of the face hallucination, but also handle the noise well, which fully demonstrates the eﬀectiveness of our method.

Fig. 6. These images used for testing in the CMU-MIT database.

544

K. Zeng et al.

Fig. 7. The hallucination results of diﬀerent methods are in CMU-MIT. From left to right: (a) input, (b) LSR [13], (c) WASR [21], (d) LLE [2], (e) LCR [5], (f) CLNE [6], (g) RMHF [18], (h) TLCR [7], (i) SRCNN [4], (j) VDSR [8] and (k) ours.

5

Conclusion

In this paper, we propose a new face hallucination via contextual-ﬁeld supported iterative representation. For better image representation, we not only increase the local perception area through the context in order to obtain the useful prior information, but also weight the reconstruction error and use the weighted Tikhonov regularization constraint. Furthermore, we update reconstructed HR images iteratively so as to obtain a more accurate weight coeﬃcient matrix and improve the performance of the algorithm. At the same time, residual dictionary learning is used to obtain high frequency information and achieve fast convergence. The results show that the proposed algorithm not only can eﬀectively study the mapping relationship between HR and LR, but also overcome most of the noise and have robustness.

References 1. Baker, S., Kanade, T.: Hallucinating faces. In: Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition 2000, p. 83 (2002) 2. Chang, H., Yeung, D., Xiong, Y.: Super-resolution through neighbor embedding. Proc. Comput. Vis. Pattern Recogn. 1, I-275–I-282 (2004) 3. Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network cascade for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 49–64. Springer, Cham (2014). https://doi.org/10.1007/ 978-3-319-10602-1 4 4. Dong, C., Chen, C.L., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)

Contextual-Field Supported Iterative Representation for Face Hallucination

545

5. Jiang, J., Hu, R., Han, Z., Lu, T., Huang, K.: Position-patch based face hallucination via locality-constrained representation. In: IEEE International Conference on Multimedia and Expo, pp. 212–217 (2012) 6. Jiang, J., Hu, R., Wang, Z., Han, Z., Ma, J.: Facial image hallucination through coupled-layer neighbor embedding. IEEE Trans. Circuits Syst. Video Technol. 26(9), 1674–1684 (2016) 7. Jiang, J., Yu, Y., Tang, S., Ma, J., Qi, G.J., Aizawa, A.: Context-patch based face hallucination via thresholding locality-constrained representation and reproducing learning. In: IEEE International Conference on Multimedia and Expo, pp. 469–474 (2017) 8. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks, pp. 1646–1654 (2015) 9. Liu, C., Shum, H.Y., Freeman, W.T.: Face Hallucination: Theory and Practice. Kluwer Academic Publishers, Dordrecht (2007) 10. Lu, T., Guan, Y., Chen, D., Xiong, Z., He, W.: Low-rank constrained collaborative representation for robust face recognition. In: IEEE International Workshop on Multimedia Signal Processing, pp. 1–7 (2017) 11. Lu, T., Xiong, Z., Zhang, Y., Wang, B., Lu, T.: Robust face super-resolution via locality-constrained low-rank representation. IEEE Access 5(99), 13103–13117 (2017) 12. Ma, X., Huang, H., Wang, S., Qi, C.: A simple approach to multiview face hallucination. IEEE Signal Process. Lett. 17(6), 579–582 (2010) 13. Ma, X., Zhang, J., Qi, C.: Hallucinating face by position-patch. Pattern Recogn. 43(6), 2224–2236 (2010) 14. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 22(10), 1090–1104 (2000) 15. Romano, Y., Elad, M.: Con-patch: when a patch meets its context. IEEE Trans. Image Process. Publ. IEEE Signal Process. Soc. 25(9), 3967–3978 (2016) 16. Romano, Y., Isidoro, J., Milanfar, P.: RAISR: rapid and accurate image super resolution. IEEE Trans. Comput. Imaging 3(1), 110–125 (2017). https://doi.org/ 10.1109/TCI.2016.2629284 17. Rowleys, H.: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 23–38 (1998) 18. Shi, J., Liu, X., Zong, Y., Qi, C., Zhao, G.: Hallucinating face image by regularization models in high-resolution feature space. IEEE Trans. Image Process. PP(99), 1 (2018) 19. Timofte, R., De, V., Gool, L.V.: Anchored neighborhood regression for fast example-based super-resolution. In: IEEE International Conference on Computer Vision, pp. 1920–1927 (2013) 20. Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighborhood regression for fast super-resolution. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 111–126. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3 8 21. Wang, Z., Hu, R., Wang, S., Jiang, J.: Face hallucination via weighted adaptive sparse regularization. IEEE Trans. Circuits Syst. Video Technol. 24(5), 802–813 (2014) 22. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 13(4), 600–612 (2004)

546

K. Zeng et al.

23. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010) 24. Yang, Z., He, P.: Non-local diﬀusion weighted image super-resolution using collaborative joint information. Exp. Ther. Med. 15(1), 217–225 (2018) 25. Zhang, Y., et al.: Collaborative representation cascade for single-image superresolution. IEEE Trans. Syst. Man Cybern. Syst. PP(99), 1–16 (2017) 26. Zhang, Y., Zhang, Y., Zhang, J., Wang, H., Dai, Q.: Single image super-resolution via iterative collaborative representation. In: Ho, Y.-S., Sang, J., Ro, Y.M., Kim, J., Wu, F. (eds.) PCM 2015. LNCS, vol. 9315, pp. 63–73. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24078-7 7

A Cancelable Multi-Biometric Template Generation Algorithm Based on Bloom Filter Lin You(B) and Xun Li College of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China [email protected]

Abstract. For the security issue of multiple biometric templates in current multi-biometric systems, this paper proposes a cancelable multibiometric template generation algorithm based on Bloom filter. Our algorithm uses the XOR operation to fuse the grouped fingerprint binary features and the face binary features into one template at the feature level, then transforms the fusion template based on the irreversibility of the Bloom filter. The cancelability and diversity of the fusion template can be achieved by updating the random matrix. Finally, a traversal matching method is used to calculate the matching score in the encryption domain. The experimental results show that our algorithm can ensure the reliability of the identity authentication and improve the security of the multi-biometric template.

Keywords: Bloom filter Template protection

1

· Fingerprint feature · Face feature

Introduction

Multi-biometric recognition technology can bring a more secure and reliable identity authentication [1]. However, there always exist some security risks in biometric-based identiﬁcation systems. Due to the uniqueness and invariability of human biometrics, once the biometric template is exposed, it is compromised forever. Another problem is that cross-matching across diﬀerent applications can easily covertly track the user in biometric systems. Especially for the multibiometric recognition system that stores multiple biometric template information of the same user, the loss of the templates is more harmful to the user’s privacy. To improve the security of biometric templates, Ratha et al. [2] indicated that the ideal biometric template protection scheme should have diversity, cancelability, and irreversibility besides the accuracy of the authentication. Since This research is partially supported by the National Science Foundation of China (No. 61772166, 61272045) and the Key Program of the Nature Science Foundation of Zhejiang province of China (No. LZ17F020002). c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 547–559, 2018. https://doi.org/10.1007/978-3-030-05057-3_41

548

L. You and X. Li

then, more and more biometric template protection schemes have been proposed. However, most of the existed biometric protection schemes are proposed for single biometric template. Such schemes can be used to protect multi-biometric template separately in multi-biometric systems, but the security of these solutions is worse than the security of fusing multiple biometric templates into one encrypted template [3]. In addition, the cancelability of multi-biometric template is also a major challenge. How to eﬀectively protect multi-biometric template and achieve cancelable multi-biometric template is an urgent problem in multibiometric recognition systems. In this work, we employ the Bloom ﬁlter in biometric templates protection and propose a cancelable multi-biometric template generation algorithm. We design a grouping combination method and use an XOR operation to fuse both the binary ﬁngerprint features and the binary face features into one template at the feature level. The fusion template is transformed by using Bloom ﬁlter. Using our feature level fusion method can make the entire multi-biometric template cancelable by updating the random matrix, and the irreversibility of the Bloom ﬁlter mapping in our algorithm will ensure the security of the multi-biometric template. The rest of this paper is organized as following: Sect. 2 reviews some related work on biometric template protection based on Bloom ﬁlter. Section 3 describes our proposed algorithm in detail. Section 4 describes the experimental results. Section 5 analyzes the security of our algorithm. Finally, concluding remarks are drawn in Sect. 6.

2

Related Work

The Bloom ﬁlter was ﬁrst applied by Rathgeb et al. [4] in biometric template protection. The authors proposed an alignment-free cancelable iris biometric templates scheme which adapted a standard Bloom ﬁlter to an adaptive Bloom ﬁlter. Additionally, they combined and implemented the Bloom ﬁlter’s idea of protecting biometric templates with iris features from both eyes of a single subject, which ensured the diversity of multi-biometric template to some extent [5]. Subsequently, the authors summed up the previous work and proposed a generic framework for the protection of multi-biometric template [6], which can fuse different biometric features at the feature level. They used the face and iris as a case to implement the framework, but this framework did not indicate how the ﬁnal fusion template can be cancelable and updated. Based on the work of the Rathgeb et al., Gomez-Barrero et al. [7] used Bloom ﬁlter to encrypt and compress face binary features. Li et al. [8] proposed a ﬁngerprint protection scheme based on Bloom ﬁlter and achieved cancelability. But their scheme required to select appropriate reference minutiae, while the change of the reference minutiae position may cause the associated errors which can degrade the recognition performance. Stokkennes et al. [9] proposed a multi-biometric template protection scheme based on Bloom ﬁlter at the mobile terminal which fused face features and the eyes periocular regions features by

A Cancelable Multi-Biometric Template Generation Algorithm

549

using weighted score level fusion. Sadhya et al. [10] made XOY on the biometrics transformed by the Bloom ﬁlter with a Key (a random (0,1) matrix) to ensure the cancelability of the templates. But this scheme heavily relied on the Key. Once the Key is compromised, the original biometrics encrypted by the Bloom ﬁlter will be exposed and the unlinkability between the templates cannot be eﬀectively ensured.

3 3.1

Proposed Algorithm Binary Feature Extraction

Fingerprint Binary Feature Extraction. To obtain the ﬁngerprint binary features, we adopt a general method named the array mapping [11–13], which can map the relative distance and direction diﬀerence of the minutiae on the 2D array. The speciﬁc implementation method is as follows: Extract the minutiae M = {mi = (xi , yi , θi )|i = 1, . . . , NM } from the ﬁngerprint image. Here, xi , yi , and θi respectively represent the x-coordinate value, the y-coordinate value, and the direction of the i-th minutiae mi . And NM is the number of minutiae. Use Eqs. (1) and (2) to calculate the relative distance drk and the direction diﬀerence θrk between mr and mk (mk ∈ M \{mr }, k = 1, . . . , NM − 1). Repeat the above method for k = 1 to NM − 1, we can obtain a new set of features (drk , θrk ) of reference minutiae mr . (1) drk = (xr − xk )2 + (yr − yk )2 θrk = |θr − θk |

(2)

Create a 2D array to map the new features. The 2D array model is shown in M Fig. 1, where CX and CYM are the length and width of cell, respectively. MX and MY are the length and width of the 2D array, respectively. The new features (drk , θrk ) can be mapped on the 2D array using Eq. (3), where . represents T are the cell index of the k-th new feature on the the ceil function. xTrk and yrk horizontal axis MX and the vertical axis MY , respectively. To ﬁx the number of the new features (drk , θrk ) mapped on the 2D array each time, the extracted (drk , θrk ) are sorted by drk from the largest to the smallest. Only the ﬁrst a new features can be mapped on the 2D array. The speciﬁc method is as follows: T M xrk drk /CX (3) = T yrk θrk /CYM If there are points in the cell, set the value of the cell to 1; otherwise, set the value of the cell to 0. Read the 2D array from left to right and form top to bottom to obtain a set of binary features frM of mr , where frM is a binary sequence of length NT . NT is equal to the total number of the cells in the 2D array and it can be divided by t, where t is an integer not equal to 0.

550

L. You and X. Li

Fig. 1. The 2D array model

Let the minutiae in the set M \{mr } be used as reference minutiae one after another, and repeat the same method as mr to obtain the binary features F M = {fiM |i = 1, . . . , NM } of the ﬁngerprint image, where fiM is the binary features of mi . According to the above processing method, the binary features size of the ﬁngerprint image is NM × NT . Face Binary Features Extraction. The basic idea of BioHashing [14] is to product the user’s biometric vector X ∈ RΓ and the random matrix R ∈ RΓ ×Γ in the user’s identity token, that is to calculate X = RX to obtain X = {xn |n = 1, . . . , Γ }, where Γ is the biometric dimension and Γ ≤ Γ. 0 , xn ≤ τ (4) n = 1, . . . , Γ xn = 1 , xn > τ Quantize the elements in X to obtain a binary value xn using the preset threshold τ and Eq. (4). Finally, the purpose of the identity authentication is achieved by comparing the binary sequences. Applying this method to the face image, we can obtain the face binary features F P . The speciﬁc method is as follows: Extract the ﬁrst NP dimension PCA features P = {pj |j = 1, . . . , NP } of the face image, where NP = NT /t. Generate a random matrix R ∈ RNP ×NP . Where R obeys the standard normal distribution, namely, R ∼ N (0, 1). By using BioHashing algorithm, we can obtain the cancelable binary features F P = {fjP |j = 1, . . . , NP } of the face images. Where fjP ∈ {0, 1} and the binary features size of the face image is 1 × NP . 3.2

Feature Level Fusion Method

Divide the processed ﬁngerprint binary features fiM into t groups, each group is denoted as viz , z = 1, . . . , t. For z = 1 to t, do XOY operation as follows: giz = viz ⊕ F P

(5)

where viz is of length NP and “⊕” is a bitwise XOR operation. The meaning of Eq. (5) is to perform the XOR operation between each group of ﬁngerprint

A Cancelable Multi-Biometric Template Generation Algorithm

551

binary features and face binary features, so as to obtain the fusion features giz of the z-th group binary features of the i-th ﬁngerprint minutiae and the face binary features, where giz is a binary sequence of length NP . The fusion features of the binary features of the i-th ﬁngerprint minutiae and the face features can be denoted as fiU = {giz |z = 1, . . . , t}.

Fig. 2. The fusion framework of fingerprint and face features

The fusion feature template F U = {fiU |i = 1, . . . , NM } of the ﬁngerprint image and the face image can be obtained by repeating the above method for the binary features of the remaining ﬁngerprint minutiae. The size of F U is NM × NP × t = NM × NT , and it is the same size as that of the ﬁngerprint binary features. The above described is the entire process of fusing ﬁngerprint features and face features at the feature level. The framework is shown in Fig. 2. 3.3

Feature Mapping Based on Bloom Filter

A Bloom ﬁlter is a bit array of length NB , where all bits are initially set to 0. The fusion features of ﬁngerprint features and face features can be irreversibly mapped based on the Bloom ﬁlter. The detail procedures are as follows: Transform fiU from top to bottom and left to right into a binary matrix F Ti of size w × l, where 2w ≥ l. Then, each column is transformed by Eq. (6), where h(x) =

w−1 λ=0

xλ 2λ

(6)

552

L. You and X. Li

xλ is the λ-th element of the column, and the h(x) means transforming each column of F Ti into a decimal. Set an initial Bloom ﬁlter bi of length NB = 2w with an index range of [0, 2w−1 ]. According to the calculation result of Eq. (6), let bi [h(x)] = 1. Namely, after each column is transformed into a decimal, its value on the corresponding index is set to 1 on bi . The index values can be mapped multiple times, but only the ﬁrst mapped index value is valid. Hence, we can obtain the cancelable template B = {bi |i = 1, . . . , NM } mapped by Bloom ﬁlter, where bi is a binary sequence of length NB . The size of B is NM × NB . The framework of the feature mapping based on Bloom ﬁlter is shown in Fig. 3.

...

...

...

Fig. 3. The framework of the feature mapping based on Bloom filter

3.4

The Matching Method

Assume that for the user’s registered ﬁngerprint M E , registered face P E , query ﬁngerprint M Q and query face P E , we use the same random matrix R to generate the registration fusion template B E = {bE i |i = 1, . . . , NE } and the query fusion template B Q = {bQ |i = 1, . . . , N }, respectively. Where NE and NQ are the Q i number of the minutiae for the registered ﬁngerprint and the query ﬁngerprint, respectively. For the reason that the number of the extracted ﬁngerprint minutiae is not ﬁxed, the elements in B Q required to be matched with the elements in B E one by one. The matching method [11] is shown in Fig. 4, and their similar score Score(B E , B Q ) can be computed as follows: Q E Q Step 1: Calculate the distance score dis(bE i , bj ) and the similarity SI(bi , bj ) Q between bE i and bj . Q ||bE i − bj ||2 Q dis(bE (7) i , bj ) = Q ||bE i ||2 + ||bj ||2 Q E Q SI(bE i , bj ) = 1 − dis(bi , bj ) = 1 −

Q ||bE i − bj ||2 Q ||bE i ||2 + ||bj ||2

(8)

A Cancelable Multi-Biometric Template Generation Algorithm

553

Q where ||.||2 is a 2-norm and the range of SI(bE i , bj ) is [0, 1]. Step 2: The similarity matrix S = {SIij } of B E and B Q can be obtained Q from Step 1, where SIij = SI(bE i , bj ). Take the maximum value of each row in the matrix S to obtain the maximum similarity set SImax.

SImax(j) = max{SIij } i

(9)

Calculate the mean SImean of all elements in SImax. NQ

SImean =

SImax(j)

j=1

(10)

NQ

Step 3: Calculate the mean of all elements greater than SImean in SImax to obtain a matching score Score(B E , B Q ) for B Q and B E . NS E

Q

Score(B , B ) =

Sμ

μ=1

(11) NS where Sμ is an element in SImax such that Sμ > SImean and NS is the number of Sμ . If Score(B E , B Q ) ≥ T h, the authentication succeeds; otherwise, it fails. Where T h is the preset threshold.

Fig. 4. The matching method

4

Experiments

To evaluate the performance of our algorithm, the simulation experiments will use the ORL face database and FVC2002-DB1 ﬁngerprint database (DB1). Since the PCA feature extraction requires training images, we divide the image of each face sample in the ORL face database into two parts. The ﬁrst 5 are training images and the last 5 are test images. We select the ﬁrst 40 samples in DB1 to combine with the ORL to create a multi-biometric database DB1-ORL. There are 40 samples in the DB1-ORL multi-biometric database. Each sample has 5 ﬁngerprint images, 5 face images, and a face image corresponds to a ﬁngerprint image. The number of the images in the DB1-ORL is 400.

554

4.1

L. You and X. Li

Performance Evaluation

Three performance indices are used for performance evaluation: GAR (Genuine Accept Rate), FAR (False Accept Rate), and EER (Equal Error Rate). In the experiment process, the ﬁrst group of images in each sample of DB1-ORL is selected to generate registration fusion template, and the latter four groups of images are respectively generated query fusion templates. The number of the genuine attempts is 160 (40 × 4), and the number of the imposter attempts is 1560 (40 × 39).

Fig. 5. Genuine and imposter distributions.

We extract the face features and the ﬁngerprint features in the DB1-ORL database. The range of the ﬁngerprint minutiae number NM is 21 ∼ 34, the range of the relative distance dik is 0 ∼ 300, and the range of the direction diﬀerence θik is 0 ∼ 2π. Hence, the ﬁrst a = 20 (the largest integer less than 21) new features are selected to map on the 2D array with size MX × MY = 300 × 360. Our M × CYM = 30 × 30, experimental results will be more ideal when the cell size CX

A Cancelable Multi-Biometric Template Generation Algorithm

555

the threshold τ = 200, the face features length NP = 60 and the ﬁngerprint grouping number t = 2. Hence, these parameters will be used in the following experiments. Figure 5 shows the genuine and imposter distributions when the fusion feature fiU is transformed into the binary matrix F Ti with the w = 5 or w = 6, respectively. Table 1. Experimental results (GARs are obtained at FAR = 0%)

Parameter

Fusion template (bits) GAR (%) EER (%)

Without Bloom ﬁlter NM × 120

96.25

0.39

w=6

NM × 64

94.37

0.79

w=5

NM × 32

81.25

1.27

Table 1 shows the experimental results of EER, GAR (FAR = 0%) and fusion template size in the case of w = 5, w = 6 and without Bloom ﬁlter mapping. For our algorithm performance, compared to the no Bloom ﬁlter mapping, the GAR at w = 5 is reduced by 15%. The reason for this decrease in performance is that the Bloom ﬁlter map loses some part of eﬀective information. But the GAR of the fusion template mapped by the Bloom ﬁlter only decreases by 1.88% at w = 6, it shows that our algorithm still maintains a high recognition rate. For the size of the ﬁnal template, it shows that after the Bloom ﬁlter mapping, the size of the fusion template is compressed. For the security of the ﬁnal template, the irreversibility of the Bloom ﬁlter mapping improves the security of multibiometric template while losing some part of eﬀective information. For the overall performance of our algorithm, the EERs in the three cases are all less than 1.5%. It shows that our algorithm has good overall performance. Figure 6 shows the ROC curves of our algorithm at w = 5, w = 6 and without Bloom ﬁlter, respectively.

Fig. 6. ROC curves of our algorithm at w = 5, w = 6 and without Bloom filter

556

4.2

L. You and X. Li

Cancelability of the Fusion Template

To measure the diversity and cancelability of the fusion template in our algorithm, we set up the experiment in this section: Generate 10 diﬀerent random matrices R1 , . . . , R10 . Each set of images for each sample in the DB1-ORL is represented by T , and T is combined with R1 , . . . , R10 to generate a corresponding fusion template T R1 , . . . , T R10 , respectively. Set T R1 is used as registration fusion template and T R2 , . . . , T R10 are used as query fusion template. Calculate the matching scores of T R2 , . . . , T R10 with T R1 respectively, the experimental results are denoted as false imposter matches, the number of the false imposter attempts is 1800 (200 × 9).

Fig. 7. False imposter distribution.

Figure 7 shows the false imposter distributions when the fusion feature fiU is transformed into the binary matrix F Ti with the row height w = 5 or w = 6, respectively. It can be seen that the normalized score distribution of the false imposter is very similar to that of the imposter. Hence, our algorithm can generate a new fusion template by updating the random matrix R when the user’s

A Cancelable Multi-Biometric Template Generation Algorithm

557

fusion template is compromised. The template stolen by the attacker will no longer be able to impersonate the user’s authentication through the system, and so our algorithm can ensure the reliability of the identity authentication and the security of the multi-biometric template. 4.3

Algorithm Performance Comparison

Table 2 shows the performance comparison between our algorithm and the existing multi-biometric template protection methods based on Bloom ﬁlter. The experimental results are the optimized data. Stokkenes et al. [9] extracted face features and the periocular regions features, and the weighted comparison score level fusion was applied to increase recognition accuracy in their work. Our algorithm extracts the ﬁngerprint features and the face features, and fuses these two biometric features into one template at the feature level. As shown in Table 2, whether a Bloom ﬁlter mapping was applied or not, our algorithm is more accurate. Table 2. Performance comparison with existing multi-biometric template protection based on Bloom filter

Algorithm

5

With Bloom ﬁlter GAR (%) FAR (%) EER (%)

Stokkenes et al. [9] No Yes

95.95 91.61

0.01 0.01

1.12 1.38

Proposed

96.25 94.37

0 0

0.39 0.79

No Yes

Security Analysis

Our algorithm proposes a grouping combination method and uses an XOR operation to fuse the binary ﬁngerprint features and the binary face features into one template at the feature level. According to Eq. (5), if F P is changed, then giz will be changed. In addition, FP is required to be performed a bit-by-bit XOR operation with each group viz of each fiM to obtain giz in the feature fusion. When the user updates the random matrix R, F P will be changed, and so that the entire fusion template will be updated. The advantages of the above biometric processing methods are as follows: Diversity: When users register multiple accounts for diﬀerent applications (PC or mobile) using the same biometrics, they can update R to obtain multiple multi-biometric templates and it ensures that the templates are unlinkable. Cancelability: Once the user’s multi-biometric template is found to be compromised, a new template can be generated immediately by updating R, and the compromised template can be immediately invalidated.

558

L. You and X. Li

Irreversibility: According to our algorithm, we assume that there is a fusion template F U = {fiU |i = 1, . . . , NM } of ﬁngerprint features and face features without Bloom ﬁlter mapping. Firstly, fiU is transformed into a binary matrix F Ti of size w × l, then F Ti is mapped to bi based on Bloom ﬁlter. The irreversibility of our algorithm lies in following: On the one hand, F Ti transformed by fiU whose original position of each column will be hidden. In other words, it is hard to infer from the position of 1 on bi that it is obtained by mapping which column of F Ti . Moreover, the high correlation between the biometric features strengthens the diﬃculty of this inverse operation. If the attacker can’t invert the complete arrangement of F Ti , the original fusion features cannot be recovered. On the other hand, F Ti transformed by fiU may have multiple columns mapped to the same position. The Bloom ﬁlter is a many-to-one mapping, it is hard to infer from the position of 1 on bi that which columns of F Ti are mapped in the same position. The irreversibility of the Bloom ﬁlter mapping had been proven by Rathgeb et al. [5].

6

Conclusion

In this work, we propose a cancelable multi-biometric template generation algorithm based on Bloom ﬁlter. In the proposed algorithm, the ﬁngerprint features and the face features are binarized respectively, which makes that the ﬁngerprint features do not need to be pre-aligned and the face binary features have cancelability. Then, using the XOR operation to fuse the grouped ﬁngerprint binary features and the face binary features into one template at the feature level, and the irreversibly transform is made on the fusion template using Bloom ﬁlter. This method of feature fusion transformation can makes the entire multibiometric template cancelable by updating the random matrix (can be stored in the user’s identity token) for the binarization of face features. Finally, the traversal matching method is used to calculate the matching score between diﬀerent templates so that the entire matching process can be completed in the encryption domain. The experimental results show that the EER without Bloom ﬁlter mapping is 0.39%, while the optimal EER with Bloom ﬁlter mapping is 0.79%. Hence, the proposed algorithm ensures the reliability of identity authentication and improves the security of multi-biometric template. On the other hand, our algorithm implements the cancelability and diversity of multi-biometric template, but for this purpose we use the method of random projection in BioHashing. In fact, Biohashing have the disadvantage that EER will rise when the random matrix is stolen. Therefore, in the feature work, we will consider the eﬀect of random matrix loss on the performance of our algorithm, and use some improved methods (such as Zheng’s novel BioHashing [15] ) instead of the BioHashing used in our algorithm to further improve the reliability of our algorithm.

A Cancelable Multi-Biometric Template Generation Algorithm

559

References 1. Chen, Y., Yang, J., Wang, C., Liu, N.: Multimodal biometrics recognition based on local fusion visual features and variational Bayesian extreme learning machine. Expert Syst. Appl. 64, 93–103 (2016) 2. Ratha, N.K., Chikkerur, S., Connell, J.H., Bolle, R.M.: Generating cancelable fingerprint templates. IEEE Trans. Pattern Anal. Mach. Intell. 29(4), 561–572 (2007) 3. Lim, M., Verma, S., Mai, G., Yuen, P.C.: Learning discriminability-preserving histogram representation from unordered features for multibiometric feature-fusedtemplate protection. Pattern Recognit. 60, 706–719 (2016) 4. Rathgeb, C., Breitinger, F., Busch, C.: Alignment-free cancelable iris biometric templates based on adaptive bloom filters. In: International Conference on Biometrics, Madrid, pp. 1–8. IEEE (2013) 5. Rathgeb, C., Busch, C.: Cancelable multi-biometrics: mixing iris-codes based on adaptive bloom filters. Comput. Secur. 42(4), 1–12 (2014) 6. Rathgeb, C., Gomez-Barrero, M., Busch, C., Galbally, J., Fierrez, J.: Towards cancelable multi-biometrics based on bloom filters: a case study on feature level fusion of face and iris. In: International Workshop on Biometrics and Forensics, Gjovik, pp. 1–6. IEEE(2015) 7. Gomez-Barrero, M., Rathgeb, C., Galbally, J., Fierrez, J.: Protected facial biometric templates based on local gabor patterns and adaptive bloom filters. In: International Conference on Pattern Recognition, Stockholm, pp. 4483–4488. IEEE (2014) 8. Li, G., Yang, B., Rathgeb, C., Busch, C.: Towards generating protected fingerprint templates based on bloom filters. In: International Workshop on Biometrics and Forensics, Gjovik, pp. 1–6. IEEE (2015) 9. Stokkenes, M., Ramachandra, R., Sigaard, M.K., Raja, K., Gomez-Barrero, M., Busch, C.: Multi-biometric template protection – a security analysis of binarized statistical features for bloom filters on smartphones. In: 2016 Sixth International Conference on Image Processing Theory, Tools and Applications, Oulu, pp. 1–6. IEEE (2016) 10. Sadhya, D., Singh, S.K.: Providing robust security measures to Bloom filter based biometric template protection schemes. Comput. Secur. 67, 59–72 (2017) 11. Lee, C., Kim, J.: Cancelable fingerprint templates using minutiae-based bit-strings. J. Netw. Comput. Appl. 33(3), 236–246 (2010) 12. Yang, W., Hu, J., Wang, S., Stojmenovic, M.: An alignment-free fingerprint biocryptosystem based on modified Voronoi neighbor structures. Pattern Recognit. 47(3), 1309–1320 (2014) 13. Sandhya, M., Prasad, M.V.N.K.: k-Nearest Neighborhood Structure (k-NNS) based alignment-free method for fingerprint template protection. In: International Conference on Biometrics, Phuket, pp. 386–393. IEEE (2015) 14. Teoh, A.B.J., Goh, A., Ngo, D.C.L.: Random multispace quantization as an analytic mechanism for biohashing of biometric and random identity inputs. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 1892–1901 (2006) 15. Zheng, Y., Cao, Y., Chang, C.H.: Facial biohashing based user-device physical unclonable function for bring your own device security. In: 2018 IEEE International Conference on Consumer Electronics, Las Vegas, pp. 1–6. IEEE (2018)

Streaming ETL in Polystore Era Nabila Berkani1 and Ladjel Bellatreche2(B) 1

2

Ecole Nationale Sup´erieure d’Informatique, BP 68M, 16309 Oued-Smar, Alger, Algeria n [email protected] LIAS/ISAE-ENSMA, Futuruscope, Poitiers, France [email protected]

Abstract. In today’s digital environment, businesses have to access, store and analyze in a real time fashion vast amounts of data issued from streaming graph-structure data sources. To meet these requirements, companies owning the data warehouse (DW) technology have to combine hardware and software solutions to reduce the time latency between a DW and its data sources. The explosion of advanced hardware deployment platforms such as polystore represents an opportunity as pointed in recent studies. But, deploying a graph-structure DW over a polystore is not a simple task, since it requires two important phases which are data partitioning and allocation. We claim that these phases have to be connected to the ETL (Extract, Transform, Load) phase, especially its loading process. This connection questions the initial schedule of ETL and deployment processes. In this paper, we present a new approach that connects ETL and deployment processes and challenges their traditional scheduling to meet real time analysis requirements.

Keywords: RDF

1

· Fragmentation · Allocation · ETL · Polystore

Introduction

With the arrival of Big Data Era, several voices have been raised in the demise of DW, in the same way as relational DBMSs after the appearance of NoSQL. The risk of such a campaign puts aside several interesting research issues that are not well studied and solved. The change of name of several thematic conferences such as DaWaK and DOLAP by removing DW and substituting it by Big Data is an example of the discussed risk. We claim that the Big Data era is a chance for the renaissance of the DW. This is because it brings several interesting dimensions that have to be exploited in the DW design. Some recent research eﬀorts have been made in this direction such as: the incremental incorporation of its V’s in the DW design (e.g., Variety [1], Value [2]), the embedding learning techniques and tools inside DBMSs hosting DW [20]), etc. Linked Open Data (an important element of the Big Data Landscape) oﬀer additional stream RDF graph sources feeding DW [2]. c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 560–574, 2018. https://doi.org/10.1007/978-3-030-05057-3_42

Streaming Graph Warehouses in Polystore Era

561

The proliferation of graph-structure data streams fueled the demand for realtime data analytics [18]. To satisfy this requirement, hardware and software solutions have to be combined in order to reduce latency time between data sources and target DW. From a software perspective, the traditional ETL (Extract, Transform, Load) phase has to be revisited to handle two main issues: (i) the new variety brought by graph-structure of data and (ii) streaming nature of data sources. The ﬁrst issue has been addressed in [2], where a meta-model describing graph sources has been proposed. It has been also used as a support to redeﬁne the traditional ETL operations identiﬁed in [25]. A couple of studies have been proposed to deal with the second issue that covers two main aspects: (i) rescheduling the processes of ETL in the context of traditional DW. In [28], ETL has been turned into ELT. In this scenario, data are loaded into the DW. Transformation primitives are then transformed inside DBMS hosting the DW via SQL queries. To speed up these queries, materialized views are selected incrementally as new data arrives. (ii) optimizations of ETL processes performed outside the DBMS which requires main memory (caching) and processes management [3,13]. From hardware perspective, the usage of advanced platforms such as polystores to deploy DW represents an interesting direction. Polystores have been launched based on the M. Stonebraker principle of “one size does not ﬁt all”, where platforms with a variety of storage engines, where each one designed for a specialized use case have been developed. Several polystores exist: BigIntegrator [31], BigDAWG [7], etc. Setting up a polystore in our context of DW necessitates the resolution of graph data partitioning1 and allocation problems which are formalized as follows: given a DW, a Sparql workload, a polystore with l stores {S1 , . . . , Sl } and a set of constraints related to each site such as storage cost. These problems consist in partitioning the DW into N fragments {F1 , . . . , FN }, and in allocating the obtained fragments over sites. The selected partitioning and the allocation schemes have to optimize the workload and respect the set of constraints. Due to the high interaction of Sparql queries [16], any partitioning algorithm has to consider this interaction. Usually, ETL and deployment (denoted by D) phases are treated in an isolated way, except the recent work presented in CIDR’2017 [19] that largely discussed the issue of associating streaming ETL to polystore in Internet of Things to reduce the data latency. With combining software and hardware solutions in mind to reduce the latency matters, the used processes have to be well-scheduled and algorithms optimized with appropriate data structures. We have identiﬁed three possible schedules: (1) [(ET ||D) −→ L]: the extraction and transformation processes of ETL and the deployment (partitioning and allocation) are ﬁrst executed in parallel and then the loading process populates the deployed DW. (2) [ET −→ DL]: extraction and transformation processes are executed ﬁrst, then the deployment and loading processes. (3) [D −→ ET L]: a deployment schema of the DW is obtained by executing our partitioning and allocation algorithms, 1

In this paper, we use fragmentation and partitioning interchangeably.

562

N. Berkani and L. Bellatreche

then the whole ETL processes are executed. These schedules with their corresponding optimizations are detailed in this paper. The paper is organized as follows: Sect. 2 overviews the main existing studies on real time ETL processes and graph-structure data partitioning and allocation. Section 3 details our proposal by presenting all software and hardware solutions dedicated to ETL and deployment processes and including scheduling and optimizations. Section 4 presents our experimental studies. Section 5 concludes the paper.

2

Related Work

This section overviews the main important studies tackling real time ETL process and RDF data partitioning and allocation. Real Time ETL Process. The main existing studies on real-time ETL are mainly focused on optimization aspects of the processes in the context of traditional DW. These optimizations cover three mains points: (i) change data capture (CDC), (ii) join operations and (iii) scheduling ETL ﬂows. CDC consists in identifying, capturing and delivering data as it is being inserted, updated, or deleted from data sources. In [11,12], the traditional loading processes are transformed to incremental loadings. [32] proposes an approach implementing a real-time DW by the means of Web services that ensure the CDC using XML Language. During ETL, join operations are widely used. [5] focused on the speed arrival of stream data and the management of disk and main memory accesses to perform these joins. Therefore, they proposed a MESH join operation in transformation phase of ETL. For the same purpose, the authors in [24] have proposed an algorithm that uses Semi-Streaming Index Join. Regarding scheduling of ETL jobs, the authors in [14] used a queue network in which an Active Data Staging Area is built between the sources and the target DW. This area allows managing diﬀerent jobs associated to ETL processes. RDF Partitioning and Allocation did not get the same attention as traditional data [21]. The existing studies on RDF data partitioning were mainly concentrated on the partitioning modes, where two types exist: hash-based partitioning and oriented-graph. Trinity.RDF [30] is a distributed in-memory RDF engine that uses hash partitioning mode to fragment RDF data represented in its native graph form and stored in a key-value store system. SHAPE [17] is based on a semantic hash partitioning approach for RDF data. It starts by a simple hash partitioning and then employs the k-hop strategy proposed in [10]. To ensure the scalability of RDF data partitioning methods, several studies proposed algorithms based on graph theory. [10] uses METIS in order to partition RDF graphs into several fragments. [29] proposed an end-to-end path partitioning scheme, which considers all possible directed paths in the RDF graph. These paths are merged in bottom-up fashion. A few studies related to the processes of RDF data partitioning and allocation exists, where workload-driven approaches were proposed and implemented in the systems WARP [9] and Partout [8]. The authors

Streaming Graph Warehouses in Polystore Era

563

in [22] propose a methodology that considers the processes of partitioning a RDF graph database and allocating its fragments over various sites. This work is totally inspired from traditional distributed databases [21].

3

Our Proposal

Before detailing our proposal, let us present some concepts and deﬁnitions related to the inputs of our problem and processes of deployment phase. RDF2 is a set of 4-tuple < s, o, p, g >, where its subject s has the property p, and the value of that property is the object o, and the graph label IRI g. Note that the graph label g can be omitted, in which case the triples are considered part of the default graph. N-Quads is a line-based, plain text format for encoding an RDF dataset. It deﬁnes an RDF dataset composed of RDF graphs. Sparql3 is the standard query language for RDF. Definition 1. RDF Graph G is a finite set of RDF N-Quads in which every 4-tuple describes a directed edge labeled with p from the node labeled with s to the node labeled with o belonging to G. Subjects s can be URIs or blank nodes, properties p are URIs, while objects o can be URIs, blank nodes, or literals. Definition 2. Sparql Query Graph is a finite set of RDF triple patterns; some nodes in a pattern are variables which may appear in multiple patterns. The global architecture of our proposal, presented in Fig. 1, takes as input various graph-structure data sources (e.g. semantic and graph databases, knowledge graph, etc.) and integrates them into the DW deployed on a Polystore system. It is composed of three main components: (i) CDC component that extracts and retrieves updated data from integrated sources, (2) In-memory store management component is responsible for partitioning phase and the required transformations of ETL process using pipeline strategy and (3) Graph Distributed Management component is responsible of the allocation of graphs over the polystore and their loading in each site store. 3.1

Partitioning and Allocation

In this section, we present the diﬀerent steps of the fragmentation and allocation strategies. We consider a Sparql workload, given in the form of Sparql query patterns Q = {q1 , q2 , . . . , qn }. Our goal is to deﬁne the strategy to partition an RDF data-sets represented by an RDF graph G into RDF fragments (sub-graphs). Each RDF fragment will be allocated to the target site. Figure 3 illustrates the strategy applied.

2 3

https://www.w3.org/RDF/. www.w3.org/TR/rdf-SPARQL-query/.

564

N. Berkani and L. Bellatreche

Fig. 1. A general architecture of our proposal.

Hypergraph Partitioning. As already stated, Sparql queries are naturally represented via graphs. Moreover, the volume of the RDF data becomes increasingly large, Sparql queries also reference a high number of quads and would therefore generate a huge number of connections between patterns giving rise to the problem of Multi-Query Optimization [16]. Managing a scenario with large number of complex queries with high interaction requires an adequate graph representation for Sparql queries. In this perspective, we propose to use the hypergraph structure, for representing Sparql queries, which are massively used to design and test integrated circuits [15]. Hypergraphs have the characteristic that their hyperedges are an arbitrary sets of vertices (nodes), and can therefore connect several vertices. From that, we project sparql queries to an hypergraph structure. We map the vertices of the hypergraph to quad patterns, and hyperedges to connect quad patterns when a transaction accesses quads pattern in queries. Figure 2 describe an example of query interactions applied to the 14 LUBM4 queries. Partitioning of graphs is a well-studied problem in computer science, and we therefore can leverage previously existing strategies to do the fragmentation of our query workload represented by an hypergraph. The following are the diﬀerent steps applied to the workload Q: – Step 1: parses the Sparql query workload Q in order to identify the vertices (nodes) and hyperedges. Each sparql query quad pattern is represented by a vertex and each operation in the query (join, ...) that connects quad patterns is deﬁned by an hyperedge in the form (subject/object, operation, subject/object). For instance, for these two patterns: t1 = (?x yago:wasBornIn yago: Cyprus) and t2 = (?y yago:isLocatedIn ?z), (t1 ,t2 ) are the vertices and e1 = (yago:Cyprus, join, ?y) is the hyperedge connecting both vertices using a join operation. The result of this step is an hypergraph HG, having a set of vertices 4

http://swat.cse.lehigh.edu/projects/lubm/.

Streaming Graph Warehouses in Polystore Era

Fig. 2. Query interactions.

565

Fig. 3. Fragmentation and allocation strategies.

V and a set of hyperedges E. Each vertex vi ∈ V represents a Sparql query pattern and each hyperedge ei ∈ E connecting vertices vi represents the operation linking query patterns (such as join and ﬁlter). – Step 2: In this step, we use the hypergraph partitioning algorithm derived from graph theory. We choose hMETIS5 , a hypergraph partitioning program [15] that divides a hypergraph into k partitions representing sub-hypergraphs having disjoint vertices. We input the Sparql workload Q as a hypergraph to hMETIS Algorithm, and we specify the desired number of partitions which corresponds to the number of Polystore sites k. Then, hMETIS outputs the partitions of components. Each component represents a sub-hypergraph. – Step 3: Here, we have partitioned the workload Q into k sub-hypergraphs. We need to transform each sub-hypergraph into a labeled graph that represents a Sparql query graph, whose its nodes are triple patterns and edges the connection between triple patterns. We adapted the algorithm proposed in SLEMAS [6]. Once the labeled sub-graphs are generated, we use them as Sparql sub-queries to construct RDF fragments. We run each Sparql sub-query on the RDF graph G. The result of each sub-query is an RDF fragment that needs to be allocated over the stores of our polystore in a round-robin mode. To evaluate the cost of data partitioning and allocation, we develop mathematical cost models. Let CostF and CostA be these respective costs deﬁned as follows: CostF = Costhmetis + CostHypgraphT oGraph , where Costhmetis and CostHypgraphT oGraph represent respectively, the time spent by HMETIS and the process of translating of a hypergraph to a graph; and CostA = m × CostAi , where CostAi is the time taken to allocate the fragment Fi over the Polystore.

5

http://glaros.dtc.umn.edu/gkhome/metis/metis/overview.

566

3.2

N. Berkani and L. Bellatreche

Real Time ETL Process

We consider as starting point some data sources having a graph representation and target DW schemata of the Polystore system (deﬁned using the fragmentation and allocation strategies). Generally, ETL process is usually represented as a directed acyclic graph (DAG) containing a set of nodes (that represent schema attributes or instances) and edges (describing data ﬂow among the nodes using RDFS taxonomy). The data ﬂow is a set of operations needed for transformation of input data. The operations are applied at the node level and generate new nodes forming an ETL graph. Additively to traditional ETL operators, some primitives need to be added to manage the graph representation of processed data: - AddN ode(GT , Vj , Ej , Lj ): adds node Vj , edge Ej , label Lj required to GT . - U pdateN ode(GT , Vj , Ej , Lj ): updates node Vj , edge Ej , label Lj in GT . - RenameN ode(GT , Vj , Ej , Lj ): renames node Vj , edge Ej , label Lj in GT . - DeleteN ode(GT , Vj , Ej , Lj ): deletes node Vj , edge Ej , label Lj from GT . - SortGraph(GT , Vj , CS): sorts nodes of GT based on some criteria CS to improve search performance.

Our objective is to facilitate, manage and optimize the design of the ETL process, deployment phase and during the continuous evolution of the DW. For that, we enrich the existing ETL operators with split, context and Link operators elevating the clean-up and deployment of ETL process at the conceptual level. - Split(G, Gi , Gj , CS): splits G into two sub-graphs Gi and Gj based on CS. - Link(GT , Vi , Vj , CS): links two nodes Vi and Vj using the rule CS. - Context(G, GT , CS): extracts from the graph G a sub-graph GT that satisfies the context defined by restrictions CS using axioms.

After having described the representation of ETL operators, we will detail the various components of our solution, that the process is depicted in Fig. 4:

Fig. 4. Real time ETL process components

Streaming Graph Warehouses in Polystore Era

567

Change Data Capture (CDC) Component. In order to process the updated stream data from sources, we use the CDC technique trigger-based capture [23], that identiﬁes the deltas of data sources using triggers implemented in each source participating in the integration. Processed data are usually freely accessible over sparql endpoints. Triggers have been implemented using Sparql query language to discover the necessary information itself. The extracted data (graphs) are sent to the In-memory Store Management and are considered as events that trigger the process of transformation. Each event correspond to a unit of graph. Figure 4 in the left side illustrates this mechanism. In-Memory Store Management Component. This component is dedicated to handle the ETL transformations and the distribution of the resulting ETL graphs to the Polystore System. These two operations are performed according our three scenarios: [(ET ||D) −→ L], [ET −→ DL] and [D −→ ET L]. The In-memory Store area contains the following zones: (1) BF &A : buﬀer dedicated to the execution of the fragmentation and allocation operations and (2) memory serveri (1 ≤ i ≤ n): a memory area dedicated to the ETL transformations according to the target n schemata of the Polystore. Note that the strategy of fragmentation and allocation deﬁned in the previous section, produces as a result a fragmentation and allocation schemes saved at the DBMS catalog level. These schemes are used during the transformation and loading steps of the ETL process. For example, Oracle graph database oﬀers the MDSYS schema that allows keeping this information and using it during the ETL process. The real time ETL process starts when an incoming event from CDC component triggers it. These events are sent to the appropriate memory zone (serveri ). The memory zones are organized as blocks and are handled depending on incoming events generating a data ﬂow. Each event involves the execution of an ETL operator according to the transformation required by the fragmentation schema. As these events continue to show up in the stream, the corresponding blocks will stay in the memory and keep updating the ETL ﬂow. After each event has been passed, the block will be crushed by new events. Note that the life cycle of any block, which means the period between a last update and the eviction from the cache is less than the real-time window (for example 2 min). Figure 4 illustrates this process. Further, ETL operations are executed in a pipelining order [27], to ensure an eﬃcient parallelism treatment. The ETL process is divided into various subprocesses that can work simultaneously using the split operator. ETL ﬂows are processed concurrently with a data stream from the consumer operator to the producer. During the implementation, diﬀerent delta batches are executed at the same time while still guaranteeing the consistency of the system. Some physical operations are required such as: multi-threading, callbacks, cancel, pause, resume. Figure 5 describes and example of our pipeline strategy for monthly publication process. Suppose the monthly publication fact table (having the schema ) records the

568

N. Berkani and L. Bellatreche

total publications in conferences and journals. The ETL ﬂow can be ﬂushed by three delta batches where two sub-processes (triggered by two diﬀerent events) can be executed in concurrency giving rise to pipelining of delta batches. The ﬁnal operation is the union of the resulting sub-processes, which can not starts until it makes sure that both have been ﬁnalized. Data pipelining is introduced here to increase the throughput of the ETL ﬂow.

Fig. 5. Example of pipelined execution of ETL ﬂow.

One of the main challenge is to ensure the fault-tolerance of the solution. This can be done by maintaining the memory zones recovery. To do so, the cache based system is designed by appending all stream events (blocks) to their corresponding ﬁles on the disk. This strategy can be implemented by almost DBMS. For instance, Oracle implements the redo logﬁles to store the logs generated during processing, it allows a redo write-ahead log for the memory operations. When the system restarts, the DBMS reads all persisted ﬁles and reconstruct the in-memory zones. Graph Distributed Management Component. This step comes at the end of ETL transformations and is responsible of the allocation of resulting data (graphs) to the Polystore system. Here, the allocation scheme (according to the allocation strategy described in the previous section) is needed to distribute data (each partitioned and transformed graph) to the appropriate site. Once the graphs have been allocated, the loading phase is executed on the Polystore system. This latter, repeatedly merges the generated sub-graphs from the ETL ﬂow into a single graph that we call chunk (a ﬁxed number of sub-graphs is merged). Each chunk has a unique id assigned which is mostly used for locating chunks. The creation of chunks happens whenever blocks of ETL events have been completely processed and evicted from the inmemory store. Converting multiple small sub-graphs to a single graph chunk depends on storage layout of the target Polystore system. For instance, Oracle database stores RDF graphs using N-Quads format based on a data model

Streaming Graph Warehouses in Polystore Era

569

store (RDF MODELS, RDF NODES, RDF LINKS, RDF VALUES, ...) where sub-graphs can be merged using union operations on the table RDF NODES. 3.3

Polystore Deployment: REST-Based ETL Process

The deployment of our solution is carried out based on Loosely-Coupled Polystore System [4] which is a reminiscent of a multidatabase system. We believe that this approach is the most appropriate for our integration solution using partitioning and allocation strategies. To achieve our goal, we have implemented a service-based solution: ETL as a Service (ETLaaS) and Polystore as a Service (PaaS). The solution takes as inputs: (i) a set of stream data sources referencing a shared ontology or knowledge base and (ii) a set of mappings deﬁned between these sources and target DW. It outputs a DW distributed storage system (Fig. 6). Each component is implemented as a web service. Since we deal with RDF data coming from the web (ontologies, knowledge bases, . . . ), we choose the protocol REST and have implemented our solution using RESTFul API. For data exchange we opted for the JavaScript Object Notation (JSON) format. The communication between data sources and target DW is done using JSON objects with a unique identiﬁer through the RESTful HTTP API. Our solution is implemented as a service oriented architecture (SOA). SOA oﬀers the loose coupling of the web services deﬁned below, and interaction among them which is more appropriate to the loosely-coupled Polystore systems. It allows the integration of new web services without aﬀecting the existing one. This provides the ﬂexibility of the physical deployments of DW.

4

Experiments

In this section we conduct an experimental study to analyze the eﬀectiveness of our proposal. We start by describing the environment of test and the evaluated scenarios, then we present the results obtained by discussing the advantages and disadvantages of the scenarios. Settings. We used Oracle Semantic Database 12c release 2 as the database backend for each store. RDF data are stored in Quad tables, using a distinct integer for each distinct URI or literal value. B-tree indexes are created on each Quad table on s, p, o, g columns, and all two and three columns combinations for performance issues. Moreover, some PL/SQL APIs are invoked after the integration of each data source (load of instances). The memory SGA and PGA are also increased to 8 GB. The ontology of LUBM benchmark6 related to the university domain is used to generate the schemes of data sources. It oﬀers fourteen extensional queries representing a variety of properties. We used a real world data set from YAGO KB, version 3.0.2, having an architecture classiﬁed 6

http://swat.cse.lehigh.edu/projects/lubm/.

570

N. Berkani and L. Bellatreche

Fig. 6. Polystore deployment of the ETL process.

on themes. Each theme is a set of facts. A fact is a RDF graph. YAGO has deﬁned the context relation between individuals [26] which we used to extract the set of themes related to our context study (e-learning domain). The resulting contextual YAGO DW contains around 562M Quads. From this contextual set, we have generated ﬁve data-sets with respectively 120, 280, 420, 720 and 950 universities, representing data sources. The ﬁve data sources and the DW stores schemata have been deployed using Oracle DBMS. Oracle oﬀers diﬀerent format for data loading such as: RDF/XML, N-TRIPLES, N-QUADS, TriG and Turtle. We choose N-QUADS format to load instances using Oracle SQL*Loader into the data sources and site stores of the DW. Our evaluations were performed on a laptop computer (HP Elite-Book 840 G3) with an Intel(R) CoreTM i76th Gen, 3.4 GHZ and 8 GB of RAM and 1 T hard disk. We use Windows10 64bits. We use Oracle Database 12c release 2 that oﬀers RDF Semantic Graph features of Oracle Spatial and Graph. Cytoscape7 is used for visualization. We conduct several experiments to evaluate our proposal. In the ﬁrst experiment, we evaluate the response time of our proposal compared to our previous work [2] that does not use partitioning strategy and memory cache. Note that a memory cache is allocated for the stream processing of the integration process. Figure 7 illustrates our ﬁnding that fragmentation strategy greatly improve the response time of ETL process. The use of the in-memory cache as well. In the second experiment, we try to evaluate the performance of the DW construction according to life cycle phases. To do so, we use a random workload of LUBM queries. 7

http://www.cytoscape.org/.

Streaming Graph Warehouses in Polystore Era

571

Figure 8 illustrates the time spent to deploy the DW and run the ETL process according to the diﬀerent scenarios descried above. The deployment and ETL process time basically consists on the fragmentation and allocation time, the execution time of the ETL process in order to populates the DW site stores. We notice that the [(ET ||D) −→ L] scenario is the optimal one. We can conclude that the parallel execution of the deployment and ETL phases considerably improves the response time. In addition, the graph partitioning using hMETIS algorithm and our round-robin allocation of fragments are performed in a reasonable time depending on the size of data sources.

Fig. 7. ETL performance time.

Fig. 8. Data loading time.

In the third experiment, we compare our partitioning proposal with two approaches: H-RDF-3X a graph based partitioning [10] and Partout a workload based partitioning [8] on 14 LUBM benchmark queries. Figure 9 shows the query performance of the diﬀerent approaches. Generally speaking, we ﬁnd out that our method outperforms the state-of-art techniques in most cases. This is because that hMETIS algorithm takes in account the interaction among queries during the partitioning process. Hence, each site stores the fragments that are semantically linked. Finally, we evaluate the scalability of our partitioning approach by varying RDF data-set sizes. The results are shown in Fig. 10. Generally, the result shows that the duration of partitioning remain reasonable vs the size of the data-set increases which shows the partitioning scalability.

Fig. 9. Performance of queries.

Fig. 10. Scalability

572

5

N. Berkani and L. Bellatreche

Conclusion

In this paper, we couple hardware and software solutions to design streaming ETL involving graph-structure data sources to reduce the latency time between a DW and its sources. We propose the usage of polystore systems as a hardware solution for deploying the target DW and ETL software techniques to achieve the ﬁxed objective. This coupling has to be well-mastered, since ETL and deployment processes are cumbersome and strongly interact. We ﬁrst proposed a scalable partitioning algorithm of graph-structure DW that exploits the interaction among Sparql queries. It is based on a hypergraph data structure. Then, three scenarios are identiﬁed to schedule the processes of ETL and deployment associated with optimizations such as in-memory caching for handling streaming transformations in a pipe-line fashion. A polystore deployment solution based on REST technique with JSON format is given. The obtained results are encouraging and show the impact of scheduling of ETL and deployment processes on reducing the time latency. The most important results are related to the performance time of the ETL processes compared to the previous work, thanks to the partitioning, in-memory and pipeline strategies. Currently, we are deploying our solution in a real polystore and studying the scalability of our proposal in increasing the number of queries and the data volume.

References 1. Berkani, N., Bellatreche, L.: A variety-sensitive ETL processes. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 201–216. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-64471-4 17 2. Berkani, N., Bellatreche, L., Benatallah, B.: A value-added approach to design BI applications. In: Madria, S., Hara, T. (eds.) DaWaK 2016. LNCS, vol. 9829, pp. 361–375. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43946-4 24 3. Berkani, N., Bellatreche, L., Ordonez, C.: ETL-aware materialized view selection in semantic data streamwarehouses. In: RCIS. IEEE (2018) 4. Bondiombouy, C., Valduriez, P.: Query processing in multistore systems: an overview. IJCC 5(4), 309–346 (2016) 5. Bornea, M.A., Deligiannakis, A., Kotidis, Y., Vassalos, V.: Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE, pp. 159–170 (2011) 6. Boukorca, A., Bellatreche, L., Cuzzocrea, A.: SLEMAS: an approach for selecting MV under query scheduling constraints. In: COMAD, pp. 66–73 (2014) 7. Duggan, J., et al.: The bigdawg polystore system. ACM Sigmod Rec. 44(2), 11–16 (2015) 8. Gal´ arraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for eﬃcient RDF processing. In: WWW, pp. 267–268. ACM (2014) 9. Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDE Workshops, pp. 1–6 (2013) 10. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)

Streaming Graph Warehouses in Polystore Era

573

11. J¨ org, T., Deßloch, S.: Towards generating ETL processes for incremental loading. In: IDEAS, pp. 101–110 (2008) 12. J¨ org, T., Dessloch, S.: Formalizing ETL jobs for incremental loading of data warehouses. In: BTW, pp. 327–346 (2009) 13. J¨ org, T., Dessloch, S.: Near real-time data warehousing using state-of-the-art ETL tools. In: Castellanos, M., Dayal, U., Miller, R.J. (eds.) BIRTE 2009. LNBIP, vol. 41, pp. 100–117. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-64214559-9 7 14. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS, pp. 28–39 (2005) 15. Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. In: DAC, pp. 343–348 (1999) 16. Le, W., Kementsietsidis, A., Duan, S., Li, F.: Scalable multi-query optimization for SPARQL. In: ICDE, pp. 666–677 (2012) 17. Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. Proc. VLDB Endow. 6(14), 1894–1905 (2013) 18. Mayer, R., Mayer, C., Tariq, M.A., Rothermel, K.: Graphcep: real-time data analytics using parallel complex event and graph processing. In: DEBS, pp. 309–316 (2016) 19. Meehan, J., Aslantas, C., Zdonik, S., Tatbul, N., Du, J.: Data ingestion for the connected world. In: CIDR (2017) 20. Ordonez, C., Johnson, T., Urbanek, S., Shkapenyuk, V., Srivastava, D.: Integrating the R language runtime system with a data stream warehouse. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 217–231. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-64471-4 18 21. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4419-8834-8 22. Peng, P., Zou, L., Chen, L., Zhao, D.: Query workload-based RDF graph fragmentation and allocation. In: EDBT, pp. 377–388 (2016) 23. Ram, P., Do, L.: Extracting delta for incremental data warehouse maintenance. In: ICDE, pp. 220–229 (2000) 24. Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workﬂows for fault-tolerance. In: ICDE, pp. 385–396 (2010) 25. Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Seman. Web 3(4), 1–24 (2007) 26. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW, pp. 697–706 (2007) 27. Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Vassiliadis, P., Simitsis, A., et al. (eds.) New Trends in Data Warehousing and Data Analysis, pp. 1–31. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-87431-9 28. Waas, F., Wrembel, R., Freudenreich, T., Thiele, M., Koncilia, C., Furtado, P.: Ondemand ELT architecture for right-time BI: extending the vision. IJDWM 9(2), 21–38 (2013) 29. Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE, pp. 795–806 (2015)

574

N. Berkani and L. Bellatreche

30. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013) 31. Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: Cloud and Service Computing (CSC), pp. 330–335. IEEE (2011) 32. Zhu, Y., An, L., Liu, S.: Data updating and query in real-time data warehouse system. In: CSSE, vol. 5, pp. 1295–1297 (2008)

Communication-Aware Prediction-Based Online Scheduling in High-Performance Real-Time Embedded Systems Baptiste Goupille-Lescar1,2(B) , Eric Lenormand1 , Nikos Parlavantzas2,3 , and Christine Morin2 1

Thales Research and Technology, 1 av. Augustin Fresnel, 91120 Palaiseau, France {baptiste.goupillelescar,eric.lenormand}@thalesgroup.com 2 Inria, IRISA, 263 av. General Leclerc, 35042 Rennes, France [email protected] 3 INSA Rennes, IRISA, 263 av. General Leclerc, 35042 Rennes, France [email protected] https://www.thalesgroup.com/en https://www.inria.fr/en/

Abstract. Current high-end, data-intensive real-time embedded sensor applications (e.g., radar, optronics) require very speciﬁc computing platforms. The nature of such applications and the environment in which they are deployed impose numerous constraints, including real-time constraints, and computing throughput and latency needs. Static application placement is traditionally used to deal with these constraints. However, this approach fails to provide adaptation capabilities in an environment in constant evolution. Through the study of an industrial radar use-case, our work aims at mitigating the aforementioned limitations by proposing a low-latency online resource manager derived from techniques used in large-scale systems, such as cloud and grid environments. The resource manager introduced in this paper is able to dynamically allocate resources to fulﬁll requests coming from several sensors, making the most of the computing platform while providing guaranties on non-functional properties and Quality of Service (QoS) levels. Thanks to the load prediction implemented in the manager, we are able to achieve a 83% load increase before overloading the platform while managing to reduce ten times the incurred QoS penalty. Further methods to reduce the impact of the overload are as well as possible future improvements are proposed and discussed. Keywords: Embedded systems Dynamic resource management

1

· Real-time · Scheduling

Introduction

Nowadays, with the increasing demand for high-performance computing and smart sensing, high-end embedded system designers are facing an increasing c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 575–592, 2018. https://doi.org/10.1007/978-3-030-05057-3_43

576

B. Goupille-Lescar et al.

number of challenges. Indeed, the targeted platforms must respect a great number of non-functional constraints, such as Size, Weight and Power (SWaP), realtime computing and cost constraints. Currently, most embedded systems meet these constraints by the use of dedicated components and static resource allocation approaches based on worst-case scenarios. This method, coupled with the emergence of workloads integrating hard, soft real-time and best-eﬀort applications and the increase of their variability, results in massive over-provisioning and under-utilization of resources. Moreover, while this method allows the design of eﬃcient and reliable systems, it nearly eliminates their adaptation and evolution capabilities by preventing the deployment of highly variable or opportunistic applications for smart sensing. To address these limitations, this paper proposes a smart resource management system, able to fulﬁll low-latency runtime requests for application execution while providing non-functional guarantees. Timing properties are considered in this paper, while other properties such as heat dissipation, will be addressed in future work. To achieve high performance gains while guaranteeing timing properties, our contribution is inspired by large-scale resource managers found in cloud or grid infrastructures, making the most of application proﬁling to enable high-level predictability, low mapping latency as well as high resource utilization. This work is supported by several industrial use-cases, including an Active Electronically Scanned Array (AESA) radar use-case, which is detailed in this paper. To ﬁt the targeted context, a complete simulation framework has been implemented. The results obtained through simulations show that our mapping method results in improvements in both performance and predictability of the system. This paper is organized as follows: Sect. 2 provides a quick description of the motivating use-case context. In Sect. 3, related research from both the embedded and large-scale systems communities are presented. Section 4 describes the models and methods used in our approach. In Sect. 5 the simulation framework created is explained and simulation results are analyzed. Section 6 introduces future work and discusses several open questions. Finally, Sect. 7 draws some conclusions and perspectives.

2

Use-Case Specific Context

Our use-case deals with optimizing the computing resource utilization of a multifunction electronically-steered surface radar and has been characterized with the support of domain experts. This equipment can typically be installed on the ground vehicles or on some surface ships such as frigates, with the mission of detecting, tracking and identifying objects of interest. This is done by illuminating narrow angular sectors by one or more sequences of known signals (waveforms) and processing the returned echoes captured on a large number of receivers. Such an observation is commonly referred as “dwell”. Dwells can be of several types, depending on their objective (e.g., scanning an unknown sector, tracking a known target) and make use of diﬀerent waveforms, each with its known duration. The radar antenna is constantly kept busy and scans the whole

Prediction-Based Online Scheduling in Embedded Systems

577

radar angular range by successively sending signals with the appropriate waveforms. The processing of received echoes is traditionally split into two distinct phases: – A front-end one that applies ﬁxed ﬁltering functions on the digitized received channels. It processes a massive bandwidth of input data, and selects a limited set of points of interest by using appropriate detection thresholds. This phase is highly computation-intensive and computing times are deterministic. – A back-end phase whose goal is to extract operational information (e.g., position, speed, trajectory, nature of target) from the detection results. Among functions executed in the back-end are the extraction, which delivers so-called “plots” that characterize one or more targets in precise directions of space, and tracking, which consolidates plots issued by extraction and builds trajectories of targets. Considering the regularity of the front-end workload, a near-optimal mapping (i.e., resource allocation) can be found at design-time and used during runtime processing. Thus, front-end tasks are not targeted by this study as their high predictability limits the potential performance gain from the integration of dynamic elements. On the contrary, static design-time mapping methods loose eﬃciency when applied to functions with data-dependent complexities, which is the case for most back-end applications. Functions like extraction have a largely variable computational cost, depending on the type of dwell as well as the number and conﬁgurations of targets that must be discriminated from the front-end detection. The actual run-time of each extraction is only known at the end of its execution. However, this cost is bounded by known minimum/maximum values related to each dwell type. The computing platform used for back-end processing must ﬁt within a speciﬁc SWaP budget. In our case, the targeted computing platform consists of 4 racks each one hosting several boards with several processing elements and a shared memory. All these elements are linked together by an Ethernet network using a star topology with one central router connected with the routers present in each rack. In addition to SWaP constraints, processing latencies matter, as observing and tracking targets needs to be kept in pace with their kinematics. The combination of variable computation times (and data transfer), constrained platform and real-time computing renders design-time mapping policies inefﬁcient and drives the need for agile methods. Another goal of the introduction of online resource management in traditionally static systems is to serve

Fig. 1. Application execution timeline

578

B. Goupille-Lescar et al.

as an enabler for the implementation of new applications with varying resource demand or unknown completion time. To add adaptation capabilities to the system, we experiment with a resource management functionality that receives sporadic sequences of requests, each corresponding to a dwell that will be input to one of the 4 radar antennas. The trigger (see Fig. 1) represents the moment at which the data will be available for the back-end (e.g. extraction) applications, after waveform emission/reception and front-end processing. The resource manager allocates the diﬀerent back-end computation and communication activities of the dwell application onto the platform resources so that they can be executed before their deadline. For reasons discussed later, this placement decision Map is taken just before the trigger reception. After having been placed, an application’s execution eﬀectively starts only when all its dependencies are met and the targeted processing elements are free, which ideally happens at T = trigger. It then executes for a time depending on the data content before sending the results and freeing the occupied computing and communication resources. Furthermore, note that the processing of separate dwells (addressing diﬀerent regions in space) can be done independently from each other. Their execution is mandatory, every request must be fulﬁlled, and is nonpreemptive. As of now, the resource management system is only seen as an executant and does not have the opportunity to reject or terminate an application. Once a request has been received, the associated application is executed in its entirety.

3

Related Work

As we are addressing high-performance computing problems in an embedded environment, the following related work is divided into two parts: studies targeting embedded systems with SWaP and timing constraints and studies addressing the placement of similar application models on large-scale systems. 3.1

Embedded Systems

We ﬁrst take a look at works that consider similar computing platforms with timing constraints. In our study, as in numerous embedded systems design problems, the real-time execution of applications is a major concern. A great number of published works aim at providing timing guaranties for hard real-time applications. Works aiming at aeronautic or automotive certiﬁcation reject dynamic methods and focus on single-processor architectures, as in [15] or [19], to maximize the system’s predictability. In recent years, there has been a growing interest in both multi-processor architectures and mixed-criticality workloads. This led to numerous publications [3] and several European research projects [1,2]. However these studies, such as [11] or [22], only address design-time schedulability analysis.

Prediction-Based Online Scheduling in Embedded Systems

579

In [21], run-time adaptation consists in mapping long-lasting jobs onto a multi-processor system-on-chip architecture by making use of conﬁgurations elaborated at design-time. However, in our case the run-time management system operates in a non-periodic context by taking very frequent mapping decisions for computing jobs with a very limited duration. Similarly, the authors of [9] use design-time exploration results to dynamically adapt the application run-time setting depending on the context. While both these works show promising results, our workload’s temporal behavior makes it impossible to adapt the resource allocated to an application once it started. The fact that all applications of our use-case are short-lived and must meet their deadline makes it impossible to modify the set of resources they are deployed on after the applications have started their execution. 3.2

Large-Scale Systems

While studies in the embedded community essentially focus on real-time control workloads or signal/video processing applications, our use-case’s workload is much closer to workﬂows encountered in large-scale systems, such as cloud computing. With the democratization of cloud computing and an increasing need for eﬃcient scientiﬁc simulations, more and more works address the scheduling of workﬂows in HPC systems [16]. These workﬂows are often materialized as Directed Acyclic Graphs (DAGs) for which computing and communication resources are allocated until their completion. Among the numerous publications, two main categories of resource management systems can be identiﬁed: – Purely dynamic approaches with close to no prior knowledge of the applications and relying on cloud elasticity to compensate for QoS run-time variation by allocating/releasing and/or migrating Virtual Machines (VMs) [7,27]. – Static methods relying on exhaustive or evolutionary algorithms to ﬁnd an optimal placement before executing the application. Commonly used methods include genetic, ant colony and Min-Min algorithm variations [4,6,14,23,25]. While both of these approaches have merits and can achieve great results in cloud or grid computing, none are directly applicable to our use-case for several reasons. First, due to the arrival rate of requests, a placement solution must be found in a few milliseconds, which prevents us from using, for example, genetic or particle swarm algorithms. Furthermore, while some methods using online model derivation, such as [8] can show great results, they are only applicable to long-lasting applications. Then, the use-case’s applications being non-preemptive and the resources not being virtual makes it impossible to rely on migration or virtual machine re-dimension mechanisms. A fair number of studies try to incorporate predictability via the use of priorities or isolation mechanisms, as in [13,17]. Unfortunately, due to the everchanging nature of cloud environments, most “real-time” cloud resource managers react to deadline misses and do not prevent them. Thus, while no management system is actually able to provide hard real-time guaranties, there exist

580

B. Goupille-Lescar et al.

few satisfying solutions able to accurately provide soft real-time guaranties when targeting data streaming. However, while it is manageable to obtain these timing properties inside a cloud environment, it is noticeably more diﬃcult to do so when considering communications with users as data have to go through heavy software stacks as well as non-deterministic communication protocols [10]. In [5], while the authors use similar workload and resource reservation mechanisms to ours, they only target bag of tasks applications, for which they have control over the parallelism level, do not consider SWaP constraints and can potentially reserve an inﬁnite number of VMs. Finally, while the problem has already been addressed for single processor systems [18], only few works actually target non-preemptive resource allocation for stochastic workloads on multi-processor platforms [12,24]. Moreover, they are purely mathematical approaches and totally abstract the computing platform by considering a set of identical unrelated machines. To conclude, while a signiﬁcant number of works seem to address a similar problem, the speciﬁcity of the considered use-case applications and the computing platform makes it impossible to simply adapt existing solutions.

4

Our Approach

In this section we ﬁrst describe the overall functioning of our resource manager, consisting of the mapper and predictor components. Afterwards, this section describes the proposed mapping heuristic. Our ﬁnal objective being to maximize the QoS provided to the radar’s operator, a QoS-aware extension of our method is ﬁnally introduced. 4.1

Mapping Process

The overall mapping process is shown Fig. 2 which serves as a guideline for this section. As explained in Sect. 2, the resource manager receives requests from the radar every few milliseconds (1 in Fig. 2). Each of these requests contains the type of application to execute, its identiﬁer, its priority, the input and output memories and the time at which the data will be available in the input memory. The application’s type deﬁnes its structure (i.e., directed acyclic graph) and proﬁling data, such as computing time and input and output data volume for which only minimal and maximal values are known. Note that priorities are only partially correlated to the application’s type because they are dynamically attributed depending on the context (e.g., threat level of a target). It is necessary to ﬁnd a valid mapping solution for these requests before the data is available to reduce latency as well as to avoid congestion in the input memory. Furthermore, only probabilistic execution times of each computation and communication activity are known when evaluating placement possibilities. Thus, unlike numerous approaches present in the literature, it is necessary to take a decision based not on the current state of resources but on an estimation of their future availability. To this extend, the Mapper interacts with the load Predictor

Prediction-Based Online Scheduling in Embedded Systems

581

Fig. 2. Simpliﬁed model view

(2 )(3 ) to ﬁnd a suitable placement for both communication and computation activities of the application (i.e., nodes of the DAG). Once a decision has been made, the application’s activities are sent to their respective hosts (4 ). They are executed on the platform using data from the radar (5 ) to determine their actual execution timings. Finally, monitoring signals from the platform (6 ) are then exploited to update the load prediction model. 4.2

Prediction Model

Computing time predictions are handled by our Predictor module. Its role consists in keeping an up-to-date simpliﬁed overview of the platform’s processors, links and memories state as well as queued activities to provide the Mapper with execution time predictions. Keeping in mind that a valid mapping solution must be found in a few milliseconds, the simulation of the actual computing platform in its entirety is far too computationally intensive to consider in detail while looking for a run-time solution. Indeed, since the platform comprises multiple shared resources (memories, communication links), estimating the impact of the deployment of an application on all the applications already running (or queued) represents a signiﬁcant amount of computation, impossible to terminate in a few milliseconds on an embedded platform. Thus, the Predictor maintains a simpliﬁed model of the real architecture containing every resource (processors, memories, links) as well as the activities queued for execution in each of these resources. The execution time predictions are made without taking into account some of the resource access conﬂicts happening in the actual platform. The architecture model inside the predictor is kept up-to-date using both the Mapper ’s placement decisions in addition to the platform’s monitoring signals. The interactions between Mapper, Platform and Predictor can be seen as a MAPE (Monitoring, Analysis, Planning and Execution) control loop often found in large-scale systems management systems. Some adjustments are necessary to this loop as,

582

B. Goupille-Lescar et al.

while it is possible to take a mapping decision at the time a request is received, it may be proﬁtable to wait for a more precise prediction since the application’s ﬁrst activity will only be executed tens or hundreds of milliseconds later, when the data is available. The earlier a decision is taken, the less accurate the execution time prediction is as we would be making predictions on top of predictions since execution timings of an application depend on estimations of other applications’ end times. On the contrary, taking a decision at the last minute means a reduced time to evaluate possible solutions. Thus, a trade-oﬀ must be found between the number of possible mappings we can evaluate versus the accuracy of these evaluations to ﬁnd the best possible solution. It is important to note that it is possible to make some assumptions on the system’s evolution thanks to its quite regular request submission pattern; it would be impossible to anticipate its behavior if it was completely random. The system receives aperiodic sequences of mapping requests with an average interval between them in the range of a few milliseconds. Indeed, the rate at which the mapper receives requests is correlated with the radar antennas emission rate and, in practical use-cases, the antennas operate close to their maximum capabilities at all times. As mentioned before, their execution times can greatly vary depending on the data to process which can not be known before execution. In summary, we are faced with a system whose dynamic behavior enables short term predictions ranging from a few milliseconds to tens of milliseconds. 4.3

Proposed Mapping Heuristics

The considered application model is as follows: Each application is seen as a directed acyclic graph AP P = (A, D, prio, trigger, dl, input, output) having a priority prio, a memory input at which necessary data will be available at t = trigger and the memory output in which the results have to be stored before the deadline dl. A represents a set of computing activities linked by execution order constraints, or dependencies D. An Activity An = (volcompmax , volcompmin , volinmax , volinmin , voloutmax , voloutmin , α) has minimum/maximum values of the number of instructions to execute (volcomp ), the data input volin and output volout volumes as well as a memory access ratio α. The memory access ratio α of a computing node determines the minimum memory bandwidth needed by a target processor to perform the computation of this activity at maximum speed. For an application to be schedulable, we assume that ∀AP P, Tmax < (deadline − trigger), Tmax being the sum of maximum computation times of the activities part of the critical path of AP P . In other words, applications have slack, meaning most of them can be slightly delayed and still respect their timing constraints. Concerning the architecture, reasoning is made on the architecture model maintained by the predictor and deﬁned as follows. The platform P F = (P, M, L) is seen as a set of processors P , memories M and links L. Each processor Pn possess a computing capacity expressed in instructions per second in addition to its type, each memory Mn a maximum bandwidth and each link Ln

Prediction-Based Online Scheduling in Embedded Systems

583

a maximum throughput expressed in Bytes per second. At run-time the architecture runs several applications. Processors (resp. communication links) can only serve one computation (resp. communication) at a time. Memories on the contrary potentially share their bandwidth between diﬀerent computations and communications, which may result in slowing down some of them in case of excessive demands. Static Mapping. This mapping strategy is used as a reference for comparison with dynamic strategies shown below. In this context, processors are allocated to successive requests in each rack on a systematic round-robin basis, without considering the computing cost or priority of each requested activity. Prediction-Based Load Balancing. This dynamic strategy consists in allocating at run-time the best communication path and computing elements in the machine, by taking account of an estimated load status of the diﬀerent resources of the architecture. Its objective is to minimize the end time of the next request to be executed. This is done by investigating all sets of processors SPn ⊂ P able to host the diﬀerent computation activities to execute. For each of them, since an application request carries information on input and output memories, it is possible to identify and generate an adequate set of communication activities to transfer data from the input memory to computing activities hosts’ memories and, ﬁnally, to the output memory. From this set of activity sets, the mapper then estimates for each of them the worst-case completion times of the application request according to the architecture resources used and their estimated current workload. The predictor is used there, providing for each resource an estimation of the completion time of its last queued activity, or the indication that it has no activity still running. This is then used to evaluate the completion time of the new candidate activity request, and elect the hosting processor and associated communication path that achieves the earliest predicted completion time. To estimate the worst-case execution time of a computing activity on a processor, this processor’s frequency and the maximum possible number of instructions for this type of activity are used. Concerning communication time predictions, it is necessary to take the maximum availability of links on the path in addition to the input and output memories bandwidths. Then, an estimated worst-case communication time is computed using the maximum data volume for this activity and the lowest available bandwidth on the path. Note that in our test bench, the possibility is left to choose to oﬀ-load a request of a busy rack to a less busy neighbour rack, when the additional cost of moving data to and from the receiving rack compensates advantageously for long potential delays in the original rack. QoS-Aware Load Balancing. This strategy is an extension of the previous prediction-based load balancing whose goal was the reduction of application latencies. While the latter allocates the earliest ﬁnishing resources to individual

584

B. Goupille-Lescar et al.

requests, this strategy now aims at optimizing a QoS indicator based on the global cost of exceeding deadlines. As mentioned before, the activity requests are of diﬀerent types with diﬀerent priorities, and thus a variable degree of ﬂexibility can be left to miss deadlines. This is an alternative to the hard deadline model. This alternative ﬁts better our radar context, which runs activities at variable computing costs and variable requirements on reactivity; as an example, activities related to tracking targets are expected to complete soon enough to avoid loosing the target, while others operate in a human time scale. Each type of activity has a penalty factor proportional to the extent of the deadline miss, and null if the deadline is met. The goal is now to keep the accumulated penalty as low as possible while executing the same workload on the same computing platform. Then the QoS-based strategy now considers a sequence of n activity requests of potentially diﬀerent types (and priorities), sorted in trigger time order. We call R0 the request in the queue with the earliest trigger time and Rn the nth requests in the queue. While the execution order (i.e. trigger time) isn’t impacted, the objective is now to determine if the best mapping found should be used for the next request R0 or a later, more critical one. – selects the best host processor P0 for R0 and estimates its potential deadline miss and resulting penalty QR0 /P0 – ∀0 < i QRi /P0 + QR0 /Pi and if Ri has a larger priority than R0 , processor Pi is allocated to R0 , leaving P0 free – else processor P0 is allocated to R0 This resource allocation strategy can be seen as an extension of the ﬁrst loadbalancing method as its behaviour will deviate from the non QoS-aware one only when predicting signiﬁcant penalty overheads.

5

Evaluation

To evaluate the proposed mapping heuristics a simulation framework has been created mostly because the objective is to evaluate the resource management policies themselves, which would be too time consuming if one had to develop and integrate real-time software for each candidate computing platforms. A modular approach has been adopted, allowing us to calibrate the simulation with actual data while having some freedom in the testing done. This section ﬁrst describes the created simulation environment before discussing some interesting results. 5.1

Simulation Framework

The simulation environment, realized for this study using the Ptolemy II Framework [20], is divided into three main components: – a request generation module – the resource management system – a computing platform simulator

Prediction-Based Online Scheduling in Embedded Systems

585

Request Generation. A request generation module has been developed to reproduce representative radar scenarios. It emulates the radar management system request submission pattern and generated requests possess properties close to actual ones. As such, it generates dwell requests as if feeding 4 antennas. The type of dwells and the request released dates are randomized to a certain extend, but in a reproducible way. The generator also includes parameters allowing us to represent the complexity of the environment observed by each antenna by manipulating the probability distribution of actual instruction and communication data volumes of dwells treated by this antenna. These volumes are sent directly to the simulator (see Vol Fig. 2) and aren’t known by the resource manager which has only knowledge of the request type, min/max volume values, trigger time, deadline and priority. Resource Manager. The resource manager represents the core of our work and encompasses both the mapper and predictor module. It contains the implementation of the proposed mapping algorithms and, for comparison purposes, the implementation of a standard static round-robin placement algorithm in addition to the proposed dynamic methods. Moreover, it keeps track of every missed deadline and assigns a penalty to each application execution calculated as follow: penaltyn = prion ∗ (end timen − deadlinen ) if end timen > deadlinen or 0 otherwise. The applications’ priorities being correlated to the operational interest of a dwell, the sum of suﬀered penalties during a run represents a relevant estimation of a mapping strategy’s impact on the QoS provided to the user while running a scenario. Platform Simulator. To both implement the full MAPE loop and evaluate the proposed methods, an execution environment was required. In order to have more control over it and ﬂexibility on testing parameters, a computing platform simulator has been designed. However, to retain meaningful evaluation results and feedback to the predictor, this simulator has been modeled on an actual platform. Moreover, to accurately simulate the activities’ execution, their memory access volume as well as interference (i.e. resource access conﬂicts) with other running activities are represented. Both the platform simulator and the interference model have been validated by domain experts. To obtain the following results, the considered platform contains 4 racks linked via an Ethernet switch, each one hosting 6 dual-processor boards with one shared memory per board. Inside a rack, all boards communicate also via an Ethernet switch. Note that each rack receives input data from the front-end of one antenna which is stored in an input memory and must send results to a ﬁxed output memory. 5.2

Results Analysis

We now compare the performance of diﬀerent mapping methods in several operational scenarios as well as the inﬂuence of anticipation delay (i.e. the time

586

B. Goupille-Lescar et al.

between the decision taking process and the application deployment) on the quality of results produced. We ﬁrst identify the four operational scenarios used in the following experiments: – SCE 1 where all antennas share an equal, average load. – SCE 2 representing a classic shore surveillance scenario with one antenna (pointing towards the coastline) illuminating a high-complexity environment with a very high number of detections, its 2 neighbors harboring average load and the last antenna treating only few detections. – SCE 3 where 2 antennas face numerous detections and the other 2 an average amount. – SCE 4, where all antennas have to treat an important quantity of detections. While SCE 1 and SCE 4 are mostly used as reference points, the second scenario SCE 2 presents a real operational interest as it represents a traditionally problematic case. Each result presented below comes from the average of 10 simulation runs with diﬀerent random seed values. During each run 2000 requests are generated, each request containing 1 computing activity and 1 to 2 communication activities, which translates in an average of 5000 activities mapped per run.

Fig. 3. Latency using static mapping

Fig. 4. Latency using dynamic load balancing

Figure 3 shows the average observed latencies of deployed applications while using a standard round-robin static mapping method. As it is clearly visible in SCE 2 and SCE 3, using a pre-determined mapping prevents online resource sharing and can cause a great load unbalance between computing racks. On the other hand, when using the dynamic load-balancing approach described in 4.3.2 (see Fig. 4) we can see that resource sharing has been eﬃciently carried out, narrowing the gap between computing racks load in both SCE 2 and SCE 3. Moreover, we can observe a global latency reduction anywhere between 18% and 32% across all scenarios. This is due to the fact that our mapping method avoids random latency spikes that can happen while using a static mapping method. These spikes can be caused by the execution of a heavy activity on a pre-determined processor which is possibly already busy executing another expensive activity, eﬀectively leading to an increased queuing time. Additionally,

Prediction-Based Online Scheduling in Embedded Systems

587

the execution time gains provided by the proposed predictive mapping system are only partly due to the resource sharing between racks as, when preventing the inter-rack migration of activities 10 to 20% execution time reduction was still observed. Table 1. QoS results. Static Mapping Load-Balancing SCE 1 penalty 3.21 nb miss 23

0 0

SCE 2 penalty 69.67 nb miss 136.6

0.17 3.6

SCE 3 penalty 172.36 nb miss 284

2.37 28.2

SCE 4 penalty 342.06 nb miss 507.4

31.8 175

While looking at the QoS provided by both static and predictive methods in Table 1, one can notice an even greater diﬀerence in terms of perceived QoS. In the Static Mapping column we can observe that, even in a average load scenario (SCE 1 ), a few latency spikes cause some deadline misses, generating some QoS penalties. While the Load-Balancing method consistently maintains 0 deadline misses. In SCE 2 and SCE 3 massive diﬀerences can be seen in provided QoS with the load balancing method, only missing 3.6 to 28.2 deadlines and the static method between 136.6 and 284 on average. This represent a more than ten times average QoS improvement under the most realistic operational scenarios. Finally, in SCE 4 where all racks are overloaded, around one fourth of the deadlines are missed with the static mapping and one eleventh with the dynamic method. Note that these show as well that the use of an eﬃcient load-balancing helps mitigating the eﬀects of an overload. While the load prediction method misses only three times fewer deadlines, it receives only less than 10% the amount of QoS penalty suﬀered by the static mapping. Figure 5 shows the impact of computing load of applications on the amount of QoS penalty received for diﬀerent mapping methods on one computing rack. The computation load is normalized on the computation volume at which the ﬁrst deadline misses (due to latency spikes) are observed while using the static method. It is possible to observe a steep increase in penalty for all mapping methods around at around three times this load. At this point we reach the maximum capacity of the computing platform and enter a global overload state. As every application must be executed, numerous computing activities will accumulate inside processors’ execution queue and be executed well after their deadlines, generating high penalty.

588

B. Goupille-Lescar et al.

Fig. 5. Impact of computation load on the quality of result

However, when using a dynamic load-balancing technique, the computing volume at which an overload state is entered diﬀers vastly from the one at which the static method overloads. Even when not taking into account random latency spikes happening for volumes from 1 to 1.5, a constant increase in observed penalty can be noticed for the static mapping from 1.5 onward. On the other hand, when using load-balancing or QoS-aware mapping methods, the ﬁrst deadline misses and received penalty only appear at the 2.75 mark. This is a straight 83% increase in the computation volume that the platform can absorb before encountering any QoS issue. The QoS-aware mapping method introduced in 4.3.3 shows results close to the ﬁrst load balancing mapping with an average penalty reduction close to 10%. Moreover, both the QoS-aware and the load-balancing methods tend to mitigate the QoS impact of an overload as they consistently suﬀer a lower QoS penalty than the static round-robin technique. As discussed before, the moment at which an application placement decision is taken can greatly inﬂuence the quality of load prediction used to take this decision and, thus, impact the quality of produced mapping. This is especially true when operating close to the platform’s maximal load. For example, when using a 2.5 volume in SCE 3, the average number of missed deadlines can vary from 0 to 38 depending on the anticipation with which the decision is taken. An ideal anticipation of 0 means that the mapping decision is taken instantly at trigger time and yields perfect results when not in an overload state. On the contrary, placing an application as soon as its execution request is received (between 100 to 300 ms before its execution) often yields noticeably inferior results due to more approximate predictions on the platform’s future state. The results presented above were obtained using a 10ms anticipation time which provides near optimal results while being a realistic mapping decision delay.

Prediction-Based Online Scheduling in Embedded Systems

6

589

Discussion

The results presented in the previous section are encouraging and prove the potential gains of QoS-aware dynamic resource manager for the specialized and demanding systems that are multi-function radars. However, while this work’s context seems very speciﬁc, the presented predictive load balancing mechanism could be adapted to other domains with similar needs such as high-performance automotive computing platforms. For example, for high performance sensors, this technique could be used to dynamically select the proper algorithms variant that ﬁts best the current observed situation, while maintaining real-time objectives. Moreover, while they operate at completely diﬀerent scales, cloud environment processing business workﬂows share some of our use-case constraints [28] and could beneﬁt from an online predictive mapping method. In addition, this approach opens important opportunities of improvement. First of all, the realized resource management system currently stands in a passive position due to its lack of control over incoming requests and its obligation to execute each of them. Thus, it can only increase the computing load supported by the computing platform before reaching an overload state, with no means to prevent it. Two solutions can be envisaged to address this issue. The ﬁrst one is the implementation of an admission control unit discarding low priority applications to make room for high priority ones when anticipating an overload. The other envisaged solution is a second control loop between the resource manager and the radar manager. This loop would receive regular updates of the platform state to help determine which type of waveform it could process eﬃciently as well as to exploit free computing opportunities by launching context-aware useful functions and, thus, enhance the QoS provided to the operator. While the results of the QoS-aware mapping method are encouraging, one anticipates that its eﬃciency will largely depend on the relevant weighting of priorities and deadline penalty factors. These parameters being characterized by the radar management system, future work will include the investigation of the inﬂuence of these factors when using real-life workloads. Moreover, the computing time prediction algorithms used in the predictor are limited by the worst-case approach used. We are currently exploring a mixed-critical stochastic approach aiming at providing an even better trade-oﬀ between QoS provided and predictability of the systems. To explore these ideas, the simulator presented in this paper is being extended by adding support for heterogeneous processing elements as well as the generation of more complex and diverse workloads. This will allow us to ﬁnely tune the prediction model to improve the QoS-aware mapping method presented above as well as to proﬁle the mapping algorithms on the target platform. These proﬁling data are crucial as they will determine the computing volume limit of the envisaged mapping heuristics. This will directly inﬂuence the size and complexity of application graphs that can be considered for further tests. In addition to that, several other QoS metrics, both radar speciﬁc (false alarm rate) and generic (processor slack time minimization), are investigated to further improve the mapping results’ quality.

590

B. Goupille-Lescar et al.

Finally, as a centralized resource manager is currently in use, the computing platform’s scaling might be an issue. As a fully distributed resources management system is hardly conceivable in the studied use-case, a hierarchical manager composed of both a central unit and locally distributed manager will be considered. Such systems have already been successfully implemented in dynamic large-scale environments such as clouds [26].

7

Conclusion

In this paper we presented a dynamic mapping method for real-time application execution on a heavily-constrained embedded architecture. This method was tested on an AESA radar use-case using a custom simulator. We showed that this approach allows us to obtain lower execution latencies than current mapping solutions while maintaining high predictability and allowing gradual performance degradation in overload scenarios. Acknowledgments. This work was made possible thanks to the support of the Surface Radar Business Line of Thales.

References 1. http://www.uni-siegen.de/dreams/home/ 2. http://www.certainty-project.eu/ 3. Baruah, S., Li, H., Stougie, L.: Towards the design of certiﬁable mixed-criticality systems. In: 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 13–22. IEEE (2010) 4. Braun, T.D., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001) 5. Cai, Z., Li, X., Ruiz, R., Li, Q.: A delay-based dynamic scheduling algorithm for bag-of-task workﬂows with stochastic task execution times in clouds. Futur. Gener. Comput. Syst. 71, 57–72 (2017) 6. Chen, H., Wang, F., Helian, N., Akanmu, G.: User-priority guided min-min scheduling algorithm for load balancing in cloud computing. In: 2013 National Conference on Parallel computing technologies (PARCOMPTECH), pp. 1–8. IEEE (2013) 7. Costache, S., Parlavantzas, N., Morin, C., Kortas, S.: Merkat: a market-based SLOdriven cloud platform. In: 2013 IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom). vol. 1, pp. 403–410, December 2013. https://doi.org/10.1109/CloudCom.2013.59 8. De Sensi, D., Torquati, M., Danelutto, M.: A reconﬁguration algorithm for poweraware parallel applications. ACM Trans. Archit. Code Optim. 13(4), 43:1–43:25 (2016). https://doi.org/10.1145/3004054, https://doi.org/10.1145/3004054 9. Gadioli, D., Palermo, G., Silvano, C.: Application autotuning to support runtime adaptivity in multicore architectures. In: 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 173–180. IEEE (2015)

Prediction-Based Online Scheduling in Embedded Systems

591

10. Garc´ıa-Valls, M., Cucinotta, T., Lu, C.: Challenges in real-time virtualization and predictable cloud computing. J. Syst. Arch. 60(9), 726–740 (2014). https://doi. org/10.1016/j.sysarc.2014.07.004, http://www.sciencedirect.com/science/article/ pii/S1383762114001015 11. Giannopoulou, G., Stoimenov, N., Huang, P., Thiele, L.: Scheduling of mixedcriticality applications on resource-sharing multicore systems. In: 2013 Proceedings of the International Conference on Embedded Software (EMSOFT), pp. 1–15, September 2013. https://doi.org/10.1109/EMSOFT.2013.6658595 12. Gupta, A., Kumar, A., Nagarajan, V., Shen, X.: Stochastic load balancing on unrelated machines. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1274–1285. SIAM (2018) 13. Khemka, B., et al.: Utility maximizing dynamic resource management in an oversubscribed energy-constrained heterogeneous computing system. Sustain. Comput. Inform. Syst. 5, 14–30 (2015). https://doi.org/10.1016/j.suscom.2014.08.001, http://www.sciencedirect.com/science/article/pii/S2210537914000420 14. Kousalya, G., Balakrishnan, P., Pethuru Raj, C.: Workﬂow scheduling algorithms and approaches. In: Automated Workﬂow Scheduling in Self-Adaptive Clouds. CCN, pp. 65–83. Springer, Cham (2017). https://doi.org/10.1007/978-3319-56982-6 4 15. Li, H., Baruah, S.: An algorithm for scheduling certiﬁable mixed-criticality sporadic task systems. In: 2010 IEEE 31st Real-Time Systems Symposium (RTSS), pp. 183– 192, November 2010. https://doi.org/10.1109/RTSS.2010.18 16. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientiﬁc workﬂow management. J. Grid Comput. 13(4), 457–493 (2015) 17. Lucier, B., Menache, I., Naor, J.S., Yaniv, J.: Eﬃcient online scheduling for deadline-sensitive jobs. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 305–314. ACM (2013) 18. Megow, N., Uetz, M., Vredeveld, T.: Models and algorithms for stochastic online scheduling. Math. Oper. Res. 31(3), 513–525 (2006) 19. Nasri, M., Brandenburg, B.B.: Oﬄine equivalence: a non-preemptive scheduling technique for resource-constrained embedded real-time systems (outstanding paper). In: 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 75–86. IEEE (2017) 20. Ptolemaeus, C. (ed.): System Design, Modeling, and Simulation using Ptolemy II. Ptolemy.org (2014). http://ptolemy.org/books/Systems 21. Quan, W., Pimentel, A.D.: A hierarchical run-time adaptive resource allocation framework for large-scale mpsoc systems. Des. Autom. Embed. Syst. 20(4), 311– 339 (2016) 22. Ren, J., Phan, L.T.X.: Mixed-criticality scheduling on multiprocessors using task grouping. In: 2015 27th Euromicro Conference on Real-Time Systems (ECRTS), pp. 25–34. IEEE (2015) 23. Rodriguez, M.A., Buyya, R.: Deadline based resource provisioningand scheduling algorithm for scientiﬁc workﬂows on clouds. IEEE Trans. Cloud Comput. 2(2), 222–235 (2014) 24. Skutella, M., Sviridenko, M., Uetz, M.: Unrelated machine scheduling with stochastic processing times. Math. Oper. Res. 41(3), 851–864 (2016) 25. Tang, X., Li, X., Fu, Z.: Budget-constraint stochastic task scheduling on heterogeneous cloud systems. Concurr. Comput. Pract. Exp. 29(19), e4210 (2017) 26. Wang, Z., Su, X.: Dynamically hierarchical resource-allocation algorithm in cloud computing environment. J. Supercomput. 71(7), 2748–2766 (2015). https://doi. org/10.1007/s11227-015-1416-x, https://doi.org/10.1007/s11227-015-1416-x

592

B. Goupille-Lescar et al.

27. Warneke, D., Kao, O.: Exploiting dynamic resource allocation for eﬃcient parallel data processing in the cloud. IEEE Trans. Parallel Distrib. Syst. 22(6), 985–997 (2011). https://doi.org/10.1109/TPDS.2011.65 28. Xu, R., Wang, Y., Huang, W., Yuan, D., Xie, Y., Yang, Y.: Near-optimal dynamic priority scheduling strategy for instance-intensive business workﬂows in cloud computing. Concurr. Comput. Pract. Exp. 29(18), e4167 (2017)

Predicting SDC Vulnerability of Instructions Based on Random Forests Algorithm LiPing Liu(&)

, LinLin Ci

, and Wei Liu

Computer Department, Beijing Institute of Technology, Beijing, China [email protected], [email protected], [email protected]

Abstract. Silent Data Corruptions (SDCs) is a serious reliability issue in many domains of computer system. Selectively protecting of the program instructions that have a higher SDC vulnerability is one of the research hot spots in computer reliability ﬁeld at present. A number of algorithms have already been presented to tackle this problem. However, many of them require tens of thousands of fault injection experiments, which are highly time and resource intensive. This paper proposes SDCPredictor, a novel solution that identify the SDC-vulnerable instructions based on random forests algorithm. SDCPredictor are based on static and dynamic features of the program alone, and do not require fault injections to be performed. SDCPredictor selectively protects the most SDCvulnerable instructions in the program subject to a given performance overhead bound. Our experimental results show that SDCPredictor can obtain higher SDC detection efﬁciency than previous similar techniques. Keywords: Fault tolerance Random forests

Error detection Reliability SDC vulnerability

1 Introduction SEU-induced soft errors have been known as one of the major threats to functionality and reliability of space-borne computers and their host spacecrafts. Soft errors may be explicit bit flips in latches or memories, or glitches in combinational logics that can propagate and be captured in latches [1]. SEU could result in silent data corruption (SDC), which means wrong outcomes of a program without any crash detected. When an SDC occurs, the program will fail without any indication of the failure. This can lead to the error propagating in the system and causing catastrophic effects. Thus, with the increase in the number of transistors on a chip and the reduction of chip sizes, the transient fault rate of software will grow with Moore’s Law [2]. Therefore, it is necessary to protect these devices against SDC errors. Conventional hardware only solutions such as guard banding and hardware redundancy are challenging to apply due to power constraints. As a result, researchers have explored software-based techniques to tolerate hardware faults [3]. Softwarebased techniques do not require any modiﬁcation in the hardware of the microprocessor. In fact, some of these approaches have already been used in mission critical systems for satellites and space missions [4]. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 593–607, 2018. https://doi.org/10.1007/978-3-030-05057-3_44

594

L. Liu et al.

Although software-based approaches such as full duplication are more costeffective than the hardware-based ones, they provoke a non-negligible overhead to the programs in terms of execution time and code size. In many cases, this is the main difﬁculty for the software-based techniques feasibility. In order to reduce these overheads and to offer more exibility to designers, recent works have proposed the selective hardening based on software [5–7]. Studies have shown that SDCs are caused by errors in a relatively small proportion of programs’ data variables [8, 9], and by selectively protecting these SDC-prone variables, one can achieve high coverage against SDCs. However, most prior work has identiﬁed SDC-prone variables using fault injection experiments, which are expensive for large applications. Various efforts have been made to reﬁne the injection framework. CriticalFault [10] applied vulnerability analysis to avoid the injections that result in mask. Since SFI was applied by Relyzer and CriticalFault, the weaknesses of SFI cannot be avoided. Relyzer [11] ran fault injections for the selected dynamic instruction sequences called “pilots”. SymPLIFIED [12] identiﬁed SDC-causing instructions by symbolic execution, which covers all SDCs in real executions. However, it was even more time-consuming than fault injection. Shoestring [13] assumed that instructions, which impact global memory or produce arguments passed to function calls, can incur SDCs. Although the time cost was reduced, it brought a large number of false positives. The work [16] proposes a conﬁgurable protection technique for SDC-causing errors that allows users to trade-off performance for reliability. Two models, namely SDCTune and SDCAuto, are built to predict the SDC proneness of a program’s data. SDCAuto is built automatically using a machine learning approach known as the Classiﬁcation and Regression Tree (CART) algorithm. Compared with fault injection based method, SDCAuto can obtain a relative accurate predicting of the SDC rate of an application and save a lot of time. However, one disadvantage of CART algorithm is that the tree may grow to be biased if some classes of data dominate. Besides, SDCAuto does not consider the data deviation of program output, which is important for some soft computing applications. For example, multimedia applications can tolerate blurry decoded images, and machine learning applications can tolerate noise. Such applications can tolerate most hardware errors as long as the erroneous outputs do not deviate signiﬁcantly from error-free outcomes. Even instructions in the same application with the same SDC rate may cause different data deviation. It’s obvious that the instruction causing more serious data deviation should give priority to be protected. We propose a new conﬁgurable protection approach, SDCPredictor, to predict the SDC vulnerability of program instructions. Our goal is to ﬁnd the subset of instructions which are the most SDC-vulnerable against SDC errors for a given performance cost budget. SDCPredictor predicts SDC vulnerability of program instructions based on random forest algorithm. Random forest has an effective method for estimating missing data, and has some mechanisms to deal with unbalanced data sets. SDCPredictor predicts the SDC vulnerability of program instructions by analyzing the features of an instruction, without requiring any fault injections to be performed, thus achieving signiﬁcant time-saving. Because the parameters of the prediction model is optimized, SDCPredictor has a better prediction speed and accurate than previous methods. The contributions of this work are as follows:

Predicting SDC Vulnerability of Instructions

595

We develop an intelligent prediction model, SDCPredictor, based on random forests algorithm, which can predict the SDC vulnerability of program instructions precisely. To the best of our knowledge, we are the ﬁrst to predict the SDC vulnerability of program instructions using random forests. SDCPredictor not only concerns about the probability that a fault in instruction will lead to SDC, but also the severity of SDC. When building the individual tree of random forests, we evaluate its quality to determine whether the tree should be retained or discarded. This screening process improve the predictive power of SDCPredictor. We evaluate the SDC vulnerability prediction accuracy and fault coverage of SDCPredictor on a set of benchmark programs. The experimental results demonstrate that SDCPredictor can obtain higher SDC detection efﬁciency than previous similar techniques. The remainder of the paper is organized as follows. In Sect. 2 the related works on identifying SDC-Causing instructions are reviewed. In Sect. 3 we describe the proposed approach in detail. Section 4 reports the experimental results we gathered and ﬁnally in Sect. 5 draws some conclusions.

2 Related Works A variety of efforts have been made to identify and protect SDC-causing instructions. To ensure system robustness, prior work has used statistical fault injection (SFI) to model the soft error rate (SER) of targeted systems. CriticalFault [10] proposes a biased injection framework that employs vulnerability analysis to map out relevant faults for stress testing. However, the remaining faults are still too many to be simulated for accurate SDC rate analysis. Relyzer [11] systematically analyzes all fault injection sites in an application for transient faults, and employs fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. SmartInjector [12] ﬁrstly lists all possible faults in an application, and then exploits the fault pruning techniques to remove most faults from injections by performing program analysis. SmartInjector also exploits a fault outcome prediction technique to determine the outcome of a simulation before it runs to completion. Although SFI has proven to be effective for identifying SDC-causing instructions, it is extremely time-consuming and not unacceptable for large applications. Another SDC identifying method is statistical vulnerability analysis. Shoestring [13] uses a static analysis approach to identify the instructions which are likely to result in SDCs, and employs instruction duplication to protect these instructions. Shoestring only considers the instructions in the backward slices of the instructions which update global variables or the arguments of library calls as the SDC-inducing instructions. The remaining instructions in a program are left unprotected. Although Shoestring only incurs an average performance overhead of 15.8%, the simple heuristic it uses only covers 33.9% of SDCs. SymPLIFIED [14] identiﬁes SDC-causing instructions by

596

L. Liu et al.

symbolic execution, which covers all SDCs in real executions. However, it is even more time-consuming than fault injection. The work [15] proposes a software-based method to identify and harden the most vulnerable blocks of a program. Using the genetic algorithm (GA), the proposed method takes the dynamic behavior of the programs into consideration to identify the most vulnerable blocks of a program. However, not all the instructions of vulnerable blocks are SDC-causing instructions and should be protected, as they incur high performance overhead when protected. In recent years, machine learning based methods are introduced to identify the SDC-causing instructions. The work [16] proposes a machine learning algorithm based model, namely SDCAuto, to predict the SDC proneness of a program’s data. SDCAuto builds the model automatically through a machine learning algorithm, thus requiring little to no effort on the part of the developer. It performs conﬁgurable protection against SDC-causing errors in general purpose applications without using fault injections.

3 Proposed Method In this Section we will describe the proposed approach in detail. We ﬁrst deﬁne some terms used in this paper, some of which are drawn from work [16]. Dynamic Dependency Graph: A Dynamic Dependency Graph (DDG) is a directed acyclic graph (V, E) that captures the dynamic dependencies among the values produced in the course of program execution, where V is the set of vertexes and E is the set of edges. In a DDG, a vertex v can be a register, a memory address or even a constant value. An edge e records the instruction (i.e., an operation) and links source operand(s) to destination operand(s). Data propagation distance: This is the maximum dynamic distance between the def and use of a value. This is denoted as DisðvÞ. Fanout: The fanout of a node is the set of all immediate successors of the node in the DDG. In terms of values, it is the set of uses of the value represented by the node. The fanout of a node indicates how many nodes are directly impacted by an error in that node. Cover: The cover of a node is the number of nodes from which an error can propagate to a given node before causing a crash. Basic Block: A Basic Block (BB) is a maximal set of ordered non-branching instructions (except in the last instruction) or branch destinations (except in the ﬁrst instruction) in which the execution always enters at the ﬁrst instruction and leaves via the last instruction. SDC coverage: The SDC coverage is deﬁned as the fraction of SDC causing errors detected by error detection technology. SDC proneness per instruction: This is the probability that a fault in instruction I leads to an SDC. This is denoted as PðSDCÞ. SDC vulnerability per instruction: This is the SDC vulnerability of instruction I. SDC vulnerability not only concerns about the SDC proneness, but also the data deviation of program output. This is denoted as VðSDCÞ. Dynamic count ratio: This is the ratio of the number of dynamic instances of instruction executed to the total number of dynamic instructions in the program. This is denoted as DðIÞ. SDC impact: The SDC impact is deﬁned as the rate of the number of

Predicting SDC Vulnerability of Instructions

597

incorrect program outcome caused by SDC over the total number fault-free program outcome during the execution. In this section, we ﬁrst extract program features of instructions that correlate with highly SDC vulnerability. We then implement fault injection experiments on a small set of benchmark programs to generate training data set for training purposes. Finally, we use training data set to build our models and generate protected code. Figure 1 shows the block diagram of the proposed method. Details of each component are given below.

programmer LLVM Passes

Code

Fault injection Feature collecƟon

Training data set

Train random regression forest Predictor R(I)

LLVM Pass

Instruction duplication Predictor R(I)

Protected code

Fig. 1. Block diagram of the proposed method

3.1

Feature Extraction

Recent studies show hardware faults are often derated or masked. A hardware-error is said to be derated if it is inherently masked in the system. Different instructions of a program have different error-derating rate [15]. In other words, different instructions of a program have different SDC vulnerability. Fault propagation can be stopped by an instruction either masking the fault, or by crashing the program. Both masking and crashing decrease the probability of an SDC resulting from the instruction that propagates its data to the other crashing/masking instruction, as a result of which its SDC proneness is lower. Our fault injection experiments show that the masking of an error at an instruction I can occur due to the following factors: (1) The error at instruction I may be masked by I itself. (2) The error at instruction I may be masked due to the successor instruction in path p. Faults that occur in the higher bit positions of operands of memory address calculation instructions are more likely to cause the program to crash. Shift instructions, comparison instructions and logical operation instructions can decrease the SDC proneness of the source operands by a certain extent. We call these instructions as SDC-masked instructions. Our fault injection experiments show that the SDC proneness of instruction is not equal to its SDC vulnerability. Figure 2 shows an example code based on Blackscholes benchmark from PARSEC benchmarks. In Fig. 2, the integer variable named numOptions determine the loop time of the two for loops. A fault corrupted the value of variable numOptions in the line 1 cause SDC of six variables; while a fault corrupted the value of variable numOptions in line 11 only cause SDC of one variable. It is obvious that the assignment instruction in line 1 has a higher SDC vulnerability than the assignment instruction in line 11. However, the SDC proneness of the assignment instruction in line 1 is equal to the assignment instruction in line 14. Therefore, the severity of SDC should be consideration in determining the SDC vulnerability of instruction.

598

L. Liu et al.

Fig. 2. Example code of Blackscholes benchmark

Features are extracted according to the above analysis and also based on prior work [12, 13, 16–18]. In total, 62 features are extracted. We categorize these features of instructions into nine categories shown in Table 1. (1) Data dependency related features. An error occurred in one variable can propagate to multiple variables by data dependencies among the instructions. Variables with a long data propagation distance or large fanout usually have a higher SDC vulnerability. (2) Type of end points of data dependency chains related features. The SDC proneness of a variable depends on (1) the fault propagation in its data dependency chain, and (2) the SDC proneness of the end point of that chain. (3) Memory address calculation related features. Memory address calculation instructions are usually used for pointer dereferences and are likely to cause SDCs and segmentation faults which crash the application. (4) Sub-word operations related features. Sub-word operations only utilize a fraction of the bits in the incoming values. Thus, fault occurred in the not utilized bits will be masked and SDC proneness of the source operands will decrease by a certain extent. (5) Logical operations related features. Logical operations derate errors that occur in AND operations when the corresponding bit in the other operand is 0, as well as OR operations when the corresponding bit in the other operand is 1. (6) Successor instruction related features. If SDC-masked instructions exist in the successor instruction of an instruction I, the SDC proneness of I will be derated. (7) Code structure related features. SDC causing code tends to be on the hot paths of the application. BBs with a higher in-degree or within a loop usually tend to be on the hot paths. (8) Data width related features. Data width is the number of bits in values, and is a major feature affecting the SDC proneness. (9) Execution time related features. Generally speaking, the instructions with a higher dynamic count ratio (DCR) have a higher SDC proneness. Our experimental results show that those instructions on the long path of DDG usually have a higher SDC vulnerability.

Predicting SDC Vulnerability of Instructions

599

Table 1. Some features extracted for model building. Feature group Data dependency related features

Type of end points of data dependency chains related features

Memory access and addressing related features

Sub-word operations related features

Logical operations related features

Successor instruction related features

Code structure related features

Data width related features

Execution time related features

Feature destination_operand_fanout

Description the fanout of destination operand destination_operand_cover the cover of destination operand is_stroe whether the operation is used to write to memory is_function_call whether the operation is a function-call operation is_cmp whether the comparison is made between primitive data is_load whether the operation is used to read from memory is_memory_addressing whether the operation is a memory addressing operation is_shl whether the operation is a shift left operation is_lshr whether the operation is a logicalshift right operation is_ashr whether the operation is a arithmetic shift right operation is_and whether the operation is a logic “and” operation is_or whether the operation is a logic “or” operation shl_instructions_count the number of left shift instructions contained in successor instruction shr_instructions_count the number of right shift instructions contained in successor instruction number_of_pred _BBs number of predecessor BBs number_of_suc _BBs number of successor BBs is_within_loop basic blocks is within a loop is_loop_terminator whether the result can break a loop execution is_accumulative_computation whether the operation is a accumulative-computation operation data_width_Source _operand the data width of source operand data_width_destination_operand the data width of destination operand dynamic_count_ratio dynamic count ratio

600

3.2

L. Liu et al.

Fault Injection and Training Data Generation

The goal of fault injection is to create a training set for the machine learning regression mode. The fault injection experiment is conducted using LLFI, a program level fault injection tool, which has been shown to be accurate for measuring SDCs in programs [19]. LLFI works at the LLVM compiler’s intermediate code level [9], and allows fault injections to be performed at speciﬁc program points, and into speciﬁc data types. It also enables tracing the propagation of the fault in the program by instrumenting the program at selected points. LLFI is closely integrated with the LLVM compiler, and can hence support a wide variety of programs. We selected a set of 12 benchmarks which are drawn from SPEC benchmarks [20], Stanford benchmarks [21], Parboil benchmarks [22] and PARSEC benchmarks [23]. We divide the 15 applications into two groups; one group for training and the other for testing. We choose these benchmarks to represent a wide range of commodity and scientiﬁc applications. Tables 2 and 3 illustrate the characteristics of the benchmarks. These benchmarks are compiled by LLVM compiler with standard optimization level (−O2). We compile the IR ﬁle and feed the produced executable ﬁle to LLFI after linking.

Table 2. Characteristics of the training benchmarks Program Bzip2 Perlbench Blackscholes Swaptions TSP Qsort BFS LBM

Description File compression and decompression program SPEC benchmark for perl interpreter Financial Analysis Price portfolio of swaptions Solving the classic TSP problem Sorting the a list of random numbers by quick-sort Breadth-First Search Fluid dynamics

Benchmark suite SPEC benchmarks SPEC benchmarks PARSEC benchmarks PARSEC benchmarks Stanford benchmarks Stanford benchmarks Parboil benchmarks Parboil benchmarks

Table 3. Characteristics of the testing benchmarks Program Gzip Ferret Queens MM

Description Compression Similarity Search Solving the classic n-queens problem Dense Matrix-Matrix Multiply

Benchmark suite SPEC benchmarks PARSEC benchmarks Stanford benchmarks Parboil benchmarks

The experiments are carried out on an Intel core i7 based machine, with 8 GB of RAM and 400 GB Hard drive. The machine is running Debian Linux Version 6.0. The previous research results show that SDC proneness is highly influenced by data dependencies among the instructions and a considerable number of program

Predicting SDC Vulnerability of Instructions

601

instructions have no effect on the program results. In this paper, we use the staticslicing technique [24] to transform a program to an identical but smaller and simpler executable version. The executable slice of a program is a subset of program instructions that can be executed. First, we run LLFI on each executable slice of program, and select speciﬁc instructions as fault injection targets. Second, we use statistical fault injection to implement fault injection. We corrupt the instruction’s source register by flipping a single bit in it and each bit flips one time. In each run, a fault, i.e., a single bit flip, is injected into the source register of exactly one dynamic instance of an instruction, and the outcome of the fault is classiﬁed by comparing the ﬁnal output with the fault free outcome. The fault-free or baseline outcome is obtained by running the original executable with the same input, but without any injected faults. We classify the outcome into four categories: (1) Crash, meaning that the program threw an exception, (2) SDC, which means the program’s output deviated from the fault-free outcome, (3) Hang, which means the program took signiﬁcantly longer to execute than a fault-free run, and (4) Benign, which means the program completed successfully and its output matched the fault-free outcome. The above outcomes are mutually exclusive and exhaustive. Third, the SDC proneness PðSDCÞ of each instruction is gathered and computed by Eq. (1): PðSDCÞ ¼

NSDC DðIÞ Nfault

ð1Þ

where NSDC is the SDC count caused by instruction I, Nfault is the total number of initial faults attributed to the instruction I, DðIÞ is the dynamic count ratio of the instruction I. Meanwhile, we gather and compute the average SDC impact of each instruction by Eq. (2): ImpactðIÞ ¼

1 NSDC

ð

N SDC X i¼1

CIOi Þ CO

ð2Þ

where CIOi is the number of incorrect program caused by i-th SDC, CO is the total number of program output during the execution. In this paper, we treat the store operation as program output. Finally, we obtain the SDC vulnerability of instruction I by equation VulnerabilityðIÞ ¼ PðSDCÞ ImpactðIÞ. Thus, training data set fF; Cg is generated, where F is the extracted features vector, and C is the annotated class labels (i.e., SDC vulnerability). 3.3

Regression Model Training

Training Random Regression Forests We train the models from a set of training instructions with the above features. The SDC vulnerability of these instructions depends on the extracted features. The full

602

L. Liu et al.

training samples have 8146 training instructions. Based on the 8146 training instructions, a random regression forest is constructed. The random forest is an ensemble learner consisting of a collection of tree-structured base learners. Each base learner is a classiﬁcation and regression tree (CART), and for regression, each tree individually predicts the target response while the forest predicts the target as the average of the individual tree predictions. Let F¼ffi 2 Rji¼ 1; 2;::: , Ng denote the extracted features, and let C ¼ fc1 ; c2 ; . . .; cN g denote the annotated class labels of the training samples (i.e., SDC vulnerability). We build the trees following the random forest framework [25]. For each tree in the random forest, a subset of samples is randomly chosen by bootstrap from the training samples, while the remaining is used to test the prediction accuracy of the random forest. The random selection of features is done at each node pﬃﬃﬃ split for building the tree. Typically this setting is n, where n is the number of features. In our method the number of features is 8. Furthermore, once a tree is built, we evaluate its quality to determine whether the tree should be retained or discarded. Only the trees which have sufﬁciently high accuracy will be kept. As mentioned above, the bootstrap method is used to randomly choose a subset of samples from the given training samples to construct a tree. These chosen samples are called in-of-bag (IOB) data, while the remainder is called out-ofbag (OOB) data. The OOB data are utilized to evaluate the tree constructed based on the IOB data. Given a tree regressor hk ðxÞ built from the kth training data subset, we deﬁne the mean square error (denoted by MSE) of the tree hk ðxÞ as MSE ¼

PNNs i¼1

ðhk ðxi Þ Ci Þ2 N Ns

ð3Þ

where xi is a sample in the OOB data and Ci is the class label of the sample xi . It is obvious that the tree with low MSE has high accuracy. Hence, we accept a tree whose MSE on the OOB data is below the prespeciﬁed threshold. In this way, the trees constructed in the forest are all with lower MSE, and a better random forest can be obtained. Choose the Instructions to Protect and Design Detector After predicting the SDC vulnerability for each instruction, we then choose instructions to minimize the SDC impact subject to a given performance overhead, using a standard dynamic programming algorithm [27]. Once we identify a set of instructions to protect, the next step is to insert error detectors at these instructions. Our detectors are based on duplicating the backward slices of the instructions to protect, similar to prior work [16]. We insert a check immediately after the instructions to be protected, which compares the original value computed by the instruction with the value computed by the duplicated instructions. Any difference in these values is deemed to be error detection and the program is stopped.

Predicting SDC Vulnerability of Instructions

603

4 Experimental Evaluation SDC vulnerability accuracy, SDC coverage, SDC detection efﬁciency and SDC impact are imperative parameters for evaluating our approach. So all of these parameters are measured and reported. 4.1

SDC Vulnerability Accuracy

There are primarily 3 parameters which can be tuned to improve the predictive power of the random regression forests model: (1) maximum number of features in individual tree, and (2) number of trees, and (3) minimum sample leaf size of an individual tree. pﬃﬃﬃ We set the value of ﬁrst parameter as n, where n is the total number of features. As for the number of trees, we gradually increased it from 150 with a step-size 5 until the prediction accuracy becomes stable or decreasing. Finally, this value is determined to be 285. There are primarily 3 parameters which can be tuned to improve the predictive power of the random regression forests model: (1) maximum number of features in individual tree, and (2) number of trees, and (3) minimum sample leaf size of an pﬃﬃﬃ individual tree. We set the value of ﬁrst parameter as n, where n is the total number of features. As for the number of trees, we gradually increased it from 150 with a step-size 5 until the prediction accuracy becomes stable or decreasing. Finally, this value is determined to be 285. In order to avoid over-ﬁtting problem, we set the third parameter as 50. To evaluate the predicting results of SDC vulnerability for each instruction, we calculate the average squared errors for testing dataset and the accuracy (the percentage of the samples whose SDC vulnerability estimation error is less than 10%) of SDC vulnerability estimation. Table 4 shows the accuracy and MSE of SDC vulnerability on the four testing benchmarks. It can be observed that our models are highly accurate in predicting the SDCs vulnerability. The high accuracy beneﬁts from that we strengthen the generalization error of a tree by choosing features according to their weights when building the tree of random forests. Besides, the SDC vulnerability of testing programs is closely to real probability because we use statistical fault injection method to implement fault injection. In addition, we optimize the parameters of the random forests which improve the prediction accuracy. Thus, it can guide detector placement to obtain high coverage at low performance overheads. Table 4. The MSE and accuracy of the testing programs Program Gzip Ferret Queens MM

MSE 0.00549 0.00247 0.00178 0.00428

Accuracy 88.76% 94.56% 95.28% 90.13%

604

L. Liu et al.

4.2

SDC Impact

The SDC impact is deﬁned as the rate of the number of incorrect program outcome caused by SDC over the total number fault-free program outcome during the execution. We apply our approach to predict the SDC vulnerability for different instructions to satisfy the performance overhead bounds provided by the user. Figure 3 shows the SDC impact obtained by our approach (SCDPredictor) and SDCAuto for each benchmark under three different performance overhead bounds: 20%, 40% and 60%. As it can be seen in Fig. 3, the averages SDC impact for SCDPredictor and SDCAuto are 74.70% and 79.31% respectively for the 20% performance overhead bound, the corresponding averages SDC impact are 59.20% and 66.30% for the 40% performance overhead bound, and 39.81% and 42.50% for the 60% performance overhead bound. It is obvious that the SCDPredictor obtains lower SDC impact at the same performance overhead bound than SDCAuto. The reason is that SCDPredictor is SDC impact sensitive. It not only concerns about the SDC proneness, but also the SDC impact of program instructions. SCDPredictor protects instructions with high SDC proneness and high SDC impact. While SDCAuto only concerns about the SDC proneness and ignore the SDC severity of instructions. Besides, the prediction model of SDCAuto is built using CART. It is well known that the tree of CART is easy to be biased. It is hard to keep robust and stable prediction accuracy for CART. Unlike SDCAuto, the prediction model of SCDPredictor is built using random forests, which hardly cause over-ﬁtting and are less sensitive to noisy due to it constructs a series of tree-based learners. Thus, SCDPredictor can offer more stable and accurate prediction performance than SDCAuto. 4.3

SDC Coverage and Detection Efﬁciency

The SDC coverage is deﬁned as the fraction of SDC causing errors detected by our detectors. Figure 4 shows the SDC coverage obtained by our approach and SDCAuto for each benchmark under three different performance overhead bounds: 20%, 40% and 60%. As it can be seen in Fig. 4, the averages SDC coverage for SCDPredictor and SDCAuto are 33.08% and 38.10% respectively for the 20% performance overhead bound, the corresponding averages SDC coverage are 50.30% and 51.90% for the 40% performance overhead bound, and 66.10% and 65.76% for the 60% performance overhead bound. As mentioned before, SDC coverage, SDC detection efﬁciency and SDC impact are imperative parameters for evaluating our approach. In literature [16], the SDC detection efﬁciency (DE) is deﬁned as the ratio between SDC coverage and performance overhead. However, the SDC impact is not taken into account. We redeﬁned the SDC detection efﬁciency; the new deﬁnition of SDC detection efﬁciency is showed in Eq. 4. SDC detection efficiency ¼

SDC coverage ð1 SDC impactÞ performance overhead

ð4Þ

The averages SDC detection efﬁciency for SCDPredictor and SDCAuto are 0.418 and 0.394 respectively for the 20% performance overhead bound, the corresponding

Predicting SDC Vulnerability of Instructions

605

Fig. 3. The comparison of SDC impact under different performance overhead bounds: 20%, 40% and 60%

Fig. 4. The comparison of SDC coverage under different performance overhead bounds: 20%, 40% and 60%

606

L. Liu et al.

averages SDC detection efﬁciency are 0.513 and 0.438 for the 40% performance overhead bound, and 0.663 and 0.630 for the 60% performance overhead bound. Thus, SCDPredictor is comparable in the SDC coverage obtained with SDCAuto. Meanwhile it has a higher SDC detection efﬁciency because of lower SDC impact compared with SDCAuto.

5 Conclusions and Future Research In this article, a random forests based SDC-vulnerable instructions identifying technique SCDPredictor is proposed. SDCPredictor not only concerns about the probability that a fault in instruction will lead to SDC, but also the severity of SDC. The proposed solution does not require fault injections to predict the SDC vulnerability of each instruction. The experimental results demonstrate that SDCPredictor can obtain higher detection efﬁciency than previous similar techniques. Research is underway to develop more excellent methods to improve fault coverage and reduce performance overhead. Invariant based techniques typically have lower overhead than duplication-based techniques as the assertions consist of much fewer instructions than the entire backward slice of the variables; therefore development of invariant based techniques is the new research topic for our research group. Acknowledgment. This research was supported by the National Natural Science Foundation of China under grant No. 61370134, the National High Technology Research and Development Program of China (863 Program) under grant No. 2013AA013901.

References 1. Bhattacharya, K., Ranganathan, N.: RADJAM: a novel approach for reduction of soft errors in logic circuits. In: International Conference on VlSI Design, pp. 453–458 (2009) 2. Racunas, P., Constantinides, K., Manne, S., et al.: Perturbation-based fault screening. In: IEEE, International Symposium on High PERFORMANCE Computer Architecture, pp. 169–180. IEEE Computer Society (2007) 3. Rivers, J.A., et al.: Conﬁgurable detection of SDC-causing errors in programs. ACM Trans. Embed. Comput. Syst. 16(3), 88 (2017) 4. Restrepocalle, F., Martnezlvarez, A., Cuencaasensi, S., et al.: Selective SWIFT-R: a flexible software-based technique for soft error mitigation in low-cost embedded systems. J. Electron. Test. 29(6), 825–838 (2013) 5. Chielle, E., Azambuja, J.R., Barth, R.S., et al.: Evaluating selective redundancy in data-flow software-based techniques. Radiation and ITS Effects on Components and Systems (2012) 6. Cong, J., Gururaj, K.: Assuring application-level correctness against soft errors, 47(10), 150–157 (2011) 7. Sundaram, A., Aakel, A., Lockhart, D., et al.: Efﬁcient fault tolerance in multi-media applications through selective instruction replication. In: The Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies, pp. 339–346. ACM (2008) 8. Hari, S.K.S., Adve, S.V., Naeimi, H.: Low-cost program-level detectors for reducing silent data corruptions. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 1–12. IEEE (2012)

Predicting SDC Vulnerability of Instructions

607

9. Thomas, A., Pattabiraman, K.: Error detector placement for soft computation. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 1–12. IEEE Computer Society (2013) 10. IEEE: Understanding soft error propagation using efﬁcient vulnerability-driven fault injection. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 1–12. IEEE Computer Society (2012) 11. Hari, S.K.S., Adve, S.V., Naeimi, H., et al.: Relyzer: application resiliency analyzer for transient faults. IEEE Micro 33(3), 58–66 (2013) 12. Li, J., Tan, Q.: SmartInjector: exploiting intelligent fault injection for SDC rate analysis. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, pp. 236–242. IEEE (2013) 13. Feng, S., Gupta, S., Ansari, A., et al.: Shoestring: probabilistic soft error reliability on the cheap. In: Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, pp. 385–396. ACM (2010) 14. Pattabiraman, K., Nakka, N.M., Kalbarczyk, Z.T., et al.: SymPLFIED: symbolic programlevel fault injection and error detection framework. IEEE Trans. Comput. 62(11), 2292–2307 (2013) 15. Arasteh, B., Bouyer, A., Pirahesh, S.: An efﬁcient vulnerability-driven method for hardening a program against soft-error using genetic algorithm. Comput. Electr. Eng. 48, 25–43 (2015) 16. Rivers, J.A., Rivers, J.A., Rivers, J.A., et al.: Conﬁgurable detection of SDC-causing errors in programs. ACM Trans. Embed. Comput. Syst. 16(3), 88 (2017) 17. Cook, J.J., Zilles, C.: A characterization of instruction-level error derating and its implications for error detection, pp. 482–491 (2008) 18. Laguna, I., Schulz, M., Richards, D.F., et al.: IPAS: intelligent protection against silent output corruption in scientiﬁc applications. In: IEEE/ACM International Symposium on Code Generation and Optimization, pp. 227–238. IEEE (2016) 19. Wei, J., Thomas, A., Li, G., et al.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 375–382. IEEE Computer Society (2014) 20. Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006) 21. Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997) 22. Stratton, J.A., Rodrigues, C., Sung, I.J., et al.: Parboil: a revised benchmark suite for scientiﬁc and commercial throughput computing (2012) 23. Bienia, C., Kumar, S., Singh, J.P., et al.: The PARSEC benchmark suite: characterization and architectural implications. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81. IEEE (2017) 24. Weiser, M.: Program slicing. IEEE Trans. Software Eng. SE-10(4), 352–357 (1984) 25. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 26. Ye, Y., Li, H., Deng, X., et al.: Feature weighting random forest for detection of hidden web search interfaces. J. Comput. Linguist. Chin. Lang. Process. 13(4), 387–404 (2008) 27. Martello, S., Toth, P.: Knapsack problems. Accessed Nov 1990

Hybrid Cloud Architecture for Cross-Platform Interoperability in Smart Homes Ming Tao1(B) , Chao Qu1 , Wenhong Wei1 , Bin Zhou1 , and Shuqiang Huang2 1

School of Computer Science and Network Security, Dongguan University of Technology, Dongguan 523808, People’s Republic of China [email protected], {quc,weiwh,zhoub}@dgut.edu.cn 2 Department of Optoelectronic Engineering, Jinan University, Guangzhou 510632, People’s Republic of China [email protected]

Abstract. With the development and application of Internet of Things (IoT) technology, many home device maker and/or vendors are interested in developing available solutions of smart home. These heterogeneous systems need to be fully interoperable to support the joint and harmonized execution of household operations. However, the heterogeneity of devices, services and communication protocols involved in most of the available solutions developed by diﬀerent vendors, still remains a challenge issue of interoperability, and is adversely aﬀecting the widespread application. Hence, it needs to be reasonably solved to operate the heterogeneous systems in IoT-enabled smart homes in an optimal fashion. To address this issue of cross-platform interoperability, a hybrid cloud architecture for IoT-enabled smart homes is proposed in this paper, which presents a solid solution to achieve the eﬀective and eﬃcient interoperability of the heterogeneous services and devices from diﬀerent vendors. The conducted experiments have been shown to demonstrate the performance. Keywords: Cloud

1

· IoT · Interoperability · Smart home

Introduction

Promoted by the advances in emerging Internet of Things (IoT) technology, an ever-growing amount of various information sources in the smart home scenario has been fostered, which are structured in multiple sensing and control platforms/applications connected through several wireless and wireline communication facilities, whereas the fundamental challenge consists in collecting, integrating, aggregating and processing the huge amount of data originated by these sources in order to transform them in the manner of knowledge needed by smart services provided in the modern home [1,2]. This may imply managing many heterogeneous devices and protocols/technologies as well as performing c Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 608–617, 2018. https://doi.org/10.1007/978-3-030-05057-3_45

Hybrid Cloud Architecture for Cross-Platform Interoperability

609

cross-platform harmonization of their produced data, which becomes really feasible only by relying on the virtually unlimited storage and computing resources provided by cloud infrastructures [3]. Furthermore, the virtualization facilities provided by clouds can signiﬁcantly boost the limited computing capacity of hardware-constrained sensing or actuator devices making them be able to handle the complex processing tasks needed by modern smart home applications [4]. Currently, from various considerations, the vendors of home devices prefer to develop proprietary smart home platforms reﬂecting their own interests. These platforms often bring their own solutions and service interfaces, such that diﬀerent communication protocols and standards are typically deployed within each solution. Hence, interconnecting heterogeneous services and devices from diﬀerent vendors, and providing interoperability across the available platforms remain the main challenges [5]. Although there are many proposals focusing on heterogeneous systems management and interoperability issues, the heterogeneity of, i.e., devices, services, communication protocols, standards and data formats, etc., involved in most of the available solutions developed by diﬀerent vendors, still remains a challenge issue of interoperability. To address this issue, a hybrid cloud architecture for IoT-enabled smart homes is developed in this paper to enable eﬀective, eﬃciency and seamless interoperations on heterogeneous devices/services provided by diﬀerent vendors. In this context, a public cloud based platform providing virtualization of the involved objects and their interfaces, and allowing their orchestration into generalized on-demand smart home services is built. In each private cloud platform, the communication and access protocols and standards, as well as the device registration, authentication, the used management and manipulation methods, are individuated by the vendor. When the home user wants to manipulate a home device, the operating processes are discussed in two scenarios. Finally, the elaborately designed experiments are conducted to demonstrate the eﬀectiveness and eﬃciency of the proposed hybrid cloud architecture. The rest of this paper is organized as follows. In Sect. 2, a brief review of the related achievements in the existing proposals are discussed. In Sect. 3, a hybrid cloud architectural model for cross-platform interoperability in IoTenabled smart home is ﬁrstly developed, and then, the operating processes of home device in two scenarios are discussed. In Sect. 4, the experimental setup and the analysis results of eﬀectiveness and eﬃciency are addressed. In Sect. 5, this paper is summarized and concluded.

2

Related Work

Recently, with the emerging IoT and Cloud computing technologies, there are many proposals focusing on the smart home design, heterogeneous systems management and interoperability issues. In [4], Soliman et al. presented a smart home approach which consists of embedding intelligence into sensors and actuators by using the Arduino platform, and networking smart things by using ZigBee technology. Ghayvat et al. [6] presented a universal IoT-based smart home model, in

610

M. Tao et al.

which, all the home devices and appliances are connected together and the home network is the integration of diﬀerent wireless technologies. Seo et al. [7] proposed a platform architecture named HePA (Hexagonal Platform Architecture), that is extremely scalable while maintaining required performance and reﬂecting requirements of the complex IoT environment. By integrating IoT and service component technologies, Li et al. [8] presented a smart home system architecture which has considered the heterogeneous data fusion in IoT, and Tao et al. [9] proposed a ontology-based scheme to address the problem of managing large volumes of heterogeneous data generated by home entities. To address the heterogeneous systems management and interoperability issues, using the SOA (Service Oriented Architecture) and web services to integrate various home services and applications as a promising option has been investigated to a much extent. In [10], Wu et al. used SOA and mobile-agent (MA) technology to support the interactions between system components, and designed a smart home architecture that is a peer-to-peer (P2P) model based on multiple Open Services Gateway Initiative (OSGi) platforms. Also OSGi-based, Cheng et al. [11] proposed an extensible architecture for heterogeneous smart home systems enabling dynamic integrations of devices, services and protocols. In [12], Hamza et al. proposed an architecture which uses open source SOA and Smart-M3 framework to provide the core technologies enabling interoperability and extendibility, and designed ontology to enable semantic middleware for integration. By taking into account of the distributed nature of the home environment with heterogeneous devices, Perumal et al. [13] presented an integrated approach using the SOAP/XML protocol for implementing eﬀective web-serviceenabled smart home management systems. New architectures and platforms for smart home management, such as the cloud- and IoT-based should be provably scalable, eﬃcient, reliable and secure before starting their large-scale deployment. Existing mechanisms and approaches, however, are not yet fully satisfactory in meeting all these requirements at the same time. There are still some serious challenges described as follows. Since there are a number of stakeholders such as device and service vendors involved in smart home clouds, and there are complex dependencies among these stakeholders as well, it necessitates global standards to refrain from incompatibilities and conﬂicts between private platforms and solutions. However, establishing global standards to push the complexity into smaller and make smart home clouds more compatible and eﬀective, remains a challenge. Further and more eﬀorts on standardization should be conducted to coordinate various kinds of resources for achieving more eﬀective smart home clouds and reducing the number of adaptations and mediation stages. The utility of smart home clouds mainly relies on their scalability in handling a dynamically and time-varying growing amount of homes. Apart from handling regular operations of home devices, smart home clouds must be able to face the ever-growing demands for home entertainment and some other applications. They also need to provide the interoperability among the heterogeneous devices and services from diﬀerent vendors, such that further and more advanced

Hybrid Cloud Architecture for Cross-Platform Interoperability

611

developments aimed at optimizing the utilizations of computing, storage and network resources are needed. Meanwhile, the realization of optimization algorithms that coordinate the private platforms/clouds with the public one to achieve realtime cross-layer data synchronization and minimize the traﬃc load between layers is necessary as well. In addition, with the launch of new home devices and technologies, designing and developing cost-eﬀective IoT middleware supporting the integration of these newly launched devices and technologies with the existing ones will be challenging.

3

Hybrid Cloud Architecture for Cross-Platform Interoperability

Building a public cloud based platform providing virtualization of the involved objects and their interfaces, and allowing their orchestration into generalized on-demand smart home services, may be an eﬀective strategy for facing the above challenges and avoiding conﬂicts between the diﬀerent private platforms characterizing the legacy vendor solutions. The layered scheme of our proposed hybrid cloud architectural model for IoTenabled smart home is shown in Fig. 1. Generally, the deﬁned layers have diﬀerent functions and the lower layers provide foundational supports for the upper layers. By integrating under a common cloud-based platform, various devices ﬁt into the broader concept of IoT, e.g., household sensors, actuators, controllers, smart home appliances, smart phones, and other Internet accessed home appliances, interconnect by using the available wireless communications technologies (e.g., Bluetooth, RFID, ZigBee, Wi-Fi, 3/4G, LTE, etc.). Such architecture enables data collection and exchange among all the home devices in order to provide cheap, secure, real-time and on-demand home services. It also allows the seamless interworking of the legacy platforms (typically private clouds) provided by diﬀerent smart home service vendors through the aforementioned public cloud layer, by generalizing the scope of each individual service, represented by using an Internet-like structure, and integrating it into a common IoT service fabric for sharing and reusing in multiple operating household contexts. SOA here will also be employed to integrate various kinds of information and connect multiple devices from diﬀerent vendors seamlessly through the smart home cloud. SOA allows the developers of smart home applications to organize, aggregate and package relevant applications into newly advanced and emerging home services. Additionally, a speciﬁc middleware stratum can be employed to hide the concrete details of implementing the underlining technologies and to provide support for integrating the speciﬁc applications implemented on the smart home cloud. By leveraging such SOA and IoT enabled smart home clouds, more and more innovative home services can be developed by device vendors, third-party service providers and government agencies. In each private cloud platform, the communication and access protocols and standards, as well as the device registration, authentication, management and manipulation methods used, are individuated by the vendor [14]. In the public

612

M. Tao et al.

Fig. 1. Multi-layer architecture for IoT-enabled smart home data clouds.

cloud, some basic public application services are provided, e.g., global identiﬁer allocation and parsing for the registered home devices, access admission and management for the service platforms of other industries; additionally, providing the virtualized service and device/object interfaces for the third party access to home services and devices, the platform bus implements protocol conversion and addressing operations for all the devices (with their IDs) registered in the platform [15]. In the developed hybrid cloud platform for IoT-enabled smart home, when the home user wants to manipulate a home device, the operating process should consider the following two scenarios. In the ﬁrst scenario where the target device is managed by the associated private cloud, the diagram of operation process is shown in Fig. 2, and the crucial procedures are simply described as follows. I. The certiﬁed home user uses the vendor-speciﬁc companion App installed on the smart phone to send a corresponding operation command to the associated private platform directly.

Hybrid Cloud Architecture for Cross-Platform Interoperability

613

(IV) Update

Public Cloud Plaƞorm Bus

Vendor-B Private Cloud

Vendor-A Private Cloud

Vendor-A

Vendor-A

Vendor-B

Vendor-B

APP

Device

APP

Device

Fig. 2. The diagram of operation process in scenario 1.

II. The ID of the target device will be checked at ﬁrst locally in such (presumably private cloud) infrastructure. If the target device is registered in the associated private platform, and the legitimacy of the operation command is positive once veriﬁed, the operation command would be forwarded to the target device associated to the private platform. III. After accomplishing the requested manipulation, the target device reports the relevant parameters about its current operating status to the associated private platform which then forwards them to the platform bus in the public cloud. IV. Subsequently, the platform bus synchronizes the device status with all the other associated private platforms. In the second scenario where the target device is not managed by the associated private cloud, the diagram of operation process is shown in Fig. 3, and the crucial procedures are simply described as follows. I. The certiﬁed home user uses the vendor-speciﬁc companion App installed on the smart phone to send a corresponding operation command to the associated private platform directly. II. The ID of the target device will be at ﬁrst checked locally in such infrastructure. If the device is not registered in the associated private cloud, the operation command would be forwarded to the platform bus in the public cloud. III. The platform bus then redirects the operation command to the corresponding private platform by performing the addressing operation with the ID, and the operation legality would be veriﬁed in the associated private platform as well. If the legitimacy of the operation command is positive, the operation command would be forwarded to the target device associated to the private platform.

614

M. Tao et al.

(V) Update

Public Cloud Plaƞorm Bus

Vendor-B Private Cloud

Vendor-A Private Cloud

Vendor-A

Vendor-A

Vendor-B

Vendor-B

APP

Device

APP

Device

Fig. 3. The diagram of operation process in scenario 2.

IV. After accomplishing the requested manipulation, the target device reports the relevant parameters about its current operating status to the associated private platform which then forwards them to the platform bus in the public cloud. V. Finally, the platform bus synchronizes the updated device status in the whole hybrid cloud platform just as speciﬁed above.

4

Experiments and Analysis

To qualitatively analyze and evaluate the performance of the proposed hybrid cloud architecture addressing the issue of eﬀective and eﬃcient interoperability, we design a prototype consisted of a Amazon EC2 based public cloud, a private smart home cloud platform built in Dongguan University of Technology (DGUT), and a private smart home cloud platform authorized by Canbo CO., LTD, China. As shown in Table 1, the two private cloud platforms employ entirely diﬀerent architectures. The former is built using some open-source solutions, i.e., virtualization software (KVM), and management software (OpenStack). The latter is built using VMware solutions, i.e., Virtualization software (VMware vSphere), and management software (VMware vCenter). Additionally, the deployed home devices in the former are provided by diﬀerent vendors and use diﬀerent network access technologies. The deployed kitchen and bathroom devices in the latter are the independent productions of Canbo CO., LTD, but the other kinds of deployed home devices are provided by diﬀerent vendors and use diﬀerent network access technologies. The consumer located in DGUT connected to the former private cloud platform by campus wireless network connectivity with 10 Mbps ∼50 Mbps data transfer rate. To evaluate the interoperability in the same associated platform or across the heterogeneous platforms in smart

Hybrid Cloud Architecture for Cross-Platform Interoperability

615

home environments, the response time deﬁned as the maximum execution time taken by systems tasks is used as the evaluation metric of eﬀectiveness and eﬃciency. Table 1. Conﬁgurations of the two private cloud platforms. DGUT

Canbo

CPU

Intel Xeon E3-1231v3

Intel Xeon E7-4850v3

RAM

16 GB

Storage 1 TB OS

128 GB 15 TB

Ubuntu Server 12.04 LTS Ubuntu Server 16.04 LTS

Fig. 4. The response time in the ﬁrst scenario.

In the two scenarios discussed in Sect. 3, total of 500 testing samples of home device manipulations are performed respectively, and the experimental results are shown in Figs. 4 and 5. In the ﬁrst scenario that the consumers and the target home devices are associated to the same private platform, the average response time is 29.38 ms and the sample standard deviation is 0.826 ms. In the second scenario that the consumers and the target devices are associated to diﬀerent private platforms, because the manipulation commands should transmit through the Internet, the average response time is 59.38 ms and the sample standard deviation is 2.023 ms. From the experimental results, we can clearly see that, within the proposed hybrid cloud architecture, the test results of the response time are justiﬁed for the requirements of home device manipulation applications. Especially, the test results are acceptable for the interoperations across heterogeneous platforms as well.

616

M. Tao et al.

Fig. 5. The response time in the second scenario.

5

Conclusion

In this paper, to address the issue of cross-platform interoperability against the heterogeneity of the available solutions developed by diﬀerent vendors, a hybrid cloud architecture for IoT-enabled smart homes is proposed. Within this architecture, two diﬀerent scenarios of home device manipulation are concretely discussed. Finally, the experimental results obtained on the actual platforms have been shown to demonstrate the eﬀectiveness and eﬃciency. With the ultimate goal of making home living experience more comfortable and enjoyable, hybrid cloud platform is expected to be the backbone of the future smart home. Through the collaborations among academia, home device companies, cloud service providers, standardization groups, government authorities and law enforcement organizations, as well as various systematic approaches in engineering new architectures and operating schemes, hybrid-cloud enabled solutions can provide promising opportunities for promoting the technology innovations in the smart home industry and other industrial business with tremendous beneﬁts to society. Acknowledgments. This work was supported in part by the Natural Science Foundation of Guangdong Province, China (Grant No. 2018A030313014); Guangdong University Scientiﬁc Innovation Project (Grant No. 2017KTSCX178); the outstanding young teacher training program of the Education Department of Guangdong Province (Grant No. YQ2015158); Guangdong Provincial Science & Technology Plan Projects (Grant Nos. 2016A010101035 & 2016A010101034); and National Natural Science Fund, China (Grant Nos. 61300198 & 61772233).

Hybrid Cloud Architecture for Cross-Platform Interoperability

617

References 1. Li, M., Lin, H.J.: Design and implementation of smart home control systems based on wireless sensor networks and power line communications. IEEE Trans Ind. Electron. 62(7), 4430–4442 (2015) 2. Tao, M., Zuo, J., Liu, Z., Castiglione, A., Palmieri, F.: Multi-layer cloud architectural model and ontology-based security service framework for IoT-based smart homes. Futur. Gener. Comput. Syst. 78, 1040–1051 (2018) 3. Li, J., Li, Y., Chen, X., Lee, P., Lou, W.: A hybrid cloud approach for secure authorized deduplication. IEEE Trans. Parallel Distrib. Syst. 26(5), 1206–1216 (2015) 4. Soliman, M., Abiodun, T., Hamouda, T., et al.: Smart home: integrating internet of things with web services and cloud computing. In: Diamond, S., Wainwright, N. (eds.) IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), pp. 317–320. IEEE, Bristol (2013). https://doi.org/10.1109/ CloudCom.2013.155 5. Rossi, L., Belli, A., Santis, A.D., et al.: Interoperability issues among smart home technological frameworks. In: Gao, Y., Zingaretti, P., Koo, J.C., Frontoni, E. (eds.) IEEE/ASME 10th International Conference on Mechatronic and Embedded Systems and Applications (MESA), pp. 1–7. IEEE, Senigallia (2014). https://doi.org/ 10.1109/MESA.2014.6935626 6. Ghayvat, H., Mukhopadhyay, S., Gui, X., et al.: WSN- and IOT-based smart homes and their extension to smart buildings. Sensors 15(5), 10350–10379 (2015) 7. Seo, S., Kim, J., Yun, S., et al.: HePA: Hexagonal platform architecture for smart home things. In: Kotagiri, R., Zomaya, A. (eds.) IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), pp. 181–189. IEEE, Melbourne (2015) https://doi.org/10.1109/ICPADS.2015.31 8. Li, B., Yu, J.: Research and application on the smart home based on component technologies and internet of things. Procedia Eng. 15, 2087–2092 (2011) 9. Tao, M., Ota, K., Dong, M.: Ontology-based data semantic management and application in IoT-and cloud-enabled smart homes. Futur. Gener. Comput. Syst. 76, 528–539 (2017) 10. Wu, C.L., Liao, C.F., Fu, L.C.: Service-oriented smart-home architecture based on OSGi and mobile-agent technology. IEEE Trans. Syst. Man Cybern. Part C 37(2), 193–205 (2007) 11. Cheng, S.T., Wang, C.H., Horng, G.J.: OSGi-based smart home architecture for heterogeneous network. Expert. Syst. Appl. 39(16), 12418–12429 (2012) 12. Hamza, H.S., Ashraf, E., Nabih, A.K., et al.: Design and implementation of an interaopreable and extednable smart home semantic architecture using smart-M3 and SOA. In: Westphall, C.B., et al. (eds.) The 10th International Conference on Networking and Services (ICNS), pp. 48–53. IARIA, Chamonix, (2014) 13. Perumal, T., Sulaiman, M.N., Sharif, K.Y., et al.: Development of an embedded smart home management scheme. Int. J. Smart Home 7(2), 15–26 (2013) 14. Li, J., Chen, X., Li, M., Li, J., Lee, P., Lou, W.: Secure deduplication with eﬃcient and reliable convergent key management. IEEE Trans. Parallel Distrib. Syst. 25(6), 1615–1625 (2014) 15. Qu, C., Tao, M., Zhang, J., et al.: Blockchain Based credibility veriﬁcation method for IoT entities. Secur. Commun. Netw. 2018, 1–11 (2018). Article ID 7817614

Conflict-Free Block-with-Stride Access of 2D Storage Structure Rui Song, Guozhao Zeng, Sheng Liu(&), and Haiyan Chen College of Computer, National University of Defense Technology, Changsha 410073, Hunan, China [email protected]

Abstract. Parallel memory modules can be used to increase memory bandwidth and feed a processor with the required access patterns of data. The parallel storage mechanism organized and managed by multiple storage modules can suit applications of images and videos. Previous investigation into data storage schemes can be used to achieve continuous conflict free access by rows, columns or blocks, however it is not only satisﬁed with some sliding window applications in video and image processing algorithms (including convolutional neural networks, sub-pixel difference, 2D ﬁltering, etc.) which need nonconflicting access by steps in computation, but also there is a different demand for horizontal and vertical strides in computing sub-processes. This paper presents a storage scheme that support for row access without collision alignment, and non-aligned block-with-stride access storage modes beginning at any address. Theoretical proofs and experiments verify the correct ness of the module address (module number to which the address is mapped). And in hardware design, it was found that in the typical case there was no path violation and with less area overhead. It suitable for application of CNN to improve performance in algorithm in convolutional. Keywords: Main memory architectures Parallel storage scheme

2D memory conflicts

1 Introduction The pros and cons of storage system design is an important factor that restricts the performance of vector SIMD processors. Because processor speeds are substantially higher than memory speeds, it has been necessary to develop architectural features to support parallelism in the memory subsystems. Parallel memory modules can be used to provide special data patterns and feed the processors with only algorithm speciﬁc data. In speciﬁc data patterns [1, 2], accessed data elements are separated by a distance called stride. Many applications in the ﬁelds of digital signal processing and telecommunications beneﬁt from the use of strides. Vector/matrix computation, Fast

This paper is supported by the National Nature Science Foundation of China (No. 61602493, Name Researches on Efﬁcient Parallel Memory Techniques for Wide Vector DSPs). © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 618–629, 2018. https://doi.org/10.1007/978-3-030-05057-3_46

Conflict-Free Block-with-Stride Access of 2D Storage Structure

619

Fourier Transform (FFT), and Viterbi algorithm are some examples [3, 4]. But this is only taking into account the horizontal stride problem. As the latest application needs, we should consider both horizontal stride and vertical stride, so that Two-dimensional (2D) storage is more suitable. The research of parallel memory systems has been studied by many previous investigators. The key of the storage scheme of the system is the mapping between addresses and physical storage locations in the memory. The initial storage scheme is a low-order interleaved (also referred to simply as interleaved) scheme which maps address a into memory module a mod N, where N is the number of memory modules in the system. And the classic linear mapping method proposed by Bunik and Kuck [5] is to skew the storage scheme of the system. Subsequent studies are based on this and often only consider one-dimensional dynamic storage mechanisms, or two-dimensional dynamic storage that does not take vertical stride into account. Through the existing 2D storage mechanism can be used to achieve continuous conflict-free access by rows, columns or blocks. Previous investigation proposed a classic memory storage scheme is Park [6–8]. It can support row, column, block, and diagonal access starting anywhere. The Park does not address part of the memory space, thereby avoiding the address calculation circuit caused by the prime memory module being too complicated and causing chip area waste. In addition, the presence of prime memory modules makes it difﬁcult to implement circular addressing. Therefore, the Park scheme is limited. Based on Park, a new scheme was proposed to call Hong et al. [9]. This scheme eliminates the defects of Park but requires more memory and brings a large area overhead. Another famous storage called Bilinear Skewed Parallel Memory (BilisPM). BilisPM [10] can ensure that it can simultaneously support conflict-free row, column, block and other access modes, and supports circular addressing in both horizontal and vertical directions, and the memory cell area overhead is relatively small. Above solutions do not seem to adapt to the latest needs. Because they all do not consider the problem of stride in storage. Through the existing 2D storage mechanism can be used to achieve continuous conflict-free access by rows, columns or blocks, some sliding window applications in video and image processing algorithms (including convolutional neural networks, subpixel difference, 2D ﬁltering, etc.) which is a need for non-conflicting access by strides in computation, and there is a different demand for horizontal and vertical strides in computing sub-processes. In addition, on the basis of the above requirements, it is a prerequisite for high-bandwidth data loading and result reading to ensure that basic row-based access conflict-free is required. The existing famous 2D memory scheme is that mapping functions proposed by Liu et al. [11]. In this paper, we consider all kinds of horizontal stride, vertical stride, the number of strides, the relationship between the number of banks and so on. We designed a uniﬁed address mapping scheme and devices to ensure that each visit to the data required in different storage body, in order to achieve from any address of the conflictfree block-with-stride access and aligned row access. Figure 1 is one of cases shown aligned row access and unaligned block-with-stride.

620

R. Song et al. 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

step j-1

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

step j

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

2step j-1

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

2step j

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

3step j-1

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

3step j

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0

1

2

3

4

5

6

7

0

4step j-1

Aligned Row Access

Unaligned Block Access

Fig. 1. Aligned row access and unaligned block-with-stride.

2 Storage Space Mapping Schemes In a memory bank which space is Xm Yn (Xm or Yn is an integer power of 2), each element can be represented by coordinates (i, j). Let M and N denote the number if elements contained in the data block in both the horizontal and vertical directions when accessed block-with-stride, both M and N are integer power of 2, and M [ ¼N:M N is the same as the SIMD width and the total number of banks of the processor. The s and h represent the step of horizontal and vertical directions, respectively, when striding by block. Furthermore, let n = log2(M * N), let s be denoted by the form of 0 s ¼ r 2s , where s and 2 are relatively prime. Then the number f(w) of the element which coordinate is mapped to the storage module is: ( f ðwÞ ¼

0

ðw þ ðw=ðM NÞÞ%2s Þ%ðM NÞ; s0 n w þ w=ðMNÞ %ðM NÞ; s0 [ n 0 s n

ð1Þ

2

and the w is: 0

w ¼ i þ ððj=hÞ þ ðj%2Þ ði=ðM 2s ÞÞ%2 ðN=4Þ 2Þ M 2s

0

ð2Þ

Where “/” indicates the quotient operation, “%” indicates the remainder operation. It can be further determined that the element with the coordinate (i, j) in the bank internal address is: gði; jÞ = i/(M N) + j * ðXm /(2 * M N)Þ + i * ðXm *Ym /(2 * M N)Þ

ð3Þ

Conflict-Free Block-with-Stride Access of 2D Storage Structure

621

In order to verify the mapping method proposed in this paper, using the following method to proof its correctness. Because of the large number of variables, there is multiple situations. So here gives certiﬁcation of M = 4, N = 2. The module number which was mapped with position (i, j) is: f ðwÞ ¼ ðw þ ðw=8Þ%s0 Þ%8

ð4Þ

w ¼ i þ ððj=hÞ%2Þ 4s0

ð5Þ

and the w is:

According to (1), Table 1 shows the values of horizontal step and s0 which was divided into several groups.

Table 1. Corresponding’s values for different strides(s) Groups 1 2 … n

s (1 < ^x ¼ ð xi Þ=n i¼1

n P ðjÞ > > : ^yðjÞ ¼ ð yi Þ=n

ð3Þ

i¼1

The RFID is at active state and the tag sends the information to readers at the frequency of 3 Hz. When the tag carrier is walking in the area of interest, the tag collects the sensor information and transmits them to the readers, which send the data to the terminal. The send data is denoted as Ds = {Tu, Ts, ax, ay, az}, Tu refers to the send information sign and Tu is initialed as zero. Ts refers to timestamp when the information is collected. When tag carrier takes a turn, the IMU collects the corresponding motion. The Tu in Ds is changed from zero to one. Then the turning information is transformed to the terminal via readers. The terminal gets the information of time point when the tag carrier takes a turning movement. The absolute position where the tag carrier takes a turning is calculated with the received signal strength. With a sequence of the turning positions is collected, the corresponding trajectory is obtained by RSS. We denote the obtained trajectory as Ls. The distance between alternative locations in Ls and alternative location obtained by strings match Lm is compared to get the optimal trajectory. Then we can get the real-time location of the tag carrier as (4). Sk ¼

N X ðLki Lsi Þ2 N P i¼1 ðLki Lsi Þ2

ð4Þ

i¼1

where i refers to ith location in the alternative location. k refers to kth alternative string. Sk refers to the similarity of Ls and kth alternative string. Lsi refers to the location obtained by received signal strength in the location of ith vertex. Lki refers to the ith vertex of ZN at physical location. The similarity is calculated according to the least square scheme. Through similarity matching calculation between LM and Ls, we can get the optimal location of similarity S.

636

J. Wu et al.

S ¼ maxfSk g k2M

ð5Þ

where Sk refers to the similarity between Ls and kth alternate string. M means the number of alternate options after the matching of text map.

5 Experiments 5.1

Experiment Setup

In order to verify the effectiveness of the proposed method, the experiment is conducted in the PuDong research center of the Third Research Institute of Ministry of Public Security in Shanghai. As shown in Fig. 4, ﬁve readers were deployed in the area of interest in the fourth floor. The reader 1, reader 4 were deployed in hall and reader 2 was deployed in room A. The reader 3 was deployed in room B. The reader 5 was deployed in room C. The volunteer carrying tag walked from the hall to room A and room B. Then tags carriers returned back to the hall. The NRF52832 is used as the chip of active RFID tag and RFID reader. The MPU6050 is used as IMU as the 6-axis chip of accelerometer and gyroscope.

Fig. 4. The experiment site and the trajectory of tag carriers. The red arrow lines refer to the walking trajectory of volunteers and the arrows refer to the walking directions. The black polygons refer to the active readers and the black square refers to the start and end location. (Color ﬁgure online)

The experiment lasts 4 h by four volunteers. The volunteers walked through area of interests with the tags on their thighs near the knees. The acceleration data and gyroscope data are collected about three times per second. The RFID sent the data on the frequency of 3 Hz. Then we recorded the location and relevant information. Totally four users traces along with 86,044 sensor data records are collected.

Graph-Based Indoor Localization with the Fusion of PDR and RFID Technologies

5.2

637

Test and Result

We compared the cumulative distribution functions of the proposed method for the positioning error with the method in [21] where we use the RFID instead of Wi-Fi technology and the method in [22] where tightly coupling based IMU and RFID fusion technology is proposed.

Fig. 5. The positioning error in experiment site. The original trajectory and the calculated trajectory with the proposed method is depicted in the ﬁgure. The red circles refer to the turning point which the volunteers take. (Color ﬁgure online)

As shown in Fig. 5, we depicted the original trajectory and the calculated trajectory with the proposed method. The positioning error is reduced largely in turning points. Because we used the IMU aided RSS ﬁltering method, the position error is also reduced to about one meter in the trajectory.

Fig. 6. The cumulative distribution functions of three different positioning methods.

638

J. Wu et al.

As shown in Fig. 6, the accuracy of the proposed algorithm is higher than the other two methods in literatures and it can achieve about 1.2 m in average. The positioning errors are mostly introduced by the width of the corridors which are not expressed with characters.

6 Discussions and Conclusions In this paper, the proposed method maps the physical space into logical text map. Then the indoor positioning in the physical environment is changed in the logical text matching process. As shown in the conducted experiments, the accuracy of proposed algorithm has achieved 1.2 m in average. The proposed method is conducted in a small floor plan. When the positioning area became larger and more complex, the advantage of this method will be more and more prominent. The proposed method established the relationship between physical positioning area and the logical text map with the undirected weighted graph. we make the match with the text map and character string to get the several alternative tag locations. Then the further comparison is made between the obtained RSS in vertexes of alternative tag locations and the ﬁngerprint database to get the most similar location. Finally, we conducted experiments to verify the feasibility of the proposed method and make the comparison between the method and other localization methods. In the future study, we will lay emphasis on symbol expression and text matching accuracy to achieve better effect in complex and 2D environment. Acknowledgment. This work has been supported by Science and Technology Commission of Shanghai Municipality [Grant No.17511106902 and 15DZ1100400].

References 1. Ni, L.M., Liu, Y., Lau, Y.C., Patil, A.: P: Landmarc: indoor location sensing using active rﬁd. Wireless Netw. 10(6), 701–710 (2004) 2. Yang, L., Chen, Y., Li, X. Y., Xiao, C., Li, M., Liu, Y: Tagoram: real-time tracking of mobile RFID tags to high precision using COTS devices. In: International Conference on Mobile Computing and NETWORKING, pp. 237–248. ACM (2014) 3. Liu, Y., Yang, Z.: Location, localization, and localizability. J. Comput. Sci. Technol. 25(2), 274–297 (2011) 4. Gentile, C., Alsindi, N., Raulefs, R., et al.: Geolocation techniques: principles and applications, 45:10(10), pp. 64–70. Springer Publishing Company, Incorporated (2012) 5. Wu, C., Yang, Z., Liu, Y., & Xi, W: Will: wireless indoor localization without site survey. IEEE Transactions on Parallel & Distributed Systems, 24(4), 839–848. (2013) 6. Sun, W., Liu, J., Wu, C., Yang, Z., Zhang, X., Liu, Y: MoLoc: on distinguishing ﬁngerprint twins. In: IEEE, International Conference on Distributed Computing Systems, vol. 7973, pp. 226–235 (2013) 7. Xiao, Z., Wen, H., Markham, A., Trigoni, N., Blunsom, P., Frolik, J.: Non-line-of-sight identiﬁcation and mitigation using received signal strength. IEEE Trans. Wireless Commun. 14(3), 1689–1702 (2015)

Graph-Based Indoor Localization with the Fusion of PDR and RFID Technologies

639

8. He, S., Chan, S.H.G., Yu, L., Liu, N: Calibration-free fusion of step counter and wireless ﬁngerprints for indoor localization. In: ACM International Joint Conference, pp. 897–908 (2015) 9. Wang, J., Katabi, D.: Dude, where’s my card?: RFID positioning that works with multipath and non-line of sight. In: ACM SIGCOMM 2013 Conference on SIGCOMM, vol. 43, pp. 51–62 (2013) 10. Chen, P. C A: Non-line-of-sight error mitigation algorithm in location estimation. In: Proceedings of IEEE Wireless Communications NETWORKING Conference, vol. 1, pp. 316–320 (1999) 11. Li, X.: An iterative NLOS mitigation algorithm for location estimation in sensor networks. In: Proceedings of 15th IST Mobile Wireless Communication Summit, Miconos, Greece, pp. 1–5. (2006) 12. Guvenc, I., Chong, C.C., Watanabe, F.: NLOS identiﬁcation and mitigation for UWB localization systems, pp. 1571–1576 (2007) 13. Nawaz, S., Trigoni, N.: Convex programming based robust localization in NLOS prone cluttered environments. In: Proceedings of 10th International Conference IPSN, Chicago, IL, USA, pp. 318–329 (2011) 14. Hilsenbeck, S., Bobkov, D., Schroth, G., Huitl, R., Steinbach, E: Graph-based data fusion of pedometer and WiFi measurements for mobile indoor positioning. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 147–158 (2014) 15. Yang, Z., Feng, X., Zhang, Q.: Adometer: push the limit of pedestrian indoor localization through crowdsourcing. IEEE Trans. Mob. Comput. 13(11), 2473–2483 (2014) 16. Xiao, Z., Wen H., Markham, A., Trigoni, N.: Lightweight map matching for indoor localization using conditional random ﬁelds. In: Proceedings of ACM/IEEE IPSN, pp. 131– 142 (2014) 17. Jiménez, A.R., Seco, F., Prieto, J.C., Guevara, J.: Indoor pedestrian navigation using an INS/EKF framework for yaw drift reduction and a foot-mounted IMU. In: The Workshop on Positioning Navigation & Communication, pp. 135–143 (2010) 18. Chen, L.H., Wu, H.K., Jin, M.H., et al.: Intelligent fusion of Wi-Fi and inertial sensor-based positioning systems for indoor pedestrian navigation. Sensors J. IEEE 14(11), 4034–4042 (2014) 19. Rai, A., Chintalapudi, K.K., Padmanabhan, V.N., et al.: Zee: zero-effort crowdsourcing for indoor localization, pp. 293–304 (2012) 20. Evennou, F., Marx, F.: Advanced integration of WIFI and inertial navigation systems for indoor mobile positioning. Hindawi Limited, pp. 1–11 (2006) 21. Chen, L.H., Wu, H.K., Jin, M.H., et al.: Intelligent fusion of Wi-Fi and inertial sensor-based positioning systems for indoor pedestrian navigation. Sensors J. IEEE 14(11), 4034–4042 (2014) 22. Ruiz, A.R.J., Granja, F.S., Honorato, J.C.P., Rosas, J.I.: G.: Accurate pedestrian indoor navigation by tightly coupling foot-mounted IMU and RFID measurements. IEEE Trans. Instrum. Measur. 61(1), 178–189 (2011)

UAV 3D Mobility Model Oriented to Dynamic and Uncertain Environment Na Wang1, Nan Di2 ✉ , Fei Dai1 ✉ , and Fangxin Liu3 (

)

(

)

1

3

Army Engineering University of PLA, Nanjing 210007, China [email protected] 2 Institute of China Electronic System Engineering Company, Beijing 100039, China [email protected] Shanghai Branch, Coordination Center of China, National Computer Network Emergency Response Technical Team, Shanghai 201315, China

Abstract. Currently, unmanned aerial vehicle (UAV) swarm has been widely used for emergency rescue in disaster areas. In dynamic and uncertain environ‐ ments, the uneven distribution of events and obstacles seriously aﬀect the eﬃ‐ ciency of UAVs’ missions and the safety of airborne operations. The traditional UAV mobility models pay more attention to the UAV’s own moving rules, so as to make the UAV’ ﬂight pattern meet real conditions as much as possible, while ignoring the requirements of UAVs’ mission and uncertainties of environment. Based on the 3D Visit-Density Gauss-Semi-Markov Mobility (3D-VDGMM) model, this paper proposes a 3D Mobility Model oriented to Dynamic and Uncer‐ tain environment (3D-DUMM). The 3D-DUMM has made improvements to emergency rescue missions while fully considering the dynamic distributed, dense and irregular obstacles in the rescue area. Simulation experiments show that 3D-DUMM can well captured uncertain events and can safely deal with dynamic and complex rescue environments. Keywords: Three-Dimensional mobility model · Dynamic uncertainty Emergency rescue

1

Introduction

In recent years, serious natural disasters have occurred frequently. The natural disasters such as earthquakes, mudslides, and ice and snow, have the characteristics of strong sudden, high speed of destruction and wide inﬂuence. Therefore, it becomes an urgent problem to discover and locate the disasters quickly and transmit the disaster information in real time eﬃciently. With the development of UAV technology, UAV swarms have gradually been used to solve the problems that many traditional methods cannot eﬀec‐ tively solve, and have played a crucial role in target monitoring and tracking, airspace situational awareness interaction, unmanned aerial vehicle automatic collision avoid‐ ance, air-to-sea-land coordinated operation and other areas [1]. UAVs have the advan‐ tages of ﬂexibility, low cost, small size, high ﬂying altitude, and can adapt to more

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 640–650, 2018. https://doi.org/10.1007/978-3-030-05057-3_48

UAV 3D Mobility Model

641

complex, dynamic and uncertain environments. Therefore, the drone technology has been widely used in the emergency rescue. A reliable, eﬃcient, and real-time UAV swarm depends on a scientiﬁc and ﬂexible mobility model to meet the challenges of events capturing, frequent link changes, and complex and dynamic environments. UAV individuals or UAV swarm usually aim at a speciﬁc task and a speciﬁc area, and then take corresponding actions around speciﬁc tasks and speciﬁc areas. Therefore, when designing the UAV mobility model, we must consider the UAV’s own motion characteristics, on the one hand. On the other hand, we must fully consider the UAVs’ tasks and the dynamics of the regional environment. The traditional UAV mobility models pay more attention to the UAV’s own movement law, so as to make the UAV’ ﬂight pattern meet real conditions as much as possible, while ignoring the requirements of UAVs’ mission and the challenges of the dynamic and uncertain environment. In this paper, aiming at the scene of emergency rescue, the 3DVDGMM model is improved by introducing the characteristics of tasks and environ‐ mental factors to the Gaussian model, and a UAV three-dimensional mobility model oriented to a dynamic uncertain environment (3D-DUMM) is proposed. The detailed contributions of this article are summarized as follows: (1) Task-oriented UAV mobility modeling is implemented. The mobility model designed in this paper ﬁts the practical scene, and the UAV can dynamically and eﬃciently capture the target events according to the characteristics of the rescue mission. (2) Overcoming the uncertainties in the environment. This mobility model greatly enhances the maneuverability of drones in the moving process, and can autono‐ mously adjust its state with environmental changes. This dynamic model can expand the application of the UAVs to more ﬁelds.

2

Related Works

Currently, researchers have conducted a series of studies on UAVs mobility models in view of the particularity of high speed motion of UAVs. The quality of the mobility model has a direct impact on the routing algorithm and network topology. After analyzing the main factors aﬀecting the quality of mobility model and their relationship, evaluation index system of mobility model and the quantitative evaluation method based on the analytic hierarchy process (AHP) are proposed in [4]. Then, the validity of the evaluation method is validated by taking the classic random mobility model (random Gauss-Markov (GM), random waypoint model (RWP), random walk model (RW), random direction model (RD)) as an example. For the UAVs own motion control, Broyles designed and implemented the three-dimensional airborne network Gauss Markov mobility model (3D-GMM) based on the high dynamic performance of the airborne network in [5]. The model increases the pitch angle parameter for airborne nodes, enable nodes to be more ﬂexible in the processing of aerial maneuvering. Rohrer increased the eﬀectiveness of the assessment on the basis of the 3D-GMM in [6], through emphasizing the adaptability of the node mobility and the diversity of the movement state. Zheng proposed the three-dimensional Gauss smoothing semi Markov mobility

642

N. Wang et al.

model in [7], taking the actual environment noise and other factors into account, and increased the Gaussian dis-turbaned in the model, but lacking the dealing with the node moving at boundary conditions. Kuiper [8] based on ant colony algorithm and artiﬁcial potential energy method, added pheromone and exclusion logic in the UAVs. Eﬀectively solve the problem that network coverage and connectivity cannot optimize at the same time. He inspired by ﬁsh and analyzed the way of transforming between nodes in diﬀerent state, and designed 3D_NMM (3D semi-random node mobility model) in [9], which achieved the eﬀective coverage of regional monitoring target event. The above model only starts from the UAVs’ own movement. During the operation, it needs to provide the speciﬁc track point or the direction of ﬂight for UAVs, to guide UAVs to move. Otherwise, these models don’t consider the dynamic change of the environment. Considering the influence of large obstacles in the flying environment on the UAVs motion mode, Regis [10] extended the three movement models (random walk, random direction and Gauss-Markov) to accommodate large obstacles. For the static obstacles encountered in UAV flight, the collision avoidance cone method is designed in [11], to achieve the avoidance of obstacles. A local path planning algorithm based on improved morphine search tree is proposed in [12], which is used to avoid “sudden” obstacles. In [13], the method called the velocity obstacle method can provide the necessary situational awareness for UAVs in a dynamic environment, and can help to generate a conflicting maneuver. In modern emergency rescue work, UAVs need to complete multiple tasks such as rapid response, data collection, relay communication, situation tracking and disaster recon‐ struction. The complexity of the environment and the tedious tasks make flight routes impossible to design in advance. The 3D Visit-Density Gauss-Semi-Markov Mobility (3DVDGMM) [14] was designed to solve this problem in the early stage of our research group. However, many uncertainties, such as static and dynamic obstacles in the environment, pose challenges for the flight safety of UAVs. Existing mobility models don’t work well for this type of scenario. For this reason, this paper designs the 3D Mobility Model for Dynamic and Uncertainty environment (3D-DUMM) based on 3D-VDGMM, which ensures the safety of UAVs while performing the capture of uncertain events.

3

Overview of 3D-VDGMM

To simulate the UAV network scenario requires accurate analysis of the transmission among wireless signals, which demands accurate knowledge of geographical location of mobility nodes. Due to the fact that the field test is uppity, dangerous and irreproducible while the verification of new applications and protocols is mainly completed with simula‐ tion technology, a UAV mobility model is in need to simulate the actual UAV movement. In [14], we have proposed a three-dimensional Gaussian semi-Markov mobility model based on visit density (3D-VDGMM). It first uses the access density to measure the access needs of unmanned people in different spatial locations, thus affecting the flight direction of UAV. For any square Si,j,k, the event density 𝜌e is equal to the total number of events at the current moment in the square, the drone density 𝜌u is equal to the number of UAVs in the square at the current time. The visit density is:

UAV 3D Mobility Model

) ( 𝜌 v = 𝜌e ∕ 1 + 𝜌u

643

(1)

After introducing the access density, the three-dimensional velocity vector of the UAV in 3D-VDGMM is represented as:

√ | | ⎧ xn = 𝛼xn−1 + (1 − 𝛼)̄x + (1 − 𝛼 2 )(−1)𝛽x |xxn−1 ,𝛾x | | | √ ⎪ | 𝛽 | ⎨ yn = 𝛼yn−1 + (1 − 𝛼)̄y + (1 − 𝛼 2 )(−1) y ||yxn−1 ,𝛾y || √ ⎪ | 𝛽 | ⎩ zn = 𝛼zn−1 + (1 − 𝛼)̄z + (1 − 𝛼 2 )(−1) z ||zxn−1 ,𝛾z || { ⎧ β = 0, 𝜌x+,y+,z+ ≥ 𝜌x−,y−,z− x,y,z ⎪ > 𝜌x+,y+,z+ ⌋ ⌊1, 𝜌x−,y−,z− ( ) ⎪ ⎪ 𝛾 = log max( 𝜌x−,y−,z− , 𝜌x+,y+,z+) ⎨ x,y,z min 𝜌x−,y−,z− , 𝜌x+,y+,z+ ⎪ ∑i∈Squarenegative ⎪ 𝜌x−,y−,z− = i 𝜌x ,y ,z ∑i∈Squarepositive i i i ⎪ 𝜌 = 𝜌 ⎩ x+,y+,z+ xj ,yj ,zj i

(2)

(3)

where βx,y,z and γx,y,z represent the parameters of the mobility model in the dimensions of X, Y, Z, respectively, βx,y,z directly affects the movement direction of node with a proba‐ bility of 1, γx,y,z indirectly influences the movement distance of node by affecting the prob‐ ability of the node’s movement distance. They both work in the process of the node moving to a new position. log(⋅) is used to determine the magnitude of two numbers; ⌊⋅⌋ indicates rounding down; 𝜌x+,y+,z+ and 𝜌x−,y−,z− respectively represent the sum of the grid density of the front and rear sides of the node in the dimensions of X, Y, and Z. The front is the positive direction of the X, Y, and Z axes, while the rear is the negative direction. As is shown in Fig. 1, 𝜌x+,y+,z+ is the sum of the 9 grid densities of the cube surface where the red arrow is directed. Z

dz+

1 events, 2 UAVs

3 events, 0 UAVs

Y dy+

X dx+

Fig. 1. Monitor areas of UAV in 3D-VDGMM.

644

4

N. Wang et al.

Problem Description and Model Design

4.1 Problem Description and Conditional Assumptions In the emergency rescue scenario, the main task of UAV is the disaster monitoring and information transmission. Therefore, the disaster information and the terminal that needs to transmit information can be deﬁned as an event. Due to the diﬀerent levels of disasters and the number of terminals, the quantity of incidents generated in diﬀerent regions at diﬀerent times varies from each other. UAVs need to complete the capture of events as much as possible, while avoiding collisions with uncertain obstacles during the move‐ ment. Diﬀerent from the general scenario of 3D-VDGMM, in the emergency rescue scene, events are mainly distributed in the two-dimensional plane of the ground. There‐ fore, this paper ﬁrst abstracts the task area into network G, and makes the following deﬁnitions and assumptions:

{ } Deﬁnition 1: U = u1 , u2 , … , un , indicates the UAV set existing in the network, where n is the number of UAVs. { } Deﬁnition 2: T = t1 , t2 , … , tk , indicates the set of events that exist in the network, where k is the total number of random occurrences of the event. Assumption 1: Divide the network G into small squares. The monitoring range of each node is 9 squares. UAVs are vertically projected to the center position (Fig. 2).

z

x

Current UAV

dx+ 1 2

4 7

8

UAV

3

5 6 9

event

y

dy+

Fig. 2. Schematic diagram of node access density in 2D plane.

Assumption 2: Due to constraints of UAV capability and mission, the UAV node’s [ ] ﬂying speed v must satisfy v ∈ Vmin , Vmax . Assumption 3: The UAV can perform horizontal all-round sensing and can sense events and obstacles as well as the distance to obstacles.

UAV 3D Mobility Model

645

Assumption 4: The horizontal cross section of all obstacles in the dynamic uncertain environment is circular. Assumption 5: No failure occurs during the task execution of UAVs. 4.2 Task-Based UAV Movement Model Because the events in the emergency rescue scene are mainly distributed on the ground and the UAV nodes only need to follow the natural Gauss Markov motion in the vertical direction, the 3D-VDGMM is ﬁrstly improved to a two-dimensional plane movement model. At this moment, the area that each UAV node needs to directly perceive and calculate is the nine squares centered on the vertical projection point of the UAV on the ground (Fig. 2). The positive forward direction of the UAV node in the two-dimensional plane is the sum of the three grid densities pointed by 𝜌x+ and 𝜌y+ in Fig. 2, and the opposite direction is pointed by 𝜌x− and 𝜌y−. Therefore, Formula (2) and Formula (3) can be rewritten as: √ ⎧ xn = 𝛼xn−1 + (1 − 𝛼)̄x + (1 − 𝛼 2 )(−1)𝛽x ||xx ,𝛾 || | n−1 x | √ ⎪ | 𝛽y | 2 y = 𝛼y + − 𝛼)̄ y + − 𝛼 (1 (1 )(−1) |yxn−1 ,𝛾y | ⎨ n n−1 | | √ ⎪ zn = 𝛼zn−1 + (1 − 𝛼)̄z + (1 − 𝛼 2 )zxn−1 ⎩ ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 𝛾x ⎪ ⎨ ⎪𝛾 ⎪ y ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

(4)

{

0, 𝜌x+ ≥ 𝜌x− { 1, 𝜌x− > 𝜌x+ 0, 𝜌y+ ≥ 𝜌y− βy = 1, ⌊ 𝜌(y− > 𝜌y+ ) ⌋ max 𝜌x− , 𝜌x+ = log ( ) min 𝜌x− , 𝜌x+ ⌊ )⌋ ( max 𝜌y− , 𝜌y+ = log ) ( min 𝜌y− , 𝜌y+ 𝜌x+ = 𝜌1 + 𝜌2 + 𝜌3 𝜌x− = 𝜌7 + 𝜌8 + 𝜌9 𝜌y+ = 𝜌3 + 𝜌6 + 𝜌9 𝜌y− = 𝜌1 + 𝜌4 + 𝜌7 βx =

(5)

The deﬁnitions of β and γ are the same as formula (2) and (3). β determines the direction in which the UAV should ﬂy, and γ determines the speed (distance) at which the UAV should ﬂy. The speciﬁc motion of the UAV node is as follows: (1) Calculate the access density dv of 9 squares in the monitoring area according to formula (1); (2) Compare the access density of 9 squares in the monitoring area. If its vertical projection corresponds to the largest grid density, the node will not perform any

646

N. Wang et al.

operation, end the current movement, maintain hovering state at the original grid position, and carry out more detailed monitoring activities; (3) If the density of its own projection grid is not the largest, the next move position of the node will be calculated according to Formula (4) and (5). 4.3 3D-DUMM Model Section 4.2 improves the 3D-VDGMM model into a two-dimensional planar mobility model for emergency rescue scenarios, solving the problem of how the mobility model is task-oriented and improves event capture eﬃciency. However, in the emergency rescue scenario, the dynamic uncertainty of the environment directly aﬀects the ﬂight safety of the UAV swarm, so the mobility model must take it into consideration. The dynamic uncertainty environment referred to in this paper is mainly the obstacle aﬀecting ﬂight safety that exist in the rescue area, i.e. those obstacles whose height exceeds the UHV minimum ﬂight height Hmin. In the 3D-VDGMM, a moving inﬂuence, is added as a Gaussian factor to the UAV’s original Gauss-Markov movement model, and the inﬂuence of the visit density on the UAV is positive, which means it will attract UAVs to be close to the location with high access density. Similarly, when considering obstacles in a dynamic uncertain environment, obstacles can also be added to the GaussMarkov model in a certain form to make them play a negative role for UAVs, which guides UAVs to be far away from the obstacles. Firstly, we expand Formula (4) to: ⎧ x = 𝛼x + (1 − 𝛼)̄x + 𝜇 𝜌(𝛽 , 𝛾 ) + (1 − 𝜇 )d(d ) n−1 x ( x x) x) ( x) ( ⎪ n 𝛾y + 1 − 𝜇y d dy ⎨ yn = 𝛼yn−1 + (1 − 𝛼)̄y + 𝜇y 𝜌 𝛽y ,√ ⎪ zn = 𝛼zn−1 + (1 − 𝛼)̄z + (1 − 𝛼 2 )zxn−1 ⎩

(6)

{ ( ) √ | | 𝜌 𝛽x , 𝛾x = (1 − 𝛼 2 )(−1)𝛽x |xxn−1 ,𝛾x | | | ) √ ( | | 𝜌 𝛽y , 𝛾y = (1 − 𝛼 2 )(−1)𝛽y |yxn−1 ,𝛾y | | |

(7)

( ) ) ( Where 𝜌 𝛽x , 𝛾x and 𝜌 𝛽y , 𝛾y are the Gaussian inﬂuence factors of the access density ( ) ( ) applied to the UAV nodes in the x and y directions respectively, d dx and d dy are the inﬂuence factors applied to the UAV nodes in the x and y direction for obstacles, dx and dy are coordinates relative to the obstacle, 𝜇x and 𝜇y are priority factors in the x and y directions, and they satisfy:

⎧ 1, d > dmax ⎪ 𝜇 = ⎨ 𝛿, dmin ≤ d ≤ dmax ⎪ 0, d < dmin ⎩

(8)

UAV 3D Mobility Model

647

That is, when the distance between the UAV and the obstacle in the x or y direction is less than dmin, the UAV enters the mode of obstacle avoidance priority. At this time, the mobility model only considers the inﬂuence of the obstacle regardless of the access density; when this distance is greater than dmax, the UAV enters the access priority mode. At this time, the mobility model only considers the impact of the access density in this direction regardless of the obstacle. When this distance is between dmin and dmax, UAV enters the hybrid mode where the mobility model considers the impact of access density and obstacles simultaneously. Considering Formula (6), (7) and (8) comprehensively, we can ﬁnd that the problem ( ) ( ) turns into how to set the function d dx or d dy and how to select the parameter δ. Since dx and dy are coordinates of UAV nodes with respect to obstacle respectively, that is ( ) ( ) dx , dy = xn−1 − ox , yn−1 − oy , the direction of dx and dy is the direction in which the obstacle is applied to the UAV. With reference to the artiﬁcial potential ﬁeld model, it is clear that the closer the UAV is to the obstacle, the greater the force is to distance it ( ) exerts. Therefore, taking the x direction as an example, the d dx can be deﬁned in combination with the away direction:

| | | | | | ( ) √ d x 2 | | d dx = (1 − 𝛼 ) x |dx | || x , 1 || | | | n−1 2 | dx | |

(9)

dx 1 determines the direction in which the UAV should ﬂy, while 2 |d | dx | x| determines the speed (distance) at which the UAV should ﬂy. In conclusion, each UAV in the swarm updates the moving position according to Formula (6), (7), (8) and (9), which means it can realize a UAV three-dimensional mobility model oriented to a dynamic uncertain environment. In the formula,

5

Simulation Results and Analysis

The experimental area is set to three-dimensional in which the length, width and height are 10 km respectively, and the side length of each square is 100 m, The initial speed of the node is a uniform random variable between 50 m/s and 100 m/s with a maximum speed of 100 m/s. Nodes update their mobile status every 1 s time interval. First, the 3D-DUMM model was simulated in the task-oriented eﬃciency. The experiments were performed using one, two, and three UAVs, respectively, and the UAV ﬂight time was 300 s. There are 100 events randomly distributed in the experimental area. The maximum number of event captures per unit time for each UAV is 1 and the event disappears when the event is captured by a UAV. The experimental results are shown in Figs. 3 and 4(a).

648

N. Wang et al.

(a)

(b)

Fig. 3. Trajectory of 1 and 2 UAV in accordance with the 3D-DUMM model.

(a)

(b)

Fig. 4. (a) Trajectory of 3 UAV in accordance with the 3D-DUMM model, (b) Trajectory of the UAV under dynamical and uncertain environment.

Figures 3(a), (b) and 4(a) show traces of event capture by 1 UAV, 2 UAVs, and 3 UAVs in a 3D-DUMM model, respectively. From the above three ﬁgures, it can be found that based on the mobility model proposed in this paper, the trajectories of UAVs are basically the same as the event’s distribution. As a result, it can achieve a higher event capturing rate during the task. During the ﬂying process of UAVs, there are few phenomena such as sharp turns (except boundary conditions), and the trajectory is rela‐ tively smooth, which is closed to the actual movement law of UAV. In order to verify the ability to adapt to the dynamic and uncertain environment through 3D-DUMM, we further added obstacles under the same experimental condi‐ tions. The experimental results are shown in Fig. 4(b). Figure 4(b) shows the trajectory of 1 UAV according to the 3D-DUMM model with random events and obstacles in the experimental area. From Fig. 4(b), it can be found that, on the one hand, UAVs are highly sensitive to events and can pass through the areas

UAV 3D Mobility Model

649

with more events during the task. On the other hand, when UAVs encounter obstacles, they can make reasonable decisions between obstacles and events, and choose to capture more events on the basis of avoiding collisions with obstacles. In order to quantitatively analyze the eﬃciency of 3D-DUMM in task execution, the comparisons on event capturing eﬃciency for 3D-DUMM, 3D-VDGMM, and 3DGMM are shown in Fig. 5.

Fig. 5. Relationship between event capturing rate and running time under three diﬀerent models.

Figure 5 shows the relationship between event capturing rate and running time when UAVs perform tasks under three diﬀerent mobility models. From Fig. 5, it can be found that in 3D-DUMM and 3D-VDGMM, the event capturing rate is higher relative to 3DGMM. And they can capture 80% of total events in about 100 s, while 3D-GMM can only captures 50% of events during 200 s. This is because both 3D-DUMM and 3DVDGMM have added event traction to the model and have a high sensitivity to events. However, 3D-GMM only considers the law of UAV movement and has no sensitivity to events, so the process of capturing events is relatively random.

6

Conclusion

In order to eﬀectively complete the rescue mission, the UAVs must be able to safely deal with complex environments ﬁrst. Based on the 3D-VDGMM proposed in the earlier part of our research group, this paper fully considers the requirements of the rescue mission and the uncertainty of the environment, and designs 3D-DUMM in the further. Comprehensively combining the Gaussian factors of visit density and obstacles, the next direction of UAV can be calculated. Which ensures that UAVs can eﬀectively capture

650

N. Wang et al.

target events while safely ﬂying. The results of simulation show the validity and ration‐ ality of the model, and provide supports to evaluating the performance of three-dimen‐ sional environment UAV network. Next, detailed research on the data dissemination of UAV network based on 3D-DUMM will be conducted.

References 1. Erturk, M., Haque, J., Arslan, H.: Challenges of aeronautical data networks. In: Proceedings of IEEE Aerospace Conference, Montana, pp. 1–7, March 2010 2. Bujari, A., Calafate, C.T., Cano, J.C., et al.: Flying ad-hoc network application scenarios and mobility models. Int. J. Distrib. Sens. Netw. 13(10), 155014771773819 (2017) 3. Zaouche, L., Natalizio, E., Bouabdallah, A.: ETTAF: eﬃcient target tracking and ﬁlming with a ﬂying ad hoc network. In: International Workshop on Experiences with the Design and Implementation of Smart Objects, pp. 49–54. ACM (2015) 4. Sheng, Z., Ming-hui, Y., Yi, H., et al.: An exploration of evaluation of mobility model based on analytic hierarchy process in opportunistic network. J. Nanchang Hangkong Univ. Nat. Sci. 31(3), 15–22 (2017) 5. Broyles, D., Jabbar, A., Sterbenz, D.: Design and analysis of a 3-D Gauss Markov mobility model for highly-dynamic airborne networks. In: International Telemetering Conference, Las Vegas, NV, October 2009 6. Rohrer, J.P.: AeroRP performance in highly-dynamic airborne networks using 3D GaussMarkov mobility model. In: MILCOM 2011 Military Communications Conference, pp. 834– 841 (2011). ISSN 2155-7578, ISBN 9781467300797 7. Zheng, B., Zhang, H.Y., Huang, G.C., et al.: Design and implemention of a 3-D smooth mobility mode. J. Xidian Univ. 38(6), 179–184 (2011) 8. Kuiper, E., Nadjm-Tehrani, S.: Mobility models for UAV group reconnaissance applications. In: International Conference on Wireless and Mobile Communications, p. 33. IEEE (2006) 9. He, M., Chen, Q.L., Chen, X.L., et al.: Fish swarm inspired Ad hoc networks node random mobility optimization model in 3D environment. Chin. J. Sci. Instrum. 35(12), 2826–2834 (2014) 10. Regis, P.A., Bhunia, S., Sengupta, S.: Implementation of 3D obstacle compliant mobility models for UAV networks in ns-3, pp. 124–131 (2016) 11. Belkhouche, F., Bendjilali, B.: Reactive path planning for 3-D autonomous vehicles. IEEE Trans. Control Syst. Technol. 20(1), 249–256 (2012) 12. Yi, Z., Fan-yu, D., Yuan, L.: A local path planning algorithm based on improved morphin search tree. Electr. Opt. Control 23(7), 15–19 (2016) 13. Jenie, Y.I., Van Kampen, E.J., De Visser, C.C., et al.: Three-dimensional velocity obstacle method for UAV deconicting maneuvers. In: AIAA Guidance, Navigation and Control Conference, AIAA 2015-0592. AIAA Kissimmee (2015) 14. Zhang, G.M., Wang, N., Wang, R., et al.: UAV 3D mobility model based on visit density. J. Beijing Univ. Posts Telecommun. 40(s1), 112–116 (2017)

Acquiring Hidden Space via Modifying Block Bitmap for Android Devices Wang Lianfang1, Huang Hong2, Li Yuanzhang1, and Zhang Li3(&) 1

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China [email protected] 2 Troops 61516 of Chinese People’s Liberation Army, Beijing 100094, China 3 School of Computer and Information Technology, Nanyang Normal University, Nanyang 473061, China [email protected]

Abstract. Mobile devices are widely used to process sensitive data. In certain situations, the sensitive data must be hidden rather than be encrypted. The existing approaches of setting up hidden space are not suitable for advanced Ext4 ﬁle system because they may require external storage device. To address the issue, we propose a novel method to establish the hidden space in Ext4 via artiﬁcially modifying the block bitmaps. To further improve usefulness of our method, we modify the multiply bits of the block bitmaps one time by creating a “host ﬁle” rather than by bit. This method is lightweight and does not require modifying the linux kernel and has no effect on the normal operations of the operating system. To validate the method performance, distributions of hidden spaces under different storage capacity are conducted. The results show that our method is effective and reliable. Keywords: Hidden space

Ext4 Block bitmap

1 Introduction With the popularity of smartphones and other mobile computing devices, the amount of personal data stored in mobile devices has increased. According to a CNNIC report [10], the number of Internet users in China reached 772 million as of Dec. 2017, and the proportion of mobile Internet users were as high as 97.5%. At the same time, the majority of the people would like to share the information and data by Internet; it also brings security issues such as privacy leaks. People are accustomed to store personal data (photos, videos, documents, passwords, etc.) in mobile devices. These sensitive data may become the targets of attacks. Encryption [2] can be utilized for protecting those sensitive data. However, with the development of forensic technology, encrypted data can be easily found by forensic tools, and it is difﬁcult to prevent encrypted data from being cracked or being modiﬁed. Data hiding technology can store data in the redundant area of the storage medium. It makes hidden data difﬁcult to ﬁnd and ensures data security compared with encryption technology. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 651–660, 2018. https://doi.org/10.1007/978-3-030-05057-3_49

652

W. Lianfang et al.

Data hiding may be implemented in different abstraction layers [3], such as hardware layer, operating system layer, and application layer [4–6]. This paper focuses on data hiding in OS layer. With the wide use of the Android operating system on mobile devices, the huge security threats and data leakage risks become more serious. It is necessary to study how to hide data in the Ext4 ﬁle system because Ext4 has now become the default ﬁle system for most Android systems. The traditional hidden data acquisition methods read data from the ﬁle and directory slack space [7], preserved areas [8], and the ﬁle system timestamps [9] in the Ext4 ﬁle system. These methods have a common drawback: the obtained hidden space is too small. They cannot be applied in some real-world scenarios. For example, human rights worker takes the large amounts of video in conflicts areas and stores these data in hidden spaces of ﬁle system [10]. Knut et al. proposed a method to hide large amounts of data in journaling ﬁle system [11], but the method isn’t applied to the ext4 ﬁle system; PANG et al. implemented the StegFS system based on Linux kernel 2.4, but the system can only be used in ext2 [12]; Adam Skillen et al. designed the Mobiflage system [13] to hide encrypted volumes, but it requires an external storage space and is not applicable to some mobile models. In order to ensure the security and the feasibility of data hiding [14, 15], this paper proposes a new method of acquiring large-capacity hidden data in on the Ext4 ﬁle system. This method can modify the corresponding block bitmaps to acquire hidden space without affecting the normal operation of the operating system, when a large “host ﬁle” is created in the data partition. Our contributions in this paper include: 1. We analyze the characteristics of the Ext4 ﬁle system, and explore the mapping relationship and structural association between metadata and data blocks such as block bitmap and inode. 2. We propose a new method to acquiring a hidden space via modifying block bitmap. The rest of the paper is organized as follows. Section 2 presents the background. In Sect. 3, we explore methods for acquiring hidden spaces and describe the system design. In Sect. 4, we discuss the implementation for Android. Section 5 presents the results of functional evaluation and performance evaluation. Section 6 is the conclusion and future work.

2 Background 2.1

Overview of the Ext4 File System

A storage device is divided into an array of logical blocks in Ext4 with the default block size of 4 K. These blocks are organized into block groups and each block group consists of metadata blocks and data blocks as displayed in Fig. 1 [16]. An inode contains the metadata of a ﬁle, such as timestamps, user and group permissions, as well as pointers to data blocks [17]. A superblock contains metadata of ﬁle system and a group descriptor stores metadata of a particular block group. In addition, each block group has a data block bitmap, which records the allocation of data blocks in this group.

Acquiring Hidden Space via Modifying Block Bitmap

653

super block

Partition

group descriptors

boot loader

reserved group descriptors

block group 0

block bitmap

block group 1

inode bitmap inode table

...

data block ...

block group N

data block

Fig. 1. Ext4 ﬁle system structure.

Ext4 uses an extended extent tree to store the ﬁle’s logical location on disk, rather than the triple indirect pointer used in previous versions. After obtaining the inode number of the ﬁle, the logical location of the ﬁle can be found according to the extent tree. 2.2

Extent Tree

The extent tree has two types of node, including index node and leaf node. Each node stores a number of 12-byte data items. Extent_node Ext4_inode

Index_node Extent_header Extent_Index

Extent_header

Extent_Index

A

Extent_header Extent

B

Extent Extent

C

Extent

Extent_Index

D Extent_header Extent

E

Extent Extent

Host file

F

Extent

G H

Fig. 2. The Ext4 extent tree structure.

654

W. Lianfang et al.

There are three types of data items: header, index, and extent. Each node has a ﬁxed header. The index node consists of a number of index items. Each index item points to a leaf node or index. There are several extents after the header in the leaf node. Each extent locates a continuous data space on the hard disk. If the size of a ﬁle exceeds the value that an extent can represent, or if there are discrete blocks, multiple extents need to be used. The root of the extent tree is stored in the i_block ﬁeld of the inode. It is an array of 60 bytes and can hold one header and four extents at most. All the data of a ﬁle can be acquired by the extent tree traversal. The data blocks in a “host ﬁle” can be cited by the extent tree as shown in Fig. 2.

3 Design 3.1

Block Bitmap

A partition is divided into standard-sized blocks and then the block bitmap is used to record each block usage. That is to say, if the block has already been allocated, the corresponding bitmap bit is set to 1, otherwise it is set to 0 (see Fig. 3). data blocks 1 1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

free block

1 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0

bitmap block

allocated block

Fig. 3. Bitmap mapping block.

We can use this structure to implement hidden data space acquisition. Speciﬁcally, the bitmap bits of some unallocated blocks can be artiﬁcially set to 1. Thus, these blocks can be regarded as the allocated ones from the operating system’s point of view, so the data stored in them can be hidden. Figure 4 show the process of hiding data by modifying the block bitmap. However, there are several difﬁculties to be considered: 1. For different storage devices, the unallocated blocks are likely to be different. Finding the right data block for each device will take a long time. This method isn’t flexible.

Acquiring Hidden Space via Modifying Block Bitmap

655

2. In order to set up a large hidden space, such as 4G, 1048576 bitmap bits need to be modiﬁed and it is easy to make mistakes. 3. To ensure the security of the hidden spaces, they should be scattered in the storage device rather than a continuous area. This is a huge challenge for directly modifying the bitmap information.

bitmap

Data blocks

Before

1 1 0 0

1 1 0 0

1 0 0 0

1 0 0 0

After

1 1 1 0

1 1 1 0

1 1 0 0

1 1 0 0

allocated block

mapping

free block

hidden block

Fig. 4. Acquiring hidden space.

However, the storage characteristic of Ext4 ﬁle system can help us to solve the above problems. In the Ext4 ﬁle system, each ﬁle is stored in a series of data blocks. For a ﬁle larger than one block, its data is scattered in the storage area. Based on this fact, this paper proposes an effective method to create “host ﬁle” as a hidden data space. When a ﬁle is created, enough data blocks are allocated to store the ﬁle data. At the same time, their block bitmap bits are all set to 1, indicating that these blocks have already been allocated. As long as the ﬁle always exists, these blocks will be reallocated no longer. Thus, we can store hidden data in these areas. The rest of this section shows how to ﬁnd these hidden spaces by the ﬁle name of the “host ﬁle”. 3.2

Hidden Space Calculation

The method proposed in this paper focuses on setting up hidden space by creating a “host ﬁle” that serve as data carriers. Calculate Offset

Begin

Read ei_leaf from ext4_idx Yes

Read offset of inode_table from block_descriptor

Read inode_table

Read extent_header

eh_depth=0?

Read ee_start & ee_len No from ext4_extent

Fig. 5. The process of hidden space calculation.

Keep offset

656

W. Lianfang et al.

When a “host ﬁle” is created, Ext4 checks the block bitmap and allocates free blocks to the ﬁle, as well as modifying block bitmaps. We need to calculate the addresses of these data blocks and their lengths and record them as offsets of hidden spaces. The calculation process is designed as the following Fig. 5. Step 1: the offset of inode table is read from the block group descriptor. The block group descriptor table is stored in the ﬁrst logical block. Each block group descriptor has 32 bytes. The bg_inode_table ﬁeld is 4 bytes that is used to store the ﬁrst block number of the inode table. So, the offset of inode table is calculated by using Eq. (1), in which the value of s_block_size and that of s_inode_size can be read from the superblock. offset i tb ¼ bg inode table s block size þ ði num 1Þ s inode size

ð1Þ

Step 2: to read the extent that points to data blocks by analyzing the inode table. If the value of eh_depth ﬁeld in extent header is 1, to read the next extent until the value of eh_depth ﬁeld is 0 that presents the extent points to data blocks. For a large ﬁle, it was highly fragmented and need to create an extent tree in order to map data blocks into different block groups. Step 3: we generate a two-dimensional array that stores the value of ee_start ﬁeld and the value of (ee_start-ee_len) ﬁeld. Based on the array, the hidden spaces can be set up. The following Fig. 6 shows an example of the array that stores the offsets of the hidden spaces.

Address 12288 4096 1640448 1644544 1738752 1742848 1800192 1808384 1837056 Length 2048 8192 2048 6144 2048 2048 26624 30720 30720

Fig. 6. Array that stores offsets of the hidden space.

4 Implementation In this article, the process of setting up hidden space can be divided into three steps. First, we create an empty ﬁle in the ext4 ﬁle system as a “host ﬁle”; second, we calculate the addresses of the blocks occupied by the ﬁle, and save them in a twodimensional array; ﬁnally, we delete the inode of the ﬁle to hide the “host ﬁle”. Table 1. The functions of process. Step 1 2

Parameter –n a –ﬁ_n

3

–di_n –w n –b N

Function Create a ﬁle of size a (GB) on/data directory, named hidden.txt Display all the extent trees occupied by the ﬁle with inode number i_n, and save the information of extent trees into array Conceal the “Host ﬁle” Write hidden blocks with number n Display all contents of the Nth block in hexadecimal

Acquiring Hidden Space via Modifying Block Bitmap

657

We write a JNI program named blc with different parameters (see Table 1), which is compiled and run on a mobile device. In addition, another two programs have been written to evaluate the performance of blc. Ext4 is a journal ﬁle system, so user operations are recorded in the log ﬁle. To prevent hidden space from being overwritten, we run the JNI program in recovery mode. 4.1

Create “Host File”

The user enters the command blc -n a and then the program will run, in which the linux shell command dd if =/dev/zero of =/data/text.txt bs = 1G count = a is executed to create a “host ﬁle” with aG size. This way makes the method of setting up hidden space more efﬁcient and flexible, because the input parameter a can be changed. The user can set up a large-capacity hidden space when the value of a is large. 4.2

Save Hidden Space

The user changes the work directory into/data directory and enters the command ls -i to get the inode number of the text.txt ﬁle named i_n. Then the position of inode table will be calculated. The extent header is read to judge the type of this extent node after the command blc -f i_n is executed. If the node is an index, it need to continue to read the next extent in the extent tree, otherwise the starting block number and the length of data block are stored in the array. 4.3

Conceal “Host File”

The command ls -lt is used to display the basic information of the “host ﬁle” named text.txt in current directory (Fig. 7). However, the ﬁle should be hidden for reasons of safety. To achieve the goal, we need to deletes the inode of the ﬁle rather than the ﬁle itself, so as to keep its data blocks invisible. The command blc -d i_n is executed and the directory of text.txt will be located. Then the i_nth index node can be found and the ﬁle name stored in it needs to be clear. Thus, the “host ﬁle” will disappear from someone else’s point of view.

dreamlte:/data # ls -lt total 62925296 -rw-rw---- 1 root sdcard_rw ...

20

2018-05-20

17:02

Fig. 7. Basic information of the “host ﬁle” in the directory.

text.txt

658

W. Lianfang et al.

5 Evaluation In this section the effectiveness and performance of the hidden space approach will be evaluated. A Samsung S8 development mobile phone (Android 7.0, Kernel 4.4.13, Internal storage: 64 G) is used as the test environment. 5.1

Functional Evaluation

In order to allow the user to manage the hidden space based on the operating system, there are two necessary assumptions: one is to acquire the root privilege of operating system, and the other is to close the SELinux security system. Theoretically, the operating system will not cover the artiﬁcial hidden data in normal day-to-day use, because the bitmap bits of these hidden spaces remain 1. We will test whether the hidden data spaces may be overwritten during the normal operations. The experimental steps are as the followings and the result is shown in Fig. 8.

dreamlte:/data/local/tmp # ./blc -b 594124 [0000] ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee [0016] ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee [0032] ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee [0048] ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee

ee ee ee ee

| | | |

Fig. 8. Data read from the hidden spaces.

1. To set up the hidden space by our proposed the method. 2. To execute the command blc -w 238 to ﬁll the blocks in the hidden spaces with the number 238 (0xee) and to restart the system. 3. To repeat the command cp/storage/ﬁll.mp4/data/i.mp4 to write a media ﬁle with 1G size into the storage device until the ﬁle system is full of data. Restart the system. 4. To execute the command blc –b N (N is an arbitrary block number in the hidden spaces) to read data from the hidden spaces and then to compare it with the written data. The result shows that the user’s normal operations (including restarting the system many times) do not overwrite the hidden space data. 5.2

Performance Evaluation

The size of the hidden spaces depends on the total capacity of the ﬁle system and the number of free blocks. In order to know about the distribution of hidden spaces, we write the test data into the ﬁle system with a total capacity of 54 G in 5 groups and generate the available space sizes 40 G, 30 G, 20 G, 10 G, and 5 G respectively. By examining the super block, we can see that the total number of blocks in ﬁle system is 14296059 and that they are divided into 437 block groups. After that, we set up a 4 G hidden space by using our method 5 times.

Acquiring Hidden Space via Modifying Block Bitmap

416 384

659

group color

1

2

3

4

5

10

11

12

13

14

15

352 320 288 256 224 192 160 128 96 64 32 0 0

1

2

3

4

5

6

7

8

9

Fig. 9. The distribution of the hidden space.

Figure 9 shows the distribution of the hidden spaces on the storage device, in which the block group numbers increase from left to right and bottom to top. To reduce the performance loss of ﬁle system due to fragmentation, the block allocator tries to keep all the blocks of each ﬁle within the same group, thus it can reduce the seek times. From the Fig. 9, we can see that the consecutive block group numbers for a large ﬁle are allocated as much as possible. Once the available storage space is limited, the large ﬁle is divided into multiple discrete areas.

6 Conclusion and Future Work In this paper, we analyze the characteristics of the ext4 ﬁle system and then propose a flexible method to set up a large-capacity hidden space based on its ﬁle allocation mechanism. This method does not require modiﬁcation of the linux kernel and has little effect on the normal operations of the operating system. After that, a JNI program is developed to implement this method. At last, the experiment results show that the hidden spaces would not be rewritten by the operating system, nor would it lose data after the mobile phone was restarted. Of course, our proposed method still has several limitations that make it impossible to apply to the actual scene: 1. The established hidden spaces are not easy to be deleted in the user level. 2. The “host ﬁle” is still easy to be detected by ﬁle system check software, although it has been concealed. In the future, we will try to solve the above limitations and then apply the established hidden space to the actual scene. At the same time, we will consider setting up multiple host ﬁles to increase security.

660

W. Lianfang et al.

Acknowledgments. This work was supported in part by the key scientiﬁc research program of He’nan Education Department of China (No. 61361166006).

References 1. CNNIC Homepage. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/t20180305_ 70249.htm 2. Pang, H., Tan, K.L., Zhou, X.: StegFS: a steganographic ﬁle system. In: 2003 Proceedings of International Conference on Data Engineering, pp. 657–667 (2003) 3. Göbel, T., Baier, H.: Anti-forensics in ext4: on secrecy and usability of timestamp-based data hiding. Dig. Investig. 24, S111–S120 (2018) 4. Zhang, X., Tan, Y., Zhang, C., Xue, Y., Li, Y., Zheng, J.: A code protection scheme by process memory relocation for android devices. Multimedia Tools Appl. 77(9), 11137– 11157 (2018) 5. Xiao, Y., et al.: A high-performance hierarchical snapshot scheme for hybrid storage systems. Chin. J. Electr. 27(1), 76–85 (2018) 6. Xiao, Y., Zhang, C., Xue, Y., Zhu, H., Li, Y., Tan, Y.: An extra-parity energy saving data layout for video surveillance. Multimedia Tools Appl. 77, 4563–4583 (2018) 7. Skillen, A., Mannan, M.: On implementing deniable storage encryption for mobile devices (2013) 8. Carrier, B.: File System Forensic Analysis. Addison-Wesley Professional, Boston (2005) 9. Piper, S., Davis, M., Manes, G., Shenoi, S.: Detecting hidden data in Ext2/Ext3 ﬁle systems. In: Pollitt, M., Shenoi, S. (eds.) DigitalForensics 2005. ITIFIP, vol. 194, pp. 245–256. Springer, Boston, MA (2006). https://doi.org/10.1007/0-387-31163-7_20 10. Sun, Z., Zhang, Q., Li, Y., Tan, Y.: DPPDL: a dynamic partial-parallel data layout for green video surveillance storage. IEEE Trans. Circ. Syst. Video Technol. 28(1), 193–205 (2018) 11. Eckstein, K., Jahnke, M.: Data hiding in journaling ﬁle systems. In: Refereed Proceedings of the Digital Forensic Research Workshop, DFRWS 2005, Astor Crowne Plaza, New Orleans, Louisiana, USA, pp. 595–599, August 2005 12. Wong, D.J.: Ext4 Disk Layout - Ext4 Wiki (2016). https://ext4.wiki.kernel.org/index.php/ Ext4_Disk_Layout. Accessed 1 Oct 2017 13. Forensic Research Workshop, DFRWS 2005, Astor Crowne Plaza, New Orleans, Louisiana, USA, pp. 595–599, August 2005 14. Xue, Y., Tan, Y., Liang, C., Zhang, C., Zheng, J.: An optimized data hiding scheme for deflate codes. Soft. Comput. 22(13), 4445–4455 (2018) 15. Yu, X., Tan, Y., Sun, Z., Liu, J., Liang, C., Zhang, Q.: A fault-tolerant and energy-efﬁcient continuous data protection system. J. Ambient Intell. Humanized Comput. (2018). http://dx. doi.org/10.1007/s12652-018-0726-2 16. Neuner, S., Voyiatzis, A.G., Schmiedecker, M., et al.: Time is on my side: steganography in ﬁlesystem metadata. Dig. Investig. 18, S76–S86 (2016) 17. Fairbanks, K.D.: An analysis of Ext4 for digital forensics. Dig. Investig. 9, S118–S130 (2012)

Interest Relevance-Based Caching Design in Content-Centric Networking Guozhi Zhang1,2 1

, Jiqiang Liu1 , Xiaolin Chang1(&) and Yang Yang1

,

Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing 100044, China {zhangguozhi,jqliu,xlchang,16112082}@bjtu.edu.cn 2 The School of Computer and Engineering, Northwest Normal University, Lanzhou 730070, China

Abstract. Among the existing Content Center Networking (CCN) caching schemes, the most important category is popularity-based schemes which perform better than non-popularity-based in terms of cache-hits. However, these existing popularity-based caching schemes assumed that they provide services for a single type of applications and assumed that content requests (interest) conform to Zipf-like distribution. Although Zipf-like request distribution was validated in many network applications, this distribution may not exist at the node level in CCN when there exist multiple types of upper-level applications in the network. Once the trafﬁc feature of Zipf-like distribution becomes less obvious, the existing popularity-based caching schemes could not work well. Therefore, how to predict the content request (interest) for each node becomes a key problem of caching design. In this paper, we use the application-level relevance of interest to assist the caching design, rather than just relying on names. We propose a scheme (named as ICDCS) based on interest/content tag analyzing in which multiple types of upper-level applications produced contents/interests and tags is added as a part of name, then a measuring mechanism is designed to count and predict the trend of interest. Our scheme can be well combined with existed approaches and improve their caching performance. Simulations over various system parameters are done to validate the effectiveness and efﬁciency of ICDCS. Keywords: Tagging system Content centric networking Cache allocation strategy Naming design

1 Introduction One of the fascinating features of Content-Centric Networking (CCN) [1] is in-network caching, which uses router/node cache to exchange communication overheads. The key idea of in-network caching is to predict the demand of consumers correctly, and then take the “best” way to cache the content to the location where users are interested. Among the existing CCN caching schemes, popularity-based schemes perform better than non-popularity-based in terms of cache-hits. The popularity-based schemes work well only for the scenario where the more popular the content, the more © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 661–671, 2018. https://doi.org/10.1007/978-3-030-05057-3_50

662

G. Zhang et al.

frequently the content is requested in the time dimension [2]. There are two major problems with these existing popularity-based caching schemes: (1) They assumed that request (named interest) arrivals for a content in a region (e.g., AS) follow Zipf distribution [3] at the ﬁle level. In addition, they assumed that the Zipf-like trafﬁc pattern still exists in CCN. The latter assumption is reasonable only when there is a single type of applications in the network. When multiple types of applications co-exist in the network, the latter assumption is not correct even content request arrivals of each application still conform to Zipf-like distribution at the ﬁle level. The trafﬁc mix at the chunk level makes the CCN trafﬁc distribution more flat and complex [4]. Namely, it is hard, if not impossible, to analyze CCN trafﬁc patterns in the scenario where there are multiple types of applications. (2) They measured the popularity of a content by applying the CCN naming design. Some schemes applied the flat naming design, in which different interest packages are considered the same if and only if their request preﬁxes are the same. However, there may exist the potential relationship between contents with different names. Some other existing schemes applied the hierarchical naming design [5]. In this design, if different interest packages are considered the same, the nodes need to be able to parse the name structure, even the semantic relations between names. However, this hierarchical naming design faces design difﬁculties in terms of efﬁciency and effectiveness. These discussions suggest that existing popularity-based caching strategy may not work effectively and efﬁciently in CCN. This is also validated by our later simulations in which the trafﬁc of various flows is mixed and chunked. This paper considers the scenario where types of upper-level applications co-exist [4]. Therefore, the knowledge of upper-level applications’ trafﬁc characteristics at the node level will help caching decision. The analysis in the previous paragraphs suggest the weakness of the naming system in obtaining trafﬁc characteristics at the node level. Thus, a new method for marking trafﬁc features should be explored. All the above analysis forms the motivation of the work in this paper. We propose a tag-based caching allocation scheme (named as ICDCS, Interest Characterize-tags Distribution based Cache-decision Scheme), which exploits tag information to achieve the content request prediction. Tag information in trafﬁc is generated at consumers or producers according to the pre-deﬁned tags. Simulations over various system parameters are done to validate the effectiveness and efﬁciency of ICDCS. We summarize the major contributions as follows. (1) We design ICDCS caching scheme, which consists of three components: (i) a counting bloom ﬁlter-based tag counting component which aims to reduce the popularity calculation overheads, (ii) ﬁtness quantiﬁer which uses cosine distance to quantify the suitability of content, and (iii) piggyback and path collaboration which aim to reduce communication overheads. Each component works in linear time, therefore, ICDCS has less overheads.

Interest Relevance-Based Caching Design

663

(2) We explore a tagging system to assist caching decision in ICDCS. Tag information in trafﬁc helps the prediction of the trend of interests and help nodes to evaluate whether the content is suitable to be cached. To the best of our knowledge, we are the ﬁrst to introduce a tagging system into CCN caching design. Note that exploiting a tagging system can alleviate the contradiction between caching design and naming design, and it also brings a new perspective to the other mechanism designs such as CCN routing and retrieval. Note that this paper applies a well-developed collaborative tagging system [6] to generate appropriate tags for content. This tagging system is a powerful tool with which users could tag content while they upload the ﬁle. It is widely deployed in Web and Online Social Networks (OSNs) for content recommendation, classiﬁcation and prediction with very low overheads. In our latter simulations, 20 tags are randomly selected and assigned to each topic. When a consumer/provider requests the content, he/she selects 4–7 tags uniformly from the pre-deﬁned 20 tags. The rest of paper is organized as follows. Section 2 gives the background and related work. Section 3 describes ICDCS scheme. Section 4 presents the simulation results. Section 5 concludes the paper.

2 Background and Related Work Because content requests (interests) have temporal and spatial features (such as following the Zip-like distributions), many studies (see [7]) have attempted to exploit these features to design caching strategies. This also makes popularity-based strategies become the most important type of caching strategy. In these strategies, it is assumed that all interests usually follow the Zipf distribution which is deﬁned as PðxÞ ¼ C=ca , where a reflects the concentration of the distribution. Table 1. The distribution assumption in some researches Reference Distribution assumption Wang et al. [8] Zipf with a 1 Paciﬁci et al. [9] Zipf with a ¼ 1 Duan et al. [10] Zipf with a ¼ Wang, et al. [11] Zipf with a ¼ *: The value cannot be obtained from the paper.

The in-network caching is designed as the fundamental function of the router in CCN. Each node receives content request (named interest) and forwards them, and they process returned contents and determine whether to cache them. Based on the studies of traditional networks [12–14], the popularity-based caching design has also become one of the important ways in CCN [3]. The common feature of studies is to draw on the popularity of traditional network model and try to cache the popular content on path node, and their difference is the use of different factors and popularity measurement methods.

664

G. Zhang et al.

In all these popularity-based caching design, there exist a strong assumption: the popular patterns used in traditional networks still exist in CCN. As can be seen in Table 1, some premise and their parameters used in the cache model scene are listed. These studies assumed that the request conforms to Zipf-like distribution but with different exponential parameter. However, we need to be aware that when the rationality of this assumption is problematic, the basis for the design of these strategies will be shaken.

3 ICDCS Caching Scheme This section presents the details of ICDCS. For convenience, we give some symbols and notation in Table 2 that may is used in later. Table 2. Symbols and notations Notation G E V i Ii Ci jCi j Px!y jPx!y j

3.1

Description Graph of topology The edge set of G The node set of G The preﬁx i of content/interest The interest with preﬁx i The content with preﬁx i The volume of content Ci The path from x to y The length of PVðiÞ x!y

Core Idea of ICDCS

Content requests are closely related to each other at the application and social level, that is, when consumers request certain content in the area within a certain period, other relevant content may also be required. Mainly because these contents belong to the TOPIC, that is, they have similar content characteristics or social attributes because they are closely related to the events that generate these contents. Is there a way for us to ﬁnd and count the implicit relevance between requests rather than rely solely on name statistics? In this work we have designed a simpler method based on tag vector distance to calculate the caching ﬁtness of the content in the node. Figure 1 shows the working process of a router/node deploying ICDCS. The core of ICDCS includes: (i) When a node receives an interest/content, it counts tags and establishes its own Tags Counting Component (TCC); and (ii) In caching decision step, router x decides whether to cache the content according to its TCC. The details of the three key components in ICDCS are given in the following.

Interest Relevance-Based Caching Design

3.2

665

Tag Counting Component Based on Counting Bloom Filter

To count the tags carried by interests and calculate the content ﬁtness effectively, it is necessary to design an efﬁcient and low-cost tag counting component. Counting Bloom Filter (CBF) is an efﬁcient data structure with a low probability of false positives. Based on CBF, we implement TCC to achieve tag recording, retrieving and counting. Described as in Fig. 2, CBF is the core of TCC. The counting ability of TCC depends on the size of CBF. In our simulation, TCC can count more than 10,000 tags, each with a total of 232 counts over a period. Note that Fitness Quantiﬁer discussed below also depends on CBF.

Fig. 1. The working process of a node

3.3

Fitness Quantiﬁer

ICDCS uses the distance between content Ci received on node x and all tags of interests received by this node in the past period to measure the similarity between Ci and the received interest. Cosine distance is used to compute the similarity (also named as ﬁtness degree) between the content and the received interest. Cosine distance is the distinction of the direction but insensitive to the absolute value. Therefore, it can adjust the errors, which are generated due to the non-uniformity of the measurement criteria. Before explaining how to calculate ﬁtness degree Dx ðCi Þ of Ci on node x, some denotations are ﬁrst presented. TSx ¼ ft1x ; . . .; tKx g denotes the tag space on node x. Here tix is a tag and K denotes the size of tag space. Let TðCi Þ ¼ ft1i ; . . .; tni g TSx as the tag set of content Ci , here n is the number of tags of Ci . TCT x ¼ fht1x ; tcx1 i; ; htKx ; tcxK ig denotes the interest tags counting table of node x. Here, tcxi is the counting value of tag tix , and this value is also considered to be the weight of tag tix . Therefore, TðCi Þ ¼ ft1i ; . . .; tni g can be represented as TCTðCi Þ ¼ fht1i ; 1i; ; htni ; 1ig because the tags carried by the content cannot be repeated. With these deﬁnitions, we deﬁne Dx ðCi Þ as:

666

G. Zhang et al.

Dx ðCi Þ ¼ cosðTCT x ; TCTðCi ÞÞ ,sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ X X i n tc ðtci Þ2 ¼ ti 2TSx \ TðCi Þ

ð1Þ

ti 2TSx \ TðCi Þ

The greater Dx ðCi Þ, the more likely Ci to be requested. Note that Dx ðCi Þ can have different meanings, depending on the meanings of tags. For instance, in OSN-like applications, tags can be used to represent the social relations or semantics of content. In geographic-related applications, it can be used to reflect the geographic location of content. In video sharing applications, it can be used to reflect the description of content.

Fig. 2. The node structures

3.4

Piggybacking of Decision Information

Dx ðCi Þ is computed along the path from providers to consumers. This value sufﬁces for making decision since the caching scope is only limited to the path. This process includes two steps: P (i) The node gets the cumulative ﬁtness Dw ðCi Þ that is carried by the received interest. The node perceives the interest of all downstream nodes along the path . P V when it receives an interest. MDx ðCi Þ ¼ Dw ðCi Þ Px!y ðiÞ denotes the aver age ﬁtness of Ci along the path PVx!y ðiÞ. Here, PVx!y ðiÞ is the length of PVx!y ðiÞ, w is a node on PVx!y ðiÞ, and Dw ðCi Þ is the ﬁtness of Ci at w. When node x receives Ci , it ﬁrst calculates Dx ðCi Þ and MDx ðCi Þ. Then, Dx ðCi Þ and MDx ðCi Þ are compared to decide whether to cache the content.

Interest Relevance-Based Caching Design

667

(ii) Once a node receives interest and the response conditions are met, it will return the content in which the decision information can be carried. Then the downstream node will decide whether to cache the content according to the threshold. l denotes the length of PVx!y ðiÞ. fvjDv ðCi Þ b MDx ðCi Þg denotes the node set of duplicates of Ci , where b is a parameter used to balance the quantity and the cost of duplicates. To increase the randomness and improve the effectiveness of the algorithm, if a node belongs to fvjDv ðCi Þ b MDx ðCi Þg, the content will be cached with probability P ¼ l=L, where L is the network diameter being evaluated. ICDCS produces some communication and computation overheads while it brings a series of advantages. But the overheads are small. In our later simulation, we select 47 tags as the set of characteristics, and each tag occupies 8 bytes. Then, the maximum cost of tags of each content will be 56 bytes.

4 Simulation and Performance Evaluation This section aims for evaluating the capability of ICDCS, by comparison with IFDD [2], LCE and POPULAR on ndnSIM [15]. LCE is the default non-popularity-based caching scheme in ndnSIM. POPULAR is a simple popularity-based mechanism proposed in this paper. In POPULAR, each node counts the received interest preﬁx for a period, and then sorts these preﬁxes in descending order. When a content is received and its preﬁx is in the highest range, it is cached. Otherwise it is discarded or cached with a certain probability.

Fig. 3. The topology of simulations

We ﬁrst build a trafﬁc model to emulate OSN-like applications [16]. Two OSN-like features are embodied in this model. (i) Burst requests for all contents of a certain topic. We use a set of tags (named a topic) to describe a group of contents. The topic popularity varies over time, resulting in changes of request frequency for contents related to this topic. Contents belonging to the same topic are generated by different producers, but their requests are related by the tags. Contents belonging to the same topic have clustered requesting characteristics over time. In this way, trafﬁc generated

668

G. Zhang et al.

by different applications is related to each other due to the topic. To make a reasonable comparison of these four schemes, content requests for the same topic follow the Zipf distribution. (ii) User preferences for contents of different topic. It is achieved by setting the different request probabilities to the same topic for users in different regions. Consumers and producers apply this trafﬁc model and collaborative tagging system to generate trafﬁc and add tags. Figure 3 shows the topology used in our simulations. Consumers and producers are on the edge of the network, and routers are on the network core. Each node is conﬁgured to produce 500 contents. Figures 4, 5 and 6 show the results in scatter plots to intuitively reflect the distribution of cache-hits. Each point in the ﬁgures represents cache-hits in the router cache during the content request. 4.1

Impact of Popularity Distribution on Cache-Hits

This sub-section compares cache-hits of ICDCS, IFDD [2], LCE, and POPULAR by varying request popularity distribution in terms of interest dispersion in traditional scenario and our scenario, respectively. Here, interest dispersion refers to the exponential parameter of Zipf-like distribution. By traditional scenario, we mean that there is only one content producer, and all content requests conform to Zipf distribution. By our scenario, we mean that there are many producers (here, this value is 20 in Fig. 3) and the requests for the same producer follow the Zipf distribution. In addition, all content trafﬁc is mixed at the chunk level.

Fig. 4. Popularity distribution vs. Cache-hits

Figure 4(a) and (b) show cache-hits differences in the traditional scenario and in our scenario, respectively. We observed that ICDCS has the best cache-hits than IFDD, LCE and POPYULAR in both two types of scenarios. The change of trafﬁc of our scenario causes a signiﬁcant drop in cache-hits for all strategies, but ICDCS could still perform best. When interest dispersion is increased, the improvement in ICDCS cachehit performance is more obvious than the other three. ICDCS is more sensitive to interest dispersion. The main reason is that the increase in the popularity of a single type of content is decentralized throughout the network under our assumption, but ICDCS can exploit the inherent relation of these contents rather than just names.

Interest Relevance-Based Caching Design

4.2

669

Impact of Cache Capacity on Cache-Hits

This sub-section investigates the impact of cache capacity on cache-hits. As we known, a good caching strategy design should have an exquisite sensitivity to capacity. That is, caching performance could be improved simply by increasing the cache capacity. The results in Fig. 5. Cache Capacity vs. Cache-hits indicate that with the increasing cache capacity, cache-hits of ICDCS increases more.

Fig. 5. Cache capacity vs. Cache-hits

4.3

Performance of Caching Schemes at Different Decision Thresholds

In the previous description, the content will be cached at fvjDv ðCi Þ b MDx ðCi Þg, where b is the threshold of decision. Figure 6 presents the results by varying b from 1 to 10, suggesting that b is a decisive factor to cache-hits. When b is increased but below the certain value, cache-hits does not drop signiﬁcantly, or even increases. When b reaches a certain value (this value is 5 in Fig. 6. Cache Decision Threshold vs. Cache-hits), cache-hits decreases rapidly. That is, controlling the number of replicas at a reasonable number will greatly improve network performance. The investigation of the reason behind this is left for our future work.

Fig. 6. Cache decision threshold vs. Cache-hits

670

G. Zhang et al.

5 Conclusion and Future Work This paper explores the CCN trafﬁc features at the mixed chunk level. Then a simpliﬁed trafﬁc model is built, in which the popularity-based caching strategy could not perform well. We further use the content social attributes to design a cache allocation scheme based on a tagging system. The simulation results show that the proposed scheme performs better than the existing caching schemes. Future work includes the usefulness of tagging systems for information retrieval and other application functions. Acknowledgements. This work was supported in part by the Natural Science Foundation of China under Grants 61672092 and 61572066, and in part by the Fundamental Research Funds for the Central Universities of China under Grants 2018JBZ103.

References 1. Jacobson, V., Smetters, D.K., Thornton, J.D., Plass, M.F., Briggs, N.H., Braynard, R.L.: Networking named content. In: International Conference on Emerging Networking Experiments and Technologies, pp. 117–124 (2009) 2. Zhang, G., Liu, J., Chang, X., Chen, Z.: Combining popularity and locality to enhance innetwork caching performance and mitigate pollution attacks in content-centric networking. IEEE Access 5, 19012–19022 (2017) 3. Ioannou, A., Weber, S.: A survey of caching policies and forwarding mechanisms in information-centric networking. IEEE Commun. Surv. Tutorials 18, 2847–2886 (2016) 4. Fricker, C., Robert, P., Roberts, J., Sbihi, N.: Impact of trafﬁc mix on caching performance in a content-centric network. In: 2012 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 310–315. IEEE (2012) 5. Quan, W., Xu, C., Guan, J., Zhang, H.: Scalable name lookup with adaptive preﬁx bloom ﬁlter for named data networking. IEEE Commun. Lett. 18, 102–105 (2014) 6. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative ﬁltering recommender systems. ACM Trans. Inf. Syst. (TOIS) 22, 5–53 (2004) 7. Din, I.U., Hassan, S., Khan, M.K., Guizani, M., Ghazali, O., Habbal, A.: Caching in information-centric networking: strategies, challenges, and future research directions. IEEE Commun. Surv. Tutorials 20, 1443–1474 (2018) 8. Wang, Y., Li, Z., Tyson, G., Uhlig, S.: Design and evaluation of the optimal cache allocation for content-centric networking. IEEE Trans. Comput. 65, 95–107 (2016) 9. Paciﬁci, V., Dán, G.: Coordinated selﬁsh distributed caching for peering content-centric networks. IEEE/ACM Trans. Netw. 24, 1–12 (2016) 10. Duan, J., Wang, X., Xu, S.Z., Liu, Y.N., Xu, C., Zhao, G.F.: Cache scheme based on prefetch operation in ICN. Plos One 11, e0158260 (2016) 11. Wang, S., Bi, J., Wu, J., Vasilakos, A.V.: CPHR: in-network caching for information-centric networking with partitioning and hash-routing. IEEE/ACM Trans. Netw. 24, 1 (2015) 12. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and Zipf-like distributions: evidence and implications. Proc. IEEE INFOCOM 1, 126–134 (1999) 13. Gu, Y., Chen, L., Tang, K.M.: A load balancing method under zipf-like requests distribution in DHT-based P2P network systems. In: 2009 International Conference on Web Information Systems and Mining, WISM 2009, pp. 656–660 (2009)

Interest Relevance-Based Caching Design

671

14. Mangili, M., Martignon, F., Capone, A.: Performance analysis of content-centric and content-delivery networks with evolving object popularity. Comput. Netw. 94, 80–98 (2015) 15. Afanasyev, A., Moiseenko, I., Zhang, L.: ndnSIM: NDN simulator for NS-3. University of California, Los Angeles, Technical report 4 (2012) 16. networking Index, C.V.: Forecast and methodology, 2016–2021, White Paper. San Jose, CA, USA 1 (2016)

Author Index

Abawajy, Jemal II-630 Adhikari, Binod Kumar IV-628 Ai, Haojun III-44, III-218, III-229, IV-326 Ali, Saqib III-118, IV-399 Alienin, Oleg I-483 Alkasem, Ameen III-432 Allombert, Victor III-159 Alzubi, Jafar A. II-130 Amar, Mohamed Abdellahi I-437 Bai, Guangwei III-142, IV-431 Bai, Lanhua III-385 Bao, Weidong III-293 Bao, Xianyue IV-204 Bellalouna, Monia I-437 Bellatreche, Ladjel III-560 Ben Maaouia, Omar II-388 Benoit, Anne III-175 Berkani, Nabila III-560 Bhoiwala, Jaina IV-62 Bi, Chongke I-621 Bi, Wei II-582 Bouazizi, Emna I-75 Cai, Bo III-218, III-229 Cai, Hongming II-611 Cai, Miao I-267 Cai, Xiaojun III-370 Cai, Yiqiao II-76, II-308 Cai, Yuan II-187 Cai, Yujie IV-3 Cai, Zhicheng III-519 Cao, Ning I-524 Cérin, Christophe II-388, III-103 Chai, Xin III-132 Chang, Xiaolin III-661 Che, Ming II-551 Chen, Chao II-272 Chen, Deng III-257 Chen, Fei I-3, IV-461 Chen, Gan II-538 Chen, Haiyan III-618 Chen, Hanwen III-257 Chen, Jiankang IV-121

Chen, Chen, Chen, Chen,

Jianxi I-562 Jixiao I-323 Lidong I-89 Long I-393, I-636, II-90, III-3, III-355, III-460 Chen, Mengqi IV-178 Chen, Mengqiang I-122, II-401 Chen, Nanxi IV-591 Chen, Shuhong IV-303 Chen, Xinhai I-242 Chen, Yifeng I-337, I-532 Chen, Yingyang II-248 Chen, Yuting II-354 Chen, Zehong IV-389 Chen, Zhe IV-98 Chen, Zhenxiang I-46, I-257, IV-109, IV-166 Chen, Zhiguang I-358 Chen, Zhili IV-204 Chen, Zhiyan I-19, II-272 Cheng, Baolei III-342 Cheng, Feng II-538 Cheng, Shuai I-184 Cheng, Yuxia II-477 Cheng, Zhiguang II-114 Chi, Lihua I-242 Chu, Qian III-209 Ci, LinLin III-593 Cui, Jie IV-461 Cui, Lizhen II-18, III-280, IV-166 Cui, Xin I-578 Cui, Zhihua II-329 Dai, Fei II-367, III-640 De Giusti, Armando I-310 Deng, Hanbo I-46 Deng, Kaixin III-453, IV-581 Deng, Mingzhu I-358 Deng, Pan II-354 Deng, Yuhui I-200, I-378 Di, Nan III-640 Ding, Wei I-153 Dong, Jiaqing II-32, IV-527 Dong, Lanlan IV-47

674

Author Index

Dong, Yong-Qiang I-34 Dong, Zhibin II-60 Du, Haiwen I-524 Du, Jiayi I-138 Du, Qingfeng III-59, IV-560 Du, Xin I-122 Du, Yu I-229 Du, Yunfei I-122, II-114 Dulin, IV-511 Dutta, Pramit IV-62 Elhabob, Rashad IV-220 Eltayieb, Nabeil IV-220 Faiz, Sami I-75 Fan, Jianhua IV-549 Fan, Jianxi II-3, III-342 Fan, Sijiang IV-341 Fan, Weibei II-3 Fan, Yibo IV-3 Fang, Junbin IV-447 Feng, Dan I-562, I-608, II-445 Feng, Hao I-200 Feng, Yilin I-421 Fkaier, Hazem II-388 Fu, Haohuan III-504 Fu, Min II-445 Gan, Lin III-504 Gan, Yong IV-600 Gang, Peng I-483 Gao, Ce II-611 Gao, Chongzhi IV-249 Gao, Cuiying I-59 Gao, Hepeng II-199 Gao, Xingkun I-593 Gava, Frédéric III-72, III-159 Geng, Yangyang I-636 Gheisari, Mehdi II-130, IV-303 Ghorbani, Hamidreza IV-303 Gong, Liangyi IV-234 Gordienko, Nikita I-483 Gordienko, Yuri I-483 Gotewal, Krishan Kumar IV-62 Goupille-Lescar, Baptiste III-575 Gu, Jingyun I-59 Gu, JingZi I-184 Gu, Liang I-34 Gu, Yingjie IV-538

Gu, Yiren III-142 Gu, Zhuoer II-272 Guan, Lele II-263 Gui, Ruowei IV-538 Gui, Xiaolin IV-538 Guo, Bing II-507 Guo, Dongchao II-32 Guo, Lin IV-628 Guo, Mengying III-17 Guo, Ning IV-447 Guo, Peiming IV-21 Guo, Yang I-138, III-196 Gupta, Sarthak IV-62 Hamdi, Sana I-75 Han, Jizhong II-226 Han, Kun III-249 Han, Liangliang III-44 Han, Wencong IV-249 Han, Yuejuan III-342 Han, Zengyi II-596 Han, Zhijie II-3 Hassan, Alzubair IV-220 He, Bin IV-72 He, Keqing II-46 He, Lei III-311 He, Ligang I-19, II-272 He, QiLin IV-399 He, Wei III-280 He, Weichao III-89 He, Ximing IV-341 He, Yu III-59, IV-560 Hong, Huang III-651 Hong, Xiaoyu III-186 Hu, Bin II-105, III-239 Hu, Cheng I-378 Hu, Jingjing IV-249 Hu, Shengjie I-215 Hu, Songlin IV-12 Hu, Xiaoteng I-284 Hu, Xiaoyan I-153 Hu, Yi II-551 Hu, Yujia III-142, IV-431 Huang, Baoqi II-175, III-476, III-489 Huang, Hao I-267 Huang, Haojun II-32 Huang, Libo IV-341 Huang, Linpeng I-578, II-354 Huang, Liping I-323, II-199 Huang, Liusheng I-636, III-311

Author Index

Huang, Huang, Huang, Huang, Huang, Huang, Huang,

Shuqiang III-608 Weimin I-498 Weiyi IV-326 Xiaomeng III-504 Xuan I-593 Yanyu IV-85 Yujie IV-3

Jain, Swati IV-62 Jemni, Mohamed II-388 Ji, Xiang II-187 Ji, Xuan III-28 Ji, Yiming I-621 Jia, Bing II-175, II-199, II-378, III-476, III-489 Jia, Chunfu II-462, IV-98, IV-178 Jia, Haipeng II-338 Jia, Huiwen IV-600 Jia, Menghan I-106 Jia, Zhiping III-370, IV-617 Jiajia, Sun IV-260 Jiang, Huiwen II-287 Jiang, Peng I-19 Jiang, Yuli III-209 Jiang, Yunpeng II-596 Jiang, Zoe L. II-248, IV-358, IV-374, IV-389, IV-415, IV-447 Jiao, Libo II-32, IV-527 Jin, Hai I-3 Jin, Yabin IV-447 Jin, Yufei II-491 Jinbao, Gao IV-260 Jing, Minge IV-3 Khaznaji, Walid I-437 Kuang, Di I-122 Kuang, Yuyu III-257 Leclercq, Étienne III-103 Lee, Dong Hoon IV-85 Lefèvre, Laurent III-175 Lei, Jing IV-193 Lei, Lixia I-498 Lenormand, Eric III-575 Li, Bingyao I-284 Li, Bo I-184, IV-85 Li, Chenyang II-263 Li, Dongsheng I-106

Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li,

Fagen IV-220 Guohui II-507 Hang IV-178 Hongwei III-89 Hui III-280, IV-34 Jian III-402 Jianjun II-507 Jiaxing I-393 Jin IV-374 Jing II-76, II-308 Jingyi I-393 Jinyang III-402 Jun III-218 Junyu II-272 Lianmeng I-284 Lingzhi IV-72 Min IV-234 Peng II-3 Pengze II-18 Qinan II-378 Qingzhong II-18 Qiushi I-545 Qun IV-166 Shengguo I-242 Shuang II-417 Tiantian II-105 Tianyou II-354 Tingting III-28 Tong IV-98 Tongfang III-270 Wanpeng IV-288 Wei I-215 Wenfa II-297 Wenjuan IV-481 Wenwu III-196 Wuyungerile II-378, III-476 Xiaofang II-60 Xiaoguo IV-138 Xiaolin III-534 Xiaoyan III-342 Xin II-567, II-630 Xinxin I-122 Xu IV-399 Xuan IV-374 Xun III-547 Yalin III-239 Yinan II-90 Yunchun I-215 Zhang III-651 Zhaoyang IV-234

675

676

Author Index

Li, Zhen II-524 Li, Zhihao II-338 Lianfang, Wang III-651 Liang, Jiahao IV-643 Liang, Yanhua III-28 Liao, Liang III-44, IV-326 Lin, Cheng-Kuan II-3, II-145 Lin, Xiaotong II-160 Lin, Yong II-319 Lin, Yongzheng I-257 Liu, Anran IV-109 Liu, Chunbo IV-178 Liu, Duan III-519 Liu, Fang I-358 Liu, Fangxin II-367, III-640 Liu, Gang I-138 Liu, Guanjun IV-47 Liu, Hengchang III-402 Liu, Hong I-257 Liu, Hongwei III-432, IV-389 Liu, Jianwei IV-549 Liu, Jie I-242 Liu, Jing III-249, IV-611 Liu, Jingning I-562 Liu, Jingyu III-132 Liu, Jiqiang III-661 Liu, Joseph K. IV-85 Liu, Ke IV-617 Liu, Lei II-18, III-280 Liu, LiPing III-593 Liu, Lu III-132 Liu, Mengdi IV-495 Liu, Minmin II-175 Liu, Qi III-402 Liu, Sheng III-618 Liu, Shenming I-267 Liu, Wantao IV-12 Liu, Wei III-593 Liu, Weijie I-406 Liu, Wenbin II-596 Liu, Wenguo I-608 Liu, Wenjie IV-341, IV-538 Liu, Wenwen II-238 Liu, Xiaoguang II-238 Liu, Ximeng IV-204 Liu, Xuefeng IV-193 Liu, Yan III-270 Liu, Yanyan I-524 Liu, Yin IV-12 Liu, Yunan I-449

Liu, Zechao IV-374 Liu, Zheli IV-85, IV-234 Liu, Zhiwei I-498 Long, Xin III-460 Lu, Kejie IV-72 Lu, Liu I-299 Lu, Tao II-432, III-257, III-534 Lu, Xiaoxiao II-551 Lu, Yutong II-114 Luo, Gang IV-495 Luo, Jun I-229 Luo, Qi I-621 Luo, Wei II-76, II-308 Lv, Haitao III-270 Lv, Jiazhuo IV-447 Lv, Peipei I-34 Lv, Siyi IV-85 Lv, Tongtong II-462, IV-98 Lv, Weiqiang II-538 Lyu, Yongqiang II-32, IV-527

Ma, Bin I-284 Ma, Jingyi III-311 Ma, Kun I-257, IV-166 Ma, Sheng III-196, IV-341 Ma, Xinwei III-453 Ma, Yong I-449, II-462 Madsen, David IV-481 Maharjan, Ramesh IV-628 Mai, Xiaoming I-508 Mao, Rui I-19 Marquer, Yoann III-72 Mei, Linjun I-562 Mei, Songzhu I-106 Meng, Dan I-184 Meng, Guozhu III-417 Meng, Weizhi IV-481 Meng, Xiangfei II-524 Menouer, Tarek III-103 Miao, Qing II-175 Miao, Xuzhi IV-591 Mo, Xiuliang IV-234 Morin, Christine III-575

Naiouf, Marcelo I-310 Ngoko, Yanik II-388 Nie, Ming I-508 Ning, Yang III-402

Author Index

Orgerie, Anne-Cécile

III-175

Parlavantzas, Nikos III-575 Pavliuchenko, Ivan I-483 Pei, Qingqi IV-193 Peng, Junjie II-538 Peng, Li III-534 Peng, Lizhi I-46, IV-109 Peng, Su IV-270 Peng, Ziwei II-114 Peters, Dennis K. I-498 Pousa, Adrián I-310 Qian, Zhuzhong II-567 Qiang, Chenyi I-406 Qiao, Xueming I-524 Qiao, Yu II-46 Qikun, Zhang I-299 Qin, Xiaolin II-630 Qin, Yongrui II-130 Qin, Zhen II-248 Qiu, Juan III-59 Qiu, Yunzhou III-630 Qu, Chao III-186, III-608 Qu, Shenming III-534 Qu, Xiaoping III-270 Raïs, Issam III-175 Raju, Daniel IV-62 Rastogi, Naveen IV-62 Ren, Fangyuan IV-581 Ren, Rui II-611 Ren, Shenyuan I-19, II-272 Ren, Shiqiang III-257 Ren, Wei II-582 Rojbi, Anis I-483 Rokovyi, Oleksandr I-483 Rudoff, Andy II-354 Sanz, Victoria I-310 Shang, Zhaohui II-551 Shankaran, Rajan IV-495 Shao, Chi II-76 Shao, Wei IV-178 Shao, Xuexiao III-239 Sharma, Priyanka IV-62 Shen, Hang III-142, IV-431 Shi, Yuliang IV-109, IV-166

Si, Lei I-378 Song, Lan I-498 Song, Rui III-618 Stirenko, Sergii I-483 Stones, Rebecca J. II-238 Su, Wei II-60 Sun, Chao II-524, II-551 Sun, Guanchao I-215 Sun, Jizhou I-621, II-551 Sun, Qiao I-465 Sun, Rujun II-212 Sun, Wenhai IV-193 Sun, Wenli IV-415 Sun, Xiaoshan III-402 Sun, Yongxiong III-28 Tan, Jiayu IV-153 Tan, Shichong IV-511 Tan, Wen-tao IV-573 Tang, Jie I-421, I-593 Tang, Shanjiang I-284, II-524 Tang, Wenjuan II-417 Tang, Xuehai II-226 Tang, Yi II-160 Tao, Ming III-186, III-608 Tao, Xiaoling IV-511 Tesson, Julien III-159 Tian, Miaomiao IV-461 Tian, Yu III-489 Tong, Hai III-142 Tseng, Yu-Chee II-145 Wan, Jianyi I-449 Wang, Bei I-532 Wang, Chongjun I-421 Wang, Chundong IV-234 Wang, Dianxin II-263 Wang, En II-596 Wang, Gang II-238 Wang, Gongming II-297 Wang, Guijuan III-342 Wang, Guojun III-118, IV-303, IV-399 Wang, Haibo III-311 Wang, Hao I-621 Wang, Haobo I-89 Wang, Hong III-209, IV-573 Wang, Hui II-329 Wang, Ji III-293 Wang, Jiaming II-432

677

678

Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang,

Author Index

Jian IV-153 Jianfeng IV-511 Jin IV-72 Jinlong IV-611 Junjie III-402 Junxiu II-378 Kai IV-527 Liangyuan II-630 Lidan II-145 Lina I-406 Lingyan IV-461 Lu-tong IV-573 Mingwen I-449 Na II-367, III-640 Peng II-226 Qiang IV-270 Qinglin I-242 Ruchuan II-3 Rui III-370 Shanshan IV-109, IV-166 Shu III-249 Songyun II-567 Tianjing III-142, IV-431 Tianyu IV-617 Weiping I-184 Wenjun II-329 Xiao II-338 Xiaodong I-168 Xiaofen IV-288 Xin III-504 Xiyang II-212 Xuan IV-358, IV-374, IV-415, IV-447 Wang, Yifeng III-44 Wang, Yingtao II-538 Wang, Yiqi II-199 Wang, Yu IV-481 Wang, Zhi IV-178 Wang, Zhiying IV-341 Wang, Zhongyue II-462 Wei, Wenhong III-608 Wei, Yu IV-85 Wen, Shiqi III-229 Wen, Tangliu I-498 Wu, Changmao I-168, I-465 Wu, Chao I-545 Wu, Chuxin IV-389 Wu, Duanwei II-308 Wu, Gangshan I-593 Wu, Hao I-508

Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu, Wu,

Huan II-491 Hui I-508 Jianhua I-89 Jianping IV-21 Jiaxi II-160 Jie III-630 Jigang I-393, II-90, III-3, III-355, III-460 Qing II-477 Song I-3 Weigang I-122, II-287, II-401 Wenhua III-249 Yang IV-617 Yulin IV-358, IV-374, IV-415, IV-447 Zhendong IV-313 Zhiwei II-477

Xia, Maocai III-504 Xia, Wen II-445 Xia, Yihang IV-313 Xia, Zijun II-551 Xiang, Dongming IV-47 Xiang, Tao IV-138 Xiao, Bo III-630 Xiao, Jian I-284, I-621 Xiao, Ning I-636 Xiao, Nong I-358, II-401 Xiao, Wenhua III-293 Xiao, Yu II-445, IV-260 Xiao, Zheng I-138 Xie, Junyuan I-421 Xie, Peizhen I-242 Xie, Tiandi III-59, IV-560 Xie, Yuxi I-46 Xing, Xiaofei III-118 Xu, Chunxiang III-385 Xu, Guangquan III-417, IV-495 Xu, Hongli III-311 Xu, Jianliang II-319 Xu, Jie I-153 Xu, Jingdong III-327 Xu, Jingheng III-504 Xu, Lei III-519 Xu, Pan IV-538 Xu, Rui III-196, IV-34 Xu, Weizhi II-297 Xu, Xiaobo IV-591 Xu, Xin II-145 Xu, Yuwei III-327 Xu, Zhengwei I-168

Author Index

Xu, Zichen I-59 Xu, Zifeng IV-270 Xue, Jingting III-385 Xue, Ruini II-60

Yuan, Sihao III-270 Yu-an, Tan IV-260 Yuanzhang, Li I-299, III-651 Yue, Yinliang I-89

Yaghoubi, Ali II-130 Yan, Fang IV-249 Yan, Jie II-524 Yan, Longchuan IV-12 Yan, Qiben IV-166 Yan, Zijie II-401 Yang, Cheng III-293 Yang, Funing I-323, II-199 Yang, Guangwen III-504 Yang, Guowu IV-288 Yang, Haomiao III-89 Yang, Jianping IV-617 Yang, Jinzhe III-504 Yang, Peng I-34 Yang, Ru I-378 Yang, Weiyong I-267 Yang, Wenjun IV-234 Yang, Xudong III-17 Yang, Yang III-661 Yang, Yingyi I-508 Yang, Yongjian I-323, II-199, II-596 Yang, Yuhong IV-326 Ye, Shibei IV-461 Ye, Yan I-122, II-401 Yin, Ao II-248, IV-358 Yin, Chao III-270 Yin, Hao II-32, IV-527 Yin, Kanglin III-59 Yin, Yifeng IV-600 Ying, Zuobin IV-204 Yiu, S. M. IV-374, IV-415, IV-447 You, Lin III-547, IV-643 You, Lujin II-538 Yu, Ce I-284, I-621, II-524, II-551 Yu, Liang I-200 Yu, Qiankun III-355 Yu, Rongwei I-406 Yu, Wenping III-327 Yu, Xiao III-249 Yu, Xiaomei III-209 Yu, Xiao-mei IV-573 Yu, Yating IV-447 Yu, Zhang I-299, IV-260 Yuan, Jiabin II-567 Yuan, Ruifen III-186

Zeng, Guozhao III-618 Zeng, Kangli III-534 Zeng, Lingfang I-562, I-608 Zeng, Wei I-483 Zeng, Xiaoyang IV-3 Zhai, Yujia III-28 Zhai, Yutong I-636 Zhang, Bo I-621 Zhang, Changyou I-465 Zhang, Chunkai II-248, IV-358, IV-374 Zhang, Congpin I-168 Zhang, Guomin II-367 Zhang, Guozhi III-661 Zhang, Huayu IV-34 Zhang, Jianzhong II-491, III-327 Zhang, Jie III-186 Zhang, Jinchao I-184 Zhang, Jun IV-415, IV-495 Zhang, Keli II-248, IV-358 Zhang, Kun I-257 Zhang, Lei I-337 Zhang, Lufei II-212 Zhang, Ning III-417 Zhang, Peng IV-389, IV-415 Zhang, Qipeng II-354 Zhang, Quan IV-3 Zhang, Quanxin IV-249 Zhang, Ran III-280 Zhang, Shuang I-532 Zhang, Shukui IV-72 Zhang, Song II-105 Zhang, Wei III-118, III-257 Zhang, Weidong I-337 Zhang, Xiaofei II-145 Zhang, Xiaosong IV-288 Zhang, Xing II-248, IV-358 Zhang, Xinxiang III-3 Zhang, Yanduo II-432, III-257, III-534 Zhang, Yanhua IV-600 Zhang, Yaocheng II-582 Zhang, Yaoxue I-545, II-417 Zhang, Yiming I-106 Zhang, Yinghui III-453, IV-581 Zhang, Yu II-491, IV-121 Zhang, Yuan III-385

679

680

Author Index

Zhang, Yunquan II-338 Zhang, Zhiwei IV-511 Zhang, Zhiyong III-370, IV-617 Zhang, Zhiyuan II-226 Zhang, Ziyao III-132 Zhang, Zonghua IV-495 Zhao, Chuan IV-109 Zhao, Fangyuan II-319 Zhao, Hainan IV-374, IV-415 Zhao, Jiangfan III-453, IV-581 Zhao, Jianhui III-218, III-229 Zhao, Kaichuan II-417 Zhao, Long III-489 Zhao, Mengying IV-617 Zhao, Ming I-358 Zhao, Yang II-524 Zhao, Yi II-46 Zhao, Yingliang IV-538 Zheng, Dong III-453, IV-581 Zheng, Jun II-263 Zheng, Lijuan IV-611 Zheng, Shengan I-578 Zheng, Wantong II-462 Zheng, Wei III-402 Zheng, Xi III-417, IV-495 Zheng, Xiangwei II-105, III-239 Zheng, Xiaokun III-453, IV-581 Zheng, Yongqing II-18 Zheng, Yuehui III-519

Zhong, Hong IV-204, IV-461 Zhou, Bin III-608 Zhou, Chunyang II-507 Zhou, Fucai IV-270 Zhou, Guangpeng II-18 Zhou, Huabing III-257 Zhou, Jingya III-342, IV-72 Zhou, Qixian III-89 Zhou, Rang IV-288 Zhou, Tao III-476 Zhou, Wenan I-229 Zhou, Xinyu I-449 Zhou, Yong III-257 Zhou, Yueyue IV-138 Zhou, Yuezhi I-545, II-417 Zhou, Yukun II-445 Zhou, Zhanyong IV-341 Zhu, Dongjie I-524 Zhu, Likun IV-234 Zhu, Minghua III-630 Zhu, Tianqing II-582 Zhu, Xiaomin III-293 Zhu, Xingyu III-519 Zhu, Zhuo I-323, II-199 Zhuang, Yuehui II-477 Zou, Xueqiang II-491 Zucheng, Huang I-299 Zuo, Decheng III-432 Zuo, Wan Li IV-628

Algorithms and Architectures for Parallel Processing

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch